[HTML][HTML] Pre-trained language models and their applications
Pre-trained language models have achieved striking success in natural language
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …
Position information in transformers: An overview
Transformers are arguably the main workhorse in recent natural language processing
research. By definition, a Transformer is invariant with respect to reordering of the input …
research. By definition, a Transformer is invariant with respect to reordering of the input …
Lost in the middle: How language models use long contexts
While recent language models have the ability to take long contexts as input, relatively little
is known about how well they use longer context. We analyze the performance of language …
is known about how well they use longer context. We analyze the performance of language …
Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution
Genomic (DNA) sequences encode an enormous amount of information for gene regulation
and protein synthesis. Similar to natural language models, researchers have proposed …
and protein synthesis. Similar to natural language models, researchers have proposed …
Efficientnetv2: Smaller models and faster training
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster
training speed and better parameter efficiency than previous models. To develop these …
training speed and better parameter efficiency than previous models. To develop these …
Megabyte: Predicting million-byte sequences with multiscale transformers
Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …
long sequences such as high-resolution images, podcasts, code, or books. We proposed …
Efficient methods for natural language processing: A survey
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …
scaling model parameters and training data; however, using only scale to improve …
Stabilizing transformer training by preventing attention entropy collapse
Training stability is of great importance to Transformers. In this work, we investigate the
training dynamics of Transformers by examining the evolution of the attention layers. In …
training dynamics of Transformers by examining the evolution of the attention layers. In …
Efficient large scale language modeling with mixtures of experts
Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …
conditional computation. This paper presents a detailed empirical study of how …
[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora
Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …