[HTML][HTML] Pre-trained language models and their applications

H Wang, J Li, H Wu, E Hovy, Y Sun - Engineering, 2023 - Elsevier
Pre-trained language models have achieved striking success in natural language
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …

Position information in transformers: An overview

P Dufter, M Schmitt, H Schütze - Computational Linguistics, 2022 - direct.mit.edu
Transformers are arguably the main workhorse in recent natural language processing
research. By definition, a Transformer is invariant with respect to reordering of the input …

Lost in the middle: How language models use long contexts

NF Liu, K Lin, J Hewitt, A Paranjape… - Transactions of the …, 2024 - direct.mit.edu
While recent language models have the ability to take long contexts as input, relatively little
is known about how well they use longer context. We analyze the performance of language …

Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution

E Nguyen, M Poli, M Faizi, A Thomas… - Advances in neural …, 2024 - proceedings.neurips.cc
Genomic (DNA) sequences encode an enormous amount of information for gene regulation
and protein synthesis. Similar to natural language models, researchers have proposed …

Efficientnetv2: Smaller models and faster training

M Tan, Q Le - International conference on machine learning, 2021 - proceedings.mlr.press
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster
training speed and better parameter efficiency than previous models. To develop these …

Megabyte: Predicting million-byte sequences with multiscale transformers

L Yu, D Simig, C Flaherty… - Advances in …, 2023 - proceedings.neurips.cc
Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

Stabilizing transformer training by preventing attention entropy collapse

S Zhai, T Likhomanenko, E Littwin… - International …, 2023 - proceedings.mlr.press
Training stability is of great importance to Transformers. In this work, we investigate the
training dynamics of Transformers by examining the evolution of the attention layers. In …

Efficient large scale language modeling with mixtures of experts

M Artetxe, S Bhosale, N Goyal, T Mihaylov… - arXiv preprint arXiv …, 2021 - arxiv.org
Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch
Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …