Shortformer: Better language modeling using shorter inputs

H Wang, J Li, H Wu, E Hovy, Y Sun - Engineering, 2023 - Elsevier

Pre-trained language models have achieved striking success in natural language
processing (NLP), leading to a paradigm shift from supervised learning to pre-training …

被引用次数：197 相关文章所有 2 个版本

[PDF] mit.edu

Position information in transformers: An overview

P Dufter, M Schmitt, H Schütze - Computational Linguistics, 2022 - direct.mit.edu

Transformers are arguably the main workhorse in recent natural language processing
research. By definition, a Transformer is invariant with respect to reordering of the input …

被引用次数：163 相关文章所有 8 个版本

[PDF] mit.edu

Lost in the middle: How language models use long contexts

NF Liu, K Lin, J Hewitt, A Paranjape… - Transactions of the …, 2024 - direct.mit.edu

While recent language models have the ability to take long contexts as input, relatively little
is known about how well they use longer context. We analyze the performance of language …

被引用次数：691 相关文章所有 11 个版本

[PDF] neurips.cc

Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution

E Nguyen, M Poli, M Faizi, A Thomas… - Advances in neural …, 2024 - proceedings.neurips.cc

Genomic (DNA) sequences encode an enormous amount of information for gene regulation
and protein synthesis. Similar to natural language models, researchers have proposed …

被引用次数：133 相关文章所有 8 个版本

[PDF] mlr.press

Efficientnetv2: Smaller models and faster training

M Tan, Q Le - International conference on machine learning, 2021 - proceedings.mlr.press

This paper introduces EfficientNetV2, a new family of convolutional networks that have faster
training speed and better parameter efficiency than previous models. To develop these …

被引用次数：2910 相关文章所有 9 个版本

[PDF] neurips.cc

Megabyte: Predicting million-byte sequences with multiscale transformers

L Yu, D Simig, C Flaherty… - Advances in …, 2023 - proceedings.neurips.cc

Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …

被引用次数：64 相关文章所有 5 个版本

[PDF] mit.edu

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：85 相关文章所有 10 个版本

[PDF] mlr.press

Stabilizing transformer training by preventing attention entropy collapse

S Zhai, T Likhomanenko, E Littwin… - International …, 2023 - proceedings.mlr.press

Training stability is of great importance to Transformers. In this work, we investigate the
training dynamics of Transformers by examining the evolution of the attention layers. In …

被引用次数：35 相关文章所有 7 个版本

[PDF] arxiv.org

Efficient large scale language modeling with mixtures of experts

M Artetxe, S Bhosale, N Goyal, T Mihaylov… - arXiv preprint arXiv …, 2021 - arxiv.org

Mixture of Experts layers (MoEs) enable efficient scaling of language models through
conditional computation. This paper presents a detailed empirical study of how …

被引用次数：90 相关文章所有 4 个版本

[PDF] ethz.ch

[PDF][PDF] Findings of the BabyLM Challenge: Sample-efficient pretraining on developmentally plausible corpora

A Warstadt, A Mueller, L Choshen… - … of the BabyLM …, 2023 - research-collection.ethz.ch

Children can acquire language from less than 100 million words of input. Large language
models are far less data-efficient: they typically require 3 or 4 orders of magnitude more data …

被引用次数：69 相关文章所有 5 个版本