Big bird: Transformers for longer sequences

M Zaheer, G Guruganesh, KA Dubey… - Advances in neural …, 2020 - proceedings.neurips.cc
Transformers-based models, such as BERT, have been one of the most successful deep
learning models for NLP. Unfortunately, one of their core limitations is the quadratic …

Linformer: Self-attention with linear complexity

S Wang, BZ Li, M Khabsa, H Fang, H Ma - arXiv preprint arXiv:2006.04768, 2020 - arxiv.org
Large transformer models have shown extraordinary success in achieving state-of-the-art
results in many natural language processing applications. However, training and deploying …

Random feature attention

H Peng, N Pappas, D Yogatama, R Schwartz… - arXiv preprint arXiv …, 2021 - arxiv.org
Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their
core is an attention function which models pairwise interactions between the inputs at every …

LongT5: Efficient text-to-text transformer for long sequences

M Guo, J Ainslie, D Uthus, S Ontanon, J Ni… - arXiv preprint arXiv …, 2021 - arxiv.org
Recent work has shown that either (1) increasing the input length or (2) increasing model
size can improve the performance of Transformer-based neural models. In this paper, we …

Longnet: Scaling transformers to 1,000,000,000 tokens

J Ding, S Ma, L Dong, X Zhang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org
Scaling sequence length has become a critical demand in the era of large language models.
However, existing methods struggle with either computational complexity or model …

Luna: Linear unified nested attention

X Ma, X Kong, S Wang, C Zhou, J May… - Advances in …, 2021 - proceedings.neurips.cc
The quadratic computational and memory complexities of the Transformer's attention
mechanism have limited its scalability for modeling long sequences. In this paper, we …

Gmat: Global memory augmentation for transformers

A Gupta, J Berant - arXiv preprint arXiv:2006.03274, 2020 - arxiv.org
Transformer-based models have become ubiquitous in natural language processing thanks
to their large capacity, innate parallelism and high performance. The contextualizing …

Do transformers need deep long-range memory

JW Rae, A Razavi - arXiv preprint arXiv:2007.03356, 2020 - arxiv.org
Deep attention models have advanced the modelling of sequential data across many
domains. For language modelling in particular, the Transformer-XL--a Transformer …

Memory transformer

MS Burtsev, Y Kuratov, A Peganov… - arXiv preprint arXiv …, 2020 - arxiv.org
Transformer-based models have achieved state-of-the-art results in many natural language
processing tasks. The self-attention architecture allows transformer to combine information …

Mega: moving average equipped gated attention

X Ma, C Zhou, X Kong, J He, L Gui, G Neubig… - arXiv preprint arXiv …, 2022 - arxiv.org
The design choices in the Transformer attention mechanism, including weak inductive bias
and quadratic computational complexity, have limited its application for modeling long …