Big bird: Transformers for longer sequences
Transformers-based models, such as BERT, have been one of the most successful deep
learning models for NLP. Unfortunately, one of their core limitations is the quadratic …
learning models for NLP. Unfortunately, one of their core limitations is the quadratic …
Linformer: Self-attention with linear complexity
Large transformer models have shown extraordinary success in achieving state-of-the-art
results in many natural language processing applications. However, training and deploying …
results in many natural language processing applications. However, training and deploying …
Random feature attention
Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their
core is an attention function which models pairwise interactions between the inputs at every …
core is an attention function which models pairwise interactions between the inputs at every …
LongT5: Efficient text-to-text transformer for long sequences
Recent work has shown that either (1) increasing the input length or (2) increasing model
size can improve the performance of Transformer-based neural models. In this paper, we …
size can improve the performance of Transformer-based neural models. In this paper, we …
Longnet: Scaling transformers to 1,000,000,000 tokens
Scaling sequence length has become a critical demand in the era of large language models.
However, existing methods struggle with either computational complexity or model …
However, existing methods struggle with either computational complexity or model …
Luna: Linear unified nested attention
The quadratic computational and memory complexities of the Transformer's attention
mechanism have limited its scalability for modeling long sequences. In this paper, we …
mechanism have limited its scalability for modeling long sequences. In this paper, we …
Gmat: Global memory augmentation for transformers
Transformer-based models have become ubiquitous in natural language processing thanks
to their large capacity, innate parallelism and high performance. The contextualizing …
to their large capacity, innate parallelism and high performance. The contextualizing …
Memory transformer
MS Burtsev, Y Kuratov, A Peganov… - arXiv preprint arXiv …, 2020 - arxiv.org
Transformer-based models have achieved state-of-the-art results in many natural language
processing tasks. The self-attention architecture allows transformer to combine information …
processing tasks. The self-attention architecture allows transformer to combine information …
Mega: moving average equipped gated attention
The design choices in the Transformer attention mechanism, including weak inductive bias
and quadratic computational complexity, have limited its application for modeling long …
and quadratic computational complexity, have limited its application for modeling long …