相关文章- 学术资源搜索

Big bird: Transformers for longer sequences

M Zaheer, G Guruganesh, KA Dubey… - Advances in neural …, 2020 - proceedings.neurips.cc

Transformers-based models, such as BERT, have been one of the most successful deep
learning models for NLP. Unfortunately, one of their core limitations is the quadratic …

被引用次数：1960 相关文章所有 8 个版本

[PDF] arxiv.org

Linformer: Self-attention with linear complexity

S Wang, BZ Li, M Khabsa, H Fang, H Ma - arXiv preprint arXiv:2006.04768, 2020 - arxiv.org

Large transformer models have shown extraordinary success in achieving state-of-the-art
results in many natural language processing applications. However, training and deploying …

被引用次数：1510 相关文章所有 3 个版本

[PDF] arxiv.org

Random feature attention

H Peng, N Pappas, D Yogatama, R Schwartz… - arXiv preprint arXiv …, 2021 - arxiv.org

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their
core is an attention function which models pairwise interactions between the inputs at every …

被引用次数：309 相关文章所有 6 个版本

[PDF] arxiv.org

LongT5: Efficient text-to-text transformer for long sequences

M Guo, J Ainslie, D Uthus, S Ontanon, J Ni… - arXiv preprint arXiv …, 2021 - arxiv.org

Recent work has shown that either (1) increasing the input length or (2) increasing model
size can improve the performance of Transformer-based neural models. In this paper, we …

被引用次数：217 相关文章所有 10 个版本

[PDF] arxiv.org

Longnet: Scaling transformers to 1,000,000,000 tokens

J Ding, S Ma, L Dong, X Zhang, S Huang… - arXiv preprint arXiv …, 2023 - arxiv.org

Scaling sequence length has become a critical demand in the era of large language models.
However, existing methods struggle with either computational complexity or model …

被引用次数：82 相关文章所有 3 个版本

[PDF] neurips.cc

Luna: Linear unified nested attention

X Ma, X Kong, S Wang, C Zhou, J May… - Advances in …, 2021 - proceedings.neurips.cc

The quadratic computational and memory complexities of the Transformer's attention
mechanism have limited its scalability for modeling long sequences. In this paper, we …

被引用次数：122 相关文章所有 6 个版本

[PDF] arxiv.org

Gmat: Global memory augmentation for transformers

A Gupta, J Berant - arXiv preprint arXiv:2006.03274, 2020 - arxiv.org

Transformer-based models have become ubiquitous in natural language processing thanks
to their large capacity, innate parallelism and high performance. The contextualizing …

被引用次数：48 相关文章所有 2 个版本

[PDF] arxiv.org

Do transformers need deep long-range memory

JW Rae, A Razavi - arXiv preprint arXiv:2007.03356, 2020 - arxiv.org

Deep attention models have advanced the modelling of sequential data across many
domains. For language modelling in particular, the Transformer-XL--a Transformer …

被引用次数：42 相关文章所有 5 个版本

[PDF] arxiv.org

Memory transformer

MS Burtsev, Y Kuratov, A Peganov… - arXiv preprint arXiv …, 2020 - arxiv.org

Transformer-based models have achieved state-of-the-art results in many natural language
processing tasks. The self-attention architecture allows transformer to combine information …

被引用次数：49 相关文章所有 3 个版本

[PDF] arxiv.org

Mega: moving average equipped gated attention

X Ma, C Zhou, X Kong, J He, L Gui, G Neubig… - arXiv preprint arXiv …, 2022 - arxiv.org

The design choices in the Transformer attention mechanism, including weak inductive bias
and quadratic computational complexity, have limited its application for modeling long …

被引用次数：105 相关文章所有 3 个版本

Big bird: Transformers for longer sequences

Linformer: Self-attention with linear complexity

Random feature attention

LongT5: Efficient text-to-text transformer for long sequences

Longnet: Scaling transformers to 1,000,000,000 tokens

Luna: Linear unified nested attention

Gmat: Global memory augmentation for transformers

Do transformers need deep long-range memory

Memory transformer

Mega: moving average equipped gated attention

相关搜索

高级搜索

引用