Finetuning pretrained transformers into rnns

B Liu, JT Ash, S Goel, A Krishnamurthy… - arXiv preprint arXiv …, 2022 - arxiv.org

Algorithmic reasoning requires capabilities which are most naturally understood through
recurrent models of computation, like the Turing machine. However, Transformer models …

被引用次数：130 相关文章所有 4 个版本

[PDF] neurips.cc

Soft: Softmax-free transformer with linear complexity

J Lu, J Yao, J Zhang, X Zhu, H Xu… - Advances in …, 2021 - proceedings.neurips.cc

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition
tasks by patch-wise image tokenization followed by self-attention. However, the employment …

被引用次数：154 相关文章所有 8 个版本

[PDF] arxiv.org

Gated linear attention transformers with hardware-efficient training

S Yang, B Wang, Y Shen, R Panda, Y Kim - arXiv preprint arXiv …, 2023 - arxiv.org

Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with …

被引用次数：49 相关文章所有 4 个版本

[PDF] neurips.cc

Sequencer: Deep lstm for image classification

Y Tatsunami, M Taki - Advances in Neural Information …, 2022 - proceedings.neurips.cc

In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly
revolutionized various architectural design efforts: ViT achieved state-of-the-art image …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

Batch prompting: Efficient inference with large language model apis

Z Cheng, J Kasai, T Yu - arXiv preprint arXiv:2301.08721, 2023 - arxiv.org

Performing inference on large volumes of samples with large language models (LLMs) can
be computationally and financially costly in industry and real-world use. We propose batch …

被引用次数：48 相关文章所有 3 个版本

[PDF] arxiv.org

Is mamba capable of in-context learning?

R Grazzi, J Siems, S Schrodi, T Brox… - arXiv preprint arXiv …, 2024 - arxiv.org

This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Fast nearest neighbor machine translation

Y Meng, X Li, X Zheng, F Wu, X Sun, T Zhang… - arXiv preprint arXiv …, 2021 - arxiv.org

Though nearest neighbor Machine Translation ($ k $ NN-MT)\citep {khandelwal2020nearest
} has proved to introduce significant performance boosts over standard neural MT systems, it …

被引用次数：62 相关文章所有 6 个版本

[PDF] arxiv.org

How much does attention actually attend? questioning the importance of attention in pretrained transformers

M Hassid, H Peng, D Rotem, J Kasai, I Montero… - arXiv preprint arXiv …, 2022 - arxiv.org

The attention mechanism is considered the backbone of the widely-used Transformer
architecture. It contextualizes the input by computing input-specific attention matrices. We …

被引用次数：22 相关文章所有 6 个版本

[PDF] arxiv.org

Bidimensional leaderboards: Generate and evaluate language hand in hand

J Kasai, K Sakaguchi, RL Bras, L Dunagan… - arXiv preprint arXiv …, 2021 - arxiv.org

Natural language processing researchers have identified limitations of evaluation
methodology for generation tasks, with new questions raised about the validity of automatic …

被引用次数：35 相关文章所有 8 个版本

[PDF] arxiv.org

Simple linear attention language models balance the recall-throughput tradeoff

S Arora, S Eyuboglu, M Zhang, A Timalsina… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …

被引用次数：17 相关文章所有 4 个版本