Transformers learn shortcuts to automata

B Liu, JT Ash, S Goel, A Krishnamurthy… - arXiv preprint arXiv …, 2022 - arxiv.org
Algorithmic reasoning requires capabilities which are most naturally understood through
recurrent models of computation, like the Turing machine. However, Transformer models …

Soft: Softmax-free transformer with linear complexity

J Lu, J Yao, J Zhang, X Zhu, H Xu… - Advances in …, 2021 - proceedings.neurips.cc
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition
tasks by patch-wise image tokenization followed by self-attention. However, the employment …

Gated linear attention transformers with hardware-efficient training

S Yang, B Wang, Y Shen, R Panda, Y Kim - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with …

Sequencer: Deep lstm for image classification

Y Tatsunami, M Taki - Advances in Neural Information …, 2022 - proceedings.neurips.cc
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly
revolutionized various architectural design efforts: ViT achieved state-of-the-art image …

Batch prompting: Efficient inference with large language model apis

Z Cheng, J Kasai, T Yu - arXiv preprint arXiv:2301.08721, 2023 - arxiv.org
Performing inference on large volumes of samples with large language models (LLMs) can
be computationally and financially costly in industry and real-world use. We propose batch …

Is mamba capable of in-context learning?

R Grazzi, J Siems, S Schrodi, T Brox… - arXiv preprint arXiv …, 2024 - arxiv.org
This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …

Fast nearest neighbor machine translation

Y Meng, X Li, X Zheng, F Wu, X Sun, T Zhang… - arXiv preprint arXiv …, 2021 - arxiv.org
Though nearest neighbor Machine Translation ($ k $ NN-MT)\citep {khandelwal2020nearest
} has proved to introduce significant performance boosts over standard neural MT systems, it …

How much does attention actually attend? questioning the importance of attention in pretrained transformers

M Hassid, H Peng, D Rotem, J Kasai, I Montero… - arXiv preprint arXiv …, 2022 - arxiv.org
The attention mechanism is considered the backbone of the widely-used Transformer
architecture. It contextualizes the input by computing input-specific attention matrices. We …

Bidimensional leaderboards: Generate and evaluate language hand in hand

J Kasai, K Sakaguchi, RL Bras, L Dunagan… - arXiv preprint arXiv …, 2021 - arxiv.org
Natural language processing researchers have identified limitations of evaluation
methodology for generation tasks, with new questions raised about the validity of automatic …

Simple linear attention language models balance the recall-throughput tradeoff

S Arora, S Eyuboglu, M Zhang, A Timalsina… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …