Transformers learn shortcuts to automata
Algorithmic reasoning requires capabilities which are most naturally understood through
recurrent models of computation, like the Turing machine. However, Transformer models …
recurrent models of computation, like the Turing machine. However, Transformer models …
Soft: Softmax-free transformer with linear complexity
Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition
tasks by patch-wise image tokenization followed by self-attention. However, the employment …
tasks by patch-wise image tokenization followed by self-attention. However, the employment …
Gated linear attention transformers with hardware-efficient training
Transformers with linear attention allow for efficient parallel training but can simultaneously
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with …
be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear (with …
Sequencer: Deep lstm for image classification
Y Tatsunami, M Taki - Advances in Neural Information …, 2022 - proceedings.neurips.cc
In recent computer vision research, the advent of the Vision Transformer (ViT) has rapidly
revolutionized various architectural design efforts: ViT achieved state-of-the-art image …
revolutionized various architectural design efforts: ViT achieved state-of-the-art image …
Batch prompting: Efficient inference with large language model apis
Performing inference on large volumes of samples with large language models (LLMs) can
be computationally and financially costly in industry and real-world use. We propose batch …
be computationally and financially costly in industry and real-world use. We propose batch …
Is mamba capable of in-context learning?
This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …
state space model, has similar in-context learning (ICL) capabilities as transformers. We …
Fast nearest neighbor machine translation
Though nearest neighbor Machine Translation ($ k $ NN-MT)\citep {khandelwal2020nearest
} has proved to introduce significant performance boosts over standard neural MT systems, it …
} has proved to introduce significant performance boosts over standard neural MT systems, it …
How much does attention actually attend? questioning the importance of attention in pretrained transformers
The attention mechanism is considered the backbone of the widely-used Transformer
architecture. It contextualizes the input by computing input-specific attention matrices. We …
architecture. It contextualizes the input by computing input-specific attention matrices. We …
Bidimensional leaderboards: Generate and evaluate language hand in hand
Natural language processing researchers have identified limitations of evaluation
methodology for generation tasks, with new questions raised about the validity of automatic …
methodology for generation tasks, with new questions raised about the validity of automatic …
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …
ground generations in tokens previously seen in context. However, the efficiency of attention …