Gated linear attention transformers with hardware-efficient training

E Akyürek, B Wang, Y Kim, J Andreas - arXiv preprint arXiv:2401.12973, 2024 - arxiv.org

Large-scale neural language models exhibit a remarkable capacity for in-context learning
(ICL): they can infer novel functions from datasets provided as input. Most of our current …

被引用次数：28 相关文章所有 3 个版本

[PDF] arxiv.org

Can mamba learn how to learn? a comparative study on in-context learning tasks

J Park, J Park, Z Xiong, N Lee, J Cho, S Oymak… - arXiv preprint arXiv …, 2024 - arxiv.org

State-space models (SSMs), such as Mamba Gu & Dao (2034), have been proposed as
alternatives to Transformer networks in language modeling, by incorporating gating …

被引用次数：32 相关文章所有 4 个版本

[PDF] arxiv.org

Data engineering for scaling language models to 128k context

Y Fu, R Panda, X Niu, X Yue, H Hajishirzi, Y Kim… - arXiv preprint arXiv …, 2024 - arxiv.org

We study the continual pretraining recipe for scaling language models' context lengths to
128K, with a focus on data engineering. We hypothesize that long context modeling, in …

被引用次数：31 相关文章所有 4 个版本

[PDF] arxiv.org

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org

While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

被引用次数：65 相关文章所有 3 个版本

[PDF] arxiv.org

Is mamba capable of in-context learning?

R Grazzi, J Siems, S Schrodi, T Brox… - arXiv preprint arXiv …, 2024 - arxiv.org

This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org

In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

被引用次数：40 相关文章所有 2 个版本

[PDF] arxiv.org

Densemamba: State space models with dense hidden connection for efficient large language models

W He, K Han, Y Tang, C Wang, Y Yang, T Guo… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) face a daunting challenge due to the excessive
computational and memory requirements of the commonly used Transformer architecture …

被引用次数：18 相关文章所有 2 个版本

[PDF] arxiv.org

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

Simple linear attention language models balance the recall-throughput tradeoff

S Arora, S Eyuboglu, M Zhang, A Timalsina… - arXiv preprint arXiv …, 2024 - arxiv.org

Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …

被引用次数：16 相关文章所有 4 个版本

[PDF] arxiv.org

State space model for new-generation network alternative to transformers: A survey

X Wang, S Wang, Y Ding, Y Li, W Wu, Y Rong… - arXiv preprint arXiv …, 2024 - arxiv.org

In the post-deep learning era, the Transformer architecture has demonstrated its powerful
performance across pre-trained big models and various downstream tasks. However, the …

被引用次数：13 相关文章所有 2 个版本