In-context language learning: Arhitectures and algorithms

E Akyürek, B Wang, Y Kim, J Andreas - arXiv preprint arXiv:2401.12973, 2024 - arxiv.org
Large-scale neural language models exhibit a remarkable capacity for in-context learning
(ICL): they can infer novel functions from datasets provided as input. Most of our current …

Can mamba learn how to learn? a comparative study on in-context learning tasks

J Park, J Park, Z Xiong, N Lee, J Cho, S Oymak… - arXiv preprint arXiv …, 2024 - arxiv.org
State-space models (SSMs), such as Mamba Gu & Dao (2034), have been proposed as
alternatives to Transformer networks in language modeling, by incorporating gating …

Data engineering for scaling language models to 128k context

Y Fu, R Panda, X Niu, X Yue, H Hajishirzi, Y Kim… - arXiv preprint arXiv …, 2024 - arxiv.org
We study the continual pretraining recipe for scaling language models' context lengths to
128K, with a focus on data engineering. We hypothesize that long context modeling, in …

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

T Dao, A Gu - arXiv preprint arXiv:2405.21060, 2024 - arxiv.org
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …

Is mamba capable of in-context learning?

R Grazzi, J Siems, S Schrodi, T Brox… - arXiv preprint arXiv …, 2024 - arxiv.org
This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …

xLSTM: Extended Long Short-Term Memory

M Beck, K Pöppel, M Spanring, A Auer… - arXiv preprint arXiv …, 2024 - arxiv.org
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …

Densemamba: State space models with dense hidden connection for efficient large language models

W He, K Han, Y Tang, C Wang, Y Yang, T Guo… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) face a daunting challenge due to the excessive
computational and memory requirements of the commonly used Transformer architecture …

Learning to (learn at test time): Rnns with expressive hidden states

Y Sun, X Li, K Dalal, J Xu, A Vikram, G Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …

Simple linear attention language models balance the recall-throughput tradeoff

S Arora, S Eyuboglu, M Zhang, A Timalsina… - arXiv preprint arXiv …, 2024 - arxiv.org
Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …

State space model for new-generation network alternative to transformers: A survey

X Wang, S Wang, Y Ding, Y Li, W Wu, Y Rong… - arXiv preprint arXiv …, 2024 - arxiv.org
In the post-deep learning era, the Transformer architecture has demonstrated its powerful
performance across pre-trained big models and various downstream tasks. However, the …