In-context language learning: Arhitectures and algorithms
Large-scale neural language models exhibit a remarkable capacity for in-context learning
(ICL): they can infer novel functions from datasets provided as input. Most of our current …
(ICL): they can infer novel functions from datasets provided as input. Most of our current …
Can mamba learn how to learn? a comparative study on in-context learning tasks
State-space models (SSMs), such as Mamba Gu & Dao (2034), have been proposed as
alternatives to Transformer networks in language modeling, by incorporating gating …
alternatives to Transformer networks in language modeling, by incorporating gating …
Data engineering for scaling language models to 128k context
We study the continual pretraining recipe for scaling language models' context lengths to
128K, with a focus on data engineering. We hypothesize that long context modeling, in …
128K, with a focus on data engineering. We hypothesize that long context modeling, in …
Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality
While Transformers have been the main architecture behind deep learning's success in
language modeling, state-space models (SSMs) such as Mamba have recently been shown …
language modeling, state-space models (SSMs) such as Mamba have recently been shown …
Is mamba capable of in-context learning?
This work provides empirical evidence that Mamba, a newly proposed selective structured
state space model, has similar in-context learning (ICL) capabilities as transformers. We …
state space model, has similar in-context learning (ICL) capabilities as transformers. We …
xLSTM: Extended Long Short-Term Memory
In the 1990s, the constant error carousel and gating were introduced as the central ideas of
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and …
Densemamba: State space models with dense hidden connection for efficient large language models
Large language models (LLMs) face a daunting challenge due to the excessive
computational and memory requirements of the commonly used Transformer architecture …
computational and memory requirements of the commonly used Transformer architecture …
Learning to (learn at test time): Rnns with expressive hidden states
Self-attention performs well in long context but has quadratic complexity. Existing RNN
layers have linear complexity, but their performance in long context is limited by the …
layers have linear complexity, but their performance in long context is limited by the …
Simple linear attention language models balance the recall-throughput tradeoff
Recent work has shown that attention-based language models excel at recall, the ability to
ground generations in tokens previously seen in context. However, the efficiency of attention …
ground generations in tokens previously seen in context. However, the efficiency of attention …
State space model for new-generation network alternative to transformers: A survey
In the post-deep learning era, the Transformer architecture has demonstrated its powerful
performance across pre-trained big models and various downstream tasks. However, the …
performance across pre-trained big models and various downstream tasks. However, the …