Megabyte: Predicting million-byte sequences with multiscale transformers

L Yu, D Simig, C Flaherty… - Advances in …, 2024 - proceedings.neurips.cc
Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …

Transformers learn shortcuts to automata

B Liu, JT Ash, S Goel, A Krishnamurthy… - arXiv preprint arXiv …, 2022 - arxiv.org
Algorithmic reasoning requires capabilities which are most naturally understood through
recurrent models of computation, like the Turing machine. However, Transformer models …

Looped transformers as programmable computers

A Giannou, S Rajput, J Sohn, K Lee… - International …, 2023 - proceedings.mlr.press
We present a framework for using transformer networks as universal computers by
programming them with specific weights and placing them in a loop. Our input sequence …

Long range language modeling via gated state spaces

H Mehta, A Gupta, A Cutkosky, B Neyshabur - arXiv preprint arXiv …, 2022 - arxiv.org
State space models have shown to be effective at modeling long range dependencies,
specially on sequence classification tasks. In this work we focus on autoregressive …

A length-extrapolatable transformer

Y Sun, L Dong, B Patra, S Ma, S Huang… - arXiv preprint arXiv …, 2022 - arxiv.org
Position modeling plays a critical role in Transformers. In this paper, we focus on length
extrapolation, ie, training on short texts while evaluating longer sequences. We define …

Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges

BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …

Scaling transformer to 1m tokens and beyond with rmt

A Bulatov, Y Kuratov, Y Kapushev… - arXiv preprint arXiv …, 2023 - arxiv.org
A major limitation for the broader scope of problems solvable by transformers is the
quadratic scaling of computational complexity with input size. In this study, we investigate …

The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey

S Pawar, SM Tonmoy, SM Zaman, V Jain… - arXiv preprint arXiv …, 2024 - arxiv.org
The advent of Large Language Models (LLMs) represents a notable breakthrough in Natural
Language Processing (NLP), contributing to substantial progress in both text …

Block-state transformers

J Pilault, M Fathi, O Firat, C Pal… - Advances in Neural …, 2024 - proceedings.neurips.cc
State space models (SSMs) have shown impressive results on tasks that require modeling
long-range dependencies and efficiently scale to long sequences owing to their …

[PDF][PDF] Efficient large language models: A survey

Z Wan, X Wang, C Liu, S Alam, Y Zheng… - arXiv preprint arXiv …, 2023 - researchgate.net
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in
important tasks such as natural language understanding, language generation, and …