Megabyte: Predicting million-byte sequences with multiscale transformers
Autoregressive transformers are spectacular models for short sequences but scale poorly to
long sequences such as high-resolution images, podcasts, code, or books. We proposed …
long sequences such as high-resolution images, podcasts, code, or books. We proposed …
Transformers learn shortcuts to automata
Algorithmic reasoning requires capabilities which are most naturally understood through
recurrent models of computation, like the Turing machine. However, Transformer models …
recurrent models of computation, like the Turing machine. However, Transformer models …
Looped transformers as programmable computers
We present a framework for using transformer networks as universal computers by
programming them with specific weights and placing them in a loop. Our input sequence …
programming them with specific weights and placing them in a loop. Our input sequence …
Long range language modeling via gated state spaces
State space models have shown to be effective at modeling long range dependencies,
specially on sequence classification tasks. In this work we focus on autoregressive …
specially on sequence classification tasks. In this work we focus on autoregressive …
A length-extrapolatable transformer
Position modeling plays a critical role in Transformers. In this paper, we focus on length
extrapolation, ie, training on short texts while evaluating longer sequences. We define …
extrapolation, ie, training on short texts while evaluating longer sequences. We define …
Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges
BN Patro, VS Agneeswaran - arXiv preprint arXiv:2404.16112, 2024 - arxiv.org
Sequence modeling is a crucial area across various domains, including Natural Language
Processing (NLP), speech recognition, time series forecasting, music generation, and …
Processing (NLP), speech recognition, time series forecasting, music generation, and …
Scaling transformer to 1m tokens and beyond with rmt
A major limitation for the broader scope of problems solvable by transformers is the
quadratic scaling of computational complexity with input size. In this study, we investigate …
quadratic scaling of computational complexity with input size. In this study, we investigate …
The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey
The advent of Large Language Models (LLMs) represents a notable breakthrough in Natural
Language Processing (NLP), contributing to substantial progress in both text …
Language Processing (NLP), contributing to substantial progress in both text …
Block-state transformers
State space models (SSMs) have shown impressive results on tasks that require modeling
long-range dependencies and efficiently scale to long sequences owing to their …
long-range dependencies and efficiently scale to long sequences owing to their …
[PDF][PDF] Efficient large language models: A survey
Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in
important tasks such as natural language understanding, language generation, and …
important tasks such as natural language understanding, language generation, and …