Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in Neural …, 2022 - proceedings.neurips.cc
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - arXiv preprint arXiv:2306.11695, 2023 - arxiv.org
As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …

The case for 4-bit precision: k-bit inference scaling laws

T Dettmers, L Zettlemoyer - International Conference on …, 2023 - proceedings.mlr.press
Quantization methods reduce the number of bits required to represent each parameter in a
model, trading accuracy for smaller memory footprints and inference latencies. However, the …

The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc
Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

Outlier suppression: Pushing the limit of low-bit transformer language models

X Wei, Y Zhang, X Zhang, R Gong… - Advances in …, 2022 - proceedings.neurips.cc
Transformer architecture has become the fundamental element of the widespread natural
language processing~(NLP) models. With the trends of large NLP models, the increasing …

All bark and no bite: Rogue dimensions in transformer language models obscure representational quality

W Timkey, M Van Schijndel - arXiv preprint arXiv:2109.04404, 2021 - arxiv.org
Similarity measures are a vital tool for understanding how language models represent and
process language. Standard representational similarity measures such as cosine similarity …

Intriguing properties of quantization at scale

A Ahmadian, S Dash, H Chen… - Advances in …, 2023 - proceedings.neurips.cc
Emergent properties have been widely adopted as a term to describe behavior not present
in smaller models but observed in larger models (Wei et al., 2022a). Recent work suggests …

Feature-learning networks are consistent across widths at realistic scales

N Vyas, A Atanasov, B Bordelon… - Advances in …, 2024 - proceedings.neurips.cc
We study the effect of width on the dynamics of feature-learning neural networks across a
variety of architectures and datasets. Early in training, wide neural networks trained on …

BERT busters: Outlier dimensions that disrupt transformers

O Kovaleva, S Kulshreshtha, A Rogers… - arXiv preprint arXiv …, 2021 - arxiv.org
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to
this received wisdom, we demonstrate that pre-trained Transformer encoders are …

Measuring the mixing of contextual information in the transformer

J Ferrando, GI Gállego, MR Costa-Jussà - arXiv preprint arXiv:2203.04212, 2022 - arxiv.org
The Transformer architecture aggregates input information through the self-attention
mechanism, but there is no clear understanding of how this information is mixed across the …