Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale
Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …
A simple and effective pruning approach for large language models
As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …
pruning methods: approaches that drop a subset of network weights while striving to …
The case for 4-bit precision: k-bit inference scaling laws
T Dettmers, L Zettlemoyer - International Conference on …, 2023 - proceedings.mlr.press
Quantization methods reduce the number of bits required to represent each parameter in a
model, trading accuracy for smaller memory footprints and inference latencies. However, the …
model, trading accuracy for smaller memory footprints and inference latencies. However, the …
The impact of positional encoding on length generalization in transformers
A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc
Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …
ones, is a critical challenge in the development of Transformer-based language models …
Outlier suppression: Pushing the limit of low-bit transformer language models
Transformer architecture has become the fundamental element of the widespread natural
language processing~(NLP) models. With the trends of large NLP models, the increasing …
language processing~(NLP) models. With the trends of large NLP models, the increasing …
All bark and no bite: Rogue dimensions in transformer language models obscure representational quality
W Timkey, M Van Schijndel - arXiv preprint arXiv:2109.04404, 2021 - arxiv.org
Similarity measures are a vital tool for understanding how language models represent and
process language. Standard representational similarity measures such as cosine similarity …
process language. Standard representational similarity measures such as cosine similarity …
Intriguing properties of quantization at scale
Emergent properties have been widely adopted as a term to describe behavior not present
in smaller models but observed in larger models (Wei et al., 2022a). Recent work suggests …
in smaller models but observed in larger models (Wei et al., 2022a). Recent work suggests …
Feature-learning networks are consistent across widths at realistic scales
We study the effect of width on the dynamics of feature-learning neural networks across a
variety of architectures and datasets. Early in training, wide neural networks trained on …
variety of architectures and datasets. Early in training, wide neural networks trained on …
BERT busters: Outlier dimensions that disrupt transformers
Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to
this received wisdom, we demonstrate that pre-trained Transformer encoders are …
this received wisdom, we demonstrate that pre-trained Transformer encoders are …
Measuring the mixing of contextual information in the transformer
The Transformer architecture aggregates input information through the self-attention
mechanism, but there is no clear understanding of how this information is mixed across the …
mechanism, but there is no clear understanding of how this information is mixed across the …