- 学术资源搜索

Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale

T Dettmers, M Lewis, Y Belkada… - Advances in Neural …, 2022 - proceedings.neurips.cc

Large language models have been widely adopted but require significant GPU memory for
inference. We develop a procedure for Int8 matrix multiplication for feed-forward and …

被引用次数：673 相关文章所有 6 个版本

[PDF] arxiv.org

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - arXiv preprint arXiv:2306.11695, 2023 - arxiv.org

As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …

被引用次数：236 相关文章所有 5 个版本

[PDF] mlr.press

The case for 4-bit precision: k-bit inference scaling laws

T Dettmers, L Zettlemoyer - International Conference on …, 2023 - proceedings.mlr.press

Quantization methods reduce the number of bits required to represent each parameter in a
model, trading accuracy for smaller memory footprints and inference latencies. However, the …

被引用次数：129 相关文章所有 6 个版本

[PDF] neurips.cc

The impact of positional encoding on length generalization in transformers

A Kazemnejad, I Padhi… - Advances in …, 2024 - proceedings.neurips.cc

Length generalization, the ability to generalize from small training context sizes to larger
ones, is a critical challenge in the development of Transformer-based language models …

被引用次数：64 相关文章所有 6 个版本

[PDF] neurips.cc

Outlier suppression: Pushing the limit of low-bit transformer language models

X Wei, Y Zhang, X Zhang, R Gong… - Advances in …, 2022 - proceedings.neurips.cc

Transformer architecture has become the fundamental element of the widespread natural
language processing~(NLP) models. With the trends of large NLP models, the increasing …

被引用次数：82 相关文章所有 5 个版本

[PDF] arxiv.org

All bark and no bite: Rogue dimensions in transformer language models obscure representational quality

W Timkey, M Van Schijndel - arXiv preprint arXiv:2109.04404, 2021 - arxiv.org

Similarity measures are a vital tool for understanding how language models represent and
process language. Standard representational similarity measures such as cosine similarity …

被引用次数：91 相关文章所有 5 个版本

[PDF] neurips.cc

Intriguing properties of quantization at scale

A Ahmadian, S Dash, H Chen… - Advances in …, 2023 - proceedings.neurips.cc

Emergent properties have been widely adopted as a term to describe behavior not present
in smaller models but observed in larger models (Wei et al., 2022a). Recent work suggests …

被引用次数：23 相关文章所有 6 个版本

[PDF] neurips.cc

Feature-learning networks are consistent across widths at realistic scales

N Vyas, A Atanasov, B Bordelon… - Advances in …, 2024 - proceedings.neurips.cc

We study the effect of width on the dynamics of feature-learning neural networks across a
variety of architectures and datasets. Early in training, wide neural networks trained on …

被引用次数：14 相关文章所有 8 个版本

[PDF] arxiv.org

BERT busters: Outlier dimensions that disrupt transformers

O Kovaleva, S Kulshreshtha, A Rogers… - arXiv preprint arXiv …, 2021 - arxiv.org

Multiple studies have shown that Transformers are remarkably robust to pruning. Contrary to
this received wisdom, we demonstrate that pre-trained Transformer encoders are …

被引用次数：62 相关文章所有 7 个版本

[PDF] arxiv.org

Measuring the mixing of contextual information in the transformer

J Ferrando, GI Gállego, MR Costa-Jussà - arXiv preprint arXiv:2203.04212, 2022 - arxiv.org

The Transformer architecture aggregates input information through the self-attention
mechanism, but there is no clear understanding of how this information is mixed across the …

被引用次数：36 相关文章所有 5 个版本