Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference

KS Kalyan, A Rajasekharan, S Sangeetha - arXiv preprint arXiv …, 2021 - arxiv.org

Transformer-based pretrained language models (T-PTLMs) have achieved great success in
almost every NLP task. The evolution of these models started with GPT and BERT. These …

被引用次数：359 相关文章所有 2 个版本

[PDF] arxiv.org

Pre-trained language models for text generation: A survey

J Li, T Tang, WX Zhao, JY Nie, JR Wen - ACM Computing Surveys, 2024 - dl.acm.org

Text Generation aims to produce plausible and readable text in human language from input
data. The resurgence of deep learning has greatly advanced this field, in particular, with the …

被引用次数：392 相关文章所有 7 个版本

[PDF] neurips.cc

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Z Yao, R Yazdani Aminabadi… - Advances in …, 2022 - proceedings.neurips.cc

How to efficiently serve ever-larger trained natural language models in practice has become
exceptionally challenging even for powerful cloud servers due to their prohibitive …

被引用次数：365 相关文章所有 7 个版本

[PDF] arxiv.org

Llm-qat: Data-free quantization aware training for large language models

Z Liu, B Oguz, C Zhao, E Chang, P Stock… - arXiv preprint arXiv …, 2023 - arxiv.org

Several post-training quantization methods have been applied to large language models
(LLMs), and have been shown to perform well down to 8-bits. We find that these methods …

被引用次数：216 相关文章所有 3 个版本

[PDF] neurips.cc

Post-training quantization for vision transformer

Z Liu, Y Wang, K Han, W Zhang… - Advances in Neural …, 2021 - proceedings.neurips.cc

Recently, transformer has achieved remarkable performance on a variety of computer vision
applications. Compared with mainstream convolutional neural networks, vision transformers …

被引用次数：358 相关文章所有 7 个版本

[PDF] mit.edu

A primer in BERTology: What we know about how BERT works

A Rogers, O Kovaleva, A Rumshisky - Transactions of the Association …, 2021 - direct.mit.edu

Transformer-based models have pushed state of the art in many areas of NLP, but our
understanding of what is behind their success is still limited. This paper is the first survey of …

被引用次数：1792 相关文章所有 12 个版本

[PDF] mlr.press

I-bert: Integer-only bert quantization

S Kim, A Gholami, Z Yao… - … on machine learning, 2021 - proceedings.mlr.press

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results
in many Natural Language Processing tasks. However, their memory footprint, inference …

被引用次数：369 相关文章所有 7 个版本

[PDF] arxiv.org

Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

被引用次数：157 相关文章所有 4 个版本

[PDF] neurips.cc

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc

Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

被引用次数：130 相关文章所有 9 个版本

[PDF] mit.edu

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：108 相关文章所有 10 个版本