AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arXiv preprint arXiv …, 2023 - arxiv.org

Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

被引用次数：1918 相关文章所有 4 个版本

[PDF] arxiv.org

A simple and effective pruning approach for large language models

M Sun, Z Liu, A Bair, JZ Kolter - arXiv preprint arXiv:2306.11695, 2023 - arxiv.org

As their size increases, Large Languages Models (LLMs) are natural candidates for network
pruning methods: approaches that drop a subset of network weights while striving to …

被引用次数：187 相关文章所有 5 个版本

[PDF] neurips.cc

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Z Zhang, Y Sheng, T Zhou, T Chen… - Advances in …, 2024 - proceedings.neurips.cc

Abstract Large Language Models (LLMs), despite their recent impressive accomplishments,
are notably cost-prohibitive to deploy, particularly for applications involving long-content …

被引用次数：92 相关文章所有 7 个版本

[PDF] arxiv.org

A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - arXiv preprint arXiv:2308.07633, 2023 - arxiv.org

Large Language Models (LLMs) have revolutionized natural language processing tasks with
remarkable success. However, their formidable size and computational demands present …

被引用次数：116 相关文章所有 2 个版本

[PDF] arxiv.org

Omniquant: Omnidirectionally calibrated quantization for large language models

W Shao, M Chen, Z Zhang, P Xu, L Zhao, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) have revolutionized natural language processing tasks.
However, their practical deployment is hindered by their immense memory and computation …

被引用次数：66 相关文章所有 3 个版本

[PDF] neurips.cc

Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization

J Kim, JH Lee, S Kim, J Park, KM Yoo… - Advances in Neural …, 2024 - proceedings.neurips.cc

Large language models (LLMs) face the challenges in fine-tuning and deployment due to
their high memory demands and computational costs. While parameter-efficient fine-tuning …

被引用次数：43 相关文章所有 6 个版本

[PDF] arxiv.org

A survey on transformer compression

Y Tang, Y Wang, J Guo, Z Tu, K Han, H Hu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large models based on the Transformer architecture play increasingly vital roles in artificial
intelligence, particularly within the realms of natural language processing (NLP) and …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling

X Wei, Y Zhang, Y Li, X Zhang, R Gong, J Guo… - arXiv preprint arXiv …, 2023 - arxiv.org

Post-training quantization~(PTQ) of transformer language models faces significant
challenges due to the existence of detrimental outliers in activations. We observe that these …

被引用次数：57 相关文章所有 6 个版本

[PDF] neurips.cc

Quantizable transformers: Removing outliers by helping attention heads do nothing

Y Bondarenko, M Nagel… - Advances in Neural …, 2024 - proceedings.neurips.cc

Transformer models have been widely adopted in various domains over the last years and
especially large language models have advanced the field of AI significantly. Due to their …

被引用次数：33 相关文章所有 5 个版本

[PDF] arxiv.org

Qa-lora: Quantization-aware low-rank adaptation of large language models

Y Xu, L Xie, X Gu, X Chen, H Chang, H Zhang… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently years have witnessed a rapid development of large language models (LLMs).
Despite the strong ability in many language-understanding tasks, the heavy computational …

被引用次数：52 相关文章所有 3 个版本