A survey on model compression for large language models

X Zhu, J Li, Y Liu, C Ma, W Wang - Transactions of the Association for …, 2024 - direct.mit.edu
Abstract Large Language Models (LLMs) have transformed natural language processing
tasks successfully. Yet, their large size and high computational needs pose challenges for …

Efficientqat: Efficient quantization-aware training for large language models

M Chen, W Shao, P Xu, J Wang, P Gao… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) are crucial in modern natural language processing and
artificial intelligence. However, they face challenges in managing their significant memory …

Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

A survey of low-bit large language models: Basics, systems, and algorithms

R Gong, Y Ding, Z Wang, C Lv, X Zheng, J Du… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have achieved remarkable advancements in natural
language processing, showcasing exceptional performance across various tasks. However …

Model quantization and hardware acceleration for vision transformers: A comprehensive survey

D Du, G Gong, X Chu - arXiv preprint arXiv:2405.00314, 2024 - arxiv.org
Vision Transformers (ViTs) have recently garnered considerable attention, emerging as a
promising alternative to convolutional neural networks (CNNs) in several vision-related …

Llmc: Benchmarking large language model quantization with a versatile compression toolkit

R Gong, Y Yong, S Gu, Y Huang, C Lv… - Proceedings of the …, 2024 - aclanthology.org
Recent advancements in large language models (LLMs) are propelling us toward artificial
general intelligence with their remarkable emergent abilities and reasoning capabilities …

Scalable MatMul-free Language Modeling

RJ Zhu, Y Zhang, E Sifferman, T Sheaves… - arXiv preprint arXiv …, 2024 - arxiv.org
Matrix multiplication (MatMul) typically dominates the overall computational cost of large
language models (LLMs). This cost only grows as LLMs scale to larger embedding …

Lpzero: Language model zero-cost proxy search from zero

P Dong, L Li, X Liu, Z Tang, X Liu, Q Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
In spite of the outstanding performance, Neural Architecture Search (NAS) is criticized for
massive computation. Recently, Zero-shot NAS has emerged as a promising approach by …

STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs

P Dong, L Li, Y Zhong, D Du, R Fan, Y Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we present the first structural binarization method for LLM compression to less
than 1-bit precision. Although LLMs have achieved remarkable performance, their memory …

LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration

Z Mo, L Wang, J Wei, Z Zeng, S Cao, L Ma… - arXiv preprint arXiv …, 2024 - arxiv.org
As large language model (LLM) inference demands ever-greater resources, there is a rapid
growing trend of using low-bit weights to shrink memory usage and boost inference …