Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats

Z Yuan, Y Shang, Y Zhou, Z Dong, Z Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org

The field of efficient Large Language Model (LLM) inference is rapidly evolving, presenting a
unique blend of opportunities and challenges. Although the field has expanded and is …

被引用次数：50 相关文章所有 2 个版本

[PDF] arxiv.org

Beyond efficiency: A systematic survey of resource-efficient large language models

G Bai, Z Chai, C Ling, S Wang, J Lu, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated
models like OpenAI's ChatGPT, represents a significant advancement in artificial …

被引用次数：63 相关文章所有 2 个版本

[PDF] arxiv.org

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

被引用次数：67 相关文章所有 5 个版本

[PDF] arxiv.org

Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations

L Donisch, S Schacht, C Lanquillon - arXiv preprint arXiv:2408.03130, 2024 - arxiv.org

Large language models are ubiquitous in natural language processing because they can
adapt to new tasks without retraining. However, their sheer scale and complexity present …

被引用次数：1 相关文章所有 2 个版本

[PDF] ieee.org

Survey of different large language model architectures: Trends, benchmarks, and challenges

M Shao, A Basit, R Karri, M Shafique - IEEE Access, 2024 - ieeexplore.ieee.org

Large Language Models (LLMs) represent a class of deep learning models adept at
understanding natural language and generating coherent text in response to prompts or …

被引用次数：6 相关文章所有 5 个版本

[PDF] arxiv.org

Model compression and efficient inference for large language models: A survey

W Wang, W Chen, Y Luo, Y Long, Z Lin… - arXiv preprint arXiv …, 2024 - arxiv.org

Transformer based large language models have achieved tremendous success. However,
the significant memory and computational costs incurred during the inference process make …

被引用次数：24 相关文章所有 2 个版本

[PDF] aaai.org

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

X Shen, P Dong, L Lu, Z Kong, Z Li, M Lin… - Proceedings of the …, 2024 - ojs.aaai.org

Large Language Models (LLMs) stand out for their impressive performance in intricate
language modeling tasks. However, their demanding computational and memory needs …

被引用次数：19 相关文章所有 3 个版本

HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models

X Shen, Z Han, L Lu, Z Kong, P Dong… - … on Computer-Aided …, 2024 - ieeexplore.ieee.org

The Large Language Models (LLMs) have been popular and widely used in creative ways
because of their powerful capabilities. However, the substantial model size and complexity …

被引用次数：2 相关文章

[PDF] aaai.org

What Makes Quantization for Large Language Model Hard? An Empirical Study from the Lens of Perturbation

Z Gong, J Liu, J Wang, X Cai, D Zhao… - Proceedings of the AAAI …, 2024 - ojs.aaai.org

Quantization has emerged as a promising technique for improving the memory and
computational efficiency of large language models (LLMs). Though the trade-off between …

被引用次数：8 相关文章所有 3 个版本

[PDF] mlsys.org

Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache

Z Zhang, S Liu, R Chen, B Kailkhura… - Proceedings of …, 2024 - proceedings.mlsys.org

This paper focuses on addressing the substantial memory footprints and bandwidth costs
associated with the deployment of Large Language Models (LLMs). LLMs, characterized by …

被引用次数：12 相关文章