Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation

D Du, Y Zhang, S Cao, J Guo, T Cao, X Chu… - arXiv preprint arXiv …, 2024 - arxiv.org
The upscaling of Large Language Models (LLMs) has yielded impressive advances in
natural language processing, yet it also poses significant deployment challenges. Weight …

Hq-dit: Efficient diffusion transformer with fp4 hybrid quantization

W Liu, SQ Zhang - arXiv preprint arXiv:2405.19751, 2024 - arxiv.org
Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial
and academic fields for their superior visual generation capabilities, outperforming …

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

Y Luo, L Chen - arXiv preprint arXiv:2410.12187, 2024 - arxiv.org
Large language models (LLMs) excel in various tasks but face deployment challenges due
to hardware constraints. We propose density-aware post-training weight-only quantization …