Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …
paradigm shift in supporting low-precision computation to harness the robustness of deep …
Bitdistiller: Unleashing the potential of sub-4-bit llms via self-distillation
The upscaling of Large Language Models (LLMs) has yielded impressive advances in
natural language processing, yet it also poses significant deployment challenges. Weight …
natural language processing, yet it also poses significant deployment challenges. Weight …
Hq-dit: Efficient diffusion transformer with fp4 hybrid quantization
Diffusion Transformers (DiTs) have recently gained substantial attention in both industrial
and academic fields for their superior visual generation capabilities, outperforming …
and academic fields for their superior visual generation capabilities, outperforming …
DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs
Y Luo, L Chen - arXiv preprint arXiv:2410.12187, 2024 - arxiv.org
Large language models (LLMs) excel in various tasks but face deployment challenges due
to hardware constraints. We propose density-aware post-training weight-only quantization …
to hardware constraints. We propose density-aware post-training weight-only quantization …