Ladder: Enabling Efficient {Low-Precision} Deep Learning Computing through Hardware-aware Tensor Transformation

L Wang, L Ma, S Cao, Q Zhang, J Xue, Y Shi… - … USENIX Symposium on …, 2024 - usenix.org
The increasing demand for improving deep learning model performance has led to a
paradigm shift in supporting low-precision computation to harness the robustness of deep …

Numerical Accuracy Matters: Applications of Machine Learned Potential Energy Surfaces

S Käser, M Meuwly - The Journal of Physical Chemistry Letters, 2024 - ACS Publications
The role of numerical accuracy in training and evaluating neural network-based potential
energy surfaces is examined for different experimental observables. For observables that …

Achieving Peak Performance for Large Language Models: A Systematic Review

ZRK Rostam, S Szénási, G Kertész - IEEE Access, 2024 - ieeexplore.ieee.org
In recent years, large language models (LLMs) have achieved remarkable success in
natural language processing (NLP). LLMs require an extreme amount of parameters to …

Accurate Block Quantization in LLMs with Outliers

N Trukhanov, I Soloveychik - arXiv preprint arXiv:2403.20137, 2024 - arxiv.org
The demand for inference on extremely large scale LLMs has seen enormous growth in the
recent months. It made evident the colossal shortage of dedicated hardware capable of …

ScalingFilter: Assessing Data Quality through Inverse Utilization of Scaling Laws

R Li, Y Wei, M Zhang, N Yu, H Hu, H Peng - arXiv preprint arXiv …, 2024 - arxiv.org
High-quality data is crucial for the pre-training performance of large language models.
Unfortunately, existing quality filtering methods rely on a known high-quality dataset as …

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

J Duan, S Zhang, Z Wang, L Jiang, W Qu, Q Hu… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with
their sophisticated capabilities. Training these models requires vast GPU clusters and …

Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization

H Xi, Y Chen, K Zhao, K Zheng, J Chen… - arXiv preprint arXiv …, 2024 - arxiv.org
Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a
promising approach to speed up pretraining. However, most FQT methods adopt a quantize …

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

K Chitsaz, Q Fournier, G Mordido… - arXiv preprint arXiv …, 2024 - arxiv.org
The increasing scale of Transformer models has led to an increase in their pre-training
computational requirements. While quantization has proven to be effective after pre-training …

LoCo: Low-Bit Communication Adaptor for Large-scale Model Training

X Xie, Z Lin, KC Toh, P Zhou - arXiv preprint arXiv:2407.04480, 2024 - arxiv.org
To efficiently train large-scale models, low-bit gradient communication compresses full-
precision gradients on local GPU nodes into low-precision ones for higher gradient …

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

B Wang, A Berg, DAE Acar, C Zhou - arXiv preprint arXiv:2407.02610, 2024 - arxiv.org
Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training
neural networks with reduced computational overhead compared to training in FP32/FP16 …