Scaling laws for sparsely-connected foundation models

E Frantar, C Riquelme, N Houlsby, D Alistarh… - arXiv preprint arXiv …, 2023 - arxiv.org
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained
on massive datasets (ie," foundation models"), in both vision and language domains. In this …

Maskllm: Learnable semi-structured sparsity for large language models

G Fang, H Yin, S Muralidharan, G Heinrich… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) are distinguished by their massive parameter counts, which
typically result in significant redundancy. This work introduces MaskLLM, a learnable …

Lookahead: An inference acceleration framework for large language model with lossless generation accuracy

Y Zhao, Z Xie, C Liang, C Zhuang, J Gu - Proceedings of the 30th ACM …, 2024 - dl.acm.org
As Large Language Models (LLMs) have made significant advancements across various
tasks, such as question answering, translation, text summarization, and dialogue systems …

Effective Interplay between Sparsity and Quantization: From Theory to Practice

SB Harma, A Chakraborty, E Kostenok… - arXiv preprint arXiv …, 2024 - arxiv.org
The increasing size of deep neural networks necessitates effective model compression to
improve computational efficiency and reduce their memory footprint. Sparsity and …

ELSA: Exploiting Layer-wise N: M Sparsity for Vision Transformer Acceleration

NC Huang, CC Chang, WC Lin… - Proceedings of the …, 2024 - openaccess.thecvf.com
N: M sparsity is an emerging model compression method supported by more and more
accelerators to speed up sparse matrix multiplication in deep neural networks. Most existing …

SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

M Mozaffari, A Yazdanbakhsh, Z Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining
method for LLMs that improves the accuracy of sparse LLMs while accelerating their …

Progressive Gradient Flow for Robust N: M Sparsity Training in Transformers

AR Bambhaniya, A Yazdanbakhsh… - arXiv preprint arXiv …, 2024 - arxiv.org
N: M Structured sparsity has garnered significant interest as a result of relatively modest
overhead and improved efficiency. Additionally, this form of sparsity holds considerable …

Beyond 2: 4: exploring V: N: M sparsity for efficient transformer inference on GPUs

K Zhao, T Yuan, H Bao, Z Su, C Gao, Z Sun… - arXiv preprint arXiv …, 2024 - arxiv.org
To date, 2: 4 sparsity has stood as the only sparse pattern that can be accelerated using
sparse tensor cores on GPUs. In practice, 2: 4 sparsity often possesses low actual speedups …

Complementary Sparsity: Accelerating Sparse CNNs with High Accuracy on General-Purpose Computing Platforms

K Zhao, Y Tan, K Han, T Hu, H Chen… - … on Machine Learning …, 2023 - openreview.net
Model sparsity is a promising approach to reducing parameters or FLOPs of convolutional
neural networks (CNNs). Compared to unstructured or coarse-grained structured sparsity …

S-STE: Continuous Pruning Function for Efficient 2: 4 Sparse Pre-training

Y Hu, J Zhu, J Chen - arXiv preprint arXiv:2409.09099, 2024 - arxiv.org
Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper
GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by …