Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Z Yao, R Yazdani Aminabadi… - Advances in …, 2022 - proceedings.neurips.cc
How to efficiently serve ever-larger trained natural language models in practice has become
exceptionally challenging even for powerful cloud servers due to their prohibitive …

Structured pruning learns compact and accurate models

M Xia, Z Zhong, D Chen - arXiv preprint arXiv:2204.00408, 2022 - arxiv.org
The growing size of neural language models has led to increased attention in model
compression. The two predominant approaches are pruning, which gradually removes …

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

Width & depth pruning for vision transformers

F Yu, K Huang, M Wang, Y Cheng, W Chu… - Proceedings of the AAAI …, 2022 - ojs.aaai.org
Transformer models have demonstrated their promising potential and achieved excellent
performance on a series of computer vision tasks. However, the huge computational cost of …

Learned token pruning for transformers

S Kim, S Shen, D Thorsley, A Gholami… - Proceedings of the 28th …, 2022 - dl.acm.org
Efficient deployment of transformer models in practice is challenging due to their inference
cost including memory footprint, latency, and power consumption, which scales quadratically …

Xtc: Extreme compression for pre-trained transformers made simple and efficient

X Wu, Z Yao, M Zhang, C Li… - Advances in Neural …, 2022 - proceedings.neurips.cc
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has
been proposed to fit large NLP models on resource-constraint devices. However, to …

Gradient-free structured pruning with unlabeled data

A Nova, H Dai, D Schuurmans - International Conference on …, 2023 - proceedings.mlr.press
Abstract Large Language Models (LLMs) have achieved great success in solving difficult
tasks across many domains, but such success comes with a high computation cost, and …

Swiftpruner: Reinforced evolutionary pruning for efficient ad relevance

LL Zhang, Y Homma, Y Wang, M Wu, M Yang… - Proceedings of the 31st …, 2022 - dl.acm.org
Ad relevance modeling plays a critical role in online advertising systems including Microsoft
Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing …

Efficient label-free pruning and retraining for Text-VQA Transformers

SC Poh, CS Chan, CK Lim - Pattern Recognition Letters, 2024 - Elsevier
Abstract Recent advancements in Scene Text Visual Question Answering (Text-VQA)
employ autoregressive Transformers, showing improved performance with larger models …

PruMUX: Augmenting Data Multiplexing with Model Compression

Y Su, V Murahari, K Narasimhan, K Li - arXiv preprint arXiv:2305.14706, 2023 - arxiv.org
As language models increase in size by the day, methods for efficient inference are critical to
leveraging their capabilities for various applications. Prior work has investigated techniques …