Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
Z Yao, R Yazdani Aminabadi… - Advances in …, 2022 - proceedings.neurips.cc
How to efficiently serve ever-larger trained natural language models in practice has become
exceptionally challenging even for powerful cloud servers due to their prohibitive …
exceptionally challenging even for powerful cloud servers due to their prohibitive …
Structured pruning learns compact and accurate models
The growing size of neural language models has led to increased attention in model
compression. The two predominant approaches are pruning, which gradually removes …
compression. The two predominant approaches are pruning, which gradually removes …
A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …
However, prior work on pruning Transformers requires retraining the models. This can add …
Width & depth pruning for vision transformers
Transformer models have demonstrated their promising potential and achieved excellent
performance on a series of computer vision tasks. However, the huge computational cost of …
performance on a series of computer vision tasks. However, the huge computational cost of …
Learned token pruning for transformers
Efficient deployment of transformer models in practice is challenging due to their inference
cost including memory footprint, latency, and power consumption, which scales quadratically …
cost including memory footprint, latency, and power consumption, which scales quadratically …
Xtc: Extreme compression for pre-trained transformers made simple and efficient
Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has
been proposed to fit large NLP models on resource-constraint devices. However, to …
been proposed to fit large NLP models on resource-constraint devices. However, to …
Gradient-free structured pruning with unlabeled data
Abstract Large Language Models (LLMs) have achieved great success in solving difficult
tasks across many domains, but such success comes with a high computation cost, and …
tasks across many domains, but such success comes with a high computation cost, and …
Swiftpruner: Reinforced evolutionary pruning for efficient ad relevance
Ad relevance modeling plays a critical role in online advertising systems including Microsoft
Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing …
Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing …
Efficient label-free pruning and retraining for Text-VQA Transformers
Abstract Recent advancements in Scene Text Visual Question Answering (Text-VQA)
employ autoregressive Transformers, showing improved performance with larger models …
employ autoregressive Transformers, showing improved performance with larger models …
PruMUX: Augmenting Data Multiplexing with Model Compression
As language models increase in size by the day, methods for efficient inference are critical to
leveraging their capabilities for various applications. Prior work has investigated techniques …
leveraging their capabilities for various applications. Prior work has investigated techniques …