Mlpruning: A multilevel structured pruning framework for transformer-based models

Z Yao, R Yazdani Aminabadi… - Advances in …, 2022 - proceedings.neurips.cc

How to efficiently serve ever-larger trained natural language models in practice has become
exceptionally challenging even for powerful cloud servers due to their prohibitive …

被引用次数：365 相关文章所有 7 个版本

[PDF] arxiv.org

Structured pruning learns compact and accurate models

M Xia, Z Zhong, D Chen - arXiv preprint arXiv:2204.00408, 2022 - arxiv.org

The growing size of neural language models has led to increased attention in model
compression. The two predominant approaches are pruning, which gradually removes …

被引用次数：228 相关文章所有 7 个版本

[PDF] neurips.cc

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc

Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

被引用次数：130 相关文章所有 9 个版本

[PDF] aaai.org

Width & depth pruning for vision transformers

F Yu, K Huang, M Wang, Y Cheng, W Chu… - Proceedings of the AAAI …, 2022 - ojs.aaai.org

Transformer models have demonstrated their promising potential and achieved excellent
performance on a series of computer vision tasks. However, the huge computational cost of …

被引用次数：112 相关文章所有 4 个版本

[PDF] acm.org

Learned token pruning for transformers

S Kim, S Shen, D Thorsley, A Gholami… - Proceedings of the 28th …, 2022 - dl.acm.org

Efficient deployment of transformer models in practice is challenging due to their inference
cost including memory footprint, latency, and power consumption, which scales quadratically …

被引用次数：153 相关文章所有 5 个版本

[PDF] neurips.cc

Xtc: Extreme compression for pre-trained transformers made simple and efficient

X Wu, Z Yao, M Zhang, C Li… - Advances in Neural …, 2022 - proceedings.neurips.cc

Extreme compression, particularly ultra-low bit precision (binary/ternary) quantization, has
been proposed to fit large NLP models on resource-constraint devices. However, to …

被引用次数：27 相关文章所有 6 个版本

[PDF] mlr.press

Gradient-free structured pruning with unlabeled data

A Nova, H Dai, D Schuurmans - International Conference on …, 2023 - proceedings.mlr.press

Abstract Large Language Models (LLMs) have achieved great success in solving difficult
tasks across many domains, but such success comes with a high computation cost, and …

被引用次数：13 相关文章所有 8 个版本

[PDF] arxiv.org

Swiftpruner: Reinforced evolutionary pruning for efficient ad relevance

LL Zhang, Y Homma, Y Wang, M Wu, M Yang… - Proceedings of the 31st …, 2022 - dl.acm.org

Ad relevance modeling plays a critical role in online advertising systems including Microsoft
Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing …

被引用次数：5 相关文章所有 3 个版本

Efficient label-free pruning and retraining for Text-VQA Transformers

SC Poh, CS Chan, CK Lim - Pattern Recognition Letters, 2024 - Elsevier

Abstract Recent advancements in Scene Text Visual Question Answering (Text-VQA)
employ autoregressive Transformers, showing improved performance with larger models …

PruMUX: Augmenting Data Multiplexing with Model Compression

Y Su, V Murahari, K Narasimhan, K Li - arXiv preprint arXiv:2305.14706, 2023 - arxiv.org

As language models increase in size by the day, methods for efficient inference are critical to
leveraging their capabilities for various applications. Prior work has investigated techniques …

被引用次数：1 相关文章所有 5 个版本