Edgebert: Sentence-level energy optimizations for latency-aware multi-task nlp inference

H Yin, A Vahdat, JM Alvarez, A Mallya… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer
ViT for images of different complexity. A-ViT achieves this by automatically reducing the …

被引用次数：270 相关文章所有 4 个版本

[PDF] neurips.cc

A fast post-training pruning framework for transformers

W Kwon, S Kim, MW Mahoney… - Advances in …, 2022 - proceedings.neurips.cc

Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …

被引用次数：127 相关文章所有 9 个版本

[PDF] mit.edu

Efficient methods for natural language processing: A survey

M Treviso, JU Lee, T Ji, B Aken, Q Cao… - Transactions of the …, 2023 - direct.mit.edu

Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …

被引用次数：106 相关文章所有 10 个版本

[PDF] arxiv.org

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org

Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

被引用次数：86 相关文章所有 4 个版本

[PDF] arxiv.org

A survey on green deep learning

J Xu, W Zhou, Z Fu, H Zhou, L Li - arXiv preprint arXiv:2111.05193, 2021 - arxiv.org

In recent years, larger and deeper models are springing up and continuously pushing state-
of-the-art (SOTA) results across various fields like natural language processing (NLP) and …

被引用次数：108 相关文章所有 3 个版本

[PDF] mit.edu

Compressing large-scale transformer-based models: A case study on bert

P Ganesh, Y Chen, X Lou, MA Khan, Y Yang… - Transactions of the …, 2021 - direct.mit.edu

Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …

被引用次数：208 相关文章所有 14 个版本

[PDF] arxiv.org

An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers

C Fang, A Zhou, Z Wang - IEEE Transactions on Very Large …, 2022 - ieeexplore.ieee.org

The Transformer has been an indispensable staple in deep learning. However, for real-life
applications, it is very challenging to deploy efficient Transformers due to the immense …

被引用次数：60 相关文章所有 4 个版本

[PDF] arxiv.org

Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation

S Hong, S Moon, J Kim, S Lee, M Kim… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org

Transformer is a deep learning language model widely used for natural language
processing (NLP) services in datacenters. Among transformer models, Generative …

被引用次数：62 相关文章所有 10 个版本

[PDF] arxiv.org

A Review on Edge Large Language Models: Design, Execution, and Applications

Y Zheng, Y Chen, B Qian, X Shi, Y Shu… - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models (LLMs) have revolutionized natural language processing with their
exceptional capabilities. However, deploying LLMs on resource-constrained edge devices …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design

H Fan, T Chau, SI Venieris, R Lee… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org

Attention-based neural networks have become pervasive in many AI tasks. Despite their
excellent algorithmic performance, the use of the attention mechanism and feedforward …

被引用次数：52 相关文章所有 6 个版本