A-vit: Adaptive tokens for efficient vision transformer
We introduce A-ViT, a method that adaptively adjusts the inference cost of vision transformer
ViT for images of different complexity. A-ViT achieves this by automatically reducing the …
ViT for images of different complexity. A-ViT achieves this by automatically reducing the …
A fast post-training pruning framework for transformers
Pruning is an effective way to reduce the huge inference cost of Transformer models.
However, prior work on pruning Transformers requires retraining the models. This can add …
However, prior work on pruning Transformers requires retraining the models. This can add …
Efficient methods for natural language processing: A survey
Recent work in natural language processing (NLP) has yielded appealing results from
scaling model parameters and training data; however, using only scale to improve …
scaling model parameters and training data; however, using only scale to improve …
Full stack optimization of transformer inference: a survey
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …
Transformer models. These models achieve superior accuracy across a wide range of …
A survey on green deep learning
In recent years, larger and deeper models are springing up and continuously pushing state-
of-the-art (SOTA) results across various fields like natural language processing (NLP) and …
of-the-art (SOTA) results across various fields like natural language processing (NLP) and …
Compressing large-scale transformer-based models: A case study on bert
Pre-trained Transformer-based models have achieved state-of-the-art performance for
various Natural Language Processing (NLP) tasks. However, these models often have …
various Natural Language Processing (NLP) tasks. However, these models often have …
An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers
The Transformer has been an indispensable staple in deep learning. However, for real-life
applications, it is very challenging to deploy efficient Transformers due to the immense …
applications, it is very challenging to deploy efficient Transformers due to the immense …
Dfx: A low-latency multi-fpga appliance for accelerating transformer-based text generation
Transformer is a deep learning language model widely used for natural language
processing (NLP) services in datacenters. Among transformer models, Generative …
processing (NLP) services in datacenters. Among transformer models, Generative …
A Review on Edge Large Language Models: Design, Execution, and Applications
Large language models (LLMs) have revolutionized natural language processing with their
exceptional capabilities. However, deploying LLMs on resource-constrained edge devices …
exceptional capabilities. However, deploying LLMs on resource-constrained edge devices …
Adaptable butterfly accelerator for attention-based NNs via hardware and algorithm co-design
Attention-based neural networks have become pervasive in many AI tasks. Despite their
excellent algorithmic performance, the use of the attention mechanism and feedforward …
excellent algorithmic performance, the use of the attention mechanism and feedforward …