A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

Full stack optimization of transformer inference: a survey

S Kim, C Hooper, T Wattanawong, M Kang… - arXiv preprint arXiv …, 2023 - arxiv.org
Recent advances in state-of-the-art DNN architecture design have been moving toward
Transformer models. These models achieve superior accuracy across a wide range of …

Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction

Y Qin, Y Wang, D Deng, Z Zhao, X Yang, L Liu… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer model is becoming prevalent in various AI applications with its outstanding
performance. However, the high cost of computation and memory footprint make its …

Bi-directional masks for efficient n: M sparse training

Y Zhang, Y Luo, M Lin, Y Zhong, J Xie… - … on machine learning, 2023 - proceedings.mlr.press
We focus on addressing the dense backward propagation issue for training efficiency of N:
M fine-grained sparsity that preserves at most N out of M consecutive weights and achieves …

Sparsity in transformers: A systematic literature review

M Farina, U Ahmad, A Taha, H Younes, Y Mesbah… - Neurocomputing, 2024 - Elsevier
Transformers have become the state-of-the-art architectures for various tasks in Natural
Language Processing (NLP) and Computer Vision (CV); however, their space and …

Sparsemae: Sparse training meets masked autoencoders

A Zhou, Y Li, Z Qin, J Liu, J Pan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Masked Autoencoders (MAE) and its variants have proven to be effective for pretraining
large-scale Vision Transformers (ViTs). However, small-scale models do not benefit from the …

STEP: learning N: M structured sparsity masks from scratch with precondition

Y Lu, S Agrawal, S Subramanian… - International …, 2023 - proceedings.mlr.press
Recent innovations on hardware (eg Nvidia A100) have motivated learning N: M structured
sparsity masks from scratch for fast model inference. However, state-of-the-art learning …

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Y Lin, H Tang, S Yang, Z Zhang, G Xiao, C Gan… - arXiv preprint arXiv …, 2024 - arxiv.org
Quantization can accelerate large language model (LLM) inference. Going beyond INT8
quantization, the research community is actively exploring even lower precision, such as …

Deep Learning in Environmental Toxicology: Current Progress and Open Challenges

H Tan, J Jin, C Fang, Y Zhang, B Chang… - ACS ES&T …, 2023 - ACS Publications
Ubiquitous chemicals in the environment may pose a threat to human health and the
ecosystem, so comprehensive toxicity information must be obtained. Due to the inability of …

BEBERT: Efficient and robust binary ensemble BERT

J Tian, C Fang, H Wang, Z Wang - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Pre-trained BERT models have achieved impressive accuracy on natural language
processing (NLP) tasks. However, their excessive amount of parameters hinders them from …