A survey of techniques for optimizing transformer inference

KT Chitty-Venkata, S Mittal, M Emani… - Journal of Systems …, 2023 - Elsevier
Recent years have seen a phenomenal rise in the performance and applications of
transformer neural networks. The family of transformer networks, including Bidirectional …

On efficient training of large-scale deep learning models: A literature review

L Shen, Y Sun, Z Yu, L Ding, X Tian, D Tao - arXiv preprint arXiv …, 2023 - arxiv.org
The field of deep learning has witnessed significant progress, particularly in computer vision
(CV), natural language processing (NLP), and speech. The use of large-scale models …

Fact: Ffn-attention co-optimized transformer architecture with eager correlation prediction

Y Qin, Y Wang, D Deng, Z Zhao, X Yang, L Liu… - Proceedings of the 50th …, 2023 - dl.acm.org
Transformer model is becoming prevalent in various AI applications with its outstanding
performance. However, the high cost of computation and memory footprint make its …

A survey on efficient inference for large language models

Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have attracted extensive attention due to their remarkable
performance across various tasks. However, the substantial computational and memory …

ShiftAddViT: Mixture of multiplication primitives towards efficient vision transformer

H You, H Shi, Y Guo, Y Lin - Advances in Neural …, 2024 - proceedings.neurips.cc
Abstract Vision Transformers (ViTs) have shown impressive performance and have become
a unified backbone for multiple vision tasks. However, both the attention mechanism and …

Flightllm: Efficient large language model inference with a complete mapping flow on fpgas

S Zeng, J Liu, G Dai, X Yang, T Fu, H Wang… - Proceedings of the …, 2024 - dl.acm.org
Transformer-based Large Language Models (LLMs) have made a significant impact on
various domains. However, LLMs' efficiency suffers from both heavy computation and …

Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads

H Fan, SI Venieris, A Kouris, N Lane - … of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
Running multiple deep neural networks (DNNs) in parallel has become an emerging
workload in both edge devices, such as mobile phones where multiple tasks serve a single …

ETTE: Efficient tensor-train-based computing engine for deep neural networks

Y Gong, M Yin, L Huang, J Xiao, Y Sui, C Deng… - Proceedings of the 50th …, 2023 - dl.acm.org
Tensor-train (TT) decomposition enables ultra-high compression ratio, making the deep
neural network (DNN) accelerators based on this method very attractive. TIE, the state-of-the …

Taskfusion: An efficient transfer learning architecture with dual delta sparsity for multi-task natural language processing

Z Fan, Q Zhang, P Abillama, S Shoouri, C Lee… - Proceedings of the 50th …, 2023 - dl.acm.org
The combination of pre-trained models and task-specific fine-tuning schemes, such as
BERT, has achieved great success in various natural language processing (NLP) tasks …

MELTing point: Mobile Evaluation of Language Transformers

S Laskaridis, K Kateveas, L Minto… - arXiv preprint arXiv …, 2024 - arxiv.org
Transformers have revolutionized the machine learning landscape, gradually making their
way into everyday tasks and equipping our computers with``sparks of intelligence'' …