Tr-bert: Dynamic token reduction for accelerating bert inference

X Ma, G Fang, X Wang - Advances in neural information …, 2023 - proceedings.neurips.cc

Large language models (LLMs) have shown remarkable capabilities in language
understanding and generation. However, such impressive capability typically comes with a …

被引用次数：426 相关文章所有 5 个版本

[PDF] arxiv.org

Towards efficient generative large language model serving: A survey from algorithms to systems

X Miao, G Oliaro, Z Zhang, X Cheng, H Jin… - arXiv preprint arXiv …, 2023 - arxiv.org

In the rapidly evolving landscape of artificial intelligence (AI), generative large language
models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However …

被引用次数：62 相关文章所有 2 个版本

[PDF] acm.org

Learned token pruning for transformers

S Kim, S Shen, D Thorsley, A Gholami… - Proceedings of the 28th …, 2022 - dl.acm.org

Efficient deployment of transformer models in practice is challenging due to their inference
cost including memory footprint, latency, and power consumption, which scales quadratically …

被引用次数：151 相关文章所有 5 个版本

[PDF] arxiv.org

Beyond efficiency: A systematic survey of resource-efficient large language models

G Bai, Z Chai, C Ling, S Wang, J Lu, N Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

The burgeoning field of Large Language Models (LLMs), exemplified by sophisticated
models like OpenAI's ChatGPT, represents a significant advancement in artificial …

被引用次数：62 相关文章所有 2 个版本

[PDF] aaai.org

A survey on model compression and acceleration for pretrained language models

C Xu, J McAuley - Proceedings of the AAAI Conference on Artificial …, 2023 - ojs.aaai.org

Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and
long inference delay prevent Transformer-based pretrained language models (PLMs) from …

被引用次数：67 相关文章所有 4 个版本

[PDF] acm.org

Diet code is healthy: Simplifying programs for pre-trained models of code

Z Zhang, H Zhang, B Shen, X Gu - Proceedings of the 30th ACM Joint …, 2022 - dl.acm.org

Pre-trained code representation models such as CodeBERT have demonstrated superior
performance in a variety of software engineering tasks, yet they are often heavy in …

被引用次数：56 相关文章所有 5 个版本

Dynamic neural network structure: A review for its theories and applications

J Guo, CLP Chen, Z Liu, X Yang - IEEE Transactions on Neural …, 2024 - ieeexplore.ieee.org

The dynamic neural network (DNN), in contrast to the static counterpart, offers numerous
advantages, such as improved accuracy, efficiency, and interpretability. These benefits stem …

被引用次数：7 相关文章

[PDF] arxiv.org

Length-adaptive transformer: Train once with length drop, use anytime with search

G Kim, K Cho - arXiv preprint arXiv:2010.07003, 2020 - arxiv.org

Despite transformers' impressive accuracy, their computational cost is often prohibitive to
use with limited computational resources. Most previous approaches to improve inference …

被引用次数：94 相关文章所有 6 个版本

[PDF] acm.org

Resource-efficient Algorithms and Systems of Foundation Models: A Survey

M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024 - dl.acm.org

Large foundation models, including large language models, vision transformers, diffusion,
and LLM-based multimodal models, are revolutionizing the entire machine learning …

Transkimmer: Transformer learns to layer-wise skim

Y Guan, Z Li, J Leng, Z Lin, M Guo - arXiv preprint arXiv:2205.07324, 2022 - arxiv.org

Transformer architecture has become the de-facto model for many machine learning tasks
from natural language processing and computer vision. As such, improving its computational …

被引用次数：36 相关文章所有 7 个版本