Large language model inference acceleration: A comprehensive hardware perspective

J Li, J Xu, S Huang, Y Chen, W Li, J Liu, Y Lian… - arXiv preprint arXiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …

Instinfer: In-storage attention offloading for cost-effective long-context llm inference

X Pan, E Li, Q Li, S Liang, Y Shan, K Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …

Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching

S Yun, K Kyung, J Cho, J Choi, J Kim… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Large language models (LLMs) have emerged due to their capability to generate high-
quality content across diverse contexts. To reduce their explosively increasing demands for …

Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing

G Heo, S Lee, J Cho, H Choi, S Lee, H Ham… - Proceedings of the 29th …, 2024 - dl.acm.org
Modern transformer-based Large Language Models (LLMs) are constructed with a series of
decoder blocks. Each block comprises three key components:(1) QKV generation,(2) multi …

An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models

SS Park, KS Kim, J So, J Jung, J Lee… - … Symposium on High …, 2024 - ieeexplore.ieee.org
Transformer-based large language models (LLMs) such as Generative Pre-trained
Transformer (GPT) have become popular due to their remarkable performance across …

AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference

J Park, J Choi, K Kyung, MJ Kim, Y Kwon… - Proceedings of the 29th …, 2024 - dl.acm.org
The Transformer-based generative model (TbGM), comprising summarization (Sum) and
generation (Gen) stages, has demonstrated unprecedented generative performance across …

From Information Overload to Lucidity: A Survey on Leveraging GPTs for Systematic Summarization of Medical and Biomedical Artifacts

B Palanisamy, A Chakrabarti, A Singh, V Hassija… - IEEE …, 2024 - ieeexplore.ieee.org
In medical research, the rapid proliferation of condition-specific studies has led to an
information overload, making it challenging for researchers and practitioners to stay abreast …

PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems

D Lee, B Hyun, T Kim, M Rhu - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Processing-in-memory (PIM) has emerged as a promising solution for accelerating memory-
intensive workloads as they provide high memory bandwidth to the processing units. This …

SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration

C Li, Z Zhou, S Zheng, J Zhang, Y Liang… - Proceedings of the 29th …, 2024 - dl.acm.org
Generative large language models'(LLMs) inference suffers from inefficiency because of the
token dependency brought by autoregressive decoding. Recently, speculative inference has …

Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments

N Iliakopoulou, J Stojkovic, C Alverti, T Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
The widespread adoption of LLMs has driven an exponential rise in their deployment,
imposing substantial demands on inference clusters. These clusters must handle numerous …