Large language model inference acceleration: A comprehensive hardware perspective
Large Language Models (LLMs) have demonstrated remarkable capabilities across various
fields, from natural language understanding to text generation. Compared to non-generative …
fields, from natural language understanding to text generation. Compared to non-generative …
Instinfer: In-storage attention offloading for cost-effective long-context llm inference
The widespread of Large Language Models (LLMs) marks a significant milestone in
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
generative AI. Nevertheless, the increasing context length and batch size in offline LLM …
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
Large language models (LLMs) have emerged due to their capability to generate high-
quality content across diverse contexts. To reduce their explosively increasing demands for …
quality content across diverse contexts. To reduce their explosively increasing demands for …
Neupims: Npu-pim heterogeneous acceleration for batched llm inferencing
Modern transformer-based Large Language Models (LLMs) are constructed with a series of
decoder blocks. Each block comprises three key components:(1) QKV generation,(2) multi …
decoder blocks. Each block comprises three key components:(1) QKV generation,(2) multi …
An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
Transformer-based large language models (LLMs) such as Generative Pre-trained
Transformer (GPT) have become popular due to their remarkable performance across …
Transformer (GPT) have become popular due to their remarkable performance across …
AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
The Transformer-based generative model (TbGM), comprising summarization (Sum) and
generation (Gen) stages, has demonstrated unprecedented generative performance across …
generation (Gen) stages, has demonstrated unprecedented generative performance across …
From Information Overload to Lucidity: A Survey on Leveraging GPTs for Systematic Summarization of Medical and Biomedical Artifacts
B Palanisamy, A Chakrabarti, A Singh, V Hassija… - IEEE …, 2024 - ieeexplore.ieee.org
In medical research, the rapid proliferation of condition-specific studies has led to an
information overload, making it challenging for researchers and practitioners to stay abreast …
information overload, making it challenging for researchers and practitioners to stay abreast …
PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems
Processing-in-memory (PIM) has emerged as a promising solution for accelerating memory-
intensive workloads as they provide high memory bandwidth to the processing units. This …
intensive workloads as they provide high memory bandwidth to the processing units. This …
SpecPIM: Accelerating Speculative Inference on PIM-Enabled System via Architecture-Dataflow Co-Exploration
Generative large language models'(LLMs) inference suffers from inefficiency because of the
token dependency brought by autoregressive decoding. Recently, speculative inference has …
token dependency brought by autoregressive decoding. Recently, speculative inference has …
Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments
The widespread adoption of LLMs has driven an exponential rise in their deployment,
imposing substantial demands on inference clusters. These clusters must handle numerous …
imposing substantial demands on inference clusters. These clusters must handle numerous …