Squeezellm: Dense-and-sparse quantization

S Kim, C Hooper, A Gholami, Z Dong, X Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …

Two-Dimensional Materials for Brain-Inspired Computing Hardware

S Hadke, MA Kang, VK Sangwan… - Chemical Reviews, 2025 - ACS Publications
Recent breakthroughs in brain-inspired computing promise to address a wide range of
problems from security to healthcare. However, the current strategy of implementing artificial …

Billm: Pushing the limit of post-training quantization for llms

W Huang, Y Liu, H Qin, Y Li, S Zhang, X Liu… - arXiv preprint arXiv …, 2024 - arxiv.org
Pretrained large language models (LLMs) exhibit exceptional general language processing
capabilities but come with significant demands on memory and computational resources. As …

Survey of CPU and memory simulators in computer architecture: A comprehensive analysis including compiler integration and emerging technology applications

I Hwang, J Lee, H Kang, G Lee, H Kim - Simulation Modelling Practice and …, 2024 - Elsevier
In computer architecture studies, simulators are crucial for design verification, reducing
research and development time and ensuring the high accuracy of verification results …

Fast matrix multiplications for lookup table-quantized llms

H Guo, W Brandon, R Cholakov… - arXiv preprint arXiv …, 2024 - arxiv.org
The deployment of large language models (LLMs) is often constrained by memory
bandwidth, where the primary bottleneck is the cost of transferring model parameters from …

Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

W An, X Bi, G Chen, S Chen, C Deng… - … Conference for High …, 2024 - ieeexplore.ieee.org
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has
exponentially increased demands of computational power and bandwidth. This, combined …

MNEMOSENE: Tile Architecture and Simulator for Memristor-based Computation-in-memory

M Zahedi, MA Lebdeh, C Bengel, D Wouters… - ACM Journal on …, 2022 - dl.acm.org
In recent years, we are witnessing a trend toward in-memory computing for future
generations of computers that differs from traditional von-Neumann architecture in which …

Recent and upcoming developments in randomized numerical linear algebra for machine learning

M Dereziński, MW Mahoney - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
Large matrices arise in many machine learning and data analysis applications, including as
representations of datasets, graphs, model weights, and first and second-order derivatives …

BBS: Bi-directional bit-level sparsity for deep learning acceleration

Y Chen, J Meng, J Seo… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable
within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially …

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

H You, Y Guo, Y Fu, W Zhou, H Shi, X Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models (LLMs) have shown impressive performance on language tasks but
face challenges when deployed on resource-constrained devices due to their extensive …