Squeezellm: Dense-and-sparse quantization
Generative Large Language Models (LLMs) have demonstrated remarkable results for a
wide range of tasks. However, deploying these models for inference has been a significant …
wide range of tasks. However, deploying these models for inference has been a significant …
Two-Dimensional Materials for Brain-Inspired Computing Hardware
Recent breakthroughs in brain-inspired computing promise to address a wide range of
problems from security to healthcare. However, the current strategy of implementing artificial …
problems from security to healthcare. However, the current strategy of implementing artificial …
Billm: Pushing the limit of post-training quantization for llms
Pretrained large language models (LLMs) exhibit exceptional general language processing
capabilities but come with significant demands on memory and computational resources. As …
capabilities but come with significant demands on memory and computational resources. As …
Survey of CPU and memory simulators in computer architecture: A comprehensive analysis including compiler integration and emerging technology applications
In computer architecture studies, simulators are crucial for design verification, reducing
research and development time and ensuring the high accuracy of verification results …
research and development time and ensuring the high accuracy of verification results …
Fast matrix multiplications for lookup table-quantized llms
H Guo, W Brandon, R Cholakov… - arXiv preprint arXiv …, 2024 - arxiv.org
The deployment of large language models (LLMs) is often constrained by memory
bandwidth, where the primary bottleneck is the cost of transferring model parameters from …
bandwidth, where the primary bottleneck is the cost of transferring model parameters from …
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
W An, X Bi, G Chen, S Chen, C Deng… - … Conference for High …, 2024 - ieeexplore.ieee.org
The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has
exponentially increased demands of computational power and bandwidth. This, combined …
exponentially increased demands of computational power and bandwidth. This, combined …
MNEMOSENE: Tile Architecture and Simulator for Memristor-based Computation-in-memory
M Zahedi, MA Lebdeh, C Bengel, D Wouters… - ACM Journal on …, 2022 - dl.acm.org
In recent years, we are witnessing a trend toward in-memory computing for future
generations of computers that differs from traditional von-Neumann architecture in which …
generations of computers that differs from traditional von-Neumann architecture in which …
Recent and upcoming developments in randomized numerical linear algebra for machine learning
M Dereziński, MW Mahoney - Proceedings of the 30th ACM SIGKDD …, 2024 - dl.acm.org
Large matrices arise in many machine learning and data analysis applications, including as
representations of datasets, graphs, model weights, and first and second-order derivatives …
representations of datasets, graphs, model weights, and first and second-order derivatives …
BBS: Bi-directional bit-level sparsity for deep learning acceleration
Bit-level sparsity methods skip ineffectual zero-bit operations and are typically applicable
within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially …
within bit-serial deep learning accelerators. This type of sparsity at the bit-level is especially …
ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
Large language models (LLMs) have shown impressive performance on language tasks but
face challenges when deployed on resource-constrained devices due to their extensive …
face challenges when deployed on resource-constrained devices due to their extensive …