A Survey of Design and Optimization for Systolic Array-based DNN Accelerators
In recent years, it has been witnessed that the systolic array is a successful architecture for
DNN hardware accelerators. However, the design of systolic arrays also encountered many …
DNN hardware accelerators. However, the design of systolic arrays also encountered many …
Olive: Accelerating large language models via hardware-friendly outlier-victim pair quantization
Transformer-based large language models (LLMs) have achieved great success with the
growing model size. LLMs' size grows by 240× every two years, which outpaces the …
growing model size. LLMs' size grows by 240× every two years, which outpaces the …
Torchsparse++: Efficient training and inference framework for sparse convolution on gpus
Sparse convolution plays a pivotal role in emerging workloads, including point cloud
processing in AR/VR, autonomous driving, and graph understanding in recommendation …
processing in AR/VR, autonomous driving, and graph understanding in recommendation …
Sparseloop: An analytical approach to sparse tensor accelerator modeling
In recent years, many accelerators have been proposed to efficiently process sparse tensor
algebra applications (eg, sparse neural networks). However, these proposals are single …
algebra applications (eg, sparse neural networks). However, these proposals are single …
Squant: On-the-fly data-free quantization via diagonal hessian approximation
Quantization of deep neural networks (DNN) has been proven effective for compressing and
accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the …
accelerating DNN models. Data-free quantization (DFQ) is a promising approach without the …
TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs
Sparse general matrix-matrix multiplication (SpGEMM) is one of the most fundamental
building blocks in sparse linear solvers, graph processing frameworks and machine learning …
building blocks in sparse linear solvers, graph processing frameworks and machine learning …
An overview of sparsity exploitation in CNNs for on-device intelligence with software-hardware cross-layer optimizations
This paper presents a detailed overview of sparsity exploitation in deep neural network
(DNN) accelerators. Despite the algorithmic advancements which drove DNNs to become …
(DNN) accelerators. Despite the algorithmic advancements which drove DNNs to become …
Highlight: Efficient and flexible dnn acceleration with hierarchical structured sparsity
Due to complex interactions among various deep neural network (DNN) optimization
techniques, modern DNNs can have weights and activations that are dense or sparse with …
techniques, modern DNNs can have weights and activations that are dense or sparse with …
Ant: Exploiting adaptive numerical data type for low-bit deep neural network quantization
Quantization is a technique to reduce the computation and memory cost of DNN models,
which are getting increasingly large. Existing quantization solutions use fixed-point integer …
which are getting increasingly large. Existing quantization solutions use fixed-point integer …
VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling
Deep learning (DL) models have achieved great success in many application domains. As
such, many industrial companies such as Google and Facebook have acknowledged the …
such, many industrial companies such as Google and Facebook have acknowledged the …