Tiramisu: A polyhedral compiler for expressing fast and portable code
R Baghdadi, J Ray, MB Romdhane… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
This paper introduces Tiramisu, a polyhedral framework designed to generate high
performance code for multiple platforms including multicores, GPUs, and distributed …
performance code for multiple platforms including multicores, GPUs, and distributed …
Vectorization for digital signal processors via equality saturation
Applications targeting digital signal processors (DSPs) benefit from fast implementations of
small linear algebra kernels. While existing auto-vectorizing compilers are effective at …
small linear algebra kernels. While existing auto-vectorizing compilers are effective at …
Empowering 1000 tokens/second on-device llm prefilling with mllm-npu
On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …
UI task automation and personalized email auto-reply, without giving away users' private …
Automatic generation of multi-objective polyhedral compiler transformations
To this day, polyhedral optimizing compilers use either extremely rigid (but accurate) cost
models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and …
models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and …
Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors
S Thomas, J Bornholt - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Embedded applications extract the best power-performance trade-off from digital signal
processors (DSPs) by making extensive use of vectorized execution. Rather than …
processors (DSPs) by making extensive use of vectorized execution. Rather than …
Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine
Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that
includes novel support for 2D SIMD datapaths and shuffle interconnection network. The …
includes novel support for 2D SIMD datapaths and shuffle interconnection network. The …
SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for Halide
Y Kanetaka, H Takagi, Y Maeda, N Fukushima - IEEE Access, 2023 - ieeexplore.ieee.org
Filtering is a fundamental tool in image processing, and its acceleration affects many
applications. Therefore, various algorithmic and hardware accelerations have been …
applications. Therefore, various algorithmic and hardware accelerations have been …
Programming tensor cores from an image processing DSL
Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta
microarchitecture in order to accelerate matrix multiplications for deep learning and linear …
microarchitecture in order to accelerate matrix multiplications for deep learning and linear …
GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs
More specialized chips are exploiting available high transistor density to expose parallelism
at a large scale with more intricate instruction sets. This paper reports on a compilation …
at a large scale with more intricate instruction sets. This paper reports on a compilation …
Restoring the Broken Covenant Between Compilers and Deep Learning Accelerators
Deep learning accelerators address the computational demands of Deep Neural Networks
(DNNs), departing from the traditional Von Neumann execution model. They leverage …
(DNNs), departing from the traditional Von Neumann execution model. They leverage …