Tiramisu: A polyhedral compiler for expressing fast and portable code

R Baghdadi, J Ray, MB Romdhane… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org
This paper introduces Tiramisu, a polyhedral framework designed to generate high
performance code for multiple platforms including multicores, GPUs, and distributed …

Vectorization for digital signal processors via equality saturation

A VanHattum, R Nigam, VT Lee, J Bornholt… - Proceedings of the 26th …, 2021 - dl.acm.org
Applications targeting digital signal processors (DSPs) benefit from fast implementations of
small linear algebra kernels. While existing auto-vectorizing compilers are effective at …

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …

Automatic generation of multi-objective polyhedral compiler transformations

L Chelini, T Gysi, T Grosser, M Kong… - Proceedings of the ACM …, 2020 - dl.acm.org
To this day, polyhedral optimizing compilers use either extremely rigid (but accurate) cost
models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and …

Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors

S Thomas, J Bornholt - Proceedings of the 29th ACM International …, 2024 - dl.acm.org
Embedded applications extract the best power-performance trade-off from digital signal
processors (DSPs) by making extensive use of vectorized execution. Rather than …

Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine

P Chatarasi, S Neuendorffer, S Bayliss… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org
Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that
includes novel support for 2D SIMD datapaths and shuffle interconnection network. The …

SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for Halide

Y Kanetaka, H Takagi, Y Maeda, N Fukushima - IEEE Access, 2023 - ieeexplore.ieee.org
Filtering is a fundamental tool in image processing, and its acceleration affects many
applications. Therefore, various algorithmic and hardware accelerations have been …

Programming tensor cores from an image processing DSL

S Sioutas, S Stuijk, T Basten, L Somers… - Proceedings of the 23th …, 2020 - dl.acm.org
Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta
microarchitecture in order to accelerate matrix multiplications for deep learning and linear …

GCD2: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs

W Niu, J Guan, X Shen, Y Wang… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org
More specialized chips are exploiting available high transistor density to expose parallelism
at a large scale with more intricate instruction sets. This paper reports on a compilation …

Restoring the Broken Covenant Between Compilers and Deep Learning Accelerators

S Kinzer, S Ghodrati, R Mahapatra, BH Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org
Deep learning accelerators address the computational demands of Deep Neural Networks
(DNNs), departing from the traditional Von Neumann execution model. They leverage …