Extending halide to improve software development for imaging dsps

R Baghdadi, J Ray, MB Romdhane… - 2019 IEEE/ACM …, 2019 - ieeexplore.ieee.org

This paper introduces Tiramisu, a polyhedral framework designed to generate high
performance code for multiple platforms including multicores, GPUs, and distributed …

被引用次数：368 相关文章所有 14 个版本

[PDF] acm.org

Vectorization for digital signal processors via equality saturation

A VanHattum, R Nigam, VT Lee, J Bornholt… - Proceedings of the 26th …, 2021 - dl.acm.org

Applications targeting digital signal processors (DSPs) benefit from fast implementations of
small linear algebra kernels. While existing auto-vectorizing compilers are effective at …

被引用次数：54 相关文章所有 6 个版本

[PDF] arxiv.org

Empowering 1000 tokens/second on-device llm prefilling with mllm-npu

D Xu, H Zhang, L Yang, R Liu, G Huang, M Xu… - arXiv preprint arXiv …, 2024 - arxiv.org

On-device large language models (LLMs) are catalyzing novel mobile applications such as
UI task automation and personalized email auto-reply, without giving away users' private …

被引用次数：4 相关文章所有 2 个版本

[PDF] unixer.de

Automatic generation of multi-objective polyhedral compiler transformations

L Chelini, T Gysi, T Grosser, M Kong… - Proceedings of the ACM …, 2020 - dl.acm.org

To this day, polyhedral optimizing compilers use either extremely rigid (but accurate) cost
models, one-size-fits-all general-purpose heuristics, or auto-tuning strategies to traverse and …

被引用次数：17 相关文章所有 20 个版本

[PDF] acm.org

Automatic Generation of Vectorizing Compilers for Customizable Digital Signal Processors

S Thomas, J Bornholt - Proceedings of the 29th ACM International …, 2024 - dl.acm.org

Embedded applications extract the best power-performance trade-off from digital signal
processors (DSPs) by making extensive use of vectorized execution. Rather than …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

Vyasa: A high-performance vectorizing compiler for tensor convolutions on the xilinx ai engine

P Chatarasi, S Neuendorffer, S Bayliss… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org

Xilinx's AI Engine is a recent industry example of energy-efficient vector processing that
includes novel support for 2D SIMD datapaths and shuffle interconnection network. The …

被引用次数：17 相关文章所有 5 个版本

[PDF] ieee.org

SlidingConv: Domain-Specific Description of Sliding Discrete Cosine Transform Convolution for Halide

Y Kanetaka, H Takagi, Y Maeda, N Fukushima - IEEE Access, 2023 - ieeexplore.ieee.org

Filtering is a fundamental tool in image processing, and its acceleration affects many
applications. Therefore, various algorithmic and hardware accelerations have been …

被引用次数：1 相关文章所有 3 个版本

[PDF] tue.nl

Programming tensor cores from an image processing DSL

S Sioutas, S Stuijk, T Basten, L Somers… - Proceedings of the 23th …, 2020 - dl.acm.org

Tensor Cores (TCUs) are specialized units first introduced by NVIDIA in the Volta
microarchitecture in order to accelerate matrix multiplications for deep learning and linear …

被引用次数：13 相关文章所有 3 个版本

[PDF] nsf.gov

GCD²: A Globally Optimizing Compiler for Mapping DNNs to Mobile DSPs

W Niu, J Guan, X Shen, Y Wang… - 2022 55th IEEE/ACM …, 2022 - ieeexplore.ieee.org

More specialized chips are exploiting available high transistor density to expose parallelism
at a large scale with more intricate instruction sets. This paper reports on a compilation …

被引用次数：4 相关文章所有 6 个版本

[PDF] arxiv.org

Restoring the Broken Covenant Between Compilers and Deep Learning Accelerators

S Kinzer, S Ghodrati, R Mahapatra, BH Ahn… - arXiv preprint arXiv …, 2023 - arxiv.org

Deep learning accelerators address the computational demands of Deep Neural Networks
(DNNs), departing from the traditional Von Neumann execution model. They leverage …