Using non-canonical array layouts in dense matrix operations

C Ferry, T Yuki, S Derrien… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Offloading compute-intensive kernels to hardware accelerators relies on the large degree of
parallelism offered by these platforms. However, the effective bandwidth of the memory …

被引用次数：9 相关文章所有 5 个版本

Automatic program parallelization with a block data distribution

LR Gervich, EN Kravchenko, BY Steinberg… - Numerical Analysis and …, 2015 - Springer

This paper discusses several automated methods of acceleration of program operation. The
acceleration is achieved by parallelization and optimization of memory access. Optimization …

被引用次数：13 相关文章所有 4 个版本

[PDF] arxiv.org

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

A Amirshahi, G Ansaloni, D Atienza - arXiv preprint arXiv:2312.13000, 2023 - arxiv.org

The increasing complexity of transformer models in artificial intelligence expands their
computational costs, memory usage, and energy consumption. Hardware acceleration …

被引用次数：1 相关文章所有 7 个版本

How OPS (optimizing parallelizing system) may be useful for clang

LR Gervich, SA Guda, DV Dubrov… - Proceedings of the 13th …, 2017 - dl.acm.org

In this work, the perspective of using Optimizing Parallelizing System (http://ops. rsu. ru/en/)
together with Clang compiler is considered. The converters from Clang intermediate …

被引用次数：7 相关文章所有 2 个版本

[PDF] mountainscholar.org

[PDF][PDF] Automating the derivation of memory allocations for acceleration of polyhedral programs

C Ferry, S Rajopadhye, S Derrien, S Pasricha… - 2024 - api.mountainscholar.org

As processors compute power keeps increasing, so do their demands in memory accesses:
some computations will require a higher bandwidth and exhibit regular memory access …

New data structures for matrices and specialized inner kernels: Low overhead for high performance

JR Herrero - International Conference on Parallel Processing and …, 2007 - Springer

Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This
approach, however, achieves suboptimal performance due to the overheads associated to …

被引用次数：9 相关文章所有 11 个版本

[PDF] hal.science

Vers des noyaux de calcul intensif pérennes

W Kirschenmann - 2012 - theses.hal.science

Cette thèse aborde les difficultés de mise au point de codes multicibles-c'est-à-dire de
codes dont les performances sont portables entre différentes cibles matérielles. Nous avons …

被引用次数：6 相关文章所有 13 个版本

[PDF] researchgate.net

[PDF][PDF] Exposing inner kernels and block storage for fast parallel dense linear algebra codes

JR Herrero - 2008 - researchgate.net

Efficient execution on processors with multiple cores requires the exploitation of parallelism
within the processor. For many dense linear algebra codes this, in turn, requires the efficient …

被引用次数：2 相关文章所有 4 个版本

[PDF] academia.edu

Hypermatrix oriented supernode amalgamation

JR Herrero, JJ Navarro - The Journal of Supercomputing, 2008 - Springer

In this paper, we introduce a supernode amalgamation algorithm which takes into account
the characteristics of a hypermatrix data structure. The resulting frontal tree is then used to …

被引用次数：1 相关文章所有 8 个版本

[PDF] cyberleninka.ru

Переразмещение матриц к блочному виду с минимизацией использования дополнительной памяти

МВ Юрушкин, СГ Семионов - Известия высших учебных …, 2017 - cyberleninka.ru

Представлен метод преобразования размещения матриц между строчным и блочным
представлениями. Строчное размещение используется в языке программирования Си …

被引用次数：1 相关文章所有 2 个版本