Increasing fpga accelerators memory bandwidth with a burst-friendly memory layout

C Ferry, T Yuki, S Derrien… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of
parallelism offered by these platforms. However, the effective bandwidth of the memory …

Automatic program parallelization with a block data distribution

LR Gervich, EN Kravchenko, BY Steinberg… - Numerical Analysis and …, 2015 - Springer
This paper discusses several automated methods of acceleration of program operation. The
acceleration is achieved by parallelization and optimization of memory access. Optimization …

Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures

A Amirshahi, G Ansaloni, D Atienza - arXiv preprint arXiv:2312.13000, 2023 - arxiv.org
The increasing complexity of transformer models in artificial intelligence expands their
computational costs, memory usage, and energy consumption. Hardware acceleration …

How OPS (optimizing parallelizing system) may be useful for clang

LR Gervich, SA Guda, DV Dubrov… - Proceedings of the 13th …, 2017 - dl.acm.org
In this work, the perspective of using Optimizing Parallelizing System (http://ops. rsu. ru/en/)
together with Clang compiler is considered. The converters from Clang intermediate …

[PDF][PDF] Automating the derivation of memory allocations for acceleration of polyhedral programs

C Ferry, S Rajopadhye, S Derrien, S Pasricha… - 2024 - api.mountainscholar.org
As processors compute power keeps increasing, so do their demands in memory accesses:
some computations will require a higher bandwidth and exhibit regular memory access …

New data structures for matrices and specialized inner kernels: Low overhead for high performance

JR Herrero - International Conference on Parallel Processing and …, 2007 - Springer
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This
approach, however, achieves suboptimal performance due to the overheads associated to …

Vers des noyaux de calcul intensif pérennes

W Kirschenmann - 2012 - theses.hal.science
Cette thèse aborde les difficultés de mise au point de codes multicibles-c'est-à-dire de
codes dont les performances sont portables entre différentes cibles matérielles. Nous avons …

[PDF][PDF] Exposing inner kernels and block storage for fast parallel dense linear algebra codes

JR Herrero - 2008 - researchgate.net
Efficient execution on processors with multiple cores requires the exploitation of parallelism
within the processor. For many dense linear algebra codes this, in turn, requires the efficient …

Hypermatrix oriented supernode amalgamation

JR Herrero, JJ Navarro - The Journal of Supercomputing, 2008 - Springer
In this paper, we introduce a supernode amalgamation algorithm which takes into account
the characteristics of a hypermatrix data structure. The resulting frontal tree is then used to …

Переразмещение матриц к блочному виду с минимизацией использования дополнительной памяти

МВ Юрушкин, СГ Семионов - Известия высших учебных …, 2017 - cyberleninka.ru
Представлен метод преобразования размещения матриц между строчным и блочным
представлениями. Строчное размещение используется в языке программирования Си …