Increasing fpga accelerators memory bandwidth with a burst-friendly memory layout
Offloading compute-intensive kernels to hardware accelerators relies on the large degree of
parallelism offered by these platforms. However, the effective bandwidth of the memory …
parallelism offered by these platforms. However, the effective bandwidth of the memory …
Automatic program parallelization with a block data distribution
LR Gervich, EN Kravchenko, BY Steinberg… - Numerical Analysis and …, 2015 - Springer
This paper discusses several automated methods of acceleration of program operation. The
acceleration is achieved by parallelization and optimization of memory access. Optimization …
acceleration is achieved by parallelization and optimization of memory access. Optimization …
Accelerator-driven Data Arrangement to Minimize Transformers Run-time on Multi-core Architectures
The increasing complexity of transformer models in artificial intelligence expands their
computational costs, memory usage, and energy consumption. Hardware acceleration …
computational costs, memory usage, and energy consumption. Hardware acceleration …
How OPS (optimizing parallelizing system) may be useful for clang
LR Gervich, SA Guda, DV Dubrov… - Proceedings of the 13th …, 2017 - dl.acm.org
In this work, the perspective of using Optimizing Parallelizing System (http://ops. rsu. ru/en/)
together with Clang compiler is considered. The converters from Clang intermediate …
together with Clang compiler is considered. The converters from Clang intermediate …
[PDF][PDF] Automating the derivation of memory allocations for acceleration of polyhedral programs
As processors compute power keeps increasing, so do their demands in memory accesses:
some computations will require a higher bandwidth and exhibit regular memory access …
some computations will require a higher bandwidth and exhibit regular memory access …
New data structures for matrices and specialized inner kernels: Low overhead for high performance
JR Herrero - International Conference on Parallel Processing and …, 2007 - Springer
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This
approach, however, achieves suboptimal performance due to the overheads associated to …
approach, however, achieves suboptimal performance due to the overheads associated to …
Vers des noyaux de calcul intensif pérennes
W Kirschenmann - 2012 - theses.hal.science
Cette thèse aborde les difficultés de mise au point de codes multicibles-c'est-à-dire de
codes dont les performances sont portables entre différentes cibles matérielles. Nous avons …
codes dont les performances sont portables entre différentes cibles matérielles. Nous avons …
[PDF][PDF] Exposing inner kernels and block storage for fast parallel dense linear algebra codes
JR Herrero - 2008 - researchgate.net
Efficient execution on processors with multiple cores requires the exploitation of parallelism
within the processor. For many dense linear algebra codes this, in turn, requires the efficient …
within the processor. For many dense linear algebra codes this, in turn, requires the efficient …
Hypermatrix oriented supernode amalgamation
JR Herrero, JJ Navarro - The Journal of Supercomputing, 2008 - Springer
In this paper, we introduce a supernode amalgamation algorithm which takes into account
the characteristics of a hypermatrix data structure. The resulting frontal tree is then used to …
the characteristics of a hypermatrix data structure. The resulting frontal tree is then used to …
Переразмещение матриц к блочному виду с минимизацией использования дополнительной памяти
МВ Юрушкин, СГ Семионов - Известия высших учебных …, 2017 - cyberleninka.ru
Представлен метод преобразования размещения матриц между строчным и блочным
представлениями. Строчное размещение используется в языке программирования Си …
представлениями. Строчное размещение используется в языке программирования Си …