Measuring energy and power with PAPI

VM Weaver, M Johnson… - 2012 41st …, 2012 - ieeexplore.ieee.org
Energy and power consumption are becoming critical metrics in the design and usage of
high performance systems. We have extended the Performance API (PAPI) analysis library …

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs

C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org
Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …

LU factorization of small matrices: Accelerating batched DGETRF on the GPU

T Dong, A Haidar, P Luszczek, JA Harris… - 2014 IEEE Intl Conf …, 2014 - ieeexplore.ieee.org
Gaussian Elimination is commonly used to solve dense linear systems in scientific models.
In a large number of applications, a need arises to solve many small size problems, instead …

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

X Lacoste, M Faverge, G Bosilca… - … Parallel & Distributed …, 2014 - ieeexplore.ieee.org
The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …

Parallel programming models for dense linear algebra on heterogeneous systems

J Dongarra, M Abalenkovs, A Abdelfattah… - Supercomputing …, 2015 - superfri.susu.ru
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …

A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

A Haidar, TT Dong, S Tomov, P Luszczek… - … Conference, ISC High …, 2015 - Springer
As modern hardware keeps evolving, an increasingly effective approach to developing
energy efficient and high-performance solvers is to design them to work on many small size …

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

I Yamazaki, T Dong, R Solcà, S Tomov… - Concurrency and …, 2014 - Wiley Online Library
For software to fully exploit the computing power of emerging heterogeneous computers, not
only must the required computational kernels be optimized for the specific hardware …

A fast batched Cholesky factorization on a GPU

T Dong, A Haidar, S Tomov… - 2014 43rd International …, 2014 - ieeexplore.ieee.org
Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems,
while solving many small independent problems, which is usually referred to as batched …

Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU duster systems

X Lacoste - 2015 - theses.hal.science
The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …

Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

Y Lu, I Yamazaki, F Ino, Y Matsushita… - Concurrency and …, 2020 - Wiley Online Library
We propose two acceleration methods, namely, Fused and Gram, for reducing out‐of‐core
data access when performing randomized singular value decomposition (RSVD) on …