Measuring energy and power with PAPI
VM Weaver, M Johnson… - 2012 41st …, 2012 - ieeexplore.ieee.org
Energy and power consumption are becoming critical metrics in the design and usage of
high performance systems. We have extended the Performance API (PAPI) analysis library …
high performance systems. We have extended the Performance API (PAPI) analysis library …
Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs
C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org
Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …
LU factorization of small matrices: Accelerating batched DGETRF on the GPU
Gaussian Elimination is commonly used to solve dense linear systems in scientific models.
In a large number of applications, a need arises to solve many small size problems, instead …
In a large number of applications, a need arises to solve many small size problems, instead …
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …
Parallel programming models for dense linear algebra on heterogeneous systems
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
A framework for batched and GPU-resident factorization algorithms applied to block householder transformations
As modern hardware keeps evolving, an increasingly effective approach to developing
energy efficient and high-performance solvers is to design them to work on many small size …
energy efficient and high-performance solvers is to design them to work on many small size …
Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems
For software to fully exploit the computing power of emerging heterogeneous computers, not
only must the required computational kernels be optimized for the specific hardware …
only must the required computational kernels be optimized for the specific hardware …
A fast batched Cholesky factorization on a GPU
Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems,
while solving many small independent problems, which is usually referred to as batched …
while solving many small independent problems, which is usually referred to as batched …
Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU duster systems
X Lacoste - 2015 - theses.hal.science
The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …
Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD
We propose two acceleration methods, namely, Fused and Gram, for reducing out‐of‐core
data access when performing randomized singular value decomposition (RSVD) on …
data access when performing randomized singular value decomposition (RSVD) on …