Performance, design, and autotuning of batched GEMM for GPUs

A Abdelfattah, A Haidar, S Tomov… - … Conference, ISC High …, 2016 - Springer
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …

The design and performance of batched BLAS on modern high-performance computing systems

J Dongarra, S Hammarling, NJ Higham… - Procedia Computer …, 2017 - Elsevier
A current trend in high-performance computing is to decompose a large linear algebra
problem into batches containing thousands of smaller problems, that can be solved …

MiniApps derived from production HPC applications using multiple programing models

OEB Messer, E D'Azevedo, J Hill… - … Journal of High …, 2018 - journals.sagepub.com
We have developed a set of reduced, proxy applications (“MiniApps”) based on large-scale
application codes supported at the Oak Ridge Leadership Computing Facility (OLCF). The …

RETRACTED: Batched matrix computations on hardware accelerators based on GPUs

A Haidar, T Dong, P Luszczek… - … Journal of High …, 2015 - journals.sagepub.com
Scientific applications require solvers that work on many small size problems that are
independent from each other. At the same time, the high-end hardware evolves rapidly and …

A set of batched basic linear algebra subprograms and LAPACK routines

A Abdelfattah, T Costa, J Dongarra, M Gates… - ACM Transactions on …, 2021 - dl.acm.org
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …

LU factorization of small matrices: Accelerating batched DGETRF on the GPU

T Dong, A Haidar, P Luszczek, JA Harris… - 2014 IEEE Intl Conf …, 2014 - ieeexplore.ieee.org
Gaussian Elimination is commonly used to solve dense linear systems in scientific models.
In a large number of applications, a need arises to solve many small size problems, instead …

Parallel programming models for dense linear algebra on heterogeneous systems

J Dongarra, M Abalenkovs, A Abdelfattah… - Supercomputing …, 2015 - superfri.susu.ru
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …

Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers

A Abdelfattah, P Ghysels, W Boukaram… - … Conference for High …, 2022 - ieeexplore.ieee.org
Many scientific applications rely on sparse direct solvers for their numerical robustness.
However, performance optimization for these solvers remains a challenging task, especially …

A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

A Haidar, TT Dong, S Tomov, P Luszczek… - … Conference, ISC High …, 2015 - Springer
As modern hardware keeps evolving, an increasingly effective approach to developing
energy efficient and high-performance solvers is to design them to work on many small size …

Batch QR factorization on GPUs: Design, optimization, and tuning

A Abdelfattah, S Tomov, J Dongarra - International Conference on …, 2022 - Springer
QR factorization of dense matrices is a ubiquitous tool in high performance computing
(HPC). From solving linear systems and least squares problems to eigenvalue problems …