Performance, design, and autotuning of batched GEMM for GPUs
The general matrix-matrix multiplication (GEMM) is the most important numerical kernel in
dense linear algebra, and is the key component for obtaining high performance in most …
dense linear algebra, and is the key component for obtaining high performance in most …
The design and performance of batched BLAS on modern high-performance computing systems
A current trend in high-performance computing is to decompose a large linear algebra
problem into batches containing thousands of smaller problems, that can be solved …
problem into batches containing thousands of smaller problems, that can be solved …
MiniApps derived from production HPC applications using multiple programing models
OEB Messer, E D'Azevedo, J Hill… - … Journal of High …, 2018 - journals.sagepub.com
We have developed a set of reduced, proxy applications (“MiniApps”) based on large-scale
application codes supported at the Oak Ridge Leadership Computing Facility (OLCF). The …
application codes supported at the Oak Ridge Leadership Computing Facility (OLCF). The …
RETRACTED: Batched matrix computations on hardware accelerators based on GPUs
Scientific applications require solvers that work on many small size problems that are
independent from each other. At the same time, the high-end hardware evolves rapidly and …
independent from each other. At the same time, the high-end hardware evolves rapidly and …
A set of batched basic linear algebra subprograms and LAPACK routines
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
LU factorization of small matrices: Accelerating batched DGETRF on the GPU
Gaussian Elimination is commonly used to solve dense linear systems in scientific models.
In a large number of applications, a need arises to solve many small size problems, instead …
In a large number of applications, a need arises to solve many small size problems, instead …
Parallel programming models for dense linear algebra on heterogeneous systems
We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …
Addressing irregular patterns of matrix computations on GPUs and their impact on applications powered by sparse direct solvers
Many scientific applications rely on sparse direct solvers for their numerical robustness.
However, performance optimization for these solvers remains a challenging task, especially …
However, performance optimization for these solvers remains a challenging task, especially …
A framework for batched and GPU-resident factorization algorithms applied to block householder transformations
As modern hardware keeps evolving, an increasingly effective approach to developing
energy efficient and high-performance solvers is to design them to work on many small size …
energy efficient and high-performance solvers is to design them to work on many small size …
Batch QR factorization on GPUs: Design, optimization, and tuning
QR factorization of dense matrices is a ubiquitous tool in high performance computing
(HPC). From solving linear systems and least squares problems to eigenvalue problems …
(HPC). From solving linear systems and least squares problems to eigenvalue problems …