Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

D Ernst, G Hager, J Thies… - The International Journal …, 2021 - journals.sagepub.com
General matrix-matrix multiplications with double-precision real and complex entries
(DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square …

Using non-canonical array layouts in dense matrix operations

JR Herrero, JJ Navarro - International Workshop on Applied Parallel …, 2006 - Springer
We present two implementations of dense matrix multiplication based on two different non-
canonical array layouts: one based on a hypermatrix data structure (HM) where data …

[图书][B] A framework for efficient execution of matrix computations

JR Herrero Zaragoza - 2006 - upcommons.upc.edu
Matrix computations lie at the heart of most scientific computational tasks. The solution of
linear systems of equations is a very frequent operation in many fields in science …

New data structures for matrices and specialized inner kernels: Low overhead for high performance

JR Herrero - International Conference on Parallel Processing and …, 2007 - Springer
Dense linear algebra codes are often expressed and coded in terms of BLAS calls. This
approach, however, achieves suboptimal performance due to the overheads associated to …

Level-3 Cholesky factorization routines improve performance of many Cholesky algorithms

FG Gustavson, J Waśniewski, JJ Dongarra… - ACM Transactions on …, 2013 - dl.acm.org
Four routines called DPOTF3i, i= a, b, c, d, are presented. DPOTF3i are a novel type of level-
3 BLAS for use by BPF (B locked P acked F ormat) Cholesky factorization and LAPACK …

[图书][B] Level-3 Cholesky Factorization Routines as Part of Manu Cholesky Algorithms

FG Gustavson, J Wasniewski, JJ Dongarra, JR Herrero… - 2011 - academia.edu
Some Linear Algebra Libraries use Level-2 routines during the factorization part of any
Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3i, i= a …

[PDF][PDF] Exposing inner kernels and block storage for fast parallel dense linear algebra codes

JR Herrero - 2008 - researchgate.net
Efficient execution on processors with multiple cores requires the exploitation of parallelism
within the processor. For many dense linear algebra codes this, in turn, requires the efficient …

[PDF][PDF] Using nonlinear array layouts in dense matrix operations

JR Herrero, JJ Navarro - Workshop on State-of-the-Art in Scientific …, 2006 - academia.edu
Using nonlinear array layouts in dense matrix operations Page 1 Using nonlinear array
layouts in dense matrix operations JR Herrero Introduction: A bottom-up approach …

A square block format for symmetric band matrices

FG Gustavson, JR Herrero, E Morancho - International Conference on …, 2013 - Springer
This contribution describes a Square Block, SB, format for storing a banded symmetric
matrix. This is possible by rearranging “in place” LAPACK Band Layout to become a SB …

New level-3 BLAS kernels for cholesky factorization

FG Gustavson, J Waśniewski, JR Herrero - Parallel Processing and …, 2012 - Springer
Abstract Some Linear Algebra Libraries use Level-2 routines during the factorization part of
any Level-3 block factorization algorithm. We discuss four Level-3 routines called DPOTF3 …