One-sided dense matrix factorizations on a multicore with multiple GPU accelerators

VM Weaver, M Johnson… - 2012 41st …, 2012 - ieeexplore.ieee.org

Energy and power consumption are becoming critical metrics in the design and usage of
high performance systems. We have extended the Performance API (PAPI) analysis library …

被引用次数：269 相关文章所有 12 个版本

[PDF] nsf.gov

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs

C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org

Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …

被引用次数：27 相关文章所有 6 个版本

[PDF] utk.edu

LU factorization of small matrices: Accelerating batched DGETRF on the GPU

T Dong, A Haidar, P Luszczek, JA Harris… - 2014 IEEE Intl Conf …, 2014 - ieeexplore.ieee.org

Gaussian Elimination is commonly used to solve dense linear systems in scientific models.
In a large number of applications, a need arises to solve many small size problems, instead …

被引用次数：64 相关文章所有 7 个版本

[PDF] arxiv.org

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

X Lacoste, M Faverge, G Bosilca… - … Parallel & Distributed …, 2014 - ieeexplore.ieee.org

The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …

被引用次数：71 相关文章所有 22 个版本

[PDF] susu.ru

Parallel programming models for dense linear algebra on heterogeneous systems

J Dongarra, M Abalenkovs, A Abdelfattah… - Supercomputing …, 2015 - superfri.susu.ru

We present a review of the current best practices in parallel programming models for dense
linear algebra (DLA) on heterogeneous architectures. We consider multicore CPUs, stand …

被引用次数：62 相关文章所有 16 个版本

[PDF] researchgate.net

A framework for batched and GPU-resident factorization algorithms applied to block householder transformations

A Haidar, TT Dong, S Tomov, P Luszczek… - … Conference, ISC High …, 2015 - Springer

As modern hardware keeps evolving, an increasingly effective approach to developing
energy efficient and high-performance solvers is to design them to work on many small size …

被引用次数：58 相关文章所有 5 个版本

[PDF] psu.edu

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

I Yamazaki, T Dong, R Solcà, S Tomov… - Concurrency and …, 2014 - Wiley Online Library

For software to fully exploit the computing power of emerging heterogeneous computers, not
only must the required computational kernels be optimized for the specific hardware …

被引用次数：44 相关文章所有 13 个版本

[PDF] netlib.org

A fast batched Cholesky factorization on a GPU

T Dong, A Haidar, S Tomov… - 2014 43rd International …, 2014 - ieeexplore.ieee.org

Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems,
while solving many small independent problems, which is usually referred to as batched …

被引用次数：40 相关文章所有 8 个版本

[PDF] hal.science

Scheduling and memory optimizations for sparse direct solver on multi-core/multi-GPU duster systems

X Lacoste - 2015 - theses.hal.science

The ongoing hardware evolution exhibits an escalation in the number, as well as in the
heterogeneity, of computing resources. The pressure to maintain reasonable levels of …

被引用次数：34 相关文章所有 6 个版本

[PDF] utk.edu

Reducing the amount of out‐of‐core data access for GPU‐accelerated randomized SVD

Y Lu, I Yamazaki, F Ino, Y Matsushita… - Concurrency and …, 2020 - Wiley Online Library

We propose two acceleration methods, namely, Fused and Gram, for reducing out‐of‐core
data access when performing randomized singular value decomposition (RSVD) on …

被引用次数：14 相关文章所有 4 个版本