Nvidia tensor core programmability, performance & precision
The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core
that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The …
that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The …
Performance evaluation of cudnn convolution algorithms on nvidia volta gpus
M Jorda, P Valero-Lara, AJ Pena - IEEE Access, 2019 - ieeexplore.ieee.org
Convolutional neural networks (CNNs) have recently attracted considerable attention due to
their outstanding accuracy in applications, such as image recognition and natural language …
their outstanding accuracy in applications, such as image recognition and natural language …
Contextual directed acyclic graphs
R Thompson, EV Bonilla… - … Conference on Artificial …, 2024 - proceedings.mlr.press
Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a
significant challenge in machine learning. Most research in this area concentrates on …
significant challenge in machine learning. Most research in this area concentrates on …
A set of batched basic linear algebra subprograms and LAPACK routines
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …
LaRIS: targeting portability and productivity for lapack codes on extreme heterogeneous systems by using iris
In keeping with the trend of heterogeneity in high-performance computing, hardware
manufacturers and vendors are developing new architectures and associated software …
manufacturers and vendors are developing new architectures and associated software …
Finch: Sparse and Structured Array Programming with Control Flow
From FORTRAN to NumPy, arrays have revolutionized how we express computation.
However, arrays in these, and almost all prominent systems, can only handle dense …
However, arrays in these, and almost all prominent systems, can only handle dense …
cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs
P Valero‐Lara, I Martínez‐Pérez… - Concurrency and …, 2018 - Wiley Online Library
The solving of tridiagonal systems is one of the most computationally expensive parts in
many applications, so that multiple studies have explored the use of NVIDIA GPUs to …
many applications, so that multiple studies have explored the use of NVIDIA GPUs to …
Harnessing deep learning via a single building block
Deep learning (DL) is one of the most prominent branches of machine learning. Due to the
immense computational cost of DL workloads, industry and academia have developed DL …
immense computational cost of DL workloads, industry and academia have developed DL …
Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures
D Mukunoki, T Ogita, K Ozaki - … 2019, Bialystok, Poland, September 8–11 …, 2020 - Springer
Generally, floating-point computations comprise rounding errors; the result may be
inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has …
inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has …
Exploring the acceleration of Nekbone on reconfigurable architectures
N Brown - 2020 IEEE/ACM International Workshop on …, 2020 - ieeexplore.ieee.org
Hardware technological advances are struggling to match scientific ambition, and a key
question is how we can use the transistors that we already have more effectively. This is …
question is how we can use the transistors that we already have more effectively. This is …