The design and performance of batched BLAS on modern high-performance computing systems

S Markidis, SW Der Chien, E Laure… - 2018 IEEE …, 2018 - ieeexplore.ieee.org

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core
that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The …

被引用次数：516 相关文章所有 8 个版本

[PDF] ieee.org

Performance evaluation of cudnn convolution algorithms on nvidia volta gpus

M Jorda, P Valero-Lara, AJ Pena - IEEE Access, 2019 - ieeexplore.ieee.org

Convolutional neural networks (CNNs) have recently attracted considerable attention due to
their outstanding accuracy in applications, such as image recognition and natural language …

被引用次数：81 相关文章所有 4 个版本

[PDF] mlr.press

Contextual directed acyclic graphs

R Thompson, EV Bonilla… - … Conference on Artificial …, 2024 - proceedings.mlr.press

Estimating the structure of directed acyclic graphs (DAGs) from observational data remains a
significant challenge in machine learning. Most research in this area concentrates on …

被引用次数：3 相关文章所有 3 个版本

[PDF] acm.org

A set of batched basic linear algebra subprograms and LAPACK routines

A Abdelfattah, T Costa, J Dongarra, M Gates… - ACM Transactions on …, 2021 - dl.acm.org

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …

被引用次数：27 相关文章所有 5 个版本

[PDF] osti.gov

LaRIS: targeting portability and productivity for lapack codes on extreme heterogeneous systems by using iris

MAH Monil, NR Miniskar, FY Liu… - 2022 IEEE/ACM …, 2022 - ieeexplore.ieee.org

In keeping with the trend of heterogeneity in high-performance computing, hardware
manufacturers and vendors are developing new architectures and associated software …

被引用次数：12 相关文章所有 6 个版本

[PDF] arxiv.org

Finch: Sparse and Structured Array Programming with Control Flow

W Ahrens, TF Collin, R Patel, K Deeds, C Hong… - arXiv preprint arXiv …, 2024 - arxiv.org

From FORTRAN to NumPy, arrays have revolutionized how we express computation.
However, arrays in these, and almost all prominent systems, can only handle dense …

被引用次数：5 相关文章所有 4 个版本

[PDF] upc.edu

cuThomasBatch and cuThomasVBatch, CUDA routines to compute batch of tridiagonal systems on NVIDIA GPUs

P Valero‐Lara, I Martínez‐Pérez… - Concurrency and …, 2018 - Wiley Online Library

The solving of tridiagonal systems is one of the most computationally expensive parts in
many applications, so that multiple studies have explored the use of NVIDIA GPUs to …

被引用次数：33 相关文章所有 4 个版本

Harnessing deep learning via a single building block

E Georganas, K Banerjee, D Kalamkar… - 2020 IEEE …, 2020 - ieeexplore.ieee.org

Deep learning (DL) is one of the most prominent branches of machine learning. Due to the
immense computational cost of DL workloads, industry and academia have developed DL …

被引用次数：25 相关文章所有 2 个版本

Reproducible BLAS routines with tunable accuracy using ozaki scheme for many-core architectures

D Mukunoki, T Ogita, K Ozaki - … 2019, Bialystok, Poland, September 8–11 …, 2020 - Springer

Generally, floating-point computations comprise rounding errors; the result may be
inaccurate and not identical (non-reproducible). Particularly, heterogeneous computing has …

被引用次数：26 相关文章所有 6 个版本

[PDF] arxiv.org

Exploring the acceleration of Nekbone on reconfigurable architectures

N Brown - 2020 IEEE/ACM International Workshop on …, 2020 - ieeexplore.ieee.org

Hardware technological advances are struggling to match scientific ambition, and a key
question is how we can use the transistors that we already have more effectively. This is …

被引用次数：21 相关文章所有 8 个版本