High-performance matrix-matrix multiplications of very small matrices

T Kolev, P Fischer, M Min, J Dongarra… - … Journal of High …, 2021 - journals.sagepub.com

Efficient exploitation of exascale architectures requires rethinking of the numerical
algorithms used in many large-scale applications. These architectures favor algorithms that …

被引用次数：57 相关文章所有 12 个版本

[PDF] usenix.org

{DeepCPU}: Serving {RNN-based} Deep Learning Models 10x Faster

M Zhang, S Rajbhandari, W Wang, Y He - 2018 USENIX Annual …, 2018 - usenix.org

Recurrent neural networks (RNNs) are an important class of deep learning (DL) models.
Existing DL frameworks have unsatisfying performance for online serving: many RNN …

被引用次数：117 相关文章所有 14 个版本

[PDF] arxiv.org

CLBlast: A tuned OpenCL BLAS library

C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org

This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …

被引用次数：104 相关文章所有 3 个版本

[PDF] arxiv.org

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

AN Ziogas, T Ben-Nun, GI Fernández… - Proceedings of the …, 2019 - dl.acm.org

The computational efficiency of a state of the art ab initio quantum transport (QT) solver,
capable of revealing the coupled electrothermal properties of atomically-resolved nano …

被引用次数：58 相关文章所有 23 个版本

[PDF] arxiv.org

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

F Petrovič, D Střelák, J Hozzová, J Ol'ha… - Future Generation …, 2020 - Elsevier

In recent years, the heterogeneity of both commodity and supercomputers hardware has
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …

被引用次数：48 相关文章所有 8 个版本

[PDF] nsf.gov

Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs

A Abdelfattah, S Tomov… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org

Matrix multiplication (GEMM) is the most important operation in dense linear algebra.
Because it is a computebound operation that is rich in data reuse, many applications from …

被引用次数：45 相关文章所有 9 个版本

[PDF] nsf.gov

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs

C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org

Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …

被引用次数：27 相关文章所有 6 个版本

[PDF] acm.org

A set of batched basic linear algebra subprograms and LAPACK routines

A Abdelfattah, T Costa, J Dongarra, M Gates… - ACM Transactions on …, 2021 - dl.acm.org

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …

被引用次数：27 相关文章所有 5 个版本

[PDF] acm.org

Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration

S Pal, S Feng, D Park, S Kim, A Amarnath… - Proceedings of the …, 2020 - dl.acm.org

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …

被引用次数：29 相关文章所有 11 个版本

[PDF] northwestern.edu

Improving scalability of parallel CNN training by adjusting mini-batch size at run-time

S Lee, Q Kang, S Madireddy… - … Conference on Big …, 2019 - ieeexplore.ieee.org

Training Convolutional Neural Network (CNN) is a computationally intensive task, requiring
efficient parallelization to shorten the execution time. Considering the ever-increasing size of …

被引用次数：30 相关文章所有 7 个版本