Efficient exascale discretizations: High-order finite element methods

T Kolev, P Fischer, M Min, J Dongarra… - … Journal of High …, 2021 - journals.sagepub.com
Efficient exploitation of exascale architectures requires rethinking of the numerical
algorithms used in many large-scale applications. These architectures favor algorithms that …

{DeepCPU}: Serving {RNN-based} Deep Learning Models 10x Faster

M Zhang, S Rajbhandari, W Wang, Y He - 2018 USENIX Annual …, 2018 - usenix.org
Recurrent neural networks (RNNs) are an important class of deep learning (DL) models.
Existing DL frameworks have unsatisfying performance for online serving: many RNN …

CLBlast: A tuned OpenCL BLAS library

C Nugteren - Proceedings of the International Workshop on OpenCL, 2018 - dl.acm.org
This work introduces CLBlast, an open-source BLAS library providing optimized OpenCL
routines to accelerate dense linear algebra for a wide variety of devices. It is targeted at …

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

AN Ziogas, T Ben-Nun, GI Fernández… - Proceedings of the …, 2019 - dl.acm.org
The computational efficiency of a state of the art ab initio quantum transport (QT) solver,
capable of revealing the coupled electrothermal properties of atomically-resolved nano …

A benchmark set of highly-efficient CUDA and OpenCL kernels and its dynamic autotuning with Kernel Tuning Toolkit

F Petrovič, D Střelák, J Hozzová, J Ol'ha… - Future Generation …, 2020 - Elsevier
In recent years, the heterogeneity of both commodity and supercomputers hardware has
increased sharply. Accelerators, such as GPUs or Intel Xeon Phi co-processors, are often …

Fast batched matrix multiplication for small sizes using half-precision arithmetic on GPUs

A Abdelfattah, S Tomov… - 2019 IEEE international …, 2019 - ieeexplore.ieee.org
Matrix multiplication (GEMM) is the most important operation in dense linear algebra.
Because it is a computebound operation that is rich in data reuse, many applications from …

Design, optimization, and benchmarking of dense linear algebra algorithms on AMD GPUs

C Brown, A Abdelfattah, S Tomov… - 2020 IEEE High …, 2020 - ieeexplore.ieee.org
Dense linear algebra (DLA) has historically been in the vanguard of software that must be
adapted first to hardware changes. This is because DLA is both critical to the accuracy and …

A set of batched basic linear algebra subprograms and LAPACK routines

A Abdelfattah, T Costa, J Dongarra, M Gates… - ACM Transactions on …, 2021 - dl.acm.org
This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms
(Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small …

Transmuter: Bridging the efficiency gap using memory and dataflow reconfiguration

S Pal, S Feng, D Park, S Kim, A Amarnath… - Proceedings of the …, 2020 - dl.acm.org
With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build
hardware for emerging applications that meet power and performance targets, while …

Improving scalability of parallel CNN training by adjusting mini-batch size at run-time

S Lee, Q Kang, S Madireddy… - … Conference on Big …, 2019 - ieeexplore.ieee.org
Training Convolutional Neural Network (CNN) is a computationally intensive task, requiring
efficient parallelization to shorten the execution time. Considering the ever-increasing size of …