A survey of techniques for optimizing deep learning on GPUs

S Mittal, S Vaishay - Journal of Systems Architecture, 2019 - Elsevier
The rise of deep-learning (DL) has been fuelled by the improvements in accelerators. Due to
its unique features, the GPU continues to remain the most widely used accelerator for DL …

Automated design of deep neural networks: A survey and unified taxonomy

EG Talbi - ACM Computing Surveys (CSUR), 2021 - dl.acm.org
In recent years, research in applying optimization approaches in the automatic design of
deep neural networks has become increasingly popular. Although various approaches have …

Pipe-SGD: A decentralized pipelined SGD framework for distributed deep net training

Y Li, M Yu, S Li, S Avestimehr… - Advances in Neural …, 2018 - proceedings.neurips.cc
Distributed training of deep nets is an important technique to address some of the present
day computing challenges like memory consumption and computational demands. Classical …

Clairvoyant prefetching for distributed machine learning I/O

N Dryden, R Böhringer, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org
I/O is emerging as a major bottleneck for machine learning training, especially in distributed
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …

Toward performance-portable PETSc for GPU-based exascale systems

RT Mills, MF Adams, S Balay, J Brown, A Dener… - Parallel Computing, 2021 - Elsevier
Abstract The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers
scalable solvers for nonlinear time-dependent differential and algebraic equations and for …

Communication optimization strategies for distributed deep neural network training: A survey

S Ouyang, D Dong, Y Xu, L Xiao - Journal of Parallel and Distributed …, 2021 - Elsevier
Recent trends in high-performance computing and deep learning have led to the
proliferation of studies on large-scale deep neural network training. However, the frequent …

Towards {Domain-Specific} Network Transport for Distributed {DNN} Training

H Wang, H Tian, J Chen, X Wan, J Xia, G Zeng… - … USENIX Symposium on …, 2024 - usenix.org
The nature of machine learning (ML) applications exposes rich characteristics to underlying
network transport, yet little work has been done so far to systematically exploit these …

Channel and filter parallelism for large-scale CNN training

N Dryden, N Maruyama, T Moon, T Benson… - Proceedings of the …, 2019 - dl.acm.org
Accelerating large-scale CNN training is needed to keep training times reasonable as
datasets grow larger and models become more complex. Existing frameworks primarily …

The PetscSF scalable communication layer

J Zhang, J Brown, S Balay… - … on Parallel and …, 2021 - ieeexplore.ieee.org
PetscSF, the communication component of the Portable, Extensible Toolkit for Scientific
Computation (PETSc), is designed to provide PETSc's communication infrastructure suitable …

Enabling rapid COVID-19 small molecule drug design through scalable deep learning of generative models

SA Jacobs, T Moon, K McLoughlin… - … Journal of High …, 2021 - journals.sagepub.com
We improved the quality and reduced the time to produce machine learned models for use
in small molecule antiviral design. Our globally asynchronous multi-level parallel training …