Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

T Hoefler, R Belli - Proceedings of the international conference for high …, 2015 - dl.acm.org
Measuring and reporting performance of parallel computers constitutes the basis for
scientific advancement of high-performance computing (HPC). Most scientific reports show …

There goes the neighborhood: performance degradation due to nearby jobs

A Bhatele, K Mohror, SH Langer… - Proceedings of the …, 2013 - dl.acm.org
Predictable performance is important for understanding and alleviating application
performance issues; quantifying the effects of source code, compiler, or system software …

Using automated performance modeling to find scalability bugs in complex codes

A Calotoiu, T Hoefler, M Poke, F Wolf - Proceedings of the International …, 2013 - dl.acm.org
Many parallel applications suffer from latent performance limitations that may prevent them
from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only …

Clairvoyant prefetching for distributed machine learning I/O

N Dryden, R Böhringer, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org
I/O is emerging as a major bottleneck for machine learning training, especially in distributed
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …

Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent

J Daily, A Vishnu, C Siegel, T Warfel… - arXiv preprint arXiv …, 2018 - arxiv.org
In this paper, we present GossipGraD-a gossip communication protocol based Stochastic
Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale …

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021 - dl.acm.org
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

Hiding global communication latency in the GMRES algorithm on massively parallel machines

P Ghysels, TJ Ashby, K Meerbergen… - SIAM journal on scientific …, 2013 - SIAM
In the generalized minimal residual method (GMRES), the global all-to-all communication
required in each iteration for orthogonalization and normalization of the Krylov base vectors …

sPIN: High-performance streaming Processing in the Network

T Hoefler, S Di Girolamo, K Taranov, RE Grant… - Proceedings of the …, 2017 - dl.acm.org
Optimizing communication performance is imperative for large-scale computing because
communication overheads limit the strong scalability of parallel applications. Today's …

Run-to-run variability on Xeon Phi based Cray XC systems

S Chunduri, K Harms, S Parker, V Morozov… - Proceedings of the …, 2017 - dl.acm.org
The increasing complexity of HPC systems has introduced new sources of variability, which
can contribute to significant differences in run-to-run performance of applications. With …