Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results
Measuring and reporting performance of parallel computers constitutes the basis for
scientific advancement of high-performance computing (HPC). Most scientific reports show …
scientific advancement of high-performance computing (HPC). Most scientific reports show …
There goes the neighborhood: performance degradation due to nearby jobs
Predictable performance is important for understanding and alleviating application
performance issues; quantifying the effects of source code, compiler, or system software …
performance issues; quantifying the effects of source code, compiler, or system software …
Using automated performance modeling to find scalability bugs in complex codes
Many parallel applications suffer from latent performance limitations that may prevent them
from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only …
from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only …
Clairvoyant prefetching for distributed machine learning I/O
I/O is emerging as a major bottleneck for machine learning training, especially in distributed
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …
Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent
In this paper, we present GossipGraD-a gossip communication protocol based Stochastic
Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale …
Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale …
The SIMNET virtual world architecture
J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …
in a virtual world. Few have been designed with an architecture that allows large numbers of …
Flare: Flexible in-network allreduce
The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …
distributed applications. To improve its bandwidth and to reduce network traffic, this …
Hiding global communication latency in the GMRES algorithm on massively parallel machines
P Ghysels, TJ Ashby, K Meerbergen… - SIAM journal on scientific …, 2013 - SIAM
In the generalized minimal residual method (GMRES), the global all-to-all communication
required in each iteration for orthogonalization and normalization of the Krylov base vectors …
required in each iteration for orthogonalization and normalization of the Krylov base vectors …
sPIN: High-performance streaming Processing in the Network
Optimizing communication performance is imperative for large-scale computing because
communication overheads limit the strong scalability of parallel applications. Today's …
communication overheads limit the strong scalability of parallel applications. Today's …
Run-to-run variability on Xeon Phi based Cray XC systems
The increasing complexity of HPC systems has introduced new sources of variability, which
can contribute to significant differences in run-to-run performance of applications. With …
can contribute to significant differences in run-to-run performance of applications. With …