Characterizing the influence of system noise on large-scale applications by simulation

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

T Hoefler, R Belli - Proceedings of the international conference for high …, 2015 - dl.acm.org

Measuring and reporting performance of parallel computers constitutes the basis for
scientific advancement of high-performance computing (HPC). Most scientific reports show …

被引用次数：326 相关文章所有 37 个版本

[PDF] osti.gov

There goes the neighborhood: performance degradation due to nearby jobs

A Bhatele, K Mohror, SH Langer… - Proceedings of the …, 2013 - dl.acm.org

Predictable performance is important for understanding and alleviating application
performance issues; quantifying the effects of source code, compiler, or system software …

被引用次数：258 相关文章所有 18 个版本

[PDF] ethz.ch

Using automated performance modeling to find scalability bugs in complex codes

A Calotoiu, T Hoefler, M Poke, F Wolf - Proceedings of the International …, 2013 - dl.acm.org

Many parallel applications suffer from latent performance limitations that may prevent them
from scaling to larger machine sizes. Often, such scalability bugs manifest themselves only …

被引用次数：194 相关文章所有 32 个版本

[PDF] arxiv.org

Clairvoyant prefetching for distributed machine learning I/O

N Dryden, R Böhringer, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org

I/O is emerging as a major bottleneck for machine learning training, especially in distributed
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …

被引用次数：65 相关文章所有 22 个版本

[PDF] arxiv.org

Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent

J Daily, A Vishnu, C Siegel, T Warfel… - arXiv preprint arXiv …, 2018 - arxiv.org

In this paper, we present GossipGraD-a gossip communication protocol based Stochastic
Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale …

被引用次数：113 相关文章所有 4 个版本

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org

Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

被引用次数：300 相关文章所有 6 个版本

[PDF] arxiv.org

Flare: Flexible in-network allreduce

D De Sensi, S Di Girolamo, S Ashkboos, S Li… - Proceedings of the …, 2021 - dl.acm.org

The allreduce operation is one of the most commonly used communication routines in
distributed applications. To improve its bandwidth and to reduce network traffic, this …

被引用次数：48 相关文章所有 26 个版本

[PDF] uantwerpen.be

Hiding global communication latency in the GMRES algorithm on massively parallel machines

P Ghysels, TJ Ashby, K Meerbergen… - SIAM journal on scientific …, 2013 - SIAM

In the generalized minimal residual method (GMRES), the global all-to-all communication
required in each iteration for orthogonalization and normalization of the Krylov base vectors …

被引用次数：169 相关文章所有 11 个版本

[PDF] arxiv.org

sPIN: High-performance streaming Processing in the Network

T Hoefler, S Di Girolamo, K Taranov, RE Grant… - Proceedings of the …, 2017 - dl.acm.org

Optimizing communication performance is imperative for large-scale computing because
communication overheads limit the strong scalability of parallel applications. Today's …

被引用次数：98 相关文章所有 27 个版本

[PDF] acm.org

Run-to-run variability on Xeon Phi based Cray XC systems

S Chunduri, K Harms, S Parker, V Morozov… - Proceedings of the …, 2017 - dl.acm.org

The increasing complexity of HPC systems has introduced new sources of variability, which
can contribute to significant differences in run-to-run performance of applications. With …

被引用次数：98 相关文章所有 5 个版本