Resiliency in numerical algorithm design for extreme scale simulations

E Agullo, M Altenbernd, H Anzt… - … Journal of High …, 2022 - journals.sagepub.com
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …

Understanding performance interference in next-generation HPC systems

OH Mondragon, PG Bridges, S Levy… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
Next-generation systems face a wide range of new potential sources of application
interference, including resilience actions, system software adaptation, and in situ analytics …

Adapt: An event-based adaptive collective communication framework

X Luo, W Wu, G Bosilca, T Patinyasakdikul… - Proceedings of the 27th …, 2018 - dl.acm.org
The increase in scale and heterogeneity of high-performance computing (HPC) systems
predispose the performance of Message Passing Interface (MPI) collective communications …

Phoenix: Memory speed hpc i/o with nvm

P Fernando, S Kannan, A Gavrilovska… - 2016 IEEE 23rd …, 2016 - ieeexplore.ieee.org
In order to bridge the gap between the applications' I/O needs on future exascale platforms,
and thecapabilities of conventional memory and storage technologies, HPC system designs …

Understanding the effects of communication and coordination on checkpointing at scale

KB Ferreira, P Widener, S Levy… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Fault-tolerance poses a major challenge for future large-scale systems. Active research into
coordinated, uncoordinated, and hybrid check pointing systems has explored how the …

Characterizing MPI matching via trace-based simulation

KB Ferreira, S Levy, K Pedretti, RE Grant - Proceedings of the 24th …, 2017 - dl.acm.org
With the increased scale expected on future leadership-class systems, detailed information
about the resource usage and performance of MPI message matching provides important …

LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming

S Shen, L Huang, M Chrapek, T Schneider… - arXiv preprint arXiv …, 2024 - arxiv.org
The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC
clusters has unintentionally aggravated network latency, adversely affecting the …

Exploring the effect of noise on the performance benefit of nonblocking allreduce

P Widener, KB Ferreira, S Levy, T Hoefler - Proceedings of the 21st …, 2014 - dl.acm.org
Relaxed synchronization offers the potential of maintaining application scalability by
allowing many processes to make independent progress when some processes suffer …

Energy-efficient localised rollback via data flow analysis and frequency scaling

K Dichev, K Cameron, DS Nikolopoulos - Proceedings of the 25th …, 2018 - dl.acm.org
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-
level checkpoint and a global rollback to recover. In recent years, techniques reducing the …

Noise-tolerant explicit stencil computations for nonuniform process execution rates

A Hammouda, AR Siegel, SF Siegel - ACM Transactions on Parallel …, 2015 - dl.acm.org
Next-generation HPC computing platforms are likely to be characterized by significant,
unpredictable nonuniformities in execution time among compute nodes and cores. The …