Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled 'Resiliency in Numerical Algorithm Design for
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Extreme Scale Simulations' held March 1–6, 2020, at Schloss Dagstuhl, that was attended …
Understanding performance interference in next-generation HPC systems
Next-generation systems face a wide range of new potential sources of application
interference, including resilience actions, system software adaptation, and in situ analytics …
interference, including resilience actions, system software adaptation, and in situ analytics …
Adapt: An event-based adaptive collective communication framework
The increase in scale and heterogeneity of high-performance computing (HPC) systems
predispose the performance of Message Passing Interface (MPI) collective communications …
predispose the performance of Message Passing Interface (MPI) collective communications …
Phoenix: Memory speed hpc i/o with nvm
In order to bridge the gap between the applications' I/O needs on future exascale platforms,
and thecapabilities of conventional memory and storage technologies, HPC system designs …
and thecapabilities of conventional memory and storage technologies, HPC system designs …
Understanding the effects of communication and coordination on checkpointing at scale
Fault-tolerance poses a major challenge for future large-scale systems. Active research into
coordinated, uncoordinated, and hybrid check pointing systems has explored how the …
coordinated, uncoordinated, and hybrid check pointing systems has explored how the …
Characterizing MPI matching via trace-based simulation
With the increased scale expected on future leadership-class systems, detailed information
about the resource usage and performance of MPI message matching provides important …
about the resource usage and performance of MPI message matching provides important …
LLAMP: Assessing Network Latency Tolerance of HPC Applications with Linear Programming
The shift towards high-bandwidth networks driven by AI workloads in data centers and HPC
clusters has unintentionally aggravated network latency, adversely affecting the …
clusters has unintentionally aggravated network latency, adversely affecting the …
Exploring the effect of noise on the performance benefit of nonblocking allreduce
Relaxed synchronization offers the potential of maintaining application scalability by
allowing many processes to make independent progress when some processes suffer …
allowing many processes to make independent progress when some processes suffer …
Energy-efficient localised rollback via data flow analysis and frequency scaling
K Dichev, K Cameron, DS Nikolopoulos - Proceedings of the 25th …, 2018 - dl.acm.org
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-
level checkpoint and a global rollback to recover. In recent years, techniques reducing the …
level checkpoint and a global rollback to recover. In recent years, techniques reducing the …
Noise-tolerant explicit stencil computations for nonuniform process execution rates
Next-generation HPC computing platforms are likely to be characterized by significant,
unpredictable nonuniformities in execution time among compute nodes and cores. The …
unpredictable nonuniformities in execution time among compute nodes and cores. The …