Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

JN Glosli, DF Richards, KJ Caspersen… - Proceedings of the …, 2007 - dl.acm.org
We report the computational advances that have enabled the first micron-scale simulation of
a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in …

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

C Engelmann - Future Generation Computer Systems, 2014 - Elsevier
As supercomputers scale to 1000 PFlop/s over the next decade, investigating the
performance of parallel applications at scale on future architectures and the performance …

xSim: The extreme-scale simulator

S Böhm, C Engelmann - 2011 International Conference on …, 2011 - ieeexplore.ieee.org
Investigating parallel application performance at scale is an important part of high-
performance computing (UPC) application development. The Extreme-scale Simulator …

Desynchronization in distributed Ant Colony Optimization in HPC environment

M Starzec, G Starzec, A Byrski, W Turek… - Future Generation …, 2020 - Elsevier
Metaheuristics have significant computing requirements, in particular Ant Colony
Optimization (ACO) processes a population of individuals (agents/ants) roaming in a graph …

Recovery patterns for iterative methods in a parallel unstable environment

J Langou, Z Chen, G Bosilca, J Dongarra - SIAM Journal on Scientific …, 2008 - SIAM
Several recovery techniques for parallel iterative methods are presented. First, the
implementation of checkpoints in parallel iterative methods is described and analyzed. Then …

Dynamic resource provisioning for sustainable cloud computing systems in the presence of correlated failures

Y Sharma, J Taheri, W Si, D Sun… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Dependence of computing resources on each other in cloud computing systems (CCS)
makes them prone to fail in correlated manner which significantly impacts their service …

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

J Hursey, T Naughton, G Vallee, RL Graham - Recent Advances in the …, 2011 - Springer
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC
systems. The MPI does not provide standardized fault tolerance interfaces and semantics …

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …