Toward exascale resilience
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …
computing (HPC) systems, in particular in the perspective of large petascale systems and …
Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability
JN Glosli, DF Richards, KJ Caspersen… - Proceedings of the …, 2007 - dl.acm.org
We report the computational advances that have enabled the first micron-scale simulation of
a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in …
a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in …
Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale
C Engelmann - Future Generation Computer Systems, 2014 - Elsevier
As supercomputers scale to 1000 PFlop/s over the next decade, investigating the
performance of parallel applications at scale on future architectures and the performance …
performance of parallel applications at scale on future architectures and the performance …
xSim: The extreme-scale simulator
S Böhm, C Engelmann - 2011 International Conference on …, 2011 - ieeexplore.ieee.org
Investigating parallel application performance at scale is an important part of high-
performance computing (UPC) application development. The Extreme-scale Simulator …
performance computing (UPC) application development. The Extreme-scale Simulator …
Desynchronization in distributed Ant Colony Optimization in HPC environment
Metaheuristics have significant computing requirements, in particular Ant Colony
Optimization (ACO) processes a population of individuals (agents/ants) roaming in a graph …
Optimization (ACO) processes a population of individuals (agents/ants) roaming in a graph …
Recovery patterns for iterative methods in a parallel unstable environment
Several recovery techniques for parallel iterative methods are presented. First, the
implementation of checkpoints in parallel iterative methods is described and analyzed. Then …
implementation of checkpoints in parallel iterative methods is described and analyzed. Then …
Dynamic resource provisioning for sustainable cloud computing systems in the presence of correlated failures
Dependence of computing resources on each other in cloud computing systems (CCS)
makes them prone to fail in correlated manner which significantly impacts their service …
makes them prone to fail in correlated manner which significantly impacts their service …
[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems
J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC
systems. The MPI does not provide standardized fault tolerance interfaces and semantics …
systems. The MPI does not provide standardized fault tolerance interfaces and semantics …
Task-level resilience: checkpointing vs. supervision
J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …
permanent hardware failure are growing in importance. Irregularity is often addressed by …