[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
Use cases of lossy compression for floating-point data in scientific data sets
Architectural and technological trends of systems used for scientific computing call for a
significant reduction of scientific data sets that are composed mainly of floating-point data …
significant reduction of scientific data sets that are composed mainly of floating-point data …
A survey of aiops methods for failure management
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …
The increase in scale and complexity of these systems challenges O&M teams that perform …
Addressing failures in exascale computing
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …
The design, deployment, and evaluation of the CORAL pre-exascale systems
SS Vazhkudai, BR De Supinski… - … Conference for High …, 2018 - ieeexplore.ieee.org
CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM
systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and …
systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and …
Exascale computing technology challenges
Abstract High Performance Computing architectures are expected to change dramatically in
the next decade as power and cooling constraints limit increases in microprocessor clock …
the next decade as power and cooling constraints limit increases in microprocessor clock …
FTI: High performance fault tolerance interface for hybrid systems
Large scientific applications deployed on current petascale systems expend a significant
amount of their execution time dumping checkpoint files to remote storage. New fault tolerant …
amount of their execution time dumping checkpoint files to remote storage. New fault tolerant …
[图书][B] Fault tolerance techniques for high-performance computing
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …
checkpointing, the de-facto standard technique for resilience in High Performance …
ThyNVM: Enabling software-transparent crash consistency in persistent memory systems
Emerging byte-addressable nonvolatile memories (NVMs) promise persistent memory,
which allows processors to directly access persistent data in main memory. Yet, persistent …
which allows processors to directly access persistent data in main memory. Yet, persistent …
Evaluating the viability of process replication reliability for exascale systems
As high-end computing machines continue to grow in size, issues such as fault tolerance
and reliability limit application scalability. Current techniques to ensure progress across …
and reliability limit application scalability. Current techniques to ensure progress across …