[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

Use cases of lossy compression for floating-point data in scientific data sets

F Cappello, S Di, S Li, X Liang… - … Journal of High …, 2019 - journals.sagepub.com
Architectural and technological trends of systems used for scientific computing call for a
significant reduction of scientific data sets that are composed mainly of floating-point data …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Addressing failures in exascale computing

M Snir, RW Wisniewski, JA Abraham… - … Journal of High …, 2014 - journals.sagepub.com
We present here a report produced by a workshop on 'Addressing failures in exascale
computing'held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to …

The design, deployment, and evaluation of the CORAL pre-exascale systems

SS Vazhkudai, BR De Supinski… - … Conference for High …, 2018 - ieeexplore.ieee.org
CORAL, the Collaboration of Oak Ridge, Argonne and Livermore, is fielding two similar IBM
systems, Summit and Sierra, with NVIDIA GPUs that will replace the existing Titan and …

Exascale computing technology challenges

J Shalf, S Dosanjh, J Morrison - … conference, Berkeley, CA, USA, June 22 …, 2011 - Springer
Abstract High Performance Computing architectures are expected to change dramatically in
the next decade as power and cooling constraints limit increases in microprocessor clock …

FTI: High performance fault tolerance interface for hybrid systems

L Bautista-Gomez, S Tsuboi, D Komatitsch… - Proceedings of 2011 …, 2011 - dl.acm.org
Large scientific applications deployed on current petascale systems expend a significant
amount of their execution time dumping checkpoint files to remote storage. New fault tolerant …

[图书][B] Fault tolerance techniques for high-performance computing

J Dongarra, T Herault, Y Robert - 2015 - Springer
This chapter provides an introduction to resilience methods. The emphasis is on
checkpointing, the de-facto standard technique for resilience in High Performance …

ThyNVM: Enabling software-transparent crash consistency in persistent memory systems

J Ren, J Zhao, S Khan, J Choi, Y Wu… - Proceedings of the 48th …, 2015 - dl.acm.org
Emerging byte-addressable nonvolatile memories (NVMs) promise persistent memory,
which allows processors to directly access persistent data in main memory. Yet, persistent …

Evaluating the viability of process replication reliability for exascale systems

K Ferreira, J Stearley, JH Laros III, R Oldfield… - Proceedings of 2011 …, 2011 - dl.acm.org
As high-end computing machines continue to grow in size, issues such as fault tolerance
and reliability limit application scalability. Current techniques to ensure progress across …