Crac: Checkpoint-restart architecture for cuda with streams and uvm

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing

R Garg, G Price, G Cooperman - … of the 28th international symposium on …, 2019 - dl.acm.org
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing
problem in HPC. The problem has been complicated by the need to provide checkpoint …

Shrink or substitute: handling process failures in HPC systems using in-situ recovery

RA Ashraf, S Hukerikar… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org
Efficient utilization of today's high-performance computing (HPC) systems with complex
software and hardware components requires that the HPC applications are designed to …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

LightPC: hardware and software co-design for energy-efficient full system persistence

S Lee, M Kwon, G Park, M Jung - Proceedings of the 49th Annual …, 2022 - dl.acm.org
We propose LightPC, a lightweight persistence-centric platform to make the system robust
against power loss. LightPC consists of hardware and software subsystems, each being …

Experimental findings on the sources of detected unrecoverable errors in gpus

FF dos Santos, S Malde, C Cazzaniga… - … on Nuclear Science, 2022 - ieeexplore.ieee.org
We investigate the sources of detected unrecoverable errors (DUEs) in graphics processing
units (GPUs) exposed to a neutron beam. Illegal memory accesses and interface errors are …

Towards Efficient Cache Allocation for High-Frequency Checkpointing

A Maurya, B Nicolae, MM Rafique… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org
While many HPC applications are known to have long runtimes, this is not always because
of single large runs: in many cases, this is due to ensembles composed of many short runs …

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms

A Benoit, L Perotin, Y Robert, F Vivien - ACM Transactions on Parallel …, 2024 - dl.acm.org
This article studies checkpointing strategies for parallel applications subject to failures. The
optimal strategy to minimize total execution time, or makespan, is well known when failure …

Job migration in hpc clusters by means of checkpoint/restart

M Rodríguez-Pascual, J Cao, JA Moríñigo… - The Journal of …, 2019 - Springer
Until now, jobs running on HPC clusters were tied to the node where their execution started.
We have removed that limitation by integrating a user-level checkpoint/restart library into a …