System-level scalable checkpoint-restart for petascale computing

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

被引用次数：24 相关文章所有 7 个版本

[HTML] nih.gov

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer

Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

被引用次数：28 相关文章所有 6 个版本

[PDF] acm.org

MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing

R Garg, G Price, G Cooperman - … of the 28th international symposium on …, 2019 - dl.acm.org

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing
problem in HPC. The problem has been complicated by the need to provide checkpoint …

被引用次数：28 相关文章所有 7 个版本

[PDF] arxiv.org

Shrink or substitute: handling process failures in HPC systems using in-situ recovery

RA Ashraf, S Hukerikar… - 2018 26th Euromicro …, 2018 - ieeexplore.ieee.org

Efficient utilization of today's high-performance computing (HPC) systems with complex
software and hardware components requires that the HPC applications are designed to …

被引用次数：30 相关文章所有 16 个版本

[HTML] hep.com.cn

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：4 相关文章所有 4 个版本

LightPC: hardware and software co-design for energy-efficient full system persistence

S Lee, M Kwon, G Park, M Jung - Proceedings of the 49th Annual …, 2022 - dl.acm.org

We propose LightPC, a lightweight persistence-centric platform to make the system robust
against power loss. LightPC consists of hardware and software subsystems, each being …

被引用次数：5 相关文章所有 4 个版本

[PDF] academia.edu

Experimental findings on the sources of detected unrecoverable errors in gpus

FF dos Santos, S Malde, C Cazzaniga… - … on Nuclear Science, 2022 - ieeexplore.ieee.org

We investigate the sources of detected unrecoverable errors (DUEs) in graphics processing
units (GPUs) exposed to a neutron beam. Illegal memory accesses and interface errors are …

被引用次数：8 相关文章所有 7 个版本

[PDF] hal.science

Towards Efficient Cache Allocation for High-Frequency Checkpointing

A Maurya, B Nicolae, MM Rafique… - 2022 IEEE 29th …, 2022 - ieeexplore.ieee.org

While many HPC applications are known to have long runtimes, this is not always because
of single large runs: in many cases, this is due to ensembles composed of many short runs …

被引用次数：4 相关文章所有 7 个版本

[PDF] hal.science

Checkpointing strategies to tolerate non-memoryless failures on HPC platforms

A Benoit, L Perotin, Y Robert, F Vivien - ACM Transactions on Parallel …, 2024 - dl.acm.org

This article studies checkpointing strategies for parallel applications subject to failures. The
optimal strategy to minimize total execution time, or makespan, is well known when failure …

被引用次数：2 相关文章所有 5 个版本

[PDF] ciemat.es

Job migration in hpc clusters by means of checkpoint/restart

M Rodríguez-Pascual, J Cao, JA Moríñigo… - The Journal of …, 2019 - Springer

Until now, jobs running on HPC clusters were tied to the node where their execution started.
We have removed that limitation by integrating a user-level checkpoint/restart library into a …

被引用次数：13 相关文章所有 6 个版本