A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads

D Shukla, M Sivathanu, S Viswanatha… - arXiv preprint arXiv …, 2022 - arxiv.org
Lowering costs by driving high utilization across deep learning workloads is a crucial lever
for cloud providers. We present Singularity, Microsoft's globally distributed scheduling …

Crac: Checkpoint-restart architecture for cuda with streams and uvm

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

{MigrOS}: Transparent {Live-Migration} Support for Containerised {RDMA} Applications

M Planeta, J Bierbaum, LSD Antony, T Hoefler… - 2021 USENIX Annual …, 2021 - usenix.org
RDMA networks offload packet processing onto specialised circuitry of the network interface
controllers (NICs) and bypass the OS to improve network latency and bandwidth. As a …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Legio: fault resiliency for embarrassingly parallel MPI applications

R Rocco, D Gadioli, G Palermo - The Journal of Supercomputing, 2022 - Springer
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due
to their high frequency. Natively, MPI cannot handle faults and it stops the execution …

LightPC: hardware and software co-design for energy-efficient full system persistence

S Lee, M Kwon, G Park, M Jung - Proceedings of the 49th Annual …, 2022 - dl.acm.org
We propose LightPC, a lightweight persistence-centric platform to make the system robust
against power loss. LightPC consists of hardware and software subsystems, each being …

[HTML][HTML] Practicable live container migrations in high performance computing clouds: Diskless, iterative, and connection-persistent

J Guitart - Journal of Systems Architecture, 2024 - Elsevier
Checkpoint/Restore techniques had been thoroughly used by the High Performance
Computing (HPC) community in the context of failure recovery. Given the current trend in …

Calculation of the high-energy neutron flux for anticipating errors and recovery techniques in exascale supercomputer centres

H Asorey, R Mayo-Garcia - The Journal of Supercomputing, 2023 - Springer
The age of exascale computing has arrived, and the risks associated with neutron and other
atmospheric radiation are becoming more critical as the computing power increases; hence …

Examining failures and repairs on supercomputers with multi-GPU compute nodes

A Taherin, T Patel, G Georgakoudis… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …