A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?
Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …
period for a parallel application executing on a supercomputing platform. It was originally …
Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
D Shukla, M Sivathanu, S Viswanatha… - arXiv preprint arXiv …, 2022 - arxiv.org
Lowering costs by driving high utilization across deep learning workloads is a crucial lever
for cloud providers. We present Singularity, Microsoft's globally distributed scheduling …
for cloud providers. We present Singularity, Microsoft's globally distributed scheduling …
Crac: Checkpoint-restart architecture for cuda with streams and uvm
T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org
The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …
{MigrOS}: Transparent {Live-Migration} Support for Containerised {RDMA} Applications
RDMA networks offload packet processing onto specialised circuitry of the network interface
controllers (NICs) and bypass the OS to improve network latency and bandwidth. As a …
controllers (NICs) and bypass the OS to improve network latency and bandwidth. As a …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
Legio: fault resiliency for embarrassingly parallel MPI applications
Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due
to their high frequency. Natively, MPI cannot handle faults and it stops the execution …
to their high frequency. Natively, MPI cannot handle faults and it stops the execution …
LightPC: hardware and software co-design for energy-efficient full system persistence
We propose LightPC, a lightweight persistence-centric platform to make the system robust
against power loss. LightPC consists of hardware and software subsystems, each being …
against power loss. LightPC consists of hardware and software subsystems, each being …
[HTML][HTML] Practicable live container migrations in high performance computing clouds: Diskless, iterative, and connection-persistent
J Guitart - Journal of Systems Architecture, 2024 - Elsevier
Checkpoint/Restore techniques had been thoroughly used by the High Performance
Computing (HPC) community in the context of failure recovery. Given the current trend in …
Computing (HPC) community in the context of failure recovery. Given the current trend in …
Calculation of the high-energy neutron flux for anticipating errors and recovery techniques in exascale supercomputer centres
H Asorey, R Mayo-Garcia - The Journal of Supercomputing, 2023 - Springer
The age of exascale computing has arrived, and the risks associated with neutron and other
atmospheric radiation are becoming more critical as the computing power increases; hence …
atmospheric radiation are becoming more critical as the computing power increases; hence …
Examining failures and repairs on supercomputers with multi-GPU compute nodes
Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …
HPC and dependability communities. However, there is no current study that analyzes both …