MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing

L Bautista-Gomez, A Benoit, S Di, T Herault… - Future Generation …, 2024 - Elsevier

Abstract The Young/Daly formula provides an approximation of the optimal checkpointing
period for a parallel application executing on a supercomputing platform. It was originally …

被引用次数：1 相关文章所有 8 个版本

[PDF] arxiv.org

Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads

D Shukla, M Sivathanu, S Viswanatha… - arXiv preprint arXiv …, 2022 - arxiv.org

Lowering costs by driving high utilization across deep learning workloads is a crucial lever
for cloud providers. We present Singularity, Microsoft's globally distributed scheduling …

被引用次数：29 相关文章所有 2 个版本

[PDF] arxiv.org

Crac: Checkpoint-restart architecture for cuda with streams and uvm

T Jain, G Cooperman - SC20: International Conference for High …, 2020 - ieeexplore.ieee.org

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues
to grow. While fault tolerance is a critical issue for supercomputing, there does not currently …

被引用次数：27 相关文章所有 7 个版本

[PDF] usenix.org

{MigrOS}: Transparent {Live-Migration} Support for Containerised {RDMA} Applications

M Planeta, J Bierbaum, LSD Antony, T Hoefler… - 2021 USENIX Annual …, 2021 - usenix.org

RDMA networks offload packet processing onto specialised circuitry of the network interface
controllers (NICs) and bypass the OS to improve network latency and bandwidth. As a …

被引用次数：20 相关文章所有 23 个版本

[PDF] researchgate.net

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：8 相关文章所有 4 个版本

[PDF] springer.com

Legio: fault resiliency for embarrassingly parallel MPI applications

R Rocco, D Gadioli, G Palermo - The Journal of Supercomputing, 2022 - Springer

Due to the increasing size of HPC machines, dealing with faults is becoming mandatory due
to their high frequency. Natively, MPI cannot handle faults and it stops the execution …

被引用次数：14 相关文章所有 12 个版本

LightPC: hardware and software co-design for energy-efficient full system persistence

S Lee, M Kwon, G Park, M Jung - Proceedings of the 49th Annual …, 2022 - dl.acm.org

We propose LightPC, a lightweight persistence-centric platform to make the system robust
against power loss. LightPC consists of hardware and software subsystems, each being …

被引用次数：6 相关文章所有 4 个版本

[HTML] sciencedirect.com

[HTML][HTML] Practicable live container migrations in high performance computing clouds: Diskless, iterative, and connection-persistent

J Guitart - Journal of Systems Architecture, 2024 - Elsevier

Checkpoint/Restore techniques had been thoroughly used by the High Performance
Computing (HPC) community in the context of failure recovery. Given the current trend in …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

Calculation of the high-energy neutron flux for anticipating errors and recovery techniques in exascale supercomputer centres

H Asorey, R Mayo-Garcia - The Journal of Supercomputing, 2023 - Springer

The age of exascale computing has arrived, and the risks associated with neutron and other
atmospheric radiation are becoming more critical as the computing power increases; hence …

被引用次数：3 相关文章所有 6 个版本

[PDF] nsf.gov

Examining failures and repairs on supercomputers with multi-GPU compute nodes

A Taherin, T Patel, G Georgakoudis… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org

Understanding the reliability characteristics of supercomputers has been a key focus of the
HPC and dependability communities. However, there is no current study that analyzes both …

被引用次数：7 相关文章所有 3 个版本