[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

Fault tolerant file models for parallel file systems: introducing distribution patterns for every file

A Calderón, F García-Carballeira, LM Sánchez… - The Journal of …, 2009 - Springer
Parallelism in file systems is obtained by using several independent server nodes
supporting one or more secondary storage devices. This approach increases the …

Handling persistent states in process checkpoint/restart mechanisms for HPC systems

P Riteau, A Lebre, C Morin - … on Cluster Computing and the Grid, 2009 - ieeexplore.ieee.org
Computer clusters are today the reference architecture for high-performance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …

Fault tolerant file models for MPI-IO parallel file systems

A Calderón, F García-Carballeira, F Isailǎ… - Recent Advances in …, 2007 - Springer
Parallelism in file systems is obtained by using several independent server nodes
supporting one or more secondary storage devices. This approach increases the …

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

P Riteau, A Lebre, C Morin - 2008 - inria.hal.science
Computer clusters are today the reference architecture for highperformance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …

[PDF][PDF] Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems

PRALC Morin - 2008 - academia.edu
Computer clusters are today the reference architecture for highperformance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …

[引用][C] FA-MPI: fault-aware MPI specification and concept of operations

A Skjellum, PV Bangalore, YS Dandass - University of Alabama at Birmingham, Tech …, 2012

[引用][C] Fault Tolerant Support for Parallel File System on Clusters