[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems
J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
Fault tolerant file models for parallel file systems: introducing distribution patterns for every file
Parallelism in file systems is obtained by using several independent server nodes
supporting one or more secondary storage devices. This approach increases the …
supporting one or more secondary storage devices. This approach increases the …
Handling persistent states in process checkpoint/restart mechanisms for HPC systems
Computer clusters are today the reference architecture for high-performance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …
large number of nodes in these systems induces a high failure rate. This makes fault …
Fault tolerant file models for MPI-IO parallel file systems
A Calderón, F García-Carballeira, F Isailǎ… - Recent Advances in …, 2007 - Springer
Parallelism in file systems is obtained by using several independent server nodes
supporting one or more secondary storage devices. This approach increases the …
supporting one or more secondary storage devices. This approach increases the …
Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
Computer clusters are today the reference architecture for highperformance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …
large number of nodes in these systems induces a high failure rate. This makes fault …
[PDF][PDF] Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems
PRALC Morin - 2008 - academia.edu
Computer clusters are today the reference architecture for highperformance computing. The
large number of nodes in these systems induces a high failure rate. This makes fault …
large number of nodes in these systems induces a high failure rate. This makes fault …
[引用][C] FA-MPI: fault-aware MPI specification and concept of operations
A Skjellum, PV Bangalore, YS Dandass - University of Alabama at Birmingham, Tech …, 2012