Automated application-level checkpointing of MPI programs
G Bronevetsky, D Marques, K Pingali… - Proceedings of the ninth …, 2003 - dl.acm.org
The running times of many computational science applications, such as protein-folding
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …
Application-level checkpointing for shared memory programs
G Bronevetsky, D Marques, K Pingali, P Szwed… - ACM SIGPLAN …, 2004 - dl.acm.org
Trends in high-performance computing are making it necessary for long-running
applications to tolerate hardware faults. The most commonly used approach is checkpoint …
applications to tolerate hardware faults. The most commonly used approach is checkpoint …
Recent advances in checkpoint/recovery systems
G Bronevetsky, R Fernandes… - … Parallel & Distributed …, 2006 - ieeexplore.ieee.org
Checkpoint and recovery (CPR) systems have many uses in high-performance computing.
Because of this, many developers have implemented it, by hand, into their applications. One …
Because of this, many developers have implemented it, by hand, into their applications. One …
Design pattern mining enhanced by machine learning
R Ferenc, A Beszedes, L Fulop… - 21st IEEE international …, 2005 - ieeexplore.ieee.org
Design patterns present good solutions to frequently occurring problems in object-oriented
software design. Thus their correct application in a system's design may significantly …
software design. Thus their correct application in a system's design may significantly …
Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs
M Schulz, G Bronevetsky, R Fernandes… - SC'04: Proceedings …, 2004 - ieeexplore.ieee.org
The running times of many computational science applications are much longer than the
mean-time-to-failure of current high-performance computing platforms. To run to completion …
mean-time-to-failure of current high-performance computing platforms. To run to completion …
A job pause service under LAM/MPI+ BLCR for transparent fault tolerance
Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale
clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R …
clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R …
Reliability challenges in large systems
Clusters built from commodity PCs dominate high-performance computing today, with
systems containing thousands of processors now being deployed. As node counts for multi …
systems containing thousands of processors now being deployed. As node counts for multi …
Scalable diskless checkpointing for large parallel systems
C Lu - 2005 - ideals.illinois.edu
Parallel scientific applications deal with machine unreliability by periodic checkpointing, in
which all processes coordinate to dump memory to stable storage simultaneously. However …
which all processes coordinate to dump memory to stable storage simultaneously. However …
C 3: A System for Automating Application-Level Checkpointing of MPI Programs
G Bronevetsky, D Marques, K Pingali… - Languages and Compilers …, 2004 - Springer
Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing
techniques make programs fault-tolerant by saving their state periodically and restoring this …
techniques make programs fault-tolerant by saving their state periodically and restoring this …
[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems
J Hursey - 2010 - search.proquest.com
Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …