Collective operations in application-level fault-tolerant MPI

G Bronevetsky, D Marques, K Pingali… - Proceedings of the ninth …, 2003 - dl.acm.org

The running times of many computational science applications, such as protein-folding
using ab initio methods, are much longer than the mean-time-to-failure of high-performance …

被引用次数：298 相关文章所有 23 个版本

[PDF] psu.edu

Application-level checkpointing for shared memory programs

G Bronevetsky, D Marques, K Pingali, P Szwed… - ACM SIGPLAN …, 2004 - dl.acm.org

Trends in high-performance computing are making it necessary for long-running
applications to tolerate hardware faults. The most commonly used approach is checkpoint …

被引用次数：178 相关文章所有 17 个版本

[PDF] psu.edu

Recent advances in checkpoint/recovery systems

G Bronevetsky, R Fernandes… - … Parallel & Distributed …, 2006 - ieeexplore.ieee.org

Checkpoint and recovery (CPR) systems have many uses in high-performance computing.
Because of this, many developers have implemented it, by hand, into their applications. One …

被引用次数：55 相关文章所有 13 个版本

[PDF] u-szeged.hu

Design pattern mining enhanced by machine learning

R Ferenc, A Beszedes, L Fulop… - 21st IEEE international …, 2005 - ieeexplore.ieee.org

Design patterns present good solutions to frequently occurring problems in object-oriented
software design. Thus their correct application in a system's design may significantly …

被引用次数：140 相关文章所有 9 个版本

[PDF] researchgate.net

Implementation and evaluation of a scalable application-level checkpoint-recovery scheme for MPI programs

M Schulz, G Bronevetsky, R Fernandes… - SC'04: Proceedings …, 2004 - ieeexplore.ieee.org

The running times of many computational science applications are much longer than the
mean-time-to-failure of current high-performance computing platforms. To run to completion …

被引用次数：113 相关文章所有 16 个版本

[PDF] christian-engelmann.de

A job pause service under LAM/MPI+ BLCR for transparent fault tolerance

C Wang, F Mueller, C Engelmann… - 2007 IEEE International …, 2007 - ieeexplore.ieee.org

Checkpoint/restart (C/R) has become a requirement for long-running jobs in large-scale
clusters due to a meantime-to-failure (MTTF) in the order of hours. After a failure, C/R …

被引用次数：115 相关文章所有 21 个版本

[PDF] academia.edu

Reliability challenges in large systems

DA Reed, CL Mendes - Future Generation Computer Systems, 2006 - Elsevier

Clusters built from commodity PCs dominate high-performance computing today, with
systems containing thousands of processors now being deployed. As node counts for multi …

被引用次数：90 相关文章所有 6 个版本

[PDF] illinois.edu

Scalable diskless checkpointing for large parallel systems

C Lu - 2005 - ideals.illinois.edu

Parallel scientific applications deal with machine unreliability by periodic checkpointing, in
which all processes coordinate to dump memory to stable storage simultaneously. However …

被引用次数：77 相关文章所有 7 个版本

[PDF] academia.edu

C ³: A System for Automating Application-Level Checkpointing of MPI Programs

G Bronevetsky, D Marques, K Pingali… - Languages and Compilers …, 2004 - Springer

Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing
techniques make programs fault-tolerant by saving their state periodically and restoring this …

被引用次数：68 相关文章所有 15 个版本

[PDF] proquest.com

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

被引用次数：45 相关文章所有 6 个版本