Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes
The execution times of large-scale parallel applications on nowadays multi/many-core
systems are usually longer than the mean time between failures. Therefore, parallel …
systems are usually longer than the mean time between failures. Therefore, parallel …
Two-level incremental checkpoint recovery scheme for reducing system total overheads
H Li, L Pang, Z Wang - PloS one, 2014 - journals.plos.org
Long-running applications are often subject to failures. Once failures occur, it will lead to
unacceptable system overheads. The checkpoint technology is used to reduce the losses in …
unacceptable system overheads. The checkpoint technology is used to reduce the losses in …
Failure avoidance in MPI applications using an application-level approach
Execution times of large-scale computational science and engineering parallel applications
are usually longer than the mean-time-between-failures. For this reason, hardware failures …
are usually longer than the mean-time-between-failures. For this reason, hardware failures …
Reducing the overhead of an MPI application-level migration approach
Process migration provides many benefits for parallel environments including dynamic load
balance, data access locality, or fault tolerance. This work proposes a solution that reduces …
balance, data access locality, or fault tolerance. This work proposes a solution that reduces …
Reducing application-level checkpoint file sizes: Towards scalable fault tolerance solutions
G Rodriguez, MJ Martín… - 2012 IEEE 10th …, 2012 - ieeexplore.ieee.org
Systems intended for the execution of long-running parallel applications require fault
tolerant capabilities, since the probability of failure increases with the execution time and the …
tolerant capabilities, since the probability of failure increases with the execution time and the …
I/O optimization in the checkpointing of openMP parallel applications
Despite the increasing popularity of shared-memory systems, there is a lack of tools for
providing fault tolerance support to shared-memory applications. Check pointing is one of …
providing fault tolerance support to shared-memory applications. Check pointing is one of …
Extending an application-level checkpointing tool to provide fault tolerance support to OpenMP applications
Despite the increasing popularity of shared-memory systems, there is a lack of tools for
providing fault tolerance support to shared-memory applications. CPPC (ComPiler for …
providing fault tolerance support to shared-memory applications. CPPC (ComPiler for …
Compiler-assisted checkpointing of parallel codes: The cetus and llvm experience
With the evolution of high-performance computing, parallel applications have developed an
increasing necessity for fault tolerance, most commonly provided by checkpoint and restart …
increasing necessity for fault tolerance, most commonly provided by checkpoint and restart …
Improving an MPI application-level migration approach through checkpoint file splitting
M Rodríguez, I Cores, P González… - 2014 IEEE 26th …, 2014 - ieeexplore.ieee.org
Traditionally used for load balancing, process migration has been gaining popularity in the
fault tolerance context. Recently, checkpoint-based migration has been proposed to …
fault tolerance context. Recently, checkpoint-based migration has been proposed to …
A new parallel recomputing code design methodology for fast failure recovery
Y Du, Y Tang, X Xie - Computers & Electrical Engineering, 2013 - Elsevier
As the size of large-scale computer systems increases, their mean-time-between-failures are
becoming significantly shorter than the execution time of many current scientific applications …
becoming significantly shorter than the execution time of many current scientific applications …