Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

Software fault tolerance in real-time systems: Identifying the future research questions

F Reghenzani, Z Guo, W Fornaciari - ACM Computing Surveys, 2023 - dl.acm.org
Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …

Informed haar-like features improve pedestrian detection

S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org
We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

Live migration of virtual machine based on full system trace and replay

H Liu, H Jin, X Liao, L Hu, C Yu - Proceedings of the 18th ACM …, 2009 - dl.acm.org
Live migration of virtual machines (VM) across distinct physical hosts provides a significant
new benefit for administrators of data centers and clusters. Previous migration schemes …

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

From patches to honey-patches: Lightweight attacker misdirection, deception, and disinformation

F Araujo, KW Hamlen, S Biedermann… - Proceedings of the …, 2014 - dl.acm.org
Traditional software security patches often have the unfortunate side-effect of quickly alerting
attackers that their attempts to exploit patched vulnerabilities have failed. Attackers greatly …

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …