Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com
The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …
reinvigorated the community interest in how to manage failures in such systems and ensure …
Software fault tolerance in real-time systems: Identifying the future research questions
Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …
the miniaturization of the hardware components, their increasing complexity, and the …
Informed haar-like features improve pedestrian detection
S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org
We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …
incorporate common sense and everyday knowledge into the design of simple and …
Toward exascale resilience
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …
computing (HPC) systems, in particular in the perspective of large petascale systems and …
Post-failure recovery of MPI communication capability: Design and rationale
As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …
Live migration of virtual machine based on full system trace and replay
Live migration of virtual machines (VM) across distinct physical hosts provides a significant
new benefit for administrators of data centers and clusters. Previous migration schemes …
new benefit for administrators of data centers and clusters. Previous migration schemes …
Desh: deep learning for system health prediction of lead times to failure in hpc
Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …
likely to experience even higher fault rates due to increased component count and density …
From patches to honey-patches: Lightweight attacker misdirection, deception, and disinformation
Traditional software security patches often have the unfortunate side-effect of quickly alerting
attackers that their attempts to exploit patched vulnerabilities have failed. Attackers greatly …
attackers that their attempts to exploit patched vulnerabilities have failed. Attackers greatly …
Fault prediction under the microscope: A closer look into HPC systems
A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …
systems is wasted because of failures. Consequently current research is focusing on …
Proactive fault tolerance using preemptive migration
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …
compute node failures from impacting running parallel applications by preemptively …