Proactive process-level live migration in HPC environments

Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities

F Cappello - The International Journal of High Performance …, 2009 - journals.sagepub.com

The emergence of petascale systems and the promise of future exascale systems have
reinvigorated the community interest in how to manage failures in such systems and ensure …

被引用次数：308 相关文章所有 4 个版本

[PDF] acm.org

Software fault tolerance in real-time systems: Identifying the future research questions

F Reghenzani, Z Guo, W Fornaciari - ACM Computing Surveys, 2023 - dl.acm.org

Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …

被引用次数：24 相关文章所有 5 个版本

[PDF] cv-foundation.org

Informed haar-like features improve pedestrian detection

S Zhang, C Bauckhage… - Proceedings of the IEEE …, 2014 - cv-foundation.org

We propose a simple yet effective detector for pedestrian detection. The basic idea is to
incorporate common sense and everyday knowledge into the design of simple and …

被引用次数：414 相关文章所有 10 个版本

[PDF] illinois.edu

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com

Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

被引用次数：484 相关文章所有 14 个版本

[PDF] psu.edu

Post-failure recovery of MPI communication capability: Design and rationale

W Bland, A Bouteiller, T Herault… - … Journal of High …, 2013 - journals.sagepub.com

As supercomputers are entering an era of massive parallelism where the frequency of faults
is increasing, the MPI Standard remains distressingly vague on the consequence of failures …

被引用次数：276 相关文章所有 8 个版本

[PDF] ucsc.edu

Live migration of virtual machine based on full system trace and replay

H Liu, H Jin, X Liao, L Hu, C Yu - Proceedings of the 18th ACM …, 2009 - dl.acm.org

Live migration of virtual machines (VM) across distinct physical hosts provides a significant
new benefit for administrators of data centers and clusters. Previous migration schemes …

被引用次数：379 相关文章所有 9 个版本

[PDF] acm.org

Desh: deep learning for system health prediction of lead times to failure in hpc

A Das, F Mueller, C Siegel, A Vishnu - Proceedings of the 27th …, 2018 - dl.acm.org

Today's large-scale supercomputers encounter faults on a daily basis. Exascale systems are
likely to experience even higher fault rates due to increased component count and density …

被引用次数：117 相关文章所有 4 个版本

[PDF] academia.edu

From patches to honey-patches: Lightweight attacker misdirection, deception, and disinformation

F Araujo, KW Hamlen, S Biedermann… - Proceedings of the …, 2014 - dl.acm.org

Traditional software security patches often have the unfortunate side-effect of quickly alerting
attackers that their attempts to exploit patched vulnerabilities have failed. Attackers greatly …

被引用次数：154 相关文章所有 11 个版本

[PDF] illinois.edu

Fault prediction under the microscope: A closer look into HPC systems

A Gainaru, F Cappello, M Snir… - SC'12: Proceedings of …, 2012 - ieeexplore.ieee.org

A large percentage of computing capacity in today's large high-performance computing
systems is wasted because of failures. Consequently current research is focusing on …

被引用次数：179 相关文章所有 14 个版本

[PDF] researchgate.net

Proactive fault tolerance using preemptive migration

C Engelmann, GR Vallee, T Naughton… - 2009 17th Euromicro …, 2009 - ieeexplore.ieee.org

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents
compute node failures from impacting running parallel applications by preemptively …

被引用次数：155 相关文章所有 17 个版本