HPAS: An HPC performance anomaly suite for reproducing performance variations
Modern high performance computing (HPC) systems, including supercomputers, routinely
suffer from substantial performance variations. The same application with the same input …
suffer from substantial performance variations. The same application with the same input …
A methodology for comparing the reliability of GPU-based and CPU-based HPCs
Today, GPUs are widely used as coprocessors/accelerators in High-Performance
Heterogeneous Computing due to their many advantages. However, many researches …
Heterogeneous Computing due to their many advantages. However, many researches …
Resiliency of hpc interconnects: A case study of interconnect failures and recovery in blue waters
Availability of the interconnection network in high-performance computing (HPC) systems is
fundamental to sustaining the continuous execution of applications at scale. When failures …
fundamental to sustaining the continuous execution of applications at scale. When failures …
Holistic measurement-driven system assessment
In high-performance computing systems, application performance and throughput are
dependent on a complex interplay of hardware and software subsystems and variable …
dependent on a complex interplay of hardware and software subsystems and variable …
Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system
Abstract Today's High Performance Computing (HPC) systems contain thousand of nodes
which work together to provide performance in the order of petaflops. The performance of …
which work together to provide performance in the order of petaflops. The performance of …
Sequence mining and property verification for fault-localization in simulink models
S Aloui Dkhil, MT Bennani, M Tekaya… - Theory and Applications …, 2020 - Springer
This paper introduces a novel approach for diagnosing automotive systems and identifying
faults at design-time, based on Sequence Mining and Property Verification for Fault …
faults at design-time, based on Sequence Mining and Property Verification for Fault …
Assessing dependability of emergent large-scale autonomous systems in the wild
S Jha - 2021 - ideals.illinois.edu
Emergent computer systems in transportation, healthcare, and enterprise systems are
increasingly adopting data-driven techniques using machine learning and artificial …
increasingly adopting data-driven techniques using machine learning and artificial …
Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed
SS Tang - 2018 - ideals.illinois.edu
Supercomputers have played an essential role in the progress of science and engineering
research. As the high-performance computing (HPC) community moves towards the next …
research. As the high-performance computing (HPC) community moves towards the next …
Fault injections on mission-critical computer systems
LR Devnani - 2018 - ideals.illinois.edu
This thesis presents two unique sets of fault injections on mission-critical computer systems
with the goal of (1) understanding the impact of faults, errors and failures, and (2) evaluating …
with the goal of (1) understanding the impact of faults, errors and failures, and (2) evaluating …
[PDF][PDF] Supporting Failure Analysis with Discoverable Annotated Log Datasets.
S Leak, A Greiner, JM Brandt, AC Gentile - 2018 - osti.gov
Detection, characterization, and mitigation of faults on supercomputers is complicated by the
large variety of interacting subsystems. Failures often manifest as vague observations like" …
large variety of interacting subsystems. Failures often manifest as vague observations like" …