HPAS: An HPC performance anomaly suite for reproducing performance variations

E Ates, Y Zhang, B Aksar, J Brandt, VJ Leung… - Proceedings of the 48th …, 2019 - dl.acm.org
Modern high performance computing (HPC) systems, including supercomputers, routinely
suffer from substantial performance variations. The same application with the same input …

A methodology for comparing the reliability of GPU-based and CPU-based HPCs

N Cini, G Yalcin - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Today, GPUs are widely used as coprocessors/accelerators in High-Performance
Heterogeneous Computing due to their many advantages. However, many researches …

Resiliency of hpc interconnects: A case study of interconnect failures and recovery in blue waters

S Jha, V Formicola, C Di Martino… - … on Dependable and …, 2017 - ieeexplore.ieee.org
Availability of the interconnection network in high-performance computing (HPC) systems is
fundamental to sustaining the continuous execution of applications at scale. When failures …

Holistic measurement-driven system assessment

S Jha, J Brandt, A Gentile, Z Kalbarczyk… - 2017 IEEE …, 2017 - ieeexplore.ieee.org
In high-performance computing systems, application performance and throughput are
dependent on a complex interplay of hardware and software subsystems and variable …

Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system

M Kumar, S Gupta, T Patel, M Wilder, W Shi… - Journal of Parallel and …, 2021 - Elsevier
Abstract Today's High Performance Computing (HPC) systems contain thousand of nodes
which work together to provide performance in the order of petaflops. The performance of …

Sequence mining and property verification for fault-localization in simulink models

S Aloui Dkhil, MT Bennani, M Tekaya… - Theory and Applications …, 2020 - Springer
This paper introduces a novel approach for diagnosing automotive systems and identifying
faults at design-time, based on Sequence Mining and Property Verification for Fault …

Assessing dependability of emergent large-scale autonomous systems in the wild

S Jha - 2021 - ideals.illinois.edu
Emergent computer systems in transportation, healthcare, and enterprise systems are
increasingly adopting data-driven techniques using machine learning and artificial …

Resiliency of high-performance computing systems: A fault-injection-based characterization of the high-speed network in the blue waters testbed

SS Tang - 2018 - ideals.illinois.edu
Supercomputers have played an essential role in the progress of science and engineering
research. As the high-performance computing (HPC) community moves towards the next …

Fault injections on mission-critical computer systems

LR Devnani - 2018 - ideals.illinois.edu
This thesis presents two unique sets of fault injections on mission-critical computer systems
with the goal of (1) understanding the impact of faults, errors and failures, and (2) evaluating …

[PDF][PDF] Supporting Failure Analysis with Discoverable Annotated Log Datasets.

S Leak, A Greiner, JM Brandt, AC Gentile - 2018 - osti.gov
Detection, characterization, and mitigation of faults on supercomputers is complicated by the
large variety of interacting subsystems. Failures often manifest as vague observations like" …