[HTML][HTML] M100 ExaData: a data collection campaign on the CINECA's Marconi100 Tier-0 supercomputer
Supercomputers are the most powerful computing machines available to society. They play
a central role in economic, industrial, and societal development. While they are used by …
a central role in economic, industrial, and societal development. While they are used by …
Prodigy: Towards unsupervised anomaly detection in production hpc systems
Performance variations caused by anomalies in modern High Performance Computing
(HPC) systems lead to decreased efficiency, impaired application performance, and …
(HPC) systems lead to decreased efficiency, impaired application performance, and …
A federated learning approach for anomaly detection in high performance computing
E Farooq, A Borghesi - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org
High Performance Computing (HPC) systems are complex machines that need to be
operated at their maximum potential to recoup their investment cost and to mitigate their …
operated at their maximum potential to recoup their investment cost and to mitigate their …
Graph neural networks for anomaly anticipation in HPC systems
M Molan, J Ahmed Khan, A Borghesi… - Companion of the 2023 …, 2023 - dl.acm.org
In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation
in high performance computing (HPC) systems. We propose a GNN-based approach that …
in high performance computing (HPC) systems. We propose a GNN-based approach that …
GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems
The main limitation of applying predictive tools to large-scale supercomputers is the
complexity of deploying Artificial Intelligence (AI) services in production and modeling …
complexity of deploying Artificial Intelligence (AI) services in production and modeling …
[HTML][HTML] Non-pattern-based anomaly detection in time-series
Anomaly detection across critical infrastructures is not only a key step towards detecting
threats but also gives early warnings of the likelihood of potential cyber-attacks, faults, or …
threats but also gives early warnings of the likelihood of potential cyber-attacks, faults, or …
DeepHYDRA: A Hybrid Deep Learning and DBSCAN-Based Approach to Time-Series Anomaly Detection in Dynamically-Configured Systems
FK Stehle, W Vandelli, F Zahn, G Avolio… - Proceedings of the 38th …, 2024 - dl.acm.org
Anomaly detection in distributed systems such as High-Performance Computing (HPC)
clusters is vital for early fault detection, performance optimisation, security monitoring …
clusters is vital for early fault detection, performance optimisation, security monitoring …
Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers
Supercomputers are highly sophisticated computing systems designed to handle complex
and computationally intensive tasks. Despite their tremendous efficiency, performance …
and computationally intensive tasks. Despite their tremendous efficiency, performance …
Exploring the Utility of Graph Methods in HPC Thermal Modeling
This work critically examines several approaches to temperature prediction for High-
Performance Computing (HPC) systems, focusing on component-level and holistic models …
Performance Computing (HPC) systems, focusing on component-level and holistic models …
The Graph-Massivizer Approach Toward a European Sustainable Data Center Digital Twin
M Molan, JA Khan, A Bartolini, R Turra… - 2023 IEEE 47th …, 2023 - ieeexplore.ieee.org
Modeling and understanding an expensive next-generation data center operating at a
sustainable exascale performance remains a challenge yet to solve. The paper presents the …
sustainable exascale performance remains a challenge yet to solve. The paper presents the …