[HTML][HTML] M100 ExaData: a data collection campaign on the CINECA's Marconi100 Tier-0 supercomputer

A Borghesi, C Di Santi, M Molan, MS Ardebili, A Mauri… - Scientific Data, 2023 - nature.com
Supercomputers are the most powerful computing machines available to society. They play
a central role in economic, industrial, and societal development. While they are used by …

Prodigy: Towards unsupervised anomaly detection in production hpc systems

B Aksar, E Sencan, B Schwaller, O Aaziz… - Proceedings of the …, 2023 - dl.acm.org
Performance variations caused by anomalies in modern High Performance Computing
(HPC) systems lead to decreased efficiency, impaired application performance, and …

A federated learning approach for anomaly detection in high performance computing

E Farooq, A Borghesi - 2023 IEEE 35th International …, 2023 - ieeexplore.ieee.org
High Performance Computing (HPC) systems are complex machines that need to be
operated at their maximum potential to recoup their investment cost and to mitigate their …

Graph neural networks for anomaly anticipation in HPC systems

M Molan, J Ahmed Khan, A Borghesi… - Companion of the 2023 …, 2023 - dl.acm.org
In this paper, we explore the use of Graph Neural Networks (GNNs) for anomaly anticipation
in high performance computing (HPC) systems. We propose a GNN-based approach that …

GRAAFE: GRaph anomaly anticipation framework for exascale HPC systems

M Molan, MS Ardebili, JA Khan, F Beneventi… - Future Generation …, 2024 - Elsevier
The main limitation of applying predictive tools to large-scale supercomputers is the
complexity of deploying Artificial Intelligence (AI) services in production and modeling …

[HTML][HTML] Non-pattern-based anomaly detection in time-series

V Tkach, A Kudin, VR Kebande, O Baranovskyi, I Kudin - Electronics, 2023 - mdpi.com
Anomaly detection across critical infrastructures is not only a key step towards detecting
threats but also gives early warnings of the likelihood of potential cyber-attacks, faults, or …

DeepHYDRA: A Hybrid Deep Learning and DBSCAN-Based Approach to Time-Series Anomaly Detection in Dynamically-Configured Systems

FK Stehle, W Vandelli, F Zahn, G Avolio… - Proceedings of the 38th …, 2024 - dl.acm.org
Anomaly detection in distributed systems such as High-Performance Computing (HPC)
clusters is vital for early fault detection, performance optimisation, security monitoring …

Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers

B Aksar, E Sencan, B Schwaller, VJ Leung… - Proceedings of the First …, 2023 - dl.acm.org
Supercomputers are highly sophisticated computing systems designed to handle complex
and computationally intensive tasks. Despite their tremendous efficiency, performance …

Exploring the Utility of Graph Methods in HPC Thermal Modeling

B Guindani, M Molan, A Bartolini, L Benini - Companion of the 15th ACM …, 2024 - dl.acm.org
This work critically examines several approaches to temperature prediction for High-
Performance Computing (HPC) systems, focusing on component-level and holistic models …

The Graph-Massivizer Approach Toward a European Sustainable Data Center Digital Twin

M Molan, JA Khan, A Bartolini, R Turra… - 2023 IEEE 47th …, 2023 - ieeexplore.ieee.org
Modeling and understanding an expensive next-generation data center operating at a
sustainable exascale performance remains a challenge yet to solve. The paper presents the …