Harnessing federated learning for anomaly detection in supercomputer nodes

E Farooq, M Milano, A Borghesi - Future Generation Computer Systems, 2024 - Elsevier
High-performance computing (HPC) systems are a crucial component of modern society,
with a significant impact in areas ranging from economics to scientific research, thanks to …

Learning anomalies from graph: predicting compute node failures on HPC clusters

JM Rozanec, R Krumpak, M Molan… - Northern Lights Deep … - openreview.net
Today, high-performance computing (HPC) systems play a crucial role in advancing artificial
intelligence. Nevertheless, the estimated global data center electricity consumption in 2022 …

Predicting Compute Node Unavailability in HPC: A Graph-Based Machine Learning Approach

R Krumpak, JM Rozanec, M Molan, M Angelinelli… - conferences.computer.org
As high-performance computing (HPC) systems advance towards Exascale computing, their
size and complexity increase, introducing new maintenance challenges. Modern HPC …