It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …

DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …

Landscape of automated log analysis: A systematic literature review and mapping study

Ł Korzeniowski, K Goczyła - IEEE Access, 2022 - ieeexplore.ieee.org
Logging is a common practice in software engineering to provide insights into working
systems. The main uses of log files have always been failure identification and root cause …

Time machine: generative real-time model for failure (and lead time) prediction in hpc systems

KA Alharthi, A Jhumka, S Di, L Gui… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org
High Performance Computing (HPC) systems generate a large amount of unstructured/
alphanumeric log messages that capture the health state of their components. Due to their …

Communication and performance evaluation of 3-ary n-cubes onto network-on-chips

W Fan, J Fan, Y Zhang, Z Han… - Science China …, 2022 - search.proquest.com
Network-on-chip (NoC) has the advantages of highly integrated, ultralow-power, low cost
and small volume, and it has become one of the mainstreams of VLSI system design [1, 2] …

Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems

KA Alharthi, A Jhumka, S Di, F Cappello - Proceedings of the 36th ACM …, 2022 - dl.acm.org
System failures are expected to be frequent in the exascale era such as current Petascale
systems. The health of such systems is usually determined from challenging analysis of …

Sentiment analysis based error detection for large-scale systems

KA Alharthi, A Jhumka, S Di… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org
Today's large-scale systems such as High Performance Computing (HPC) Systems are
designed/utilized towards exascale computing, inevitably decreasing its reliability due to the …

Bibliometric survey of IT infrastructure management to avoid failure conditions

DA Bhanage, AV Pawar - Information Discovery and Delivery, 2021 - emerald.com
Purpose The purpose of this paper is to present the bibliometric study of articles IT
Infrastructure Management to Avoid Failure Conditions. As in today's era of IT Industries, IT …

A survey of log-correlation tools for failure diagnosis and prediction in cluster systems

E Chuah, A Jhumka, M Malek, N Suri - IEEE Access, 2022 - ieeexplore.ieee.org
System logs are the first source of information available to system designers to analyze and
troubleshoot their cluster systems. For example, High-Performance Computing (HPC) …

Failure diagnosis for cluster systems using partial correlations

E ChuahM, A Jhumka, S Alt… - 2021 IEEE Intl Conf on …, 2021 - ieeexplore.ieee.org
Failures have expensive implications in HPC (High-Performance Computing) systems.
Consequently, effective diagnosis of system failures is desired to help improve system …

The terminator: an AI-based framework to handle dependability threats in large-scale distributed systems

KA Alharthi - 2023 - wrap.warwick.ac.uk
With the advent of resource-hungry applications such as scientific simulations and artificial
intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming …