It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …
DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …
of any element or connection results in downtime and triggers monetary and performance …
Landscape of automated log analysis: A systematic literature review and mapping study
Ł Korzeniowski, K Goczyła - IEEE Access, 2022 - ieeexplore.ieee.org
Logging is a common practice in software engineering to provide insights into working
systems. The main uses of log files have always been failure identification and root cause …
systems. The main uses of log files have always been failure identification and root cause …
Time machine: generative real-time model for failure (and lead time) prediction in hpc systems
High Performance Computing (HPC) systems generate a large amount of unstructured/
alphanumeric log messages that capture the health state of their components. Due to their …
alphanumeric log messages that capture the health state of their components. Due to their …
Communication and performance evaluation of 3-ary n-cubes onto network-on-chips
Network-on-chip (NoC) has the advantages of highly integrated, ultralow-power, low cost
and small volume, and it has become one of the mainstreams of VLSI system design [1, 2] …
and small volume, and it has become one of the mainstreams of VLSI system design [1, 2] …
Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems
System failures are expected to be frequent in the exascale era such as current Petascale
systems. The health of such systems is usually determined from challenging analysis of …
systems. The health of such systems is usually determined from challenging analysis of …
Sentiment analysis based error detection for large-scale systems
Today's large-scale systems such as High Performance Computing (HPC) Systems are
designed/utilized towards exascale computing, inevitably decreasing its reliability due to the …
designed/utilized towards exascale computing, inevitably decreasing its reliability due to the …
Bibliometric survey of IT infrastructure management to avoid failure conditions
DA Bhanage, AV Pawar - Information Discovery and Delivery, 2021 - emerald.com
Purpose The purpose of this paper is to present the bibliometric study of articles IT
Infrastructure Management to Avoid Failure Conditions. As in today's era of IT Industries, IT …
Infrastructure Management to Avoid Failure Conditions. As in today's era of IT Industries, IT …
A survey of log-correlation tools for failure diagnosis and prediction in cluster systems
System logs are the first source of information available to system designers to analyze and
troubleshoot their cluster systems. For example, High-Performance Computing (HPC) …
troubleshoot their cluster systems. For example, High-Performance Computing (HPC) …
Failure diagnosis for cluster systems using partial correlations
Failures have expensive implications in HPC (High-Performance Computing) systems.
Consequently, effective diagnosis of system failures is desired to help improve system …
Consequently, effective diagnosis of system failures is desired to help improve system …
The terminator: an AI-based framework to handle dependability threats in large-scale distributed systems
KA Alharthi - 2023 - wrap.warwick.ac.uk
With the advent of resource-hungry applications such as scientific simulations and artificial
intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming …
intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming …