Towards comprehensive dependability-driven resource use and message log-analysis for HPC...

It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …

DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org

Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …

被引用次数：30 相关文章所有 2 个版本

[PDF] ieee.org

Landscape of automated log analysis: A systematic literature review and mapping study

Ł Korzeniowski, K Goczyła - IEEE Access, 2022 - ieeexplore.ieee.org

Logging is a common practice in software engineering to provide insights into working
systems. The main uses of log files have always been failure identification and root cause …

被引用次数：24 相关文章所有 5 个版本

[PDF] warwick.ac.uk

Time machine: generative real-time model for failure (and lead time) prediction in hpc systems

KA Alharthi, A Jhumka, S Di, L Gui… - 2023 53rd Annual …, 2023 - ieeexplore.ieee.org

High Performance Computing (HPC) systems generate a large amount of unstructured/
alphanumeric log messages that capture the health state of their components. Due to their …

被引用次数：5 相关文章所有 7 个版本

Communication and performance evaluation of 3-ary n-cubes onto network-on-chips

W Fan, J Fan, Y Zhang, Z Han… - Science China …, 2022 - search.proquest.com

Network-on-chip (NoC) has the advantages of highly integrated, ultralow-power, low cost
and small volume, and it has become one of the mainstreams of VLSI system design [1, 2] …

被引用次数：19 相关文章所有 2 个版本

[PDF] warwick.ac.uk

Clairvoyant: a log-based transformer-decoder for failure prediction in large-scale systems

KA Alharthi, A Jhumka, S Di, F Cappello - Proceedings of the 36th ACM …, 2022 - dl.acm.org

System failures are expected to be frequent in the exascale era such as current Petascale
systems. The health of such systems is usually determined from challenging analysis of …

被引用次数：6 相关文章所有 3 个版本

[PDF] warwick.ac.uk

Sentiment analysis based error detection for large-scale systems

KA Alharthi, A Jhumka, S Di… - 2021 51st Annual …, 2021 - ieeexplore.ieee.org

Today's large-scale systems such as High Performance Computing (HPC) Systems are
designed/utilized towards exascale computing, inevitably decreasing its reliability due to the …

被引用次数：11 相关文章所有 7 个版本

Bibliometric survey of IT infrastructure management to avoid failure conditions

DA Bhanage, AV Pawar - Information Discovery and Delivery, 2021 - emerald.com

Purpose The purpose of this paper is to present the bibliometric study of articles IT
Infrastructure Management to Avoid Failure Conditions. As in today's era of IT Industries, IT …

被引用次数：13 相关文章

[PDF] ieee.org

A survey of log-correlation tools for failure diagnosis and prediction in cluster systems

E Chuah, A Jhumka, M Malek, N Suri - IEEE Access, 2022 - ieeexplore.ieee.org

System logs are the first source of information available to system designers to analyze and
troubleshoot their cluster systems. For example, High-Performance Computing (HPC) …

被引用次数：2 相关文章所有 6 个版本

[PDF] lancs.ac.uk

Failure diagnosis for cluster systems using partial correlations

E ChuahM, A Jhumka, S Alt… - 2021 IEEE Intl Conf on …, 2021 - ieeexplore.ieee.org

Failures have expensive implications in HPC (High-Performance Computing) systems.
Consequently, effective diagnosis of system failures is desired to help improve system …

被引用次数：2 相关文章所有 8 个版本

[PDF] warwick.ac.uk

The terminator: an AI-based framework to handle dependability threats in large-scale distributed systems

KA Alharthi - 2023 - wrap.warwick.ac.uk

With the advent of resource-hungry applications such as scientific simulations and artificial
intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming …

被引用次数：1 相关文章所有 2 个版本