A survey on automated log analysis for reliability engineering

S He, P He, Z Chen, T Yang, Y Su, MR Lyu - ACM computing surveys …, 2021 - dl.acm.org
Logs are semi-structured text generated by logging statements in software source code. In
recent decades, software logs have become imperative in the reliability assurance …

It infrastructure anomaly detection and failure handling: A systematic literature review focusing on datasets, log preprocessing, machine & deep learning approaches …

DA Bhanage, AV Pawar, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Nowadays, reliability assurance is crucial in components of IT infrastructures. Unavailability
of any element or connection results in downtime and triggers monetary and performance …

Semparser: A semantic parser for log analytics

Y Huo, Y Su, C Lee, MR Lyu - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Logs, being run-time information automatically generated by software, record system events
and activities with their timestamps. Before obtaining more insights into the run-time status of …

LogKG: Log Failure Diagnosis through Knowledge Graph

Y Sui, Y Zhang, J Sun, T Xu, S Zhang… - IEEE Transactions …, 2023 - ieeexplore.ieee.org
Logs are one of the most valuable data to describe the running state of services. Failure
diagnosis through logs is crucial for service reliability and security. The current automatic log …

Quality evaluation of modern code reviews through intelligent biometric program comprehension

H Hijazi, J Duraes, R Couceiro… - IEEE Transactions …, 2022 - ieeexplore.ieee.org
Code review is an essential practice in software engineering to spot code defects in the
early stages of software development. Modern code reviews (eg, acceptance or rejection of …

Fail through the cracks: Cross-system interaction failures in modern cloud systems

L Tang, C Bhandari, Y Zhang, A Karanika, S Ji… - Proceedings of the …, 2023 - dl.acm.org
Modern cloud systems are orchestrations of independent and interacting (sub-) systems,
each specializing in important services (eg, data processing, storage, resource …

Fault injection analytics: A novel approach to discover failure modes in cloud-computing systems

D Cotroneo, L De Simone, P Liguori… - IEEE transactions on …, 2020 - ieeexplore.ieee.org
Cloud computing systems fail in complex and unexpected ways due to unexpected
combinations of events and interactions between hardware and software components. Fault …

Incident-aware duplicate ticket aggregation for cloud systems

J Liu, S He, Z Chen, L Li, Y Kang… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
In cloud systems, incidents are potential threats to customer satisfaction and business
revenue. When customers are affected by incidents, they often request customer support …

An intelligent framework for timely, accurate, and comprehensive cloud incident detection

Y Li, X Zhang, S He, Z Chen, Y Kang, J Liu… - ACM SIGOPS …, 2022 - dl.acm.org
Cloud incidents (service interruptions or performance degradation) dramatically degrade the
reliability of large-scale cloud systems, causing customer dissatisfaction and revenue loss …

Understanding and predicting incident mitigation time

W Wang, J Chen, L Yang, H Zhang, Z Wang - Information and Software …, 2023 - Elsevier
Context: Incident management plays a significant role in online service systems. Incidents
should be mitigated as soon as possible in order to achieve high service stability. However …