Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …

Semparser: A semantic parser for log analytics

Y Huo, Y Su, C Lee, MR Lyu - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Logs, being run-time information automatically generated by software, record system events
and activities with their timestamps. Before obtaining more insights into the run-time status of …

Mining root cause knowledge from cloud service incident investigations for aiops

A Saha, SCH Hoi - Proceedings of the 44th International Conference on …, 2022 - dl.acm.org
Root Cause Analysis (RCA) of any service-disrupting incident is one of the most critical as
well as complex tasks in IT processes, especially for cloud industry leaders like Salesforce …

A survey on intelligent management of alerts and incidents in IT services

Q Yu, N Zhao, M Li, Z Li, H Wang, W Zhang… - Journal of Network and …, 2024 - Elsevier
Modern service systems are constantly improving with the development of various IT
technologies, leading to a boost in system scales and complex dependencies among …

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

J Liu, T Yang, Z Chen, Y Su, C Feng… - 2023 IEEE 34th …, 2023 - ieeexplore.ieee.org
As modern software systems continue to grow in terms of complexity and volume, anomaly
detection on multivariate monitoring metrics, which profile systems' health status, becomes …

Autotsg: learning and synthesis for incident troubleshooting

M Shetty, C Bansal, SP Upadhyayula… - Proceedings of the 30th …, 2022 - dl.acm.org
Incident management is a key aspect of operating large-scale cloud services. To aid with
faster and efficient resolution of incidents, engineering teams document frequent …

Logvm: Variable semantics miner for log messages

Y Huo, Y Su, M Lyu - 2022 IEEE International Symposium on …, 2022 - ieeexplore.ieee.org
Modern automated log analytics rely on log events without paying attention to variables.
However, variables, such as the return code (eg,“404”) in logs, are noteworthy for their …

MTL-TRANSFER: Leveraging Multi-task Learning and Transferred Knowledge for Improving Fault Localization and Program Repair

X Wang, H Yu, X Meng, H Cao, H Zhang… - ACM Transactions on …, 2024 - dl.acm.org
Fault localization (FL) and automated program repair (APR) are two main tasks of automatic
software debugging. Compared with traditional methods, deep learning-based approaches …