Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Robust anomaly detection for multivariate time series through stochastic recurrent neural network

Y Su, Y Zhao, C Niu, R Liu, W Sun, D Pei - Proceedings of the 25th ACM …, 2019 - dl.acm.org
Industry devices (ie, entities) such as server machines, spacecrafts, engines, etc., are
typically monitored with multivariate time series, whose anomaly detection is critical for an …

Recommending root-cause and mitigation steps for cloud incidents using large language models

T Ahmed, S Ghosh, C Bansal… - 2023 IEEE/ACM 45th …, 2023 - ieeexplore.ieee.org
Incident management for cloud services is a complex process involving several steps and
has a huge impact on both service health and developer productivity. On-call engineers …

Profit prediction using ARIMA, SARIMA and LSTM models in time series forecasting: A comparison

UM Sirisha, MC Belavagi, G Attigeri - IEEE Access, 2022 - ieeexplore.ieee.org
Time series forecasting using historical data is significantly important nowadays. Many fields
such as finance, industries, healthcare, and meteorology use it. Profit analysis using …

Xpert: Empowering incident management with query recommendations via large language models

Y Jiang, C Zhang, S He, Z Yang, M Ma, S Qin… - Proceedings of the …, 2024 - dl.acm.org
Large-scale cloud systems play a pivotal role in modern IT infrastructure. However, incidents
occurring within these systems can lead to service disruptions and adversely affect user …

[HTML][HTML] A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications

J Qiu, Q Du, K Yin, SL Zhang, C Qian - Applied Sciences, 2020 - mdpi.com
With the development of cloud computing technology, the microservice architecture (MSA)
has become a prevailing application architecture in cloud-native applications. Many user …

Automatic root cause analysis via large language models for cloud incidents

Y Chen, H Xie, M Ma, Y Kang, X Gao, L Shi… - Proceedings of the …, 2024 - dl.acm.org
Ensuring the reliability and availability of cloud services necessitates efficient root cause
analysis (RCA) for cloud incidents. Traditional RCA methods, which rely on manual …

Root-cause metric location for microservice systems via log anomaly detection

L Wang, N Zhao, J Chen, P Li… - … conference on web …, 2020 - ieeexplore.ieee.org
Microservice systems are typically fragile and failures are inevitable in them due to their
complexity and large scale. However, it is challenging to localize the root-cause metric due …

How to fight production incidents? an empirical study on a large-scale cloud service

S Ghosh, M Shetty, C Bansal, S Nath - … of the 13th Symposium on Cloud …, 2022 - dl.acm.org
Production incidents in today's large-scale cloud services can be extremely expensive in
terms of customer impacts and engineering resources required to mitigate them. Despite …