Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Sage: practical and scalable ML-driven performance debugging in microservices

Y Gan, M Liang, S Dev, D Lo, C Delimitrou - Proceedings of the 26th …, 2021 - dl.acm.org
Cloud applications are increasingly shifting from large monolithic services to complex
graphs of loosely-coupled microservices. Despite the advantages of modularity and …

Root cause analysis of failures in microservices through causal discovery

A Ikram, S Chakraborty, S Mitra… - Advances in …, 2022 - proceedings.neurips.cc
Most cloud applications use a large number of smaller sub-components (called
microservices) that interact with each other in the form of a complex graph to provide the …

Microscope: Pinpoint performance issues with causal graphs in micro-service environments

JJ Lin, P Chen, Z Zheng - … , ICSOC 2018, Hangzhou, China, November 12 …, 2018 - Springer
Driven by the emerging business models (eg, digital sales) and IT technologies (eg, DevOps
and Cloud computing), the architecture of software is shifting from monolithic to microservice …

Localizing failure root causes in a microservice through causality inference

Y Meng, S Zhang, Y Sun, R Zhang, Z Hu… - 2020 IEEE/ACM 28th …, 2020 - ieeexplore.ieee.org
An increasing number of Internet applications are applying microservice architecture due to
its flexibility and clear logic. The stability of microservice is thus vitally important for these …

A spatiotemporal deep learning approach for unsupervised anomaly detection in cloud systems

Z He, P Chen, X Li, Y Wang, G Yu… - … on Neural Networks …, 2020 - ieeexplore.ieee.org
Anomaly detection is a critical task for maintaining the performance of a cloud system. Using
data-driven methods to address this issue is the mainstream in recent years. However, due …

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments

G Yu, P Chen, H Chen, Z Guan, Z Huang… - Proceedings of the Web …, 2021 - dl.acm.org
With the advantages of flexible scalability and fast delivery, microservice has become a
popular software architecture in the modern IT industry. However, the explosion in the …

Groot: An event-graph-based approach for root cause analysis in industrial settings

H Wang, Z Wu, H Jiang, Y Huang… - 2021 36th IEEE/ACM …, 2021 - ieeexplore.ieee.org
For large-scale distributed systems, it is crucial to efficiently diagnose the root causes of
incidents to maintain high system availability. The recent development of microservice …

Microhecl: High-efficient root cause localization in large-scale microservice systems

D Liu, C He, X Peng, F Lin, C Zhang… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Availability issues of industrial microservice systems (eg, drop of successfully placed orders
and processed transactions) directly affect the running of the business. These issues are …

Failure diagnosis in microservice systems: A comprehensive survey and analysis

S Zhang, S Xia, W Fan, B Shi, X Xiong, Z Zhong… - arXiv preprint arXiv …, 2024 - arxiv.org
Modern microservice systems have gained widespread adoption due to their high
scalability, flexibility, and extensibility. However, the characteristics of independent …