Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey

J Soldani, A Brogi - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …

Performance anomaly detection and bottleneck identification

O Ibidunmoye, F Hernández-Rodriguez… - ACM Computing Surveys …, 2015 - dl.acm.org
In order to meet stringent performance requirements, system administrators must effectively
detect undesirable performance behaviours, identify potential root causes, and take …

A survey of aiops methods for failure management

P Notaro, J Cardoso, M Gerndt - ACM Transactions on Intelligent …, 2021 - dl.acm.org
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …

Detecting large-scale system problems by mining console logs

W Xu, L Huang, A Fox, D Patterson… - Proceedings of the ACM …, 2009 - dl.acm.org
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter
services, for they often consist of the voluminous intermixing of messages from many …

Structured comparative analysis of systems logs to diagnose performance problems

K Nagaraj, C Killian, J Neville - 9th USENIX Symposium on Networked …, 2012 - usenix.org
Diagnosis and correction of performance issues in modern, large-scale distributed systems
can be a daunting task, since a single developer is unlikely to be familiar with the entire …

A survey on load testing of large-scale software systems

ZM Jiang, AE Hassan - IEEE Transactions on Software …, 2015 - ieeexplore.ieee.org
Many large-scale software systems must service thousands or millions of concurrent
requests. These systems must be load tested to ensure that they can function correctly under …

Characterizing logging practices in open-source software

D Yuan, S Park, Y Zhou - 2012 34th international conference …, 2012 - ieeexplore.ieee.org
Software logging is a conventional programming practice. While its efficacy is often
important for users and developers to understand what have happened in the production …

Sherlog: error diagnosis by connecting clues from run-time logs

D Yuan, H Mai, W Xiong, L Tan, Y Zhou… - Proceedings of the …, 2010 - dl.acm.org
Computer systems often fail due to many factors such as software bugs or administrator
errors. Diagnosing such production run failures is an important but challenging task since it …

Improving software diagnosability via log enhancement

D Yuan, J Zheng, S Park, Y Zhou… - ACM Transactions on …, 2012 - dl.acm.org
Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental
complexity of troubleshooting any complex software system, but further exacerbated by the …

Fingerprinting the datacenter: automated classification of performance crises

P Bodik, M Goldszmidt, A Fox, DB Woodard… - Proceedings of the 5th …, 2010 - dl.acm.org
Contemporary datacenters comprise hundreds or thousands of machines running
applications requiring high availability and responsiveness. Although a performance crisis is …