Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey
The proliferation of services and service interactions within microservices and cloud-native
applications, makes it harder to detect failures and to identify their possible root causes …
applications, makes it harder to detect failures and to identify their possible root causes …
Performance anomaly detection and bottleneck identification
O Ibidunmoye, F Hernández-Rodriguez… - ACM Computing Surveys …, 2015 - dl.acm.org
In order to meet stringent performance requirements, system administrators must effectively
detect undesirable performance behaviours, identify potential root causes, and take …
detect undesirable performance behaviours, identify potential root causes, and take …
A survey of aiops methods for failure management
Modern society is increasingly moving toward complex and distributed computing systems.
The increase in scale and complexity of these systems challenges O&M teams that perform …
The increase in scale and complexity of these systems challenges O&M teams that perform …
Detecting large-scale system problems by mining console logs
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter
services, for they often consist of the voluminous intermixing of messages from many …
services, for they often consist of the voluminous intermixing of messages from many …
Structured comparative analysis of systems logs to diagnose performance problems
Diagnosis and correction of performance issues in modern, large-scale distributed systems
can be a daunting task, since a single developer is unlikely to be familiar with the entire …
can be a daunting task, since a single developer is unlikely to be familiar with the entire …
A survey on load testing of large-scale software systems
Many large-scale software systems must service thousands or millions of concurrent
requests. These systems must be load tested to ensure that they can function correctly under …
requests. These systems must be load tested to ensure that they can function correctly under …
Characterizing logging practices in open-source software
D Yuan, S Park, Y Zhou - 2012 34th international conference …, 2012 - ieeexplore.ieee.org
Software logging is a conventional programming practice. While its efficacy is often
important for users and developers to understand what have happened in the production …
important for users and developers to understand what have happened in the production …
Sherlog: error diagnosis by connecting clues from run-time logs
Computer systems often fail due to many factors such as software bugs or administrator
errors. Diagnosing such production run failures is an important but challenging task since it …
errors. Diagnosing such production run failures is an important but challenging task since it …
Improving software diagnosability via log enhancement
D Yuan, J Zheng, S Park, Y Zhou… - ACM Transactions on …, 2012 - dl.acm.org
Diagnosing software failures in the field is notoriously difficult, in part due to the fundamental
complexity of troubleshooting any complex software system, but further exacerbated by the …
complexity of troubleshooting any complex software system, but further exacerbated by the …
Fingerprinting the datacenter: automated classification of performance crises
Contemporary datacenters comprise hundreds or thousands of machines running
applications requiring high availability and responsiveness. Although a performance crisis is …
applications requiring high availability and responsiveness. Although a performance crisis is …