Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

[PDF][PDF] Loganomaly: Unsupervised detection of sequential and quantitative anomalies in unstructured logs.

W Meng, Y Liu, Y Zhu, S Zhang, D Pei, Y Liu, Y Chen… - IJCAI, 2019 - nkcs.iops.ai
Recording runtime status via logs is common for almost computer system, and detecting
anomalies in logs is crucial for timely identifying malfunctions of systems. However …

Unsupervised anomaly detection for intricate kpis via adversarial training of vae

W Chen, H Xu, Z Li, D Pei, J Chen… - … -IEEE conference on …, 2019 - ieeexplore.ieee.org
To ensure the reliability of the Internet-based application services, KPIs (Key Performance
Monitors) are closely monitored in real time and the anomalies presented in the KPIs must …

Efficient kpi anomaly detection through transfer learning for large-scale web services

S Zhang, Z Zhong, D Li, Q Fan, Y Sun… - IEEE Journal on …, 2022 - ieeexplore.ieee.org
Timely anomaly detection of key performance indicators (KPIs), eg, service response time,
error rate, is of utmost importance to Web services. Over the years, many unsupervised deep …

Robust and unsupervised KPI anomaly detection based on conditional variational autoencoder

Z Li, W Chen, D Pei - 2018 IEEE 37th International Performance …, 2018 - ieeexplore.ieee.org
To ensure undisrupted web-based services, operators need to closely monitor various KPIs
(Key Performance Indicator, such as CPU usages, network throughput, page views, number …

Tsagen: synthetic time series generation for kpi anomaly detection

C Wang, K Wu, T Zhou, G Yu… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
A key performance indicator (KPI) consists of critical time series data that reflect the runtime
states of network systems (eg, response time and available bandwidth). Despite the …

Putracead: Trace anomaly detection with partial labels based on GNN and Pu Learning

K Zhang, C Zhang, X Peng… - 2022 IEEE 33rd …, 2022 - ieeexplore.ieee.org
Distributed tracing has been an important part of microservice infrastructure and learning-
based trace analysis has been used to detect anomalies in microservice systems. Existing …

Generic and robust localization of multi-dimensional root causes

Z Li, C Luo, Y Zhao, Y Sun, K Sui… - 2019 IEEE 30th …, 2019 - ieeexplore.ieee.org
Operators of online software services periodically collect various measures with many
attributes. When a measure becomes abnormal, indicating service problems such as …

Fluxrank: A widely-deployable framework to automatically localizing root cause machines for software service failure mitigation

P Liu, Y Chen, X Nie, J Zhu, S Zhang… - 2019 IEEE 30th …, 2019 - ieeexplore.ieee.org
The failures of software service directly affect user experiences and service revenue. Thus
operators monitor both service-level KPIs (eg, response time) and machine-level KPIs (eg …

Intelligent detection for key performance indicators in industrial-based cyber-physical systems

S He, Z Li, J Wang, NN Xiong - IEEE Transactions on Industrial …, 2020 - ieeexplore.ieee.org
Intelligent anomaly detection for key performance indicators (KPIs) is important for keeping
services reliable in industrial-based cyber-physical systems (CPS). However, it is common in …