Pre-trained kpi anomaly detection model through disentangled transformer

Z Yu, C Pei, X Wang, M Ma, C Bansal… - Proceedings of the 30th …, 2024 - dl.acm.org
In large-scale online service systems, numerous Key Performance Indicators (KPIs), such as
service response time and error rate, are gathered in a time-series format. KPI Anomaly …

End-to-end automl for unsupervised log anomaly detection

S Zhang, Y Ji, J Luan, X Nie, Z Chen, M Ma… - Proceedings of the 39th …, 2024 - dl.acm.org
As modern software systems evolve towards greater complexity, ensuring their reliable
operation has become a critical challenge. Log data analysis is vital in maintaining system …

Can We Trust Auto-Mitigation? Improving Cloud Failure Prediction with Uncertain Positive Learning

H Li, M Ma, Y Liu, P Zhao, S Li, Z Li… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
In the rapidly expanding domain of cloud computing, a variety of software services have
been deployed in the cloud. To ensure the reliability of cloud services, prior studies focus on …

Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization

L Tao, S Zhang, Z Jia, J Sun, M Ma, Z Li, Y Sun… - Proceedings of the 39th …, 2024 - dl.acm.org
Microservice systems are inherently complex and prone to failures, which can significantly
impact user experience. Existing diagnostic approaches based on single-modal data such …

Large Language Models Can Provide Accurate and Interpretable Incident Triage

Z Wang, J Li, M Ma, Z Li, Y Kang… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
Large-scale cloud services frequently experience incidents that can have a significant
impact on their stability. Incident triage is a critical process that assigns incidents to …

A Survey on Large Language Models for Communication, Network, and Service Management: Application Insights, Challenges, and Future Directions

GO Boateng, H Sami, A Alagha, H Elmekki… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid evolution of communication networks in recent decades has intensified the need
for advanced Network and Service Management (NSM) strategies to address the growing …

Early Bird: Ensuring Reliability of Cloud Systems Through Early Failure Prediction

Y Liu, M Ma, P Zhao, T Li, B Qiao, S Li… - 2024 IEEE 35th …, 2024 - ieeexplore.ieee.org
As cloud service continues to dominate various sectors, the reliability of cloud infrastructures
becomes crucial. Traditional methods of failure prediction often fall short in providing …

Engineering Trustworthy Software: A Mission for LLMs

M Vieira - arXiv preprint arXiv:2411.17981, 2024 - arxiv.org
LLMs are transforming software engineering by accelerating development, reducing
complexity, and cutting costs. When fully integrated into the software lifecycle they will drive …