A survey of machine learning for computer architecture and systems

N Wu, Y Xie - ACM Computing Surveys (CSUR), 2022 - dl.acm.org
It has been a long time that computer architecture and systems are optimized for efficient
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges

Q Cheng, D Sahoo, A Saha, W Yang, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …

Aiops: real-world challenges and research innovations

Y Dang, Q Lin, P Huang - 2019 IEEE/ACM 41st International …, 2019 - ieeexplore.ieee.org
AIOps is about empowering software and service engineers (eg, developers, program
managers, support engineers, site reliability engineers) to efficiently and effectively build …

Making disk failure predictions {SMARTer}!

S Lu, B Luo, T Patel, Y Yao, D Tiwari… - 18th USENIX Conference …, 2020 - usenix.org
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …

A fault tolerant elastic resource management framework toward high availability of cloud services

D Saxena, I Gupta, AK Singh… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Cloud computing has become inevitable for every digital service which has exponentially
increased its usage. However, a tremendous surge in cloud resource demand stave off …

Deep Learning for HDD health assessment: An application based on LSTM

A De Santo, A Galli, M Gravina… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Hard disk drive failures are one of the most common causes of service downtime in data
centers. Predictive maintenance techniques have been adopted to extend the Remaining …

Towards intelligent incident management: why we need it and how we make it

Z Chen, Y Kang, L Li, X Zhang, H Zhang, H Xu… - Proceedings of the 28th …, 2020 - dl.acm.org
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …

Predicting node failure in cloud service systems

Q Lin, K Hsieh, Y Dang, H Zhang, K Sui, Y Xu… - Proceedings of the …, 2018 - dl.acm.org
In recent years, many traditional software systems have migrated to cloud computing
platforms and are provided as online services. The service quality matters because system …

{Jump-Starting} multivariate time series anomaly detection for online service systems

M Ma, S Zhang, J Chen, J Xu, H Li, Y Lin… - 2021 USENIX Annual …, 2021 - usenix.org
With the booming of online service systems, anomaly detection on multivariate time series,
such as a combination of CPU utilization, average response time, and requests per second …

Identifying bad software changes via multimodal anomaly detection for online service systems

N Zhao, J Chen, Z Yu, H Wang, J Li, B Qiu… - Proceedings of the 29th …, 2021 - dl.acm.org
In large-scale online service systems, software changes are inevitable and frequent. Due to
importing new code or configurations, changes are likely to incur incidents and destroy user …