A survey of machine learning for computer architecture and systems
It has been a long time that computer architecture and systems are optimized for efficient
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …
execution of machine learning (ML) models. Now, it is time to reconsider the relationship …
Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges
Artificial Intelligence for IT operations (AIOps) aims to combine the power of AI with the big
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …
data generated by IT Operations processes, particularly in cloud infrastructures, to provide …
Aiops: real-world challenges and research innovations
AIOps is about empowering software and service engineers (eg, developers, program
managers, support engineers, site reliability engineers) to efficiently and effectively build …
managers, support engineers, site reliability engineers) to efficiently and effectively build …
Making disk failure predictions {SMARTer}!
Disk drives are one of the most commonly replaced hardware components and continue to
pose challenges for accurate failure prediction. In this work, we present analysis and …
pose challenges for accurate failure prediction. In this work, we present analysis and …
A fault tolerant elastic resource management framework toward high availability of cloud services
Cloud computing has become inevitable for every digital service which has exponentially
increased its usage. However, a tremendous surge in cloud resource demand stave off …
increased its usage. However, a tremendous surge in cloud resource demand stave off …
Deep Learning for HDD health assessment: An application based on LSTM
Hard disk drive failures are one of the most common causes of service downtime in data
centers. Predictive maintenance techniques have been adopted to extend the Remaining …
centers. Predictive maintenance techniques have been adopted to extend the Remaining …
Towards intelligent incident management: why we need it and how we make it
The management of cloud service incidents (unplanned interruptions or outages of a
service/product) greatly affects customer satisfaction and business revenue. After years of …
service/product) greatly affects customer satisfaction and business revenue. After years of …
Predicting node failure in cloud service systems
In recent years, many traditional software systems have migrated to cloud computing
platforms and are provided as online services. The service quality matters because system …
platforms and are provided as online services. The service quality matters because system …
{Jump-Starting} multivariate time series anomaly detection for online service systems
With the booming of online service systems, anomaly detection on multivariate time series,
such as a combination of CPU utilization, average response time, and requests per second …
such as a combination of CPU utilization, average response time, and requests per second …
Identifying bad software changes via multimodal anomaly detection for online service systems
In large-scale online service systems, software changes are inevitable and frequent. Due to
importing new code or configurations, changes are likely to incur incidents and destroy user …
importing new code or configurations, changes are likely to incur incidents and destroy user …