Fault tolerance in cloud computing environment: A systematic survey

M Hasan, MS Goraya - Computers in Industry, 2018 - Elsevier
Fault tolerance is among the most imperative issues in cloud to deliver reliable services. It is
difficult to implement due to dynamic service infrastructure, complex configurations and …

Task failure prediction in cloud data centers using deep learning

J Gao, H Wang, H Shen - IEEE transactions on services …, 2020 - ieeexplore.ieee.org
A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …

Cloud-native computing: A survey from the perspective of services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

Assise: Performance and availability via client-local {NVM} in a distributed file system

TE Anderson, M Canini, J Kim, D Kostić… - … USENIX Symposium on …, 2020 - usenix.org
The adoption of low latency persistent memory modules (PMMs) upends the long-
established model of remote storage for distributed file systems. Instead, by colocating …

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism

J Kim, I Jang, W Reda, J Im, M Canini, D Kostić… - Proceedings of the …, 2021 - dl.acm.org
In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly
a burden to application performance. CPU and memory interference cause degraded and …

Multi-agent based autonomic network management architecture

ST Arzo, R Bassoli, F Granelli… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
The advent of network softwarization is enabling multiple innovative solutions through
software-defined networking (SDN) and network function virtualization (NFV). Specifically …

An analysis of {Network-Partitioning} failures in cloud systems

A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org
We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …

Robust anomaly detection on unreliable data

Z Zhao, S Cerf, R Birke, B Robu… - 2019 49th Annual …, 2019 - ieeexplore.ieee.org
Classification algorithms have been widely adopted to detect anomalies for various systems,
eg, IoT and cloud, under the common assumption that the data source is clean, ie, features …