Fault tolerance in cloud computing environment: A systematic survey
M Hasan, MS Goraya - Computers in Industry, 2018 - Elsevier
Fault tolerance is among the most imperative issues in cloud to deliver reliable services. It is
difficult to implement due to dynamic service infrastructure, complex configurations and …
difficult to implement due to dynamic service infrastructure, complex configurations and …
Task failure prediction in cloud data centers using deep learning
A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …
low failure occurrence probability. However, current large-scale cloud data centers still face …
Cloud-native computing: A survey from the perspective of services
The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …
native computing. Cloud-native computing, as the most influential development principle for …
Why does the cloud stop computing? lessons from hundreds of service outages
We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …
Assise: Performance and availability via client-local {NVM} in a distributed file system
The adoption of low latency persistent memory modules (PMMs) upends the long-
established model of remote storage for distributed file systems. Instead, by colocating …
established model of remote storage for distributed file systems. Instead, by colocating …
What can we learn from four years of data center hardware failures?
Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …
present studies on over 290,000 hardware failure reports collected over the past four years …
LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism
In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly
a burden to application performance. CPU and memory interference cause degraded and …
a burden to application performance. CPU and memory interference cause degraded and …
Multi-agent based autonomic network management architecture
The advent of network softwarization is enabling multiple innovative solutions through
software-defined networking (SDN) and network function virtualization (NFV). Specifically …
software-defined networking (SDN) and network function virtualization (NFV). Specifically …
An analysis of {Network-Partitioning} failures in cloud systems
A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org
We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …
faults from 25 widely used distributed systems. We found that the majority of the failures led …
Robust anomaly detection on unreliable data
Classification algorithms have been widely adopted to detect anomalies for various systems,
eg, IoT and cloud, under the common assumption that the data source is clean, ie, features …
eg, IoT and cloud, under the common assumption that the data source is clean, ie, features …