Failure analysis of virtual and physical machines: patterns, causes and characteristics

M Hasan, MS Goraya - Computers in Industry, 2018 - Elsevier

Fault tolerance is among the most imperative issues in cloud to deliver reliable services. It is
difficult to implement due to dynamic service infrastructure, complex configurations and …

被引用次数：116 相关文章所有 2 个版本

[PDF] virginia.edu

Task failure prediction in cloud data centers using deep learning

J Gao, H Wang, H Shen - IEEE transactions on services …, 2020 - ieeexplore.ieee.org

A large-scale cloud data center needs to provide high service reliability and availability with
low failure occurrence probability. However, current large-scale cloud data centers still face …

被引用次数：371 相关文章所有 5 个版本

[PDF] arxiv.org

Cloud-native computing: A survey from the perspective of services

S Deng, H Zhao, B Huang, C Zhang… - Proceedings of the …, 2024 - ieeexplore.ieee.org

The development of cloud computing delivery models inspires the emergence of cloud-
native computing. Cloud-native computing, as the most influential development principle for …

被引用次数：8 相关文章所有 9 个版本

[PDF] drj.com

Why does the cloud stop computing? lessons from hundreds of service outages

HS Gunawi, M Hao, RO Suminto, A Laksono… - Proceedings of the …, 2016 - dl.acm.org

We conducted a cloud outage study (COS) of 32 popular Internet services. We analyzed
1247 headline news and public post-mortem reports that detail 597 unplanned outages that …

被引用次数：283 相关文章所有 6 个版本

[PDF] usenix.org

Assise: Performance and availability via client-local {NVM} in a distributed file system

TE Anderson, M Canini, J Kim, D Kostić… - … USENIX Symposium on …, 2020 - usenix.org

The adoption of low latency persistent memory modules (PMMs) upends the long-
established model of remote storage for distributed file systems. Instead, by colocating …

被引用次数：73 相关文章所有 10 个版本

[PDF] tsinghua.edu.cn

What can we learn from four years of data center hardware failures?

G Wang, L Zhang, W Xu - 2017 47th Annual IEEE/IFIP …, 2017 - ieeexplore.ieee.org

Hardware failures have a big impact on the dependability of large-scale data centers. We
present studies on over 290,000 hardware failure reports collected over the past four years …

被引用次数：145 相关文章所有 9 个版本

[PDF] diva-portal.org

LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism

J Kim, I Jang, W Reda, J Im, M Canini, D Kostić… - Proceedings of the …, 2021 - dl.acm.org

In multi-tenant systems, the CPU overhead of distributed file systems (DFSes) is increasingly
a burden to application performance. CPU and memory interference cause degraded and …

被引用次数：45 相关文章所有 14 个版本

Multi-agent based autonomic network management architecture

ST Arzo, R Bassoli, F Granelli… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org

The advent of network softwarization is enabling multiple innovative solutions through
software-defined networking (SDN) and network function virtualization (NFV). Specifically …

被引用次数：56 相关文章所有 3 个版本

[PDF] usenix.org

An analysis of {Network-Partitioning} failures in cloud systems

A Alquraan, H Takruri, M Alfatafta… - 13th USENIX Symposium …, 2018 - usenix.org

We present a comprehensive study of 136 system failures attributed to network-partitioning
faults from 25 widely used distributed systems. We found that the majority of the failures led …

被引用次数：94 相关文章所有 15 个版本

[PDF] hal.science

Robust anomaly detection on unreliable data

Z Zhao, S Cerf, R Birke, B Robu… - 2019 49th Annual …, 2019 - ieeexplore.ieee.org

Classification algorithms have been widely adopted to detect anomalies for various systems,
eg, IoT and cloud, under the common assumption that the data source is clean, ie, features …

被引用次数：54 相关文章所有 11 个版本