Scalable and fault tolerant failure detection and consensus

A Kovalenko, H Kuchuk - Advances in Self-healing Systems Monitoring …, 2022 - Springer

The chapter proposes a set of data management methods in Self-healing Systems. The
proposed methods are focused on taking into account the features of Self-healing Systems …

被引用次数：37 相关文章所有 4 个版本

[HTML] nih.gov

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer

Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

被引用次数：33 相关文章所有 6 个版本

[PDF] hal.science

Failure detection and propagation in HPC systems

G Bosilca, A Bouteiller, A Guermouche… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org

Building an infrastructure for Exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This paper describes the design and …

被引用次数：43 相关文章所有 17 个版本

[PDF] sciencedirect.com

Decentralized network building change in large manufacturing companies towards Industry 4.0

P Poonpakdee, J Koiwanit, C Yuangyai - Procedia computer science, 2017 - Elsevier

In complex industrial ecosystems together with an increasing global competition, success
depends on a complete value chain transformation. The use of Industry 4.0 standards is …

被引用次数：33 相关文章所有 2 个版本

[HTML] scirp.org

[HTML][HTML] A survey about self-healing systems (desktop and web application)

AA Hudaib, HN Fakhouri, FE Al Adwan… - Communications and …, 2017 - scirp.org

The complexity of computer architectures, software, web applications, and its large spread
worldwide using the internet and the rapid increase in the number of users in companion …

被引用次数：26 相关文章所有 6 个版本

[PDF] nsf.gov

A failure detector for HPC platforms

G Bosilca, A Bouteiller, A Guermouche… - … Journal of High …, 2018 - journals.sagepub.com

Building an infrastructure for exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This article describes the design and …

被引用次数：22 相关文章所有 13 个版本

[PDF] osti.gov

Epidemic failure detection and consensus for extreme parallelism

A Katti, G Di Fatta, T Naughton… - … International Journal of …, 2018 - journals.sagepub.com

Future extreme-scale high-performance computing systems will be required to work under
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …

被引用次数：17 相关文章所有 9 个版本

[PDF] researchgate.net

A short survey of dimensionality reduction techniques

VL Chetana, SS Kolisetty, K Amogh - Recent advances in …, 2020 - taylorfrancis.com

Advancement in data collection has increased the availability of high-dimensional data.
High dimensional data results in data overload which makes the storage and processing …

被引用次数：11 相关文章所有 5 个版本

[PDF] arxiv.org

Match: An mpi fault tolerance benchmark suite

L Guo, G Georgakoudis, K Parasyris… - 2020 IEEE …, 2020 - ieeexplore.ieee.org

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …

被引用次数：11 相关文章所有 6 个版本

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

G Nansamba, A Altarawneh, A Skjellum - International journal of parallel …, 2023 - Springer

Large-scale HPC systems experience failures arising from faults in hardware, software,
and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault …

被引用次数：1 相关文章所有 4 个版本