Methods to Manage Data in Self-healing Systems

A Kovalenko, H Kuchuk - Advances in Self-healing Systems Monitoring …, 2022 - Springer
The chapter proposes a set of data management methods in Self-healing Systems. The
proposed methods are focused on taking into account the features of Self-healing Systems …

Reinit: Evaluating the performance of global-restart recovery methods for mpi fault tolerance

G Georgakoudis, L Guo, I Laguna - International Conference on High …, 2020 - Springer
Scaling supercomputers comes with an increase in failure rates due to the increasing
number of hardware components. In standard practice, applications are made resilient …

Failure detection and propagation in HPC systems

G Bosilca, A Bouteiller, A Guermouche… - SC'16: Proceedings …, 2016 - ieeexplore.ieee.org
Building an infrastructure for Exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This paper describes the design and …

Decentralized network building change in large manufacturing companies towards Industry 4.0

P Poonpakdee, J Koiwanit, C Yuangyai - Procedia computer science, 2017 - Elsevier
In complex industrial ecosystems together with an increasing global competition, success
depends on a complete value chain transformation. The use of Industry 4.0 standards is …

[HTML][HTML] A survey about self-healing systems (desktop and web application)

AA Hudaib, HN Fakhouri, FE Al Adwan… - Communications and …, 2017 - scirp.org
The complexity of computer architectures, software, web applications, and its large spread
worldwide using the internet and the rapid increase in the number of users in companion …

A failure detector for HPC platforms

G Bosilca, A Bouteiller, A Guermouche… - … Journal of High …, 2018 - journals.sagepub.com
Building an infrastructure for exascale applications requires, in addition to many other key
components, a stable and efficient failure detector. This article describes the design and …

Epidemic failure detection and consensus for extreme parallelism

A Katti, G Di Fatta, T Naughton… - … International Journal of …, 2018 - journals.sagepub.com
Future extreme-scale high-performance computing systems will be required to work under
frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has …

A short survey of dimensionality reduction techniques

VL Chetana, SS Kolisetty, K Amogh - Recent advances in …, 2020 - taylorfrancis.com
Advancement in data collection has increased the availability of high-dimensional data.
High dimensional data results in data overload which makes the storage and processing …

Match: An mpi fault tolerance benchmark suite

L Guo, G Georgakoudis, K Parasyris… - 2020 IEEE …, 2020 - ieeexplore.ieee.org
MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate
distributed scientific applications running on tens of hundreds of processes and compute …

A Fault-Model-Relevant Classification of Consensus Mechanisms for MPI and HPC

G Nansamba, A Altarawneh, A Skjellum - International journal of parallel …, 2023 - Springer
Large-scale HPC systems experience failures arising from faults in hardware, software,
and/or networking. Failure rates continue to grow as systems scale up and out. Crash fault …