A survey on resilience in the iot: Taxonomy, classification, and discussion of resilience mechanisms
Internet-of-Things (IoT) ecosystems tend to grow both in scale and complexity, as they
consist of a variety of heterogeneous devices that span over multiple architectural IoT layers …
consist of a variety of heterogeneous devices that span over multiple architectural IoT layers …
Predictive reliability and fault management in exascale systems: State of the art and perspectives
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
A machine learning approach to online fault classification in HPC systems
Abstract As High-Performance Computing (HPC) systems strive towards the exascale goal,
failure rates both at the hardware and software levels will increase significantly. Thus …
failure rates both at the hardware and software levels will increase significantly. Thus …
Modeling patterns for reliability assessment of safety instrumented systems
H Meng, L Kloul, A Rauzy - Reliability Engineering & System Safety, 2018 - Elsevier
Abstract Safety Instrumented Systems (SIS) act as crucial safety barriers for preventing
hazardous accidents in the industrial systems. It is therefore of primary importance to study …
hazardous accidents in the industrial systems. It is therefore of primary importance to study …
Remind: A framework for the resilient design of automotive systems
In the past years, great effort has been spent on enhancing the security and safety of
vehicular systems. Current advances in information and communication technology have …
vehicular systems. Current advances in information and communication technology have …
Towards scalable resource management for supercomputers
Y Dai, Y Dong, K Lu, R Wang, W Zhang… - … Conference for High …, 2022 - ieeexplore.ieee.org
Today's supercomputers offer massive computation resources to execute a large number of
user jobs. Effectively managing such large-scale hardware parallelism and workloads is …
user jobs. Effectively managing such large-scale hardware parallelism and workloads is …
A comparison of application-level fault tolerance schemes for task pools
J Posner, L Reitz, C Fohry - Future Generation Computer Systems, 2020 - Elsevier
Fault tolerance is an important requirement for successful program execution on exascale
systems. The common approach, checkpointing, regularly saves a program's state, such that …
systems. The common approach, checkpointing, regularly saves a program's state, such that …
Task-level resilience: checkpointing vs. supervision
J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …
permanent hardware failure are growing in importance. Irregularity is often addressed by …
The INTERSECT open federated architecture for the laboratory of the future
A federated instrument-to-edge-to-center architecture is needed to autonomously collect,
transfer, store, process, curate, and archive scientific data and reduce human-in-the-loop …
transfer, store, process, curate, and archive scientific data and reduce human-in-the-loop …
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …
growth of the available computing power. As the number of cores in top computing facilities …