A survey on resilience in the iot: Taxonomy, classification, and discussion of resilience mechanisms

C Berger, P Eichhammer, HP Reiser… - ACM Computing …, 2021 - dl.acm.org
Internet-of-Things (IoT) ecosystems tend to grow both in scale and complexity, as they
consist of a variety of heterogeneous devices that span over multiple architectural IoT layers …

Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

A machine learning approach to online fault classification in HPC systems

A Netti, Z Kiziltan, O Babaoglu, A Sîrbu… - Future Generation …, 2020 - Elsevier
Abstract As High-Performance Computing (HPC) systems strive towards the exascale goal,
failure rates both at the hardware and software levels will increase significantly. Thus …

Modeling patterns for reliability assessment of safety instrumented systems

H Meng, L Kloul, A Rauzy - Reliability Engineering & System Safety, 2018 - Elsevier
Abstract Safety Instrumented Systems (SIS) act as crucial safety barriers for preventing
hazardous accidents in the industrial systems. It is therefore of primary importance to study …

Remind: A framework for the resilient design of automotive systems

T Rosenstatter, K Strandberg, R Jolak… - 2020 IEEE Secure …, 2020 - ieeexplore.ieee.org
In the past years, great effort has been spent on enhancing the security and safety of
vehicular systems. Current advances in information and communication technology have …

Towards scalable resource management for supercomputers

Y Dai, Y Dong, K Lu, R Wang, W Zhang… - … Conference for High …, 2022 - ieeexplore.ieee.org
Today's supercomputers offer massive computation resources to execute a large number of
user jobs. Effectively managing such large-scale hardware parallelism and workloads is …

A comparison of application-level fault tolerance schemes for task pools

J Posner, L Reitz, C Fohry - Future Generation Computer Systems, 2020 - Elsevier
Fault tolerance is an important requirement for successful program execution on exascale
systems. The common approach, checkpointing, regularly saves a program's state, such that …

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp
With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

The INTERSECT open federated architecture for the laboratory of the future

C Engelmann, O Kuchar, S Boehm, MJ Brim… - Smoky Mountains …, 2022 - Springer
A federated instrument-to-edge-to-center architecture is needed to autonomously collect,
transfer, store, process, curate, and archive scientific data and reduce human-in-the-loop …

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

T Benacchio, L Bonaventura… - … Journal of High …, 2021 - journals.sagepub.com
Progress in numerical weather and climate prediction accuracy greatly depends on the
growth of the available computing power. As the number of cores in top computing facilities …