Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

CRUM: Checkpoint-restart support for CUDA's unified memory

R Garg, A Mohan, M Sullivan… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …

GPM: leveraging persistent memory from a GPU

S Pandey, AK Kamath, A Basu - Proceedings of the 27th ACM …, 2022 - dl.acm.org
The GPU is a key computing platform for many application domains. While the new non-
volatile memory technology has brought the promise of byte-addressable persistence (aka …

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

Distributed configuration, authorization and management in the cloud-based internet of things

M Henze, B Wolters, R Matzutt… - 2017 IEEE Trustcom …, 2017 - ieeexplore.ieee.org
Network-based deployments within the Internet of Things increasingly rely on the cloud-
controlled federation of individual networks to configure, authorize, and manage devices …

Checkpoint restart support for heterogeneous hpc applications

K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

N Tan, J Luettgau, J Marquez, K Teranishi… - Proceedings of the …, 2023 - dl.acm.org
Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …

Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution

Z Lin, H Dai, M Mantor, H Zhou - ACM Transactions on Architecture and …, 2019 - dl.acm.org
Contemporary GPUs support multiple kernels to run concurrently on the same streaming
multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel …

Sudden power-outage resilient in-processor checkpointing for energy-harvesting nonvolatile processors

N Onizawa, A Mochizuki, A Tamakoshi… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
This paper introduces a sudden power-outage resilient in-processor checkpointing for
energy-harvesting nonvolatile processors. In energy harvesting applications, a power supply …