Heterocheckpoint: Efficient checkpointing for accelerator-based systems

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org

Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

被引用次数：38 相关文章所有 12 个版本

[PDF] arxiv.org

CRUM: Checkpoint-restart support for CUDA's unified memory

R Garg, A Mohan, M Sullivan… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org

Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …

被引用次数：44 相关文章所有 6 个版本

[PDF] iisc.ac.in

GPM: leveraging persistent memory from a GPU

S Pandey, AK Kamath, A Basu - Proceedings of the 27th ACM …, 2022 - dl.acm.org

The GPU is a key computing platform for many application domains. While the new non-
volatile memory technology has brought the promise of byte-addressable persistence (aka …

被引用次数：20 相关文章所有 4 个版本

[PDF] sjtu.edu.cn

Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems

J Leng, A Buyuktosunoglu, R Bertran… - … Symposium on High …, 2020 - ieeexplore.ieee.org

Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …

被引用次数：26 相关文章所有 6 个版本

[PDF] researchgate.net

Distributed configuration, authorization and management in the cloud-based internet of things

M Henze, B Wolters, R Matzutt… - 2017 IEEE Trustcom …, 2017 - ieeexplore.ieee.org

Network-based deployments within the Internet of Things increasingly rely on the cloud-
controlled federation of individual networks to configure, authorize, and manage devices …

被引用次数：37 相关文章所有 7 个版本

[PDF] upc.edu

Checkpoint restart support for heterogeneous hpc applications

K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org

As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …

被引用次数：26 相关文章所有 3 个版本

[PDF] springer.com

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer

With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …

被引用次数：8 相关文章所有 4 个版本

[PDF] acm.org

Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication

N Tan, J Luettgau, J Marquez, K Teranishi… - Proceedings of the …, 2023 - dl.acm.org

Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …

被引用次数：2 相关文章所有 8 个版本

[PDF] acm.org Full View

Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution

Z Lin, H Dai, M Mantor, H Zhou - ACM Transactions on Architecture and …, 2019 - dl.acm.org

Contemporary GPUs support multiple kernels to run concurrently on the same streaming
multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel …

被引用次数：14 相关文章所有 6 个版本

Sudden power-outage resilient in-processor checkpointing for energy-harvesting nonvolatile processors

N Onizawa, A Mochizuki, A Tamakoshi… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org

This paper introduces a sudden power-outage resilient in-processor checkpointing for
energy-harvesting nonvolatile processors. In energy harvesting applications, a power supply …

被引用次数：19 相关文章所有 4 个版本