Predictive reliability and fault management in exascale systems: State of the art and perspectives
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …
CRUM: Checkpoint-restart support for CUDA's unified memory
Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …
GPU. The older CUDA programming style is akin to older large-memory UNIX applications …
GPM: leveraging persistent memory from a GPU
The GPU is a key computing platform for many application domains. While the new non-
volatile memory technology has brought the promise of byte-addressable persistence (aka …
volatile memory technology has brought the promise of byte-addressable persistence (aka …
Asymmetric resilience: Exploiting task-level idempotency for transient error recovery in accelerator-based systems
Accelerators make the task of building systems that are re-silient against transient errors like
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …
voltage noise and soft errors hard. Architects integrate accelerators into the system as black …
Distributed configuration, authorization and management in the cloud-based internet of things
Network-based deployments within the Internet of Things increasingly rely on the cloud-
controlled federation of individual networks to configure, authorize, and manage devices …
controlled federation of individual networks to configure, authorize, and manage devices …
Checkpoint restart support for heterogeneous hpc applications
K Parasyris, K Keller… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
As we approach the era of exa-scale computing, fault tolerance is of growing importance.
The increasing number of cores as well as the increased complexity of modern …
The increasing number of cores as well as the increased complexity of modern …
Software approaches for resilience of high performance computing systems: a survey
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …
has been descending continuously. Therefore, system resilience has been regarded as one …
Scalable Incremental Checkpointing using GPU-Accelerated De-Duplication
Writing large amounts of data concurrently to stable storage is a typical I/O pattern of many
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …
HPC workflows. This pattern introduces high I/O overheads and results in increased storage …
Coordinated CTA combination and bandwidth partitioning for GPU concurrent kernel execution
Contemporary GPUs support multiple kernels to run concurrently on the same streaming
multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel …
multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel …
Sudden power-outage resilient in-processor checkpointing for energy-harvesting nonvolatile processors
N Onizawa, A Mochizuki, A Tamakoshi… - IEEE Transactions on …, 2016 - ieeexplore.ieee.org
This paper introduces a sudden power-outage resilient in-processor checkpointing for
energy-harvesting nonvolatile processors. In energy harvesting applications, a power supply …
energy-harvesting nonvolatile processors. In energy harvesting applications, a power supply …