Calculation of the high-energy neutron flux for anticipating errors and recovery techniques in exascale supercomputer centres
H Asorey, R Mayo-Garcia - The Journal of Supercomputing, 2023 - Springer
The age of exascale computing has arrived, and the risks associated with neutron and other
atmospheric radiation are becoming more critical as the computing power increases; hence …
atmospheric radiation are becoming more critical as the computing power increases; hence …
Enabling Practical Transparent Checkpointing for MPI: A Topological Sort Approach
Y Xu, G Cooperman - 2024 IEEE International Conference on …, 2024 - ieeexplore.ieee.org
MPI is the de facto standard for parallel computing on a cluster of computers. Checkpointing
is an important component in any strategy for software resilience and for long-running jobs …
is an important component in any strategy for software resilience and for long-running jobs …
Implementation-Oblivious Transparent Checkpoint-Restart for MPI
This work presents experience with traditional use cases of checkpointing on a novel
platform. A single codebase (MANA) transparently checkpoints production workloads for …
platform. A single codebase (MANA) transparently checkpoints production workloads for …
Debugging MPI Implementations via Reduction-to-Primitives
G Cooperman, D Li, Z Zhao - 2022 IEEE/ACM Third …, 2022 - ieeexplore.ieee.org
Testing correctness of either a new MPI implementation or a transparent checkpointing
package for MPI is inherently difficult. A bug is often observed when running a correctly …
package for MPI is inherently difficult. A bug is often observed when running a correctly …