Predictive reliability and fault management in exascale systems: State of the art and perspectives

R Canal, C Hernandez, R Tornero, A Cilardo… - ACM Computing …, 2020 - dl.acm.org
Performance and power constraints come together with Complementary Metal Oxide
Semiconductor technology scaling in future Exascale systems. Technology scaling makes …

Nu: Achieving {Microsecond-Scale} resource fungibility with logical processes

Z Ruan, SJ Park, MK Aguilera, A Belay… - … USENIX Symposium on …, 2023 - usenix.org
Datacenters waste significant compute and memory resources today because they lack
resource fungibility: the ability to reassign resources quickly and without disruption. We …

Terabase-scale metagenome coassembly with MetaHipMer

S Hofmeyr, R Egan, E Georganas, AC Copeland… - Scientific reports, 2020 - nature.com
Metagenome sequence datasets can contain terabytes of reads, too many to be
coassembled together on a single shared-memory computer; consequently, they have only …

Spatially distributed infection increases viral load in a computational model of SARS-CoV-2 lung infection

ME Moses, S Hofmeyr, JL Cannon… - PLoS computational …, 2021 - journals.plos.org
A key question in SARS-CoV-2 infection is why viral loads and patient outcomes vary
dramatically across individuals. Because spatial-temporal dynamics of viral spread and …

Embracing Irregular Parallelism in HPC with YGM

T Steil, T Reza, B Priest, R Pearce - Proceedings of the International …, 2023 - dl.acm.org
YGM is a general-purpose asynchronous distributed computing library for C++/MPI,
designed to handle the irregular data access patterns and small messages of graph …

A Fine-grained Asynchronous Bulk Synchronous parallelism model for PGAS applications

SR Paul, A Hayashi, K Chen, Y Elmougy… - Journal of Computational …, 2023 - Elsevier
Abstract The Partitioned Global Address Space (PGAS) model is well suited for executing
irregular applications on cluster-based systems, due to its efficient support for short, one …

Static local concurrency errors detection in MPI-RMA programs

E Saillard, M Sergent, CTA Kaci… - 2022 IEEE/ACM Sixth …, 2022 - ieeexplore.ieee.org
Communications are a critical part of HPC simulations, and one of the main focuses of
application developers when scaling on supercomputers. While classical message passing …

ECP software technology capability assessment report

MA Heroux, LC McInnes, R Thakur, JS Vetter, XS Li… - 2020 - osti.gov
The Exascale Computing Project (ECP) Software Technology (ST) Focus Area is
responsible for developing critical software capabilities that will enable successful execution …

Towards efficient remote openmp offloading

W Lu, B Shan, E Raut, J Meng, M Araya-Polo… - … Workshop on OpenMP, 2022 - Springer
On modern heterogeneous HPC systems, the most popular way to realize distributed
computation is the hybrid programming model of MPI+ X (X being OpenMP/CUDA/etc.), as it …

Devastator: A Scalable Parallel Discrete Event Simulation Framework for Modern C++

J Bachan, J Ye, X Jiang, T Nguyen… - Proceedings of the 38th …, 2024 - dl.acm.org
Parallel discrete event simulation is a fundamental simulation technology that is essential to
the parallelization of event-based models including hardware and transportation systems …