[HTML][HTML] Toward exascale resilience: 2014 update

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

Dare: High-performance state machine replication on rdma networks

M Poke, T Hoefler - Proceedings of the 24th International Symposium on …, 2015 - dl.acm.org
The increasing amount of data that needs to be collected and analyzed requires large-scale
datacenter architectures that are naturally more susceptible to faults of single components …

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

Exploration of lossy compression for application-level checkpoint/restart

N Sasaki, K Sato, T Endo… - 2015 IEEE international …, 2015 - ieeexplore.ieee.org
The scale of high performance computing (HPC) systems is exponentially growing,
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …

Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems

D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org
Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …

DCD—disk caching disk: a new approach for boosting I/O performance

Y Hu, Q Yang - ACM SIGARCH Computer Architecture News, 1996 - dl.acm.org
This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the
purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk …

Characterizing deep-learning I/O workloads in TensorFlow

SWD Chien, S Markidis, CP Sishtla… - 2018 IEEE/ACM 3rd …, 2018 - ieeexplore.ieee.org
The performance of Deep-Learning (DL) computing frameworks rely on the performance of
data ingestion and checkpointing. In fact, during the training, a considerable high number of …

A user-level infiniband-based file system and checkpoint strategy for burst buffers

K Sato, K Mohror, A Moody, T Gamblin… - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …

A 1 PB/s file system to checkpoint three million MPI tasks

R Rajachandrasekar, A Moody, K Mohror… - Proceedings of the 22nd …, 2013 - dl.acm.org
With the massive scale of high-performance computing systems, long-running scientific
parallel applications periodically save the state of their execution to files called checkpoints …