[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
The landscape of exascale research: A data-driven literature analysis
S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org
The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …
systems capable of at least one quintillion (billion billion) floating-point operations per …
Dare: High-performance state machine replication on rdma networks
The increasing amount of data that needs to be collected and analyzed requires large-scale
datacenter architectures that are naturally more susceptible to faults of single components …
datacenter architectures that are naturally more susceptible to faults of single components …
The SIMNET virtual world architecture
J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org
Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …
in a virtual world. Few have been designed with an architecture that allows large numbers of …
Exploration of lossy compression for application-level checkpoint/restart
The scale of high performance computing (HPC) systems is exponentially growing,
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …
Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems
D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org
Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …
DCD—disk caching disk: a new approach for boosting I/O performance
Y Hu, Q Yang - ACM SIGARCH Computer Architecture News, 1996 - dl.acm.org
This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the
purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk …
purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk …
Characterizing deep-learning I/O workloads in TensorFlow
The performance of Deep-Learning (DL) computing frameworks rely on the performance of
data ingestion and checkpointing. In fact, during the training, a considerable high number of …
data ingestion and checkpointing. In fact, during the training, a considerable high number of …
A user-level infiniband-based file system and checkpoint strategy for burst buffers
Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …
performance computing applications that run continuously for hours or days at a time …
A 1 PB/s file system to checkpoint three million MPI tasks
With the massive scale of high-performance computing systems, long-running scientific
parallel applications periodically save the state of their execution to files called checkpoints …
parallel applications periodically save the state of their execution to files called checkpoints …