Design and modeling of a non-blocking checkpointing system

F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org

Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …

被引用次数：423 相关文章所有 14 个版本

[PDF] acm.org

The landscape of exascale research: A data-driven literature analysis

S Heldens, P Hijma, BV Werkhoven… - ACM Computing …, 2020 - dl.acm.org

The next generation of supercomputers will break the exascale barrier. Soon we will have
systems capable of at least one quintillion (billion billion) floating-point operations per …

被引用次数：62 相关文章所有 4 个版本

[PDF] ethz.ch

Dare: High-performance state machine replication on rdma networks

M Poke, T Hoefler - Proceedings of the 24th International Symposium on …, 2015 - dl.acm.org

The increasing amount of data that needs to be collected and analyzed requires large-scale
datacenter architectures that are naturally more susceptible to faults of single components …

被引用次数：186 相关文章所有 20 个版本

The SIMNET virtual world architecture

J Calvin, A Dickens, B Gaines… - Proceedings of IEEE …, 1993 - ieeexplore.ieee.org

Many tools and techniques have been developed to address specific aspects of interacting
in a virtual world. Few have been designed with an architecture that allows large numbers of …

被引用次数：298 相关文章所有 6 个版本

[PDF] github.io

Exploration of lossy compression for application-level checkpoint/restart

N Sasaki, K Sato, T Endo… - 2015 IEEE international …, 2015 - ieeexplore.ieee.org

The scale of high performance computing (HPC) systems is exponentially growing,
potentially causing prohibitive shrinkage of mean time between failures (MTBF) while the …

被引用次数：115 相关文章所有 8 个版本

[PDF] psu.edu

Lazy checkpointing: Exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems

D Tiwari, S Gupta, SS Vazhkudai - 2014 44th Annual IEEE/IFIP …, 2014 - ieeexplore.ieee.org

Continuing increase in the computational power of supercomputers has enabled large-scale
scientific applications in the areas of astrophysics, fusion, climate and combustion to run …

被引用次数：118 相关文章所有 9 个版本

[PDF] acm.org

DCD—disk caching disk: a new approach for boosting I/O performance

Y Hu, Q Yang - ACM SIGARCH Computer Architecture News, 1996 - dl.acm.org

This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the
purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk …

被引用次数：257 相关文章所有 10 个版本

[PDF] arxiv.org

Characterizing deep-learning I/O workloads in TensorFlow

SWD Chien, S Markidis, CP Sishtla… - 2018 IEEE/ACM 3rd …, 2018 - ieeexplore.ieee.org

The performance of Deep-Learning (DL) computing frameworks rely on the performance of
data ingestion and checkpointing. In fact, during the training, a considerable high number of …

被引用次数：64 相关文章所有 10 个版本

[PDF] osti.gov

A user-level infiniband-based file system and checkpoint strategy for burst buffers

K Sato, K Mohror, A Moody, T Gamblin… - 2014 14th IEEE/ACM …, 2014 - ieeexplore.ieee.org

Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-
performance computing applications that run continuously for hours or days at a time …

被引用次数：79 相关文章所有 8 个版本

[PDF] ohio-state.edu

A 1 PB/s file system to checkpoint three million MPI tasks

R Rajachandrasekar, A Moody, K Mohror… - Proceedings of the 22nd …, 2013 - dl.acm.org

With the massive scale of high-performance computing systems, long-running scientific
parallel applications periodically save the state of their execution to files called checkpoints …

被引用次数：81 相关文章所有 12 个版本