Super-scalable algorithms for computing on 100,000 processors

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com

Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

被引用次数：484 相关文章所有 14 个版本

Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability

JN Glosli, DF Richards, KJ Caspersen… - Proceedings of the …, 2007 - dl.acm.org

We report the computational advances that have enabled the first micron-scale simulation of
a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in …

被引用次数：140 相关文章所有 4 个版本

[PDF] christian-engelmann.info

Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale

C Engelmann - Future Generation Computer Systems, 2014 - Elsevier

As supercomputers scale to 1000 PFlop/s over the next decade, investigating the
performance of parallel applications at scale on future architectures and the performance …

被引用次数：70 相关文章所有 9 个版本

[PDF] psu.edu

xSim: The extreme-scale simulator

S Böhm, C Engelmann - 2011 International Conference on …, 2011 - ieeexplore.ieee.org

Investigating parallel application performance at scale is an important part of high-
performance computing (UPC) application development. The Extreme-scale Simulator …

被引用次数：70 相关文章所有 11 个版本

Desynchronization in distributed Ant Colony Optimization in HPC environment

M Starzec, G Starzec, A Byrski, W Turek… - Future Generation …, 2020 - Elsevier

Metaheuristics have significant computing requirements, in particular Ant Colony
Optimization (ACO) processes a population of individuals (agents/ants) roaming in a graph …

被引用次数：23 相关文章

[PDF] manchester.ac.uk

Recovery patterns for iterative methods in a parallel unstable environment

J Langou, Z Chen, G Bosilca, J Dongarra - SIAM Journal on Scientific …, 2008 - SIAM

Several recovery techniques for parallel iterative methods are presented. First, the
implementation of checkpoints in parallel iterative methods is described and analyzed. Then …

被引用次数：75 相关文章所有 25 个版本

Dynamic resource provisioning for sustainable cloud computing systems in the presence of correlated failures

Y Sharma, J Taheri, W Si, D Sun… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org

Dependence of computing resources on each other in cloud computing systems (CCS)
makes them prone to fail in correlated manner which significantly impacts their service …

被引用次数：15 相关文章所有 5 个版本

[PDF] proquest.com

[图书][B] Coordinated checkpoint/restart process fault tolerance for MPI applications on HPC systems

J Hursey - 2010 - search.proquest.com

Scientists use advanced computing techniques to assist in answering the complex questions
at the forefront of discovery. The High Performance Computing (HPC) scientific applications …

被引用次数：45 相关文章所有 6 个版本

[PDF] researchgate.net

A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI

J Hursey, T Naughton, G Vallee, RL Graham - Recent Advances in the …, 2011 - Springer

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC
systems. The MPI does not provide standardized fault tolerance interfaces and semantics …

被引用次数：43 相关文章所有 14 个版本

[PDF] jst.go.jp

Task-level resilience: checkpointing vs. supervision

J Posner, L Reitz, C Fohry - International Journal of Networking and …, 2022 - jstage.jst.go.jp

With the advent of exascale computing, issues such as application irregularity and
permanent hardware failure are growing in importance. Irregularity is often addressed by …

被引用次数：6 相关文章所有 7 个版本