Reliability Analysis in HPC clusters

L Zhao, Y Ren, Y Xiang… - 2010 IEEE 12th …, 2010 - ieeexplore.ieee.org

In the existing studies on fault-tolerant scheduling, the active replication schema makes use
of e+ 1 replicas for each task to tolerate E failures. However, in this paper, we show that it …

被引用次数：118 相关文章所有 11 个版本

Possibility for decision

C Carlsson, R Fullér - Studies in Fuzziness and Soft Computing, 2011 - Springer

This monograph summarizes the authors' works in the the first decade of the 21st century on
possibility distributions and decisions. The book is organized as follows. It begins, in …

被引用次数：92 相关文章所有 8 个版本

[PDF] psu.edu

Fault tolerance and recovery of scientific workflows on computational grids

G Kandaswamy, A Mandal… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org

In this paper, we describe the design and implementation of two mechanisms for fault-
tolerance and recovery for complex scientific workflows on computational grids. We present …

被引用次数：123 相关文章所有 10 个版本

[PDF] osti.gov

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

J Brandt, A Gentile, J Mayo, P Pebay… - … on Parallel & …, 2009 - ieeexplore.ieee.org

Using the cloud computing paradigm, a host of companies promise to make huge compute
resources available to users on a pay-as-you-go basis. These resources can be configured …

被引用次数：100 相关文章所有 10 个版本

[PDF] academia.edu

An analysis of clustered failures on large supercomputing systems

TJ Hacker, F Romero, CD Carothers - Journal of Parallel and Distributed …, 2009 - Elsevier

Large supercomputers are built today using thousands of commodity components, and
suffer from poor reliability due to frequent component failures. The characteristics of failure …

被引用次数：94 相关文章所有 7 个版本

[PDF] researchgate.net

Combined fault tolerance and scheduling techniques for workflow applications on computational grids

Y Zhang, A Mandal, C Koelbel… - 2009 9th IEEE/ACM …, 2009 - ieeexplore.ieee.org

Complex scientific workflows are now Increasingly executed on computational grids. In
addition to the challenges of managing and scheduling these workflows, reliability …

被引用次数：77 相关文章所有 6 个版本

[PDF] hal.science

Numerical recovery strategies for parallel resilient Krylov linear solvers

E Agullo, L Giraud, A Guermouche… - … Linear Algebra with …, 2016 - Wiley Online Library

As the computational power of high‐performance computing systems continues to increase
by using a huge number of cores or specialized processing units, high‐performance …

被引用次数：37 相关文章所有 6 个版本

[PDF] ulakbim.gov.tr

A methodology for comparing the reliability of GPU-based and CPU-based HPCs

N Cini, G Yalcin - ACM Computing Surveys (CSUR), 2020 - dl.acm.org

Today, GPUs are widely used as coprocessors/accelerators in High-Performance
Heterogeneous Computing due to their many advantages. However, many researches …

被引用次数：7 相关文章所有 4 个版本

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

J Brandt, F Chen, V De Sapio, A Gentile… - … Workshops (DSN-W …, 2010 - ieeexplore.ieee.org

Effective failure prediction and mitigation strategies in high-performance computing systems
could provide huge gains in resilience of tightly coupled large-scale scientific codes. These …

被引用次数：21 相关文章所有 4 个版本

[PDF] northeastern.edu

Shiraz: Exploiting system reliability and application resilience characteristics to improve large scale system throughput

R Garg, T Patel, G Cooperman… - 2018 48th Annual IEEE …, 2018 - ieeexplore.ieee.org

Large-scale applications rely on resilience mechanisms such as checkpoint-restart to make
forward progress in the presence of failures. Unfortunately, this incurs huge I/O overhead …

被引用次数：13 相关文章所有 3 个版本