Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems

L Zhao, Y Ren, Y Xiang… - 2010 IEEE 12th …, 2010 - ieeexplore.ieee.org
In the existing studies on fault-tolerant scheduling, the active replication schema makes use
of e+ 1 replicas for each task to tolerate E failures. However, in this paper, we show that it …

Possibility for decision

C Carlsson, R Fullér - Studies in Fuzziness and Soft Computing, 2011 - Springer
This monograph summarizes the authors' works in the the first decade of the 21st century on
possibility distributions and decisions. The book is organized as follows. It begins, in …

Fault tolerance and recovery of scientific workflows on computational grids

G Kandaswamy, A Mandal… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
In this paper, we describe the design and implementation of two mechanisms for fault-
tolerance and recovery for complex scientific workflows on computational grids. We present …

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

J Brandt, A Gentile, J Mayo, P Pebay… - … on Parallel & …, 2009 - ieeexplore.ieee.org
Using the cloud computing paradigm, a host of companies promise to make huge compute
resources available to users on a pay-as-you-go basis. These resources can be configured …

An analysis of clustered failures on large supercomputing systems

TJ Hacker, F Romero, CD Carothers - Journal of Parallel and Distributed …, 2009 - Elsevier
Large supercomputers are built today using thousands of commodity components, and
suffer from poor reliability due to frequent component failures. The characteristics of failure …

Combined fault tolerance and scheduling techniques for workflow applications on computational grids

Y Zhang, A Mandal, C Koelbel… - 2009 9th IEEE/ACM …, 2009 - ieeexplore.ieee.org
Complex scientific workflows are now Increasingly executed on computational grids. In
addition to the challenges of managing and scheduling these workflows, reliability …

Numerical recovery strategies for parallel resilient Krylov linear solvers

E Agullo, L Giraud, A Guermouche… - … Linear Algebra with …, 2016 - Wiley Online Library
As the computational power of high‐performance computing systems continues to increase
by using a huge number of cores or specialized processing units, high‐performance …

A methodology for comparing the reliability of GPU-based and CPU-based HPCs

N Cini, G Yalcin - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Today, GPUs are widely used as coprocessors/accelerators in High-Performance
Heterogeneous Computing due to their many advantages. However, many researches …

Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example

J Brandt, F Chen, V De Sapio, A Gentile… - … Workshops (DSN-W …, 2010 - ieeexplore.ieee.org
Effective failure prediction and mitigation strategies in high-performance computing systems
could provide huge gains in resilience of tightly coupled large-scale scientific codes. These …

Shiraz: Exploiting system reliability and application resilience characteristics to improve large scale system throughput

R Garg, T Patel, G Cooperman… - 2018 48th Annual IEEE …, 2018 - ieeexplore.ieee.org
Large-scale applications rely on resilience mechanisms such as checkpoint-restart to make
forward progress in the presence of failures. Unfortunately, this incurs huge I/O overhead …