Fault-tolerant scheduling with dynamic number of replicas in heterogeneous systems
In the existing studies on fault-tolerant scheduling, the active replication schema makes use
of e+ 1 replicas for each task to tolerate E failures. However, in this paper, we show that it …
of e+ 1 replicas for each task to tolerate E failures. However, in this paper, we show that it …
Possibility for decision
C Carlsson, R Fullér - Studies in Fuzziness and Soft Computing, 2011 - Springer
This monograph summarizes the authors' works in the the first decade of the 21st century on
possibility distributions and decisions. The book is organized as follows. It begins, in …
possibility distributions and decisions. The book is organized as follows. It begins, in …
Fault tolerance and recovery of scientific workflows on computational grids
G Kandaswamy, A Mandal… - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
In this paper, we describe the design and implementation of two mechanisms for fault-
tolerance and recovery for complex scientific workflows on computational grids. We present …
tolerance and recovery for complex scientific workflows on computational grids. We present …
Resource monitoring and management with OVIS to enable HPC in cloud computing environments
Using the cloud computing paradigm, a host of companies promise to make huge compute
resources available to users on a pay-as-you-go basis. These resources can be configured …
resources available to users on a pay-as-you-go basis. These resources can be configured …
An analysis of clustered failures on large supercomputing systems
TJ Hacker, F Romero, CD Carothers - Journal of Parallel and Distributed …, 2009 - Elsevier
Large supercomputers are built today using thousands of commodity components, and
suffer from poor reliability due to frequent component failures. The characteristics of failure …
suffer from poor reliability due to frequent component failures. The characteristics of failure …
Combined fault tolerance and scheduling techniques for workflow applications on computational grids
Y Zhang, A Mandal, C Koelbel… - 2009 9th IEEE/ACM …, 2009 - ieeexplore.ieee.org
Complex scientific workflows are now Increasingly executed on computational grids. In
addition to the challenges of managing and scheduling these workflows, reliability …
addition to the challenges of managing and scheduling these workflows, reliability …
Numerical recovery strategies for parallel resilient Krylov linear solvers
As the computational power of high‐performance computing systems continues to increase
by using a huge number of cores or specialized processing units, high‐performance …
by using a huge number of cores or specialized processing units, high‐performance …
A methodology for comparing the reliability of GPU-based and CPU-based HPCs
Today, GPUs are widely used as coprocessors/accelerators in High-Performance
Heterogeneous Computing due to their many advantages. However, many researches …
Heterogeneous Computing due to their many advantages. However, many researches …
Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example
Effective failure prediction and mitigation strategies in high-performance computing systems
could provide huge gains in resilience of tightly coupled large-scale scientific codes. These …
could provide huge gains in resilience of tightly coupled large-scale scientific codes. These …
Shiraz: Exploiting system reliability and application resilience characteristics to improve large scale system throughput
Large-scale applications rely on resilience mechanisms such as checkpoint-restart to make
forward progress in the presence of failures. Unfortunately, this incurs huge I/O overhead …
forward progress in the presence of failures. Unfortunately, this incurs huge I/O overhead …