Nvbit: A dynamic binary instrumentation framework for nvidia gpus
Binary instrumentation frameworks are widely used to implement profilers, performance
evaluation, error checking, and bug detection tools. While dynamic binary instrumentation …
evaluation, error checking, and bug detection tools. While dynamic binary instrumentation …
GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications
While graphics processing units (GPUs) have gained wide adoption as accelerators for
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …
Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
Understanding the propagation of transient errors in HPC applications
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …
community. Despite its importance, still much remains to be determined regarding how faults …
/spl times/pipes Lite: a synthesis oriented design library for networks on chips
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
Lightweight silent data corruption detection based on runtime data analysis for HPC applications
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …
time, consume several times less energy per operation. Consequently, the number of soft …
Experimental and analytical study of xeon phi reliability
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …
Phi processors based on radiation experiments and high-level fault injection. Besides …
F-sefi: A fine-grained soft error fault injection tool for profiling application vulnerability
Q Guan, N Debardeleben… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org
As the high performance computing (HPC) community continues to push towards exascale
computing, resilience remains a serious challenge. With the expected decrease of both …
computing, resilience remains a serious challenge. With the expected decrease of both …
GPGPUs: How to combine high computational power with high reliability
GPGPUs are used increasingly in several domains, from gaming to different kinds of
computationally intensive applications. In many applications GPGPU reliability is becoming …
computationally intensive applications. In many applications GPGPU reliability is becoming …
FlipIt: An LLVM based fault injector for HPC
High performance computing (HPC) is increasingly subjected to faulty computations. The
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …