Nvbit: A dynamic binary instrumentation framework for nvidia gpus

O Villa, M Stephenson, D Nellans… - Proceedings of the 52nd …, 2019 - dl.acm.org
Binary instrumentation frameworks are widely used to implement profilers, performance
evaluation, error checking, and bug detection tools. While dynamic binary instrumentation …

GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications

B Fang, K Pattabiraman, M Ripeanu… - … Analysis of Systems …, 2014 - ieeexplore.ieee.org
While graphics processing units (GPUs) have gained wide adoption as accelerators for
general-purpose applications (GPGPU), the end-to-end reliability implications of their use …

Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory

Y Luo, S Govindan, B Sharma… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …

Understanding the propagation of transient errors in HPC applications

RA Ashraf, R Gioiosa, G Kestor, RF DeMara… - Proceedings of the …, 2015 - dl.acm.org
Resiliency of exascale systems has quickly become an important concern for the scientific
community. Despite its importance, still much remains to be determined regarding how faults …

/spl times/pipes Lite: a synthesis oriented design library for networks on chips

S Stergiou, F Angiolini, S Carta, L Raffo… - … Automation and Test …, 2005 - ieeexplore.ieee.org
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …

Lightweight silent data corruption detection based on runtime data analysis for HPC applications

E Berrocal, L Bautista-Gomez, S Di, Z Lan… - Proceedings of the 24th …, 2015 - dl.acm.org
Next-generation supercomputers are expected to have more components and, at the same
time, consume several times less energy per operation. Consequently, the number of soft …

Experimental and analytical study of xeon phi reliability

D Oliveira, L Pilla, N DeBardeleben… - Proceedings of the …, 2017 - dl.acm.org
We present an in-depth analysis of transient faults effects on HPC applications in Intel Xeon
Phi processors based on radiation experiments and high-level fault injection. Besides …

F-sefi: A fine-grained soft error fault injection tool for profiling application vulnerability

Q Guan, N Debardeleben… - 2014 IEEE 28th …, 2014 - ieeexplore.ieee.org
As the high performance computing (HPC) community continues to push towards exascale
computing, resilience remains a serious challenge. With the expected decrease of both …

GPGPUs: How to combine high computational power with high reliability

LB Gomez, F Cappello, L Carro… - … , Automation & Test …, 2014 - ieeexplore.ieee.org
GPGPUs are used increasingly in several domains, from gaming to different kinds of
computationally intensive applications. In many applications GPGPU reliability is becoming …

FlipIt: An LLVM based fault injector for HPC

J Calhoun, L Olson, M Snir - … Euro-Par 2014 International Workshops, Porto …, 2014 - Springer
High performance computing (HPC) is increasingly subjected to faulty computations. The
frequency of silent data corruptions (SDCs) in particular is expected to increase in emerging …