Analyzing and increasing the reliability of convolutional neural networks on GPUs

FF dos Santos, PF Pimenta, C Lunardi… - IEEE Transactions …, 2018 - ieeexplore.ieee.org
Graphics processing units (GPUs) are playing a critical role in convolutional neural networks
(CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments …

[图书][B] Architecture design for soft errors

S Mukherjee - 2011 - books.google.com
Architecture Design for Soft Errors provides a comprehensive description of the architectural
techniques to tackle the soft error problem. It covers the new methodologies for quantitative …

Artificial neural networks for space and safety-critical applications: Reliability issues and potential solutions

P Rech - IEEE Transactions on Nuclear Science, 2024 - ieeexplore.ieee.org
Machine learning is among the greatest advancements in computer science and
engineering and is today used to classify or detect objects, a key feature in autonomous …

Low-cost program-level detectors for reducing silent data corruptions

SKS Hari, SV Adve, H Naeimi - IEEE/IFIP international …, 2012 - ieeexplore.ieee.org
With technology scaling, transient faults are becoming an increasing threat to hardware
reliability. Commodity systems must be made resilient to these in-field faults through very …

Application-level correctness and its impact on fault tolerance

X Li, D Yeung - 2007 IEEE 13th International symposium on …, 2007 - ieeexplore.ieee.org
Traditionally, fault tolerance researchers have required architectural state to be numerically
perfect for program execution to be correct. However, in many programs, even if execution is …

Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators-Trends in Quantum Computing, Heterogeneous Systems and …

S Venkatesha, R Parthasarathi - ACM Computing Surveys, 2024 - dl.acm.org
Rapid progress in the CMOS technology for the past 25 years has increased the
vulnerability of processors towards faults. Subsequently, focus of computer architects shifted …

Examining ACE analysis reliability estimates using fault-injection

NJ Wang, A Mahesri, SJ Patel - Proceedings of the 34th annual …, 2007 - dl.acm.org
ACE analysis is a technique to provide an early reliability estimate for microprocessors. ACE
analysis couples data from abstract performance models with low level design details to …

Software-controlled fault tolerance

GA Reis, J Chang, N Vachharajani, R Rangan… - ACM Transactions on …, 2005 - dl.acm.org
Traditional fault-tolerance techniques typically utilize resources ineffectively because they
cannot adapt to the changing reliability and performance demands of a system. This paper …

nZDC: A compiler technique for near zero silent data corruption

M Didehban, A Shrivastava - Proceedings of the 53rd Annual Design …, 2016 - dl.acm.org
Exponentially growing rate of soft errors makes reliability a major concern in modern
processor design. Since software-oriented approaches offer flexible protection even in off …

Compiler-managed software-based redundant multi-threading for transient fault detection

C Wang, H Kim, Y Wu, V Ying - International Symposium on …, 2007 - ieeexplore.ieee.org
As transistors become increasingly smaller and faster with tighter noise margins, modern
processors are becoming increasingly more susceptible to transient hardware faults …