Assessing dependability with software fault injection: A survey

R Natella, D Cotroneo, HS Madeira - ACM Computing Surveys (CSUR), 2016 - dl.acm.org
With the rise of software complexity, software-related accidents represent a significant threat
for computer-based systems. Software Fault Injection is a method to anticipate worst-case …

Software fault tolerance in real-time systems: Identifying the future research questions

F Reghenzani, Z Guo, W Fornaciari - ACM Computing Surveys, 2023 - dl.acm.org
Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …

Memory errors in modern systems: The good, the bad, and the ugly

V Sridharan, N DeBardeleben, S Blanchard… - ACM SIGARCH …, 2015 - dl.acm.org
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …

Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization

KK Chang, A Kashyap, H Hassan, S Ghose… - Proceedings of the …, 2016 - dl.acm.org
Long DRAM latency is a critical performance bottleneck in current systems. DRAM access
latency is defined by three fundamental operations that take place within the DRAM cell …

Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field

J Meza, Q Wu, S Kumar, O Mutlu - 2015 45th Annual IEEE/IFIP …, 2015 - ieeexplore.ieee.org
Computing systems use dynamic random-access memory (DRAM) as main memory. As
prior works have shown, failures in DRAM devices are an important source of errors in …

A study of DRAM failures in the field

V Sridharan, D Liberty - SC'12: Proceedings of the International …, 2012 - ieeexplore.ieee.org
Most modern computer systems use dynamic random access memory (DRAM) as a main
memory store. Recent publications have confirmed that DRAM errors are a common source …

Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design

AA Hwang, IA Stefanovici, B Schroeder - ACM SIGPLAN Notices, 2012 - dl.acm.org
Main memory is one of the leading hardware causes for machine crashes in today's
datacenters. Designing, evaluating and modeling systems that are resilient against memory …

The efficacy of error mitigation techniques for DRAM retention failures: A comparative experimental study

S Khan, D Lee, Y Kim, AR Alameldeen… - ACM SIGMETRICS …, 2014 - dl.acm.org
As DRAM cells continue to shrink, they become more susceptible to retention failures.
DRAM cells that permanently exhibit short retention times are fairly easy to identify and …

Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults

V Sridharan, J Stearley, N DeBardeleben… - Proceedings of the …, 2013 - dl.acm.org
Several recent publications confirm that faults are common in high-performance computing
systems. Therefore, further attention to the faults experienced by such computing systems is …

Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory

Y Luo, S Govindan, B Sharma… - 2014 44th Annual …, 2014 - ieeexplore.ieee.org
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …