Assessing dependability with software fault injection: A survey
With the rise of software complexity, software-related accidents represent a significant threat
for computer-based systems. Software Fault Injection is a method to anticipate worst-case …
for computer-based systems. Software Fault Injection is a method to anticipate worst-case …
Software fault tolerance in real-time systems: Identifying the future research questions
Tolerating hardware faults in modern architectures is becoming a prominent problem due to
the miniaturization of the hardware components, their increasing complexity, and the …
the miniaturization of the hardware components, their increasing complexity, and the …
Memory errors in modern systems: The good, the bad, and the ugly
Several recent publications have shown that hardware faults in the memory subsystem are
commonplace. These faults are predicted to become more frequent in future systems that …
commonplace. These faults are predicted to become more frequent in future systems that …
Understanding latency variation in modern DRAM chips: Experimental characterization, analysis, and optimization
Long DRAM latency is a critical performance bottleneck in current systems. DRAM access
latency is defined by three fundamental operations that take place within the DRAM cell …
latency is defined by three fundamental operations that take place within the DRAM cell …
Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field
Computing systems use dynamic random-access memory (DRAM) as main memory. As
prior works have shown, failures in DRAM devices are an important source of errors in …
prior works have shown, failures in DRAM devices are an important source of errors in …
A study of DRAM failures in the field
V Sridharan, D Liberty - SC'12: Proceedings of the International …, 2012 - ieeexplore.ieee.org
Most modern computer systems use dynamic random access memory (DRAM) as a main
memory store. Recent publications have confirmed that DRAM errors are a common source …
memory store. Recent publications have confirmed that DRAM errors are a common source …
Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design
Main memory is one of the leading hardware causes for machine crashes in today's
datacenters. Designing, evaluating and modeling systems that are resilient against memory …
datacenters. Designing, evaluating and modeling systems that are resilient against memory …
The efficacy of error mitigation techniques for DRAM retention failures: A comparative experimental study
As DRAM cells continue to shrink, they become more susceptible to retention failures.
DRAM cells that permanently exhibit short retention times are fairly easy to identify and …
DRAM cells that permanently exhibit short retention times are fairly easy to identify and …
Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults
Several recent publications confirm that faults are common in high-performance computing
systems. Therefore, further attention to the faults experienced by such computing systems is …
systems. Therefore, further attention to the faults experienced by such computing systems is …
Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory
Memory devices represent a key component of datacenter total cost of ownership (TCO),
and techniques used to reduce errors that occur on these devices increase this cost. Existing …
and techniques used to reduce errors that occur on these devices increase this cost. Existing …