[HTML][HTML] Toward exascale resilience: 2014 update
F Cappello, G Al, W Gropp, S Kale, B Kramer… - … and Innovations: an …, 2014 - dl.acm.org
Resilience is a major roadblock for HPC executions on future exascale systems. These
systems will typically gather millions of CPU cores running up to a billion threads …
systems will typically gather millions of CPU cores running up to a billion threads …
Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors
In this paper, we present supervision-by-registration, an unsupervised approach to improve
the precision of facial landmark detectors on both images and video. Our key observation is …
the precision of facial landmark detectors on both images and video. Our key observation is …
Variability mitigation in nanometer CMOS integrated systems: A survey of techniques from circuits to software
Variation in performance and power across manufactured parts and their operating
conditions is an accepted reality in modern microelectronic manufacturing processes with …
conditions is an accepted reality in modern microelectronic manufacturing processes with …
Using benchmarks for radiation testing of microprocessors and FPGAs
Performance benchmarks have been used over the years to compare different systems.
These benchmarks can be useful for researchers trying to determine how changes to the …
These benchmarks can be useful for researchers trying to determine how changes to the …
Evaluating the impact of SDC on the GMRES iterative solver
Increasing parallelism and transistor density, along with increasingly tighter energy and
peak power constraints, may force exposure of occasionally incorrect computation or …
peak power constraints, may force exposure of occasionally incorrect computation or …
/spl times/pipes Lite: a synthesis oriented design library for networks on chips
The limited scalability of current bus topologies for systems on chips (SoCs) dictates the
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
adoption of networks on chips (NoCs) as a scalable interconnection scheme. Current SoCs …
Adaptive impact-driven detection of silent data corruption for HPC applications
S Di, F Cappello - IEEE Transactions on Parallel and …, 2016 - ieeexplore.ieee.org
For exascale HPC applications, silent data corruption (SDC) is one of the most dangerous
problems because there is no indication that there are errors during the execution. We …
problems because there is no indication that there are errors during the execution. We …
Resilience design patterns: A structured approach to resilience at extreme scale
S Hukerikar, C Engelmann - arXiv preprint arXiv:1708.07422, 2017 - arxiv.org
Reliability is a serious concern for future extreme-scale high-performance computing (HPC)
systems. While the HPC community has developed various resilience solutions, the solution …
systems. While the HPC community has developed various resilience solutions, the solution …
[HTML][HTML] The use of imprecise processing to improve accuracy in weather & climate prediction
PD Düben, H McNamara, TN Palmer - Journal of Computational Physics, 2014 - Elsevier
The use of stochastic processing hardware and low precision arithmetic in atmospheric
models is investigated. Stochastic processors allow hardware-induced faults in calculations …
models is investigated. Stochastic processors allow hardware-induced faults in calculations …
RAFTing MapReduce: Fast recovery on the RAFT
MapReduce is a computing paradigm that has gained a lot of popularity as it allows non-
expert users to easily run complex analytical tasks at very large-scale. At such scale, task …
expert users to easily run complex analytical tasks at very large-scale. At such scale, task …