Generation of an error set that emulates software faults based on field data
J Christmansson, R Chillarege - Proceedings of Annual …, 1996 - ieeexplore.ieee.org
A significant issue in fault injection experiments is that the injected faults are representative
of software faults observed in the field. Another important issue is the time used, as we want …
of software faults observed in the field. Another important issue is the time used, as we want …
xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning
Abstract Machine learning techniques have become ubiquitous both in industry and
academic applications. Increasing model sizes and training data volumes necessitate fast …
academic applications. Increasing model sizes and training data volumes necessitate fast …
Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks
B Prisacari, G Rodriguez, P Heidelberger… - Proceedings of the 23rd …, 2014 - dl.acm.org
Dragonflies are recent network designs that are one of the most promising topologies for the
Exascale effort due to their scalability and cost. While being able to achieve very high …
Exascale effort due to their scalability and cost. While being able to achieve very high …
Topological characterization of hamming and dragonfly networks and its implications on routing
Current High-Performance Computing (HPC) and data center networks rely on large-radix
routers. Hamming graphs (Cartesian products of complete graphs) and dragonflies (two …
routers. Hamming graphs (Cartesian products of complete graphs) and dragonflies (two …
Contention-based nonminimal adaptive routing in high-radix networks
Adaptive routing is an efficient congestion avoidance mechanism for modern Data enter and
HPC networks. Congestion detection traditionally relies on the occupancy of the router …
HPC networks. Congestion detection traditionally relies on the occupancy of the router …
Hierarchical and reconfigurable optical/electrical interconnection network for high-performance computing
Compared with electrical packet switches, optical switching technology could enable a more
desirable high-performance computing (HPC) system with lower power consumption, lower …
desirable high-performance computing (HPC) system with lower power consumption, lower …
AMLR: an adaptive multi-level routing algorithm for dragonfly network
L Zhu, H Gu, X Yu, W Sun - IEEE Communications Letters, 2021 - ieeexplore.ieee.org
High-radix hierarchical structures, such as the dragonfly, fat-tree, and torus, are cost-
effective topologies for high-performance computer (HPC) networks. In these networks …
effective topologies for high-performance computer (HPC) networks. In these networks …
SDCC: Software-defined collective communication for distributed training
Communication is crucial to the performance of distributed training. Today's solutions tightly
couple the control and data planes and lack flexibility, generality, and performance. In this …
couple the control and data planes and lack flexibility, generality, and performance. In this …
Level-spread: A new job allocation policy for dragonfly networks
The dragonfly network topology has attracted attention in recent years owing to its high radix
and constant diameter. However, the influence of job allocation on communication time in …
and constant diameter. However, the influence of job allocation on communication time in …
A scheduling policy to save 10% of communication time in parallel fast Fourier transform
SA Aseeri, A Gopal Chatterjee… - Concurrency and …, 2023 - Wiley Online Library
The fast Fourier transform (FFT) has applications in almost every frequency related study, for
example, in image and signal processing, and radio astronomy. It is also used as a Poisson …
example, in image and signal processing, and radio astronomy. It is also used as a Poisson …