Generation of an error set that emulates software faults based on field data

J Christmansson, R Chillarege - Proceedings of Annual …, 1996 - ieeexplore.ieee.org
A significant issue in fault injection experiments is that the injected faults are representative
of software faults observed in the field. Another important issue is the time used, as we want …

xCCL: A Survey of Industry-Led Collective Communication Libraries for Deep Learning

A Weingram, Y Li, H Qi, D Ng, L Dai, X Lu - Journal of Computer Science …, 2023 - Springer
Abstract Machine learning techniques have become ubiquitous both in industry and
academic applications. Increasing model sizes and training data volumes necessitate fast …

Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks

B Prisacari, G Rodriguez, P Heidelberger… - Proceedings of the 23rd …, 2014 - dl.acm.org
Dragonflies are recent network designs that are one of the most promising topologies for the
Exascale effort due to their scalability and cost. While being able to achieve very high …

Topological characterization of hamming and dragonfly networks and its implications on routing

C Camarero, E Vallejo, R Beivide - ACM Transactions on Architecture …, 2014 - dl.acm.org
Current High-Performance Computing (HPC) and data center networks rely on large-radix
routers. Hamming graphs (Cartesian products of complete graphs) and dragonflies (two …

Contention-based nonminimal adaptive routing in high-radix networks

P Fuentes, E Vallejo, M García… - 2015 IEEE …, 2015 - ieeexplore.ieee.org
Adaptive routing is an efficient congestion avoidance mechanism for modern Data enter and
HPC networks. Congestion detection traditionally relies on the occupancy of the router …

Hierarchical and reconfigurable optical/electrical interconnection network for high-performance computing

Z Zhao, B Guo, Y Shang, S Huang - Journal of Optical …, 2020 - opg.optica.org
Compared with electrical packet switches, optical switching technology could enable a more
desirable high-performance computing (HPC) system with lower power consumption, lower …

AMLR: an adaptive multi-level routing algorithm for dragonfly network

L Zhu, H Gu, X Yu, W Sun - IEEE Communications Letters, 2021 - ieeexplore.ieee.org
High-radix hierarchical structures, such as the dragonfly, fat-tree, and torus, are cost-
effective topologies for high-performance computer (HPC) networks. In these networks …

SDCC: Software-defined collective communication for distributed training

X Jin, Z Zhang, Y Jia, Y Ma, X Liu - Science China Information Sciences, 2024 - Springer
Communication is crucial to the performance of distributed training. Today's solutions tightly
couple the control and data planes and lack flexibility, generality, and performance. In this …

Level-spread: A new job allocation policy for dragonfly networks

Y Zhang, O Tuncer, F Kaplan, K Olcoz… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
The dragonfly network topology has attracted attention in recent years owing to its high radix
and constant diameter. However, the influence of job allocation on communication time in …

A scheduling policy to save 10% of communication time in parallel fast Fourier transform

SA Aseeri, A Gopal Chatterjee… - Concurrency and …, 2023 - Wiley Online Library
The fast Fourier transform (FFT) has applications in almost every frequency related study, for
example, in image and signal processing, and radio astronomy. It is also used as a Poisson …