Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings

N Jouppi, G Kurian, S Li, P Ma, R Nagarajan… - Proceedings of the 50th …, 2023 - dl.acm.org
In response to innovations in machine learning (ML) models, production workloads changed
radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its …

[HTML][HTML] High-bandwidth density silicon photonic resonators for energy-efficient optical interconnects

A Novick, A James, LY Dai, Z Wu, A Rizzo… - Applied Physics …, 2023 - pubs.aip.org
The growth of artificial intelligence applications demands ever larger and more complex
deep learning models, dominating today's—and tomorrow's—data center and high …

Lightning: A reconfigurable photonic-electronic smartnic for fast and energy-efficient inference

Z Zhong, M Yang, J Lang, C Williams… - Proceedings of the …, 2023 - dl.acm.org
The massive growth of machine learning-based applications and the end of Moore's law
have created a pressing need to redesign computing platforms. We propose Lightning, the …

Petabit-scale silicon photonic interconnects with integrated kerr frequency combs

A Rizzo, S Daudlin, A Novick, A James… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Silicon photonics holds significant promise in revolutionizing optical interconnects in data
centers and high performance computers to enable scaling into the Pb/s package escape …

Topologies in distributed machine learning: Comprehensive survey, recommendations and future directions

L Liu, P Zhou, G Sun, X Chen, T Wu, H Yu, M Guizani - Neurocomputing, 2024 - Elsevier
With the widespread use of distributed machine learning (DML), many IT companies have
established networks dedicated to DML. Different communication architectures of DML have …

Cerberus: The power of choices in datacenter topology design-a throughput perspective

C Griner, J Zerwas, A Blenk, M Ghobadi… - Proceedings of the …, 2021 - dl.acm.org
The bandwidth and latency requirements of modern datacenter applications have led
researchers to propose various topology designs using static, dynamic demand-oblivious …

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

Lightwave fabrics: at-scale optical circuit switching for datacenter and machine learning systems

H Liu, R Urata, K Yasumura, X Zhou… - Proceedings of the …, 2023 - dl.acm.org
We describe our experience developing what we believe to be the world's first large-scale
production deployments of lightwave fabrics used for both datacenter networking and …

GRID: Gradient routing with in-network aggregation for distributed training

J Fang, G Zhao, H Xu, C Wu… - IEEE/ACM Transactions on …, 2023 - ieeexplore.ieee.org
As the scale of distributed training increases, it brings huge communication overhead in
clusters. Some works try to reduce the communication cost through gradient compression or …

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …