Scaling distributed machine learning with {In-Network} aggregation
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …
accelerate distributed parallel training by designing a communication primitive that uses a …
The Sunway TaihuLight supercomputer: system and applications
Abstract The Sunway TaihuLight supercomputer is the world's first system with a peak
performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the …
performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the …
Status and future perspectives for lattice gauge theory calculations to the exascale and beyond
B Joó, C Jung, NH Christ, W Detmold… - The European Physical …, 2019 - Springer
In this and a set of companion white papers, the USQCD Collaboration lays out a program of
science and computing for lattice gauge theory. These white papers describe how …
science and computing for lattice gauge theory. These white papers describe how …
[图书][B] Distributed and cloud computing: from parallel processing to the internet of things
Distributed and Cloud Computing: From Parallel Processing to the Internet of Things offers
complete coverage of modern distributed computing technology including clusters, the grid …
complete coverage of modern distributed computing technology including clusters, the grid …
Hot sax: Efficiently finding the most unusual time series subsequence
In this work, we introduce the new problem of finding time series discords. Time series
discords are subsequences of a longer time series that are maximally different to all the rest …
discords are subsequences of a longer time series that are maximally different to all the rest …
What supercomputers say: A study of five system logs
A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …
we must study real deployed systems and the data they generate. Progress has been …
AutoLog: Anomaly detection by deep autoencoding of system logs
The use of system logs for detecting and troubleshooting anomalies of production systems
has been known since the early days of computers. In spite of the advances in the area, the …
has been known since the early days of computers. In spite of the advances in the area, the …
The tofu interconnect d
Y Ajima, T Kawashima, T Okamoto… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
In this paper, we introduce a new and highly scalable interconnect called Tofu interconnect
D that will be used in the post-K machine. This machine will officially be operational around …
D that will be used in the post-K machine. This machine will officially be operational around …
Blue Gene/L torus interconnection network
NR Adiga, MA Blumrich, D Chen… - IBM Journal of …, 2005 - ieeexplore.ieee.org
The main interconnect of the massively parallel Blue Gene®/L is a three-dimensional torus
network with dynamic virtual cut-through routing. This paper describes both the architecture …
network with dynamic virtual cut-through routing. This paper describes both the architecture …
Power aware scheduling of bag-of-tasks applications with deadline constraints on DVS-enabled clusters
Power-aware scheduling problem has been a recent issue in cluster systems not only for
operational cost due to electricity cost, but also for system reliability. As recent commodity …
operational cost due to electricity cost, but also for system reliability. As recent commodity …