Scaling distributed machine learning with {In-Network} aggregation

A Sapio, M Canini, CY Ho, J Nelson, P Kalnis… - … USENIX Symposium on …, 2021 - usenix.org
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …

The Sunway TaihuLight supercomputer: system and applications

H Fu, J Liao, J Yang, L Wang, Z Song, X Huang… - Science China …, 2016 - Springer
Abstract The Sunway TaihuLight supercomputer is the world's first system with a peak
performance greater than 100 PFlops. In this paper, we provide a detailed introduction to the …

Status and future perspectives for lattice gauge theory calculations to the exascale and beyond

B Joó, C Jung, NH Christ, W Detmold… - The European Physical …, 2019 - Springer
In this and a set of companion white papers, the USQCD Collaboration lays out a program of
science and computing for lattice gauge theory. These white papers describe how …

[图书][B] Distributed and cloud computing: from parallel processing to the internet of things

K Hwang, J Dongarra, GC Fox - 2013 - books.google.com
Distributed and Cloud Computing: From Parallel Processing to the Internet of Things offers
complete coverage of modern distributed computing technology including clusters, the grid …

Hot sax: Efficiently finding the most unusual time series subsequence

E Keogh, J Lin, A Fu - … Conference on Data Mining (ICDM'05), 2005 - ieeexplore.ieee.org
In this work, we introduce the new problem of finding time series discords. Time series
discords are subsequences of a longer time series that are maximally different to all the rest …

What supercomputers say: A study of five system logs

A Oliner, J Stearley - 37th annual IEEE/IFIP international …, 2007 - ieeexplore.ieee.org
If we hope to automatically detect and diagnose failures in large-scale computer systems,
we must study real deployed systems and the data they generate. Progress has been …

AutoLog: Anomaly detection by deep autoencoding of system logs

M Catillo, A Pecchia, U Villano - Expert Systems with Applications, 2022 - Elsevier
The use of system logs for detecting and troubleshooting anomalies of production systems
has been known since the early days of computers. In spite of the advances in the area, the …

The tofu interconnect d

Y Ajima, T Kawashima, T Okamoto… - 2018 IEEE …, 2018 - ieeexplore.ieee.org
In this paper, we introduce a new and highly scalable interconnect called Tofu interconnect
D that will be used in the post-K machine. This machine will officially be operational around …

Blue Gene/L torus interconnection network

NR Adiga, MA Blumrich, D Chen… - IBM Journal of …, 2005 - ieeexplore.ieee.org
The main interconnect of the massively parallel Blue Gene®/L is a three-dimensional torus
network with dynamic virtual cut-through routing. This paper describes both the architecture …

Power aware scheduling of bag-of-tasks applications with deadline constraints on DVS-enabled clusters

KH Kim, R Buyya, J Kim - … on Cluster Computing and the Grid …, 2007 - ieeexplore.ieee.org
Power-aware scheduling problem has been a recent issue in cluster systems not only for
operational cost due to electricity cost, but also for system reliability. As recent commodity …