Communication lower bounds and optimal algorithms for numerical linear algebra

G Ballard, E Carson, J Demmel, M Hoemmen… - Acta Numerica, 2014 - cambridge.org
The traditional metric for the efficiency of a numerical algorithm has been the number of
arithmetic operations it performs. Technological trends have long been reducing the time to …

Mesh-tensorflow: Deep learning for supercomputers

N Shazeer, Y Cheng, N Parmar… - Advances in neural …, 2018 - proceedings.neurips.cc
Abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network
(DNN) training strategy, due to its universal applicability and its amenability to Single …

Scalable computing

WF McColl - Computer Science Today: Recent Trends and …, 2005 - Springer
Scalable computing will, over the next few years, become the normal form of computing. In
this paper we present a unified framework, based on the BSP model, which aims to serve as …

A bridging model for parallel computation

LG Valiant - Communications of the ACM, 1990 - dl.acm.org
The success of the von Neumann model of sequential computation is attributable to the fact
that it is an efficient bridge between software and hardware: high-level languages can be …

[图书][B] Parallel algorithms

J JáJá - 1992 - users.cs.utah.edu
The purpose of this chapter is to introduce several parallel models and to specify a suitable
framework for presenting and analyzing parallel algorithms. A commonly accepted model for …

LogP: Towards a realistic model of parallel computation

D Culler, R Karp, D Patterson, A Sahay… - Proceedings of the …, 1993 - dl.acm.org
A vast body of theoretical research has focused either on overly simplistic models of parallel
computation, notably the PRAM, or overly specific models that have few representatives in …

[图书][B] Parallel programming

T Rauber, G Rünger - 2013 - Springer
Innovations in hardware architecture, such as hyper-threading or multicore processors,
make parallel computing resources available for computer systems in different areas …

LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation

A Alexandrov, MF Ionescu, KE Schauser… - Proceedings of the …, 1995 - dl.acm.org
We present a new model of parallel computation—the LogGP model—and use it to analyze
a number of algorithms, most notably, the single node scatter (one-to-all personalized …

Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms

E Solomonik, J Demmel - European Conference on Parallel Processing, 2011 - Springer
Extra memory allows parallel matrix multiplication to be done with asymptotically less
communication than Cannon's algorithm and be faster in practice.“3D” algorithms arrange …

A massively parallel tensor contraction framework for coupled-cluster computations

E Solomonik, D Matthews, JR Hammond… - Journal of Parallel and …, 2014 - Elsevier
Precise calculation of molecular electronic wavefunctions by methods such as coupled-
cluster requires the computation of tensor contractions, the cost of which has polynomial …