Communication lower bounds and optimal algorithms for numerical linear algebra
The traditional metric for the efficiency of a numerical algorithm has been the number of
arithmetic operations it performs. Technological trends have long been reducing the time to …
arithmetic operations it performs. Technological trends have long been reducing the time to …
Mesh-tensorflow: Deep learning for supercomputers
Abstract Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network
(DNN) training strategy, due to its universal applicability and its amenability to Single …
(DNN) training strategy, due to its universal applicability and its amenability to Single …
Scalable computing
WF McColl - Computer Science Today: Recent Trends and …, 2005 - Springer
Scalable computing will, over the next few years, become the normal form of computing. In
this paper we present a unified framework, based on the BSP model, which aims to serve as …
this paper we present a unified framework, based on the BSP model, which aims to serve as …
A bridging model for parallel computation
LG Valiant - Communications of the ACM, 1990 - dl.acm.org
The success of the von Neumann model of sequential computation is attributable to the fact
that it is an efficient bridge between software and hardware: high-level languages can be …
that it is an efficient bridge between software and hardware: high-level languages can be …
[图书][B] Parallel algorithms
J JáJá - 1992 - users.cs.utah.edu
The purpose of this chapter is to introduce several parallel models and to specify a suitable
framework for presenting and analyzing parallel algorithms. A commonly accepted model for …
framework for presenting and analyzing parallel algorithms. A commonly accepted model for …
LogP: Towards a realistic model of parallel computation
D Culler, R Karp, D Patterson, A Sahay… - Proceedings of the …, 1993 - dl.acm.org
A vast body of theoretical research has focused either on overly simplistic models of parallel
computation, notably the PRAM, or overly specific models that have few representatives in …
computation, notably the PRAM, or overly specific models that have few representatives in …
[图书][B] Parallel programming
T Rauber, G Rünger - 2013 - Springer
Innovations in hardware architecture, such as hyper-threading or multicore processors,
make parallel computing resources available for computer systems in different areas …
make parallel computing resources available for computer systems in different areas …
LogGP: Incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation
A Alexandrov, MF Ionescu, KE Schauser… - Proceedings of the …, 1995 - dl.acm.org
We present a new model of parallel computation—the LogGP model—and use it to analyze
a number of algorithms, most notably, the single node scatter (one-to-all personalized …
a number of algorithms, most notably, the single node scatter (one-to-all personalized …
Communication-optimal parallel 2.5 D matrix multiplication and LU factorization algorithms
E Solomonik, J Demmel - European Conference on Parallel Processing, 2011 - Springer
Extra memory allows parallel matrix multiplication to be done with asymptotically less
communication than Cannon's algorithm and be faster in practice.“3D” algorithms arrange …
communication than Cannon's algorithm and be faster in practice.“3D” algorithms arrange …
A massively parallel tensor contraction framework for coupled-cluster computations
Precise calculation of molecular electronic wavefunctions by methods such as coupled-
cluster requires the computation of tensor contractions, the cost of which has polynomial …
cluster requires the computation of tensor contractions, the cost of which has polynomial …