There's plenty of room at the Top: What will drive computer performance after Moore's law?
BACKGROUND Improvements in computing power can claim a large share of the credit for
many of the things that we take for granted in our modern lives: cellphones that are more …
many of the things that we take for granted in our modern lives: cellphones that are more …
External memory algorithms and data structures: Dealing with massive data
JS Vitter - ACM Computing surveys (CsUR), 2001 - dl.acm.org
Data sets in large applications are often too massive to fit completely inside the computers
internal memory. The resulting input/output communication (or I/O) between fast internal …
internal memory. The resulting input/output communication (or I/O) between fast internal …
Flexgen: High-throughput generative inference of large language models with a single gpu
The high computational and memory requirements of large language model (LLM) inference
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
make it feasible only with multiple high-end accelerators. Motivated by the emerging …
Data movement is all you need: A case study on optimizing transformers
Transformers are one of the most important machine learning workloads today. Training one
is a very compute-intensive task, often taking days or weeks, and significant attention has …
is a very compute-intensive task, often taking days or weeks, and significant attention has …
[图书][B] Applied numerical linear algebra
JW Demmel - 1997 - SIAM
This textbook covers both direct and iterative methods for the solution of linear systems, least
squares problems, eigenproblems, and the singular value decomposition. Earlier versions …
squares problems, eigenproblems, and the singular value decomposition. Earlier versions …
[图书][B] Why systolic architecture?
HT Kung - 1982 - eecs.harvard.edu
Roughly, the cycle for developing a special-purpose system can be divided into three
phases–task definition, design, and implementation. During task definition, some system …
phases–task definition, design, and implementation. During task definition, some system …
The input/output complexity of sorting and related problems
A Aggarwal, JS Vitter - Communications of the ACM, 1988 - dl.acm.org
We provide tight upper and lower bounds, up to a constant factor, for the number of inputs
and outputs (I/OS) between internal memory and secondary storage required for five sorting …
and outputs (I/OS) between internal memory and secondary storage required for five sorting …
Cache-oblivious algorithms
M Frigo, CE Leiserson, H Prokop… - … on Foundations of …, 1999 - ieeexplore.ieee.org
This paper presents asymptotically optimal algorithms for rectangular matrix transpose, FFT,
and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms …
and sorting on computers with multiple levels of caching. Unlike previous optimal algorithms …
[图书][B] Space-filling curves: an introduction with applications in scientific computing
M Bader - 2012 - books.google.com
The present book provides an introduction to using space-filling curves (SFC) as tools in
scientific computing. Special focus is laid on the representation of SFC and on resulting …
scientific computing. Special focus is laid on the representation of SFC and on resulting …
[PDF][PDF] The cache performance and optimizations of blocked algorithms
MD Lam, EE Rothberg, ME Wolf - ACM SIGOPS Operating Systems …, 1991 - dl.acm.org
Blocking is a well-known optimization technique for improving the effectiveness of memory
hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms …
hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms …