Supporting very large models using automatic dataflow graph partitioning
This paper presents Tofu, a system that partitions very large DNN models across multiple
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …
Generating configurable hardware from parallel patterns
In recent years the computing landscape has seen an increasing shift towards specialized
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …
Legate NumPy: Accelerated and distributed array computing
M Bauer, M Garland - Proceedings of the International Conference for …, 2019 - dl.acm.org
NumPy is a popular Python library used for performing array-based numerical computations.
The canonical implementation of NumPy used by most programmers runs on a single CPU …
The canonical implementation of NumPy used by most programmers runs on a single CPU …
Exploring the hidden dimension in graph processing
Task partitioning of a graph-parallel system is traditionally considered equivalent to the
graph partition problem. Such equivalence exists because the properties associated with …
graph partition problem. Such equivalence exists because the properties associated with …
Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns
High performance in modern computing platforms requires programs to be parallel,
distributed, and run on heterogeneous hardware. However programming such architectures …
distributed, and run on heterogeneous hardware. However programming such architectures …
Automatic optimization of matrix implementations for distributed machine learning and linear algebra
Machine learning (ML) computations are often expressed using vectors, matrices, or higher-
dimensional tensors. Such data structures can have many different implementations …
dimensional tensors. Such data structures can have many different implementations …
Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms
R Gu, Y Tang, C Tian, H Zhou, G Li… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Matrix multiplication is a dominant but very time-consuming operation in many big data
analytic applications. Thus its performance optimization is an important and fundamental …
analytic applications. Thus its performance optimization is an important and fundamental …
Unifying data, model and hybrid parallelism in deep learning via tensor tiling
Deep learning systems have become vital tools across many fields, but the increasing model
sizes mean that training must be accelerated to maintain such systems' utility. Current …
sizes mean that training must be accelerated to maintain such systems' utility. Current …
HeAT–a distributed and GPU-accelerated tensor framework for data analytics
To cope with the rapid growth in available data, the efficiency of data analysis and machine
learning libraries has recently received increased attention. Although great advancements …
learning libraries has recently received increased attention. Although great advancements …
Chopper: Optimizing data partitioning for in-memory data analytics frameworks
The performance of in-memory based data analytic frameworks such as Spark is
significantly affected by how data is partitioned. This is because the partitioning effectively …
significantly affected by how data is partitioned. This is because the partitioning effectively …