Supporting very large models using automatic dataflow graph partitioning

M Wang, C Huang, J Li - … of the Fourteenth EuroSys Conference 2019, 2019 - dl.acm.org
This paper presents Tofu, a system that partitions very large DNN models across multiple
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …

Generating configurable hardware from parallel patterns

R Prabhakar, D Koeplinger, KJ Brown, HJ Lee… - Acm Sigplan …, 2016 - dl.acm.org
In recent years the computing landscape has seen an increasing shift towards specialized
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …

Legate NumPy: Accelerated and distributed array computing

M Bauer, M Garland - Proceedings of the International Conference for …, 2019 - dl.acm.org
NumPy is a popular Python library used for performing array-based numerical computations.
The canonical implementation of NumPy used by most programmers runs on a single CPU …

Exploring the hidden dimension in graph processing

M Zhang, Y Wu, K Chen, X Qian, X Li… - 12th USENIX Symposium …, 2016 - usenix.org
Task partitioning of a graph-parallel system is traditionally considered equivalent to the
graph partition problem. Such equivalence exists because the properties associated with …

Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns

KJ Brown, HJ Lee, T Rompf, AK Sujeeth… - Proceedings of the …, 2016 - dl.acm.org
High performance in modern computing platforms requires programs to be parallel,
distributed, and run on heterogeneous hardware. However programming such architectures …

Automatic optimization of matrix implementations for distributed machine learning and linear algebra

S Luo, D Jankov, B Yuan, C Jermaine - Proceedings of the 2021 …, 2021 - dl.acm.org
Machine learning (ML) computations are often expressed using vectors, matrices, or higher-
dimensional tensors. Such data structures can have many different implementations …

Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms

R Gu, Y Tang, C Tian, H Zhou, G Li… - … on Parallel and …, 2017 - ieeexplore.ieee.org
Matrix multiplication is a dominant but very time-consuming operation in many big data
analytic applications. Thus its performance optimization is an important and fundamental …

Unifying data, model and hybrid parallelism in deep learning via tensor tiling

M Wang, C Huang, J Li - arXiv preprint arXiv:1805.04170, 2018 - arxiv.org
Deep learning systems have become vital tools across many fields, but the increasing model
sizes mean that training must be accelerated to maintain such systems' utility. Current …

HeAT–a distributed and GPU-accelerated tensor framework for data analytics

M Götz, C Debus, D Coquelin, K Krajsek… - … Conference on Big …, 2020 - ieeexplore.ieee.org
To cope with the rapid growth in available data, the efficiency of data analysis and machine
learning libraries has recently received increased attention. Although great advancements …

Chopper: Optimizing data partitioning for in-memory data analytics frameworks

AK Paul, W Zhuang, L Xu, M Li… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
The performance of in-memory based data analytic frameworks such as Spark is
significantly affected by how data is partitioned. This is because the partitioning effectively …