Spartan: A distributed array framework with smart tiling

M Wang, C Huang, J Li - … of the Fourteenth EuroSys Conference 2019, 2019 - dl.acm.org

This paper presents Tofu, a system that partitions very large DNN models across multiple
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …

被引用次数：158 相关文章所有 9 个版本

[PDF] acm.org

Generating configurable hardware from parallel patterns

R Prabhakar, D Koeplinger, KJ Brown, HJ Lee… - Acm Sigplan …, 2016 - dl.acm.org

In recent years the computing landscape has seen an increasing shift towards specialized
accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the …

被引用次数：101 相关文章所有 15 个版本

[PDF] acm.org

Legate NumPy: Accelerated and distributed array computing

M Bauer, M Garland - Proceedings of the International Conference for …, 2019 - dl.acm.org

NumPy is a popular Python library used for performing array-based numerical computations.
The canonical implementation of NumPy used by most programmers runs on a single CPU …

被引用次数：58 相关文章所有 7 个版本

[PDF] usenix.org

Exploring the hidden dimension in graph processing

M Zhang, Y Wu, K Chen, X Qian, X Li… - 12th USENIX Symposium …, 2016 - usenix.org

Task partitioning of a graph-parallel system is traditionally considered equivalent to the
graph partition problem. Such equivalence exists because the properties associated with …

被引用次数：80 相关文章所有 7 个版本

[PDF] stanford.edu

Have abstraction and eat performance, too: Optimized heterogeneous computing with parallel patterns

KJ Brown, HJ Lee, T Rompf, AK Sujeeth… - Proceedings of the …, 2016 - dl.acm.org

High performance in modern computing platforms requires programs to be parallel,
distributed, and run on heterogeneous hardware. However programming such architectures …

被引用次数：64 相关文章所有 13 个版本

[PDF] acm.org

Automatic optimization of matrix implementations for distributed machine learning and linear algebra

S Luo, D Jankov, B Yuan, C Jermaine - Proceedings of the 2021 …, 2021 - dl.acm.org

Machine learning (ML) computations are often expressed using vectors, matrices, or higher-
dimensional tensors. Such data structures can have many different implementations …

被引用次数：13 相关文章所有 3 个版本

[PDF] nju.edu.cn

Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms

R Gu, Y Tang, C Tian, H Zhou, G Li… - … on Parallel and …, 2017 - ieeexplore.ieee.org

Matrix multiplication is a dominant but very time-consuming operation in many big data
analytic applications. Thus its performance optimization is an important and fundamental …

被引用次数：29 相关文章所有 5 个版本

[PDF] arxiv.org

Unifying data, model and hybrid parallelism in deep learning via tensor tiling

M Wang, C Huang, J Li - arXiv preprint arXiv:1805.04170, 2018 - arxiv.org

Deep learning systems have become vital tools across many fields, but the increasing model
sizes mean that training must be accelerated to maintain such systems' utility. Current …

被引用次数：26 相关文章所有 2 个版本

[PDF] arxiv.org

HeAT–a distributed and GPU-accelerated tensor framework for data analytics

M Götz, C Debus, D Coquelin, K Krajsek… - … Conference on Big …, 2020 - ieeexplore.ieee.org

To cope with the rapid growth in available data, the efficiency of data analysis and machine
learning libraries has recently received increased attention. Although great advancements …

被引用次数：11 相关文章所有 10 个版本

[PDF] github.io

Chopper: Optimizing data partitioning for in-memory data analytics frameworks

AK Paul, W Zhuang, L Xu, M Li… - 2016 IEEE …, 2016 - ieeexplore.ieee.org

The performance of in-memory based data analytic frameworks such as Spark is
significantly affected by how data is partitioned. This is because the partitioning effectively …

被引用次数：22 相关文章所有 4 个版本