Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs

MA Ibrahim, O Kayiran, Y Eckert… - … Symposium on High …, 2021 - ieeexplore.ieee.org

Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way to
address the memory wall. However, they are not always efficiently utilized for optimal GPU …

被引用次数：20 相关文章所有 10 个版本

[PDF] acm.org Full View

Cross-core Data Sharing for Energy-efficient GPUs

H Falahati, M Sadrosadati, Q Xu… - ACM Transactions on …, 2024 - dl.acm.org

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …

被引用次数：1 相关文章

[PDF] acm.org

Equity2vec: End-to-end deep learning framework for cross-sectional asset pricing

Q Wu, CG Brinton, Z Zhang, A Pizzoferrato… - Proceedings of the …, 2021 - dl.acm.org

Pricing assets has attracted significant attention from the financial technology community.
We observe that the existing solutions overlook the cross-sectional effects and not fully …

被引用次数：20 相关文章所有 7 个版本

[PDF] bu.edu

Valkyrie: Leveraging inter-tlb locality to enhance gpu performance

T Baruah, Y Sun, SA Mojumder, JL Abellán… - Proceedings of the …, 2020 - dl.acm.org

Programming on a GPU has been made considerably easier with the introduction of Virtual
Memory features, which support common pointer-based semantics between the CPU and …

被引用次数：20 相关文章所有 4 个版本

[PDF] acm.org

Analyzing and leveraging shared L1 caches in GPUs

MA Ibrahim, O Kayiran, Y Eckert, GH Loh… - Proceedings of the ACM …, 2020 - dl.acm.org

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes
them effective for achieving high throughput for a wide range of applications. However, the …

被引用次数：16 相关文章所有 9 个版本

[PDF] acm.org Full View

MemHC: an optimized GPU memory management framework for accelerating many-body correlation

Q Wang, Z Peng, B Ren, J Chen… - ACM Transactions on …, 2022 - dl.acm.org

The many-body correlation function is a fundamental computation kernel in modern physics
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …

被引用次数：6 相关文章所有 5 个版本

[PDF] umich.edu

Locality-aware optimizations for improving remote memory latency in multi-gpu systems

L Belayneh, H Ye, KY Chen, D Blaauw… - Proceedings of the …, 2022 - dl.acm.org

With generational gains from transistor scaling, GPUs have been able to accelerate
traditional computation-intensive workloads. But with the obsolescence of Moore's Law …

被引用次数：4 相关文章所有 4 个版本

[PDF] acm.org Full View

Efficient nearest-neighbor data sharing in GPUs

N Nematollahi, M Sadrosadati, H Falahati… - ACM Transactions on …, 2020 - dl.acm.org

Stencil codes (aka nearest-neighbor computations) are widely used in image processing,
machine learning, and scientific applications. Stencil codes incur nearest-neighbor data …

被引用次数：10 相关文章所有 3 个版本

Colab: Collaborative and efficient processing of replicated cache requests in gpu

BW Cheng, EM Huang, CH Chao, WF Sun… - Proceedings of the 28th …, 2023 - dl.acm.org

In this work, we aim to capture replicated cache requests between Stream Multiprocessors
(SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Rosella: A self-driving distributed scheduler for heterogeneous clusters

Q Wu, Z Liu - 2021 17th International Conference on Mobility …, 2021 - ieeexplore.ieee.org

Large-scale interactive web services and advanced AI applications make sophisticated
decisions in real-time, based on executing a massive amount of computation tasks on …

被引用次数：7 相关文章所有 4 个版本