Analyzing and leveraging decoupled L1 caches in GPUs

MA Ibrahim, O Kayiran, Y Eckert… - … Symposium on High …, 2021 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way to
address the memory wall. However, they are not always efficiently utilized for optimal GPU …

Cross-core Data Sharing for Energy-efficient GPUs

H Falahati, M Sadrosadati, Q Xu… - ACM Transactions on …, 2024 - dl.acm.org
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …

Equity2vec: End-to-end deep learning framework for cross-sectional asset pricing

Q Wu, CG Brinton, Z Zhang, A Pizzoferrato… - Proceedings of the …, 2021 - dl.acm.org
Pricing assets has attracted significant attention from the financial technology community.
We observe that the existing solutions overlook the cross-sectional effects and not fully …

Valkyrie: Leveraging inter-tlb locality to enhance gpu performance

T Baruah, Y Sun, SA Mojumder, JL Abellán… - Proceedings of the …, 2020 - dl.acm.org
Programming on a GPU has been made considerably easier with the introduction of Virtual
Memory features, which support common pointer-based semantics between the CPU and …

Analyzing and leveraging shared L1 caches in GPUs

MA Ibrahim, O Kayiran, Y Eckert, GH Loh… - Proceedings of the ACM …, 2020 - dl.acm.org
Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes
them effective for achieving high throughput for a wide range of applications. However, the …

MemHC: an optimized GPU memory management framework for accelerating many-body correlation

Q Wang, Z Peng, B Ren, J Chen… - ACM Transactions on …, 2022 - dl.acm.org
The many-body correlation function is a fundamental computation kernel in modern physics
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …

Locality-aware optimizations for improving remote memory latency in multi-gpu systems

L Belayneh, H Ye, KY Chen, D Blaauw… - Proceedings of the …, 2022 - dl.acm.org
With generational gains from transistor scaling, GPUs have been able to accelerate
traditional computation-intensive workloads. But with the obsolescence of Moore's Law …

Efficient nearest-neighbor data sharing in GPUs

N Nematollahi, M Sadrosadati, H Falahati… - ACM Transactions on …, 2020 - dl.acm.org
Stencil codes (aka nearest-neighbor computations) are widely used in image processing,
machine learning, and scientific applications. Stencil codes incur nearest-neighbor data …

Colab: Collaborative and efficient processing of replicated cache requests in gpu

BW Cheng, EM Huang, CH Chao, WF Sun… - Proceedings of the 28th …, 2023 - dl.acm.org
In this work, we aim to capture replicated cache requests between Stream Multiprocessors
(SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of …

Rosella: A self-driving distributed scheduler for heterogeneous clusters

Q Wu, Z Liu - 2021 17th International Conference on Mobility …, 2021 - ieeexplore.ieee.org
Large-scale interactive web services and advanced AI applications make sophisticated
decisions in real-time, based on executing a massive amount of computation tasks on …