Analyzing and leveraging decoupled L1 caches in GPUs
MA Ibrahim, O Kayiran, Y Eckert… - … Symposium on High …, 2021 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) use caches to provide on-chip bandwidth as a way to
address the memory wall. However, they are not always efficiently utilized for optimal GPU …
address the memory wall. However, they are not always efficiently utilized for optimal GPU …
Cross-core Data Sharing for Energy-efficient GPUs
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …
domains, because they can accelerate massively parallel workloads and can be easily …
Equity2vec: End-to-end deep learning framework for cross-sectional asset pricing
Pricing assets has attracted significant attention from the financial technology community.
We observe that the existing solutions overlook the cross-sectional effects and not fully …
We observe that the existing solutions overlook the cross-sectional effects and not fully …
Valkyrie: Leveraging inter-tlb locality to enhance gpu performance
Programming on a GPU has been made considerably easier with the introduction of Virtual
Memory features, which support common pointer-based semantics between the CPU and …
Memory features, which support common pointer-based semantics between the CPU and …
Analyzing and leveraging shared L1 caches in GPUs
Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes
them effective for achieving high throughput for a wide range of applications. However, the …
them effective for achieving high throughput for a wide range of applications. However, the …
MemHC: an optimized GPU memory management framework for accelerating many-body correlation
The many-body correlation function is a fundamental computation kernel in modern physics
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …
computing applications, eg, Hadron Contractions in Lattice quantum chromodynamics …
Locality-aware optimizations for improving remote memory latency in multi-gpu systems
With generational gains from transistor scaling, GPUs have been able to accelerate
traditional computation-intensive workloads. But with the obsolescence of Moore's Law …
traditional computation-intensive workloads. But with the obsolescence of Moore's Law …
Efficient nearest-neighbor data sharing in GPUs
N Nematollahi, M Sadrosadati, H Falahati… - ACM Transactions on …, 2020 - dl.acm.org
Stencil codes (aka nearest-neighbor computations) are widely used in image processing,
machine learning, and scientific applications. Stencil codes incur nearest-neighbor data …
machine learning, and scientific applications. Stencil codes incur nearest-neighbor data …
Colab: Collaborative and efficient processing of replicated cache requests in gpu
In this work, we aim to capture replicated cache requests between Stream Multiprocessors
(SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of …
(SMs) within an SM cluster to alleviate the Network-on-Chip (NoC) congestion problem of …
Rosella: A self-driving distributed scheduler for heterogeneous clusters
Large-scale interactive web services and advanced AI applications make sophisticated
decisions in real-time, based on executing a massive amount of computation tasks on …
decisions in real-time, based on executing a massive amount of computation tasks on …