Snake: A variable-length chain-based prefetching for gpus
Graphics Processing Units (GPUs) utilize memory hierarchy and Thread-Level Parallelism
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
(TLP) to tolerate off-chip memory latency, which is a significant bottleneck for memory-bound …
OSM: Off-chip shared memory for GPUs
S Darabi, E Yousefzadeh-Asl-Miandoab… - … on Parallel and …, 2022 - ieeexplore.ieee.org
Graphics Processing Units (GPUs) employ a shared memory, a software-managed cache for
programmers, in each streaming multiprocessor to accelerate data sharing among the …
programmers, in each streaming multiprocessor to accelerate data sharing among the …
CV32RT: Enabling Fast Interrupt and Context Switching for RISC-V Microcontrollers
R Balas, A Ottaviano, L Benini - IEEE Transactions on Very …, 2024 - ieeexplore.ieee.org
Processors using the open RISC-V instruction set architecture (ISA) are finding increasing
adoption in the embedded world. Many embedded use cases have real-time constraints and …
adoption in the embedded world. Many embedded use cases have real-time constraints and …
Adaptable register file organization for vector processors
Contemporary Vector Processors (VPs) are de-signed either for short vector lengths, eg,
Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, eg, NEC Aurora …
Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, eg, NEC Aurora …
Cross-core data sharing for energy-efficient GPUs
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application
domains, because they can accelerate massively parallel workloads and can be easily …
domains, because they can accelerate massively parallel workloads and can be easily …
Investigating Register Cache Behavior: Implications for CUDA and Tensor Core Workloads on GPUs
V Geraeinejad, Q Qian… - IEEE Journal on Emerging …, 2024 - ieeexplore.ieee.org
GPUs are extensively employed as the primary devices for running a broad spectrum of
applications, covering general-purpose applications as well as Artificial Intelligence (AI) …
applications, covering general-purpose applications as well as Artificial Intelligence (AI) …
PresCount: Effective Register Allocation for Bank Conflict Reduction
Modern processors with large multi-banked register files often rely on hardware solutions to
resolve bank conflicts efficiently. However, these hardware-based methods, while flexible …
resolve bank conflicts efficiently. However, these hardware-based methods, while flexible …