Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

Benchmarking a new paradigm: Experimental analysis and characterization of a real processing-in-memory system

J Gómez-Luna, I El Hajj, I Fernandez… - IEEE …, 2022 - ieeexplore.ieee.org
Many modern workloads, such as neural networks, databases, and graph processing, are
fundamentally memory-bound. For such workloads, the data movement between main …

CSR5: An efficient storage format for cross-platform sparse matrix-vector multiplication

W Liu, B Vinter - Proceedings of the 29th ACM on International …, 2015 - dl.acm.org
Sparse matrix-vector multiplication (SpMV) is a fundamental building block for numerous
applications. In this paper, we propose CSR5 (Compressed Sparse Row 5), a new storage …

Scalable GPU graph traversal

D Merrill, M Garland, A Grimshaw - ACM Sigplan Notices, 2012 - dl.acm.org
Breadth-first search (BFS) is a core primitive for graph traversal and a basis for many higher-
level graph analysis algorithms. It is also representative of a class of parallel computations …

Benchmarking a new paradigm: An experimental analysis of a real processing-in-memory architecture

J Gómez-Luna, IE Hajj, I Fernandez… - arXiv preprint arXiv …, 2021 - arxiv.org
Many modern workloads, such as neural networks, databases, and graph processing, are
fundamentally memory-bound. For such workloads, the data movement between main …

GPU-accelerated compression and visualization of large-scale vessel trajectories in maritime IoT industries

Y Huang, Y Li, Z Zhang, RW Liu - IEEE Internet of Things …, 2020 - ieeexplore.ieee.org
The automatic identification system (AIS), an automatic vessel-tracking system, has been
widely adopted to perform intelligent traffic management and collision avoidance services in …

Fast tridiagonal solvers on the GPU

Y Zhang, J Cohen, JD Owens - ACM Sigplan Notices, 2010 - dl.acm.org
We study the performance of three parallel algorithms and their hybrid variants for solving
tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) …

High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing

D Merrill, A Grimshaw - Parallel Processing Letters, 2011 - World Scientific
The need to rank and order data is pervasive, and many algorithms are fundamentally
dependent upon sorting and partitioning operations. Prior to this work, GPU stream …

yaSpMV: Yet another SpMV framework on GPUs

S Yan, C Li, Y Zhang, H Zhou - Acm Sigplan Notices, 2014 - dl.acm.org
SpMV is a key linear algebra algorithm and has been widely used in many important
application domains. As a result, numerous attempts have been made to optimize SpMV on …

[图书][B] Efficient parallel scan algorithms for many-core gpus

S Sengupta, MJ Harris, M Garland, JD Owens - 2011 - api.taylorfrancis.com
We have witnessed a phenomenal increase in computational resources for graphics
processors units (GPU) over the last few years. The highest performing graphics processors …