Optimization techniques for GPU programming

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC

E Nurvitadhi, D Sheffield, J Sim… - … Conference on Field …, 2016 - ieeexplore.ieee.org
Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-
the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized …

AESPA: Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory

H Kal, C Yoo, WW Ro - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
This paper presents an asynchronous execution scheme to leverage the bank-level
parallelism of near-bank processing-in-memory (PIM). We observe that performing memory …

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

F Li, Y Ye, Z Tian, X Zhang - Neural Computing and Applications, 2019 - Springer
Matrix computing is the core component of machine learning and artificial intelligence. Fast
matrix computations can facilitate many large-scale computational projects greatly. Basic …

[HTML][HTML] Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

D Mukunoki, T Ogita - Journal of Computational and Applied Mathematics, 2020 - Elsevier
This paper presents the implementation, performance, and energy consumption of accurate
and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix …

A detection technique for degraded face images

S Hayashi, O Hasegawa - 2006 IEEE Computer Society …, 2006 - ieeexplore.ieee.org
This paper describes a face detection technique that enables detection of extremely small
faces such as 6× 6 pixel. This is the first approach to detect very low-resolution faces in the …

Analysis of the Leaky Integrate-and-Fire neuron model for GPU implementation

IE Venetis, A Provata - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Understanding how neurons perform, when they are organized in interacting networks, is a
key to understanding how the brain performs complex functions. Different models that …

Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs

D Mukunoki, T Imamura… - 2016 IEEE 10th …, 2016 - ieeexplore.ieee.org
The performance of a CUDA kernel often depends on the number of threads per thread-
block (thread-block size), and the optimal configuration differs according to the graphics …

A domain specific language for facilitating automatic parallelization and placement of SDR patterns into heterogeneous computing architectures

LJ Mohapi - 2017 - open.uct.ac.za
This thesis presents a domain-specific language (DSL) for software defined radio (SDR)
which is referred to as OptiSDR. The main objective of OptiSDR is to facilitate the …

[引用][C] 面向SW26010-Pro 的1, 2 级BLAS 函数众核并行优化技术

胡怡, 陈道琨, 杨超, 刘芳芳, 马文静, 尹万旺, 袁欣辉… - 软件学报, 2022