Optimization techniques for GPU programming
In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …
high-performance computing and they still advance new fields such as IoT, autonomous …
Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC
E Nurvitadhi, D Sheffield, J Sim… - … Conference on Field …, 2016 - ieeexplore.ieee.org
Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-
the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized …
the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized …
AESPA: Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory
This paper presents an asynchronous execution scheme to leverage the bank-level
parallelism of near-bank processing-in-memory (PIM). We observe that performing memory …
parallelism of near-bank processing-in-memory (PIM). We observe that performing memory …
CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms
Matrix computing is the core component of machine learning and artificial intelligence. Fast
matrix computations can facilitate many large-scale computational projects greatly. Basic …
matrix computations can facilitate many large-scale computational projects greatly. Basic …
[HTML][HTML] Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs
D Mukunoki, T Ogita - Journal of Computational and Applied Mathematics, 2020 - Elsevier
This paper presents the implementation, performance, and energy consumption of accurate
and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix …
and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix …
A detection technique for degraded face images
S Hayashi, O Hasegawa - 2006 IEEE Computer Society …, 2006 - ieeexplore.ieee.org
This paper describes a face detection technique that enables detection of extremely small
faces such as 6× 6 pixel. This is the first approach to detect very low-resolution faces in the …
faces such as 6× 6 pixel. This is the first approach to detect very low-resolution faces in the …
Analysis of the Leaky Integrate-and-Fire neuron model for GPU implementation
IE Venetis, A Provata - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Understanding how neurons perform, when they are organized in interacting networks, is a
key to understanding how the brain performs complex functions. Different models that …
key to understanding how the brain performs complex functions. Different models that …
Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs
D Mukunoki, T Imamura… - 2016 IEEE 10th …, 2016 - ieeexplore.ieee.org
The performance of a CUDA kernel often depends on the number of threads per thread-
block (thread-block size), and the optimal configuration differs according to the graphics …
block (thread-block size), and the optimal configuration differs according to the graphics …
A domain specific language for facilitating automatic parallelization and placement of SDR patterns into heterogeneous computing architectures
LJ Mohapi - 2017 - open.uct.ac.za
This thesis presents a domain-specific language (DSL) for software defined radio (SDR)
which is referred to as OptiSDR. The main objective of OptiSDR is to facilitate the …
which is referred to as OptiSDR. The main objective of OptiSDR is to facilitate the …