Fast implementation of general matrix-vector multiplication (GEMV) on Kepler GPUs

P Hijma, S Heldens, A Sclocco… - ACM Computing …, 2023 - dl.acm.org

In the past decade, Graphics Processing Units have played an important role in the field of
high-performance computing and they still advance new fields such as IoT, autonomous …

被引用次数：54 相关文章所有 3 个版本

[PDF] jaewoong.org

Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC

E Nurvitadhi, D Sheffield, J Sim… - … Conference on Field …, 2016 - ieeexplore.ieee.org

Deep neural networks (DNNs) are widely used in data analytics, since they deliver state-of-
the-art accuracies. Binarized neural networks (BNNs) are recently proposed optimized …

被引用次数：423 相关文章所有 5 个版本

AESPA: Asynchronous Execution Scheme to Exploit Bank-Level Parallelism of Processing-in-Memory

H Kal, C Yoo, WW Ro - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org

This paper presents an asynchronous execution scheme to leverage the bank-level
parallelism of near-bank processing-in-memory (PIM). We observe that performing memory …

被引用次数：2 相关文章所有 4 个版本

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

F Li, Y Ye, Z Tian, X Zhang - Neural Computing and Applications, 2019 - Springer

Matrix computing is the core component of machine learning and artificial intelligence. Fast
matrix computations can facilitate many large-scale computational projects greatly. Basic …

被引用次数：32 相关文章所有 5 个版本

[HTML] sciencedirect.com

[HTML][HTML] Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs

D Mukunoki, T Ogita - Journal of Computational and Applied Mathematics, 2020 - Elsevier

This paper presents the implementation, performance, and energy consumption of accurate
and mixed-precision linear algebra kernels, including inner-product (DOT), dense matrix …

被引用次数：16 相关文章所有 5 个版本

A detection technique for degraded face images

S Hayashi, O Hasegawa - 2006 IEEE Computer Society …, 2006 - ieeexplore.ieee.org

This paper describes a face detection technique that enables detection of extremely small
faces such as 6× 6 pixel. This is the first approach to detect very low-resolution faces in the …

被引用次数：69 相关文章所有 4 个版本

Analysis of the Leaky Integrate-and-Fire neuron model for GPU implementation

IE Venetis, A Provata - Journal of Parallel and Distributed Computing, 2022 - Elsevier

Understanding how neurons perform, when they are organized in interacting networks, is a
key to understanding how the brain performs complex functions. Different models that …

被引用次数：2 相关文章所有 2 个版本

Automatic thread-block size adjustment for memory-bound BLAS kernels on GPUs

D Mukunoki, T Imamura… - 2016 IEEE 10th …, 2016 - ieeexplore.ieee.org

The performance of a CUDA kernel often depends on the number of threads per thread-
block (thread-block size), and the optimal configuration differs according to the graphics …

被引用次数：8 相关文章所有 2 个版本

[PDF] uct.ac.za

A domain specific language for facilitating automatic parallelization and placement of SDR patterns into heterogeneous computing architectures

LJ Mohapi - 2017 - open.uct.ac.za

This thesis presents a domain-specific language (DSL) for software defined radio (SDR)
which is referred to as OptiSDR. The main objective of OptiSDR is to facilitate the …

被引用次数：1 相关文章所有 2 个版本

[引用][C] 面向SW26010-Pro 的1, 2 级BLAS 函数众核并行优化技术

胡怡，陈道琨，杨超，刘芳芳，马文静，尹万旺，袁欣辉… - 软件学报, 2022