[PDF][PDF] An effective implementation of Strassen's algorithm using AVX intrinsics for a multicore architecture
NZ Oo, P Chaikan - Songklanakarin Journal of Science & …, 2020 - thaiscience.info
NZ Oo, P Chaikan
Songklanakarin Journal of Science & Technology, 2020•thaiscience.infoThis paper proposes an effective implementation of Strassen's algorithm with AVX intrinsics
to augment matrix-matrix multiplication in a multicore system. AVX-2 and FMA3 intrinsic
functions are utilized, along with OpenMP, to implement the multiplication kernel of
Strassen's algorithm. Loop tiling and unrolling techniques are also utilized to increase the
cache utilization. A systematic method is proposed for determining the best stop condition for
the recursion to achieve maximum performance on specific matrix sizes. In addition, an …
to augment matrix-matrix multiplication in a multicore system. AVX-2 and FMA3 intrinsic
functions are utilized, along with OpenMP, to implement the multiplication kernel of
Strassen's algorithm. Loop tiling and unrolling techniques are also utilized to increase the
cache utilization. A systematic method is proposed for determining the best stop condition for
the recursion to achieve maximum performance on specific matrix sizes. In addition, an …
Abstract
This paper proposes an effective implementation of Strassen’s algorithm with AVX intrinsics to augment matrix-matrix multiplication in a multicore system. AVX-2 and FMA3 intrinsic functions are utilized, along with OpenMP, to implement the multiplication kernel of Strassen’s algorithm. Loop tiling and unrolling techniques are also utilized to increase the cache utilization. A systematic method is proposed for determining the best stop condition for the recursion to achieve maximum performance on specific matrix sizes. In addition, an analysis method makes fine-tuning possible when our algorithm is adapted to another machine with a different hardware configuration. Performance comparisons between our algorithm and the latest versions of two well-known open-source libraries have been carried out. Our algorithm is, on average, 1.52 and 1.87 times faster than the Eigen and the OpenBLAS libraries, respectively, and can be scaled efficiently when the matrix becomes larger.
thaiscience.info
以上显示的是最相近的搜索结果。 查看全部搜索结果