학술논문

Home

자료검색

학술논문

검색결과 돌아가기

검색화면

내보내기 프린트

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices.

Resource Type: Journal
Authors: Masliah, I. (1-TN-INC) AMS Author Profile; Abdelfattah, A. (1-TN-INC) AMS Author Profile; Haidar, A. (1-TN-INC) AMS Author Profile; Tomov, S. (F-PARIS11-NDM) AMS Author Profile; Baboulin, M. (F-PARIS11-NDM) AMS Author Profile; Falcou, J. (1-TN-INC) AMS Author Profile; Dongarra, J. AMS Author Profile
Source: Parallel Computing. Systems \& Applications (Parallel Comput.) (20190101), 81, 1-21. ISSN: 0167-8191 (print).eISSN: 1872-7336.
Subject: 65 Numerical analysis -- 65Y Computer aspects of numerical algorithms
65Y05 Parallel computation
Language: English

Online Access

초록

Summary: ``Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for {\it small matrices} of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single {\it batched} routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90\% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95\% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.''

공지

DAU Library

학술논문

요약정보

Algorithms and optimization techniques for high-performance matrix-matrix multiplications of very small matrices.

Online Access

초록