eArticles

Home

eArticles

검색결과 돌아가기

검색화면

Export 프린트

Accelerating small matrix multiplications by adaptive batching strategy on GPU

Resource Type: Conference
Authors: Zhang, Yaqing; Wang, Yaobin; Mo, Zhangbin; Zhou, Yong; Sun, Tao; Xu, Guang; Xing, Chaojun; Yang, Liang
Source: 2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys) HPCC-DSS-SMARTCITY-DEPENDSYS High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), 2022 IEEE 24th. :882-887 Dec, 2022
Subject: Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Computing and Processing
Chemistry
Scientific computing
Graphics processing units
Machine learning
Signal processing
Parallel processing
Hardware
GEMM
GPU
Batching
Language

Online Access

Full Text (IEEE)

초록

General matrix multiplication (GEMM) is a key operator in a wide range of fields such as machine learning, scientific computing, and signal processing. In practice, the matrix sizes are usually too small to make full use of GPU resources in many applications. To this end, previous work has attempted to batch small GEMMs by designing a CUDA core to process multiple GEMMs simultaneously. It divides the GEMM into multiple tiles, and then process a tile in a thread block. However, the approach only processes one tile per thread block and allocates the number of thread blocks based on the size of the largest matrix in the batch. It will make some thread blocks and threads idle. In this paper, we propose a batching strategy to batch small GEMMs with the consideration of several factors, including tile number, block number, and block size. We process multiple tiles in a thread block. Different tiles may need different block size and block number, we select the optimal scheme adaptively according to the hardware resources occupancy. Our proposed strategy achieves the performance improvement of batched GEMM by improving GPU occupancy. The experimental results show that our batching strategy can achieve about 1.7X speedup on average over the MAGMA and about 1.2X performance improvement on average over the state-of-the-art L-gemm.

공지

DAU Library

eArticles

요약정보

Accelerating small matrix multiplications by adaptive batching strategy on GPU

Online Access

초록