General matrix multiplication (GEMM) is a key operator in a wide range of fields such as machine learning, scientific computing, and signal processing. In practice, the matrix sizes are usually too small to make full use of GPU resources in many applications. To this end, previous work has attempted to batch small GEMMs by designing a CUDA core to process multiple GEMMs simultaneously. It divides the GEMM into multiple tiles, and then process a tile in a thread block. However, the approach only processes one tile per thread block and allocates the number of thread blocks based on the size of the largest matrix in the batch. It will make some thread blocks and threads idle. In this paper, we propose a batching strategy to batch small GEMMs with the consideration of several factors, including tile number, block number, and block size. We process multiple tiles in a thread block. Different tiles may need different block size and block number, we select the optimal scheme adaptively according to the hardware resources occupancy. Our proposed strategy achieves the performance improvement of batched GEMM by improving GPU occupancy. The experimental results show that our batching strategy can achieve about 1.7X speedup on average over the MAGMA and about 1.2X performance improvement on average over the state-of-the-art L-gemm.