Still image-based human action recognition is a highly sought-after but challenging field in computer vision, and such challenge mainly stems from the lack of information in single images. Therefore, efficient extraction of visual appearance features and other valuable information of images is crucial for action recognition. To this purpose, on the one hand, we use a convolutional neural network (CNN) classifier based on EfficientNetV2-S network as the main pathway for extracting appearance features from images and classification. To make the CNN classifier focus on important spatial features, we propose the residual spatial attention module (RSAM) and incorporate it into the CNN classifier. In addition, we leverage transfer learning to enhance the training speed and recognition precision of the CNN classifier. On the other hand, we utilize the OpenPose algorithm to extract the coordinates of human key-points in the auxiliary pathway and perform information extraction and classification on the obtained key-points with a self-made network. Finally, we use one-dimensional convolution to merge the results of these two classifications. One-dimensional convolution can automatically learn the weights of these two results and merge them based on their importance. Experimental results on three challenging datasets, namely Stanford 40 Actions, People Play Music Instrument (PPMI) and MPII Human Pose datasets, illustrate the superiority of the proposed method.