The proliferation of videos captured by sensor-based cameras has driven the application of human action recognition (HAR) task. As the fundamental video application in human–computer interaction devices, HAR aims to identify human action in video clips, where lightweight networks are crucial. In this field, the convolutional neural networks (CNNs) and transformers have shown great potential for feature representation in Euclidean space, but ignoring more flexible non-Euclidean manifolds. To address this issue, we interpret a video as a set of graph nodes and propose a Video Tube Graph network (VT-Grapher) for action recognition task. As the first lightweight graph neural network (GNN) for RGB-based action recognition, our VT-Grapher contains three main components: 1) three spatial–temporal learning strategies for effectively mining the relationships between video visual features and semantics, where the tube-in-embedding spatial–temporal (TE-ST) strategy demonstrates the best balance between performance and computing; 2) the video tube generation block with temporal center loss for generating the multiple granular video tubes based on temporal similarity and pushing away video tubes with low semantic similarity; and 3) adversarial self-distillation method for enhancing the multigranularity information aggregation capabilities of VT-Grapher. The proposed VT-Grapher network works in a plug-and-play way and can be integrated with vision GNNs, such as ViG and Mobile ViG. Extensive experiments are carried out on the Mini-Kinetics (Top-1 76.1%), Kinetics-400 (Top-1 73.7%), UCF101 (Acc 94.5%), and the multimodal Northwestern-UCLA (N-UCLA) datasets (Top-1 99.7%), showing the effectiveness of VT-Grapher.