Online learning has gained wide attention and application due to its flexibility and convenience. However, due to the separation of time and space, the level of students' engagement is not easily informed by teachers, which affects the effectiveness of teaching. Automatic detection of students' engagement is an effective way to solve this problem. It can help teachers obtain timely feedback from students and adjust the teaching schedule. In this paper, transformer is first applied in engagement recognition and a novel network based on an improved video vision transformer (ViViT) is proposed to detect student engagement. A new transformer encoder, named Transformer Encoder with Low Complexity (TELC) is proposed. It adopts unit force operated attention (UFO-attention) to eliminate the nonlinearity of the original self-attention in standard ViViT and Patch Merger to fuse the input patches, which allows the network to significantly reduce computational complexity while improving performance. The proposed method is evaluated on the Dataset for Affective States in E-learning Environments (DAiSEE) and achieves an accuracy of 63.91% in the four-level classification task, which is superior to state-of-the-art methods. The experimental results demonstrate the effectiveness of our method, which is more suitable for the practical application of online learning.