Evaluation and analysis during rehabilitation training is a very important part for patients. The subtle changes in the rehabilitation training process cannot be captured by the naked eye. However, previous studies have been focusing on evaluating the patient's training process in the form of action recognition, which cannot capture the subtle changes in the patient's training process and give a quantitative evaluation. In order to improve the accuracy and practicality of rehabilitation training assessments, better provide patients with rehabilitation training assistance and training feedback in a home environment. We propose a Transformer-based network structure Recurrent Spatio-temporal Transformer (RSTformer). Our model extracts patient motion feature information from common RGB videos for scoring, and the proposed model can handle variable-length inputs, which allowing users to perform multiple training sessions at different speeds. We use MAD, RMS, and MAPE to evaluate our model on KIMORE and UI-PRMD datasets. The experimental results show that proposed model has achieved a 21.8% improvement in rehabilitation training evaluation compared to previous methods, which shows that we are in the Superior Performance for Rehabilitation Training Assessment.