In this study, we proposed a micro-expression recognition framework combining the Vision Transformer (ViT) network based on the multi-head self-attention mechanism with a pre-trained model. In this framework, each image in the micro-and-macro expression warehouse (MMEW) dataset was divided into 224 patches, and these patches were then rearranged into a one-dimensional vector. Due to the difficulty of collection, the datasets of micro-expressions are usually small, making it difficult to achieve satisfactory results using traditional deep network models. Therefore, in our framework we introduce a pre-trained model (ViT_base_patch16_224) that is trained on ImageNet to address this problem. Given that the pre-trained model has sufficient parameters and the MMEW dataset is smaller, a freezing strategy is adopted to make it suitable for the MMEW data. To the best of our knowledge, this is the first work that combining used the ViT model and the pre-trained model for micro-expression recognition task. Our preliminary experiments results show that the proposed micro-expression recognition framework achieved an accuracy of 87% on MMEW after 10 epochs training, which is the state-of-the-art accuracy achieved in this data set. Moreover, the accuracy is further improved to 93% with 30 epochs training, reflecting the potential of our framework in micro-expression recognition.