Transformer is an encoder-decoder model based on self-attention mechanism. It was initially applied in the field of natural language processing due to its novel architecture design and remarkable performance. With the continuous development of transformer models in recent years, many researchers have been studying how to apply this model to computer vision and have produced many innovative results. The most representative ones are image classification, object detection, and image segmentation. Currently, vision transformers are still a hot research topic, and transformer-based visual technologies are widely used in fields such as medicine and education. This article starts from the basics of the transformer model, summarizes its recent development achievements and applications, and elaborates on the structure and principles of the Vision Transformer model. It also introduces recent development achievements of visual transformers in model architecture, parameters and computational efficiency. Finally, the article summarizes the problems and difficulties existing in various models of vision transformers and points out the future development direction.