Abnormal condition monitoring and fault diagnosis of power transmission and substation equipment is crucial to the safety and stability of the power system, and its normal operation affects daily use, commerce and industry. The safety inspection of electrical equipment is a necessary task and the introduction of automated inspection methods based on optical and deep learning theory is a promising scientific task. Traditional recognition methods have certain requirements for human time, while the previous generation of intelligent detection methods based on color and texture have lower accuracy, only deal with data of a single mode, and rely more on the quality of data. We propose a ViT-based approach for multimodal power image anomaly detection, taking into account visible, infrared and polarized light image pairs, to achieve improved anomaly detection accuracy in a unified multimodal signal processing framework. This is one of the first applications of anomaly detection algorithms based on the Transformer framework in the field of image anomaly detection for power equipment. This paper discusses the improved network model, discusses the impact of multimodal data input for network recognition accuracy, and looks at future related work that still has potential in this area.