This paper proposes a novel hybrid model that combines CNN feature extraction with Transformer to solve the issue of information loss in image processing by Vision Transformer and conventional convolutional networks. The study segmented potato leaf images using the GrabCut algorithm, then used the baseline of ConvNeXt network to replace the first layer of $16 \times 16$ convolution of the original Vision Transformer model with a step of 16, and finally used transposed convolution to convert the feature extraction results into the Transformer Encoder expected input size. The Transformer Encoder is then fed the input. The CBAM attention mechanism is then introduced to improve the model's attention to key image regions. The improved model obtains a recognition accuracy of 97.29%, which is 4.22 percentage points higher than before the improvement. Furthermore, the model is robust and adaptable.