Environment Sound Classification (ESC) has been a challenging task in the audio field due to the different types of ambient sounds involved. In this paper, we propose a method for the ESC tasks based on the CAR-Transformer neural network model, which includes stages of sound sample pre-processing, deep learning-based feature extraction, and classifier classification. We convert the one-dimensional audio signal into two-dimensional Mel Frequency Cepstral Coefficients (MFCC) and use them as the feature map of the audio. The CAR-Transformer model was used for feature extraction, and after dimensionality reduction of the extracted feature map, we use the fully connected layer as a classifier of the feature map to obtain the final results. The method achieves a classification accuracy of 96.91% on the UrbanSound8K dataset, while the number of parameters in the model is only 0.16 M. The results of this paper were compared with other state-of-art research. [ABSTRACT FROM AUTHOR]