Pathology image classification plays an important role in cancer diagnosis and precision treatment. Convolutional neural network has been widely employed in pathology image classification. Due to its convolution and pooling operation, it has a great advantage in extracting local features of small objects in images. However, it lacks the ability to extract the global contextual information contained in long-rang tissue structures in pathology images. Transformer, which adopts innate global self-attention mechanisms, has been obtained remarkable performance on large-scale datasets, but its localization abilities are limited because of insufficient low-level details. In this paper, we propose a transformer-based method which combines the advantages of CNN and Transformer for pathology image classification. CNN pays attention to extracting the local information of small objects, and Transformer focuses on digging out the global contextual information implied in the long-dependence tissue structures. The sparse interaction and weigh sharing inherited from CNN also allow the proposed method can be trained on small datasets. Experiments show that the proposed method achieves accuracy of 90.48 and 97.18 on PCam and NCT-CRC datasets, respectively, which is better than existing state-of-the-art methods.