Remote sensing image classification is gradually growing into an attractive research topic and is widely used in various tasks. But unlike natural general images, remote sensing images are usually high-resolution, which are consist of complex backgrounds and various randomly arranged targets. Remote sensing images usually have the characteristics of large intra-class differences and huge inter-class similarity, which make it difficult for the network to focus on the main objects in the scene, resulting in unsatisfactory classification performance. In this paper, we propose the Triple Attention Vision Transformers (TaViT) network to solve this problem. The backbone of TaViT is mainly composed of two branches, namely Dual Attention Vision Transformers (DaViT) and Branch Attention Module (BA Module). DAViT uses the self-attention mechanism of “spatial tokens” and “channel tokens” to capture the features of the global context. The BA Module aggregates the features of each channel along the horizontal and vertical directions, respectively, and continuously provides rich local details for the network. Experimental results on three publicly available remote sensing datasets demonstrate that our TaViT achieves state-of-the-art performance.