Scene text image super-resolution (STISR) aims to restoring textual details within low-resolution (LR) text images, producing high-resolution (HR) images that closely align with human perception. However, current state-of-the-art methods lack a comprehensive approach to modeling text images, resulting in an unsatisfactory balance between improving image quality and enhancing downstream task performance in STISR. In this paper, we propose a novel STISR framework (ATSR) via Anchor Point Guided and Efficient Transformer. We comprehensively analyze text images from three perspectives: local features within characters, character-level features, and image-level global features. Specifically, we propose Anchor Point Guided Pixel Attention (APGPA) to establish the invariance characteristics of characters. To ensure accurate characters restoration, we introduce a text recognition operator as a prior constraint on the character features of the generated text image. Additionally, recognizing the significance of global features for image quality and subsequent text recognition tasks, we present the Efficient Text Super-Resolution Transformer (ETSRT), a module that efficiently reconstructs images by leveraging visual and semantic information from the context. Our approach outperforms the existing baseline as demonstrated by numerous experiments.