Image captioning is a hot trend in the field of artificial intelligence researches currently, which allows a computer to read image information and generate corresponding text description. Although advanced methods have extracted and fused rich features for image encoding and constructed reliable transformer-based networks for cross-modal prediction, image captioning tasks still face many challenges such as redundant and time-consuming features, incomplete information in the generated sentences. In order to improve the presentation of the deep networks in captioning pipeline, we have designed a novel visual encoding structure to achieve local cross-modal alignment, whose features are also employed for global semantic alignment in our proposed captioning model. Our method has been evaluated on the standard image captioning benchmark and reached outstanding performance.