The key to effective image captioning lies in extracting rich semantic information from the image. However, most existing approaches rely on pre-trained classification models or object detectors to extract this information, which may not fully capture the semantic relationships within the image and can result in limited image captioning performance. To address this issue, we propose a Visual-Linguistic Co-Understanding Network (VLCU-Net) for Image Captioning based on the Transformer architecture. Our approach combines the semantic ranking process with richer image semantic understanding information in an integrated framework. Specifically, we begin by querying sentences related to the semantic information of each image, and then extract the semantic words through a text-image understanding extractor. Simultaneously, we infer words related to the semantic words. We feed all the words obtained into a semantic words sorter to arrange them in a linguistic order. Finally, we combine the resulting ordered semantic word expression sequences with image features to generate captions. Our proposed approach outperforms state-of-the-art methods on both metrics and manual evaluations on the COCO benchmark, as evidenced by extensive experimental results.