The input to a Visual Question Answering(VQA) is an image and a textual question related to the content of the image. The computer system needs to understand and process the image based on the input question and retrieve the answer to the question from the image. Since its emergence, VQA has naturally spanned two domains, Natural Language Processing (NLP) and Computer Vision (CV), and is a typical multimodal learning task. People often consider a variety of high-level semantic information when answering questions. For example, questions about color, type, quantity, purpose, etc. Although this information is critical to answering the VQA question, it is not directly available from the input data. Therefore, in this paper, we input the high-level semantic information of question intention into the model in the form of external knowledge, which is used to influence the process of multimodal information interaction and select the most appropriate features. The visual question answering method based on question intention designed in this paper was tested on the open datasets VQAv2, and the accuracy of the VQA is better than that of the baseline model.