Outside-knowledge Visual Question Answering (VQA) is a challenging and promising task with broad applications. It requires models to accumulate external knowledge, acquire cross-modality scene understanding, and develop reasoning capabilities. A good VQA system can serve as the “eyes” of visually impaired people, enabling them to see the world around them. For example, it can assist in reading road signs, identifying directions, and recognizing objects. However, most of the existing VQA systems are based on classification fashion, which is limited in its ability to handle answers that have not been encountered in the training set. In contrast, generative VQA systems would naturally perform better in real-world scenarios. Nonetheless, research on such methods is still in its initial stages. Furthermore, in real-world scenarios, visual question answering often requires the utilization of extensive outside knowledge. And existing methods are often based on explicit knowledge from fixed knowledge bases, which is difficult to provide sufficient knowledge. To address the above two issues, this paper proposes a novel VQA system based on encoder-decoder generative models that fuses both implicit multimodal knowledge and implicit textual knowledge. Its absolute improvement of at least 5.69% over a range of baselines on the OKVQA dataset verifies the effectiveness of the proposed method.