In recent years, public opinion events such as “fake news” and “news reversal” have occurred frequently, and spreading rumors through images has become a new form of rumor circulating in the digital age. Most of the existing methods only consider the text content, ignoring the role of the information in the additional images; for the fusion between multiple modalities, their adequate information cannot be fully utilized, and the graphic and text information have not fully interacted. Therefore, we propose a multi-level image-text fusion method (MLFRD), which can effectively obtain local and global information about events, improve the connection between text and images, and improve the performance of rumor detection. MLFRD consists of three parts, a multimodal feature extractor to extract textual and visual features from posts, the extracted features are sent to a multilevel feature fusion network for efficient fusion, and finally to a rumor detector for rumor discrimination. We conduct extensive experiments on two real datasets, and MLFRD can better fuse features between multiple modalities for rumor detection and outperform state-of-the-art methods.