Recently, the rapid advancement of communication technology has undoubtedly improved people's lives. Unfortunately, this progress has also given rise to a significant issue: the proliferation of phishing websites. Regrettably, phishing web sites have now emerged as one of the primary threats to cybersecurity. While there have been extensive efforts to develop anti-phishing techniques, many of these approaches focus solely on a single modality of information, such as the website's URL, HTML source code, or visual features. Given the growing sophistication of phishing attacks, relying on just one modality of information is no longer sufficient to accurately identify phishing websites. To tackle this challenge, our research aims to leverage the complementarity of various modalities to enhance phishing website detection. In this paper, we introduce a multi-modal framework called FusionNet to identify phishing web pages. This framework utilizes three distinct information modalities: URLs, HTML source codes, and visual features. Specifically, our approach first designs specific representation learning architectures tailored to each modality's unique properties. Subsequently, we incorpo-rate an attention mechanism to merge these representations, thereby exploiting the synergies between the multi-modal data. The learned feature representations are more discriminative, contributing to significantly improved phishing website detection accuracy. To evaluate the performance of FusionNet, we collect a dataset comprising phishing webpages, each sample contains all three modalities-URLs, HTML source codes, and visual features. Our extensive experiments conducted on this dataset demonstrate that FusionNet outperforms state-of-the-art methods in terms of both robustness and accuracy.