Knee osteoarthritis (OA) is a highly prevalent form of arthritis and a leading cause of physical disability, given the growing aging population. To assist the knee OA assessment, there is a demanding interest in computer-aided grading algorithms. Existing grading methods generally require resource-intensive annotated datasets for supervised training. Moreover, they only consider unimodal data, whilst multimodal medical images are rarely utilised to formulate better knee OA patterns. Therefore, in this study, a novel Self-supervised Multimodal Fusion Network (S-MFN) is proposed for multimodal unsupervised knee OA grading with X-ray and magnetic resonance imaging (MRI) modalities. Specifically, S-MFN involves two modality-specific streams to obtain knee OA representations from the two corresponding modalities. A modality-aware information exchange mechanism is devised to interactively formulate cross-modal patterns in a multi-scale manner regarding the scales of feature maps. To this end, a multimodal contrastive learning is introduced in a self-supervised manner through modality-specific and cross-modal modelling. Comprehensive experimental results on the widely used dataset, Osteoarthritis Initiative (OAI), demonstrate the effectiveness of the proposed method.