Traditional hearing aids solutions are based on audio and visual information, which has limited effectiveness in challenging scenarios, such as noisy environments and obstacles. Additionally, their level of comfort and privacy protection is often unsatisfactory. In this case, there has been increasing interest in recent years, aiming to develop a contactless and privacy-preserving alternative using radio frequency (RF) signals. However, the RF-based approach requires a large amount of training data to support its reliability and accuracy, which is undoubtedly very time-consuming and labour-intensive. To address these limitations and benefit future hearing aids with RF sensing, the fusion of multiple modalities provides us with a solution. In this paper, we propose an RF - Visual based speech recognition system based on the fusion of visual and RF information, which is based on a multi-input convolutional neural network (CNN) and can achieve up to 87.55% recognition accuracy. We have comprehensively compared and evaluated the system performance with single and multiple modalities, and can conclude the proposed RF-Visual-based SR system has great potential for advancing hearing aid technology.