With around 1.5 billion people worldwide suffering from hearing impairment, it is particularly important to communicate between non-disabled people and people with hearing or speech impairment and to build a barrier-free society. Multi-modal learning provides an excellent artificial intelligence channel for this purpose. In this article, we create an End-to-end Chinese Lip-Reading Recognition System based on multi-modal fusion to implement Chinese lip translation in order to facilitate communication between individuals with hearing impairment. Our system adopts the End-to-end Audio-visual feature fusion Lip-reading Recognition Architecture (EALRA), with feature extraction based on a MobileNet0.25 tuned CNN skeleton and the encoder back-end using the Conformer self-attentive convolution encoder for modelling. The largest Chinese Mandarin Lip-Reading (CMLR) was selected as the dataset for the empirical study, and the performance metric for Chinese lip recognition was the character error rate (CER). The results of our experiments show that the CER metric of EALRA in the lip-recognition model is 8.0, which is on average 23.74% lower than the CER metrics of other lip-recognition models, indicating that EALRA performs better in fusing image features and audio features.