During intrapartum fetal monitoring, it is significant for early detection and diagnosis of fetal distress. However, the traditional cardiotocography (CTG) interpretation methods heavily rely on physicians’ experience and lack consideration of the clinical features of pregnant women. To overcome these challenges, we propose a multimodal deep learning approach using Convolutional Neural Networks (CNN) and Vision Transformer (ViT) to end-to-end extract detailed and global deep features of CTG signal, respectively, and fuse the clinical features of pregnant women to predict fetal status. The experimental results demonstrate that the combination of CNN and ViT achieves outstanding performance with an average F1 value of 0.74 and an AUC value of 0.84. Furthermore, incorporating clinical features of pregnant women improves the model’s performance with an average F1 value of 0.78 and an area under the curve (AUC) value of 0.87. In summary, the proposed multimodal deep learning model shows the feasibility and effectiveness for intrapartum fetal monitoring.