Medical images are information carriers that visually reflect and record the anatomical structure of the human body, and play an important role in clinical diagnosis, teaching and research, etc. Modern medicine has become increasingly inseparable from the intelligent processing of medical images. In recent years, there have been more and more attempts to apply deep learning theory to medical image segmentation tasks, and it is imperative to explore a simple and efficient deep learning algorithm for medical image segmentation. In this paper, we conduct research on multi-modal medical image segmentation algorithms with a hybrid architecture of Convolutional Neural Networks and Vision Transformer. This paper proposes a multi-modal medical image segmentation model SWT-UNet based on the CNN-ViT hybrid framework. The self-attention mechanism and sliding window design of the visual Transformer are used to capture the global feature association and break the limitation of the receptive field caused by the inductive bias of the convolution operation. At the same time, a widened self-attentive vector is used to streamline the number of modules and compress the model size, so as to fit the characteristics of a small amount of medical data, which makes the model easy to be overfitted. Experiments on two multi-modal medical image datasets validate that the algorithm can achieve efficient medical image segmentation at a lightweight scale.