Deep convolutional neural networks have been employed in image segmentation more and more recently due to their ability to extract detailed properties from images. One of the most successful neural network frameworks for image segmentation among them is the encoder-decoder network structure. U-Net combines an encoder and a decoder to segment images at the pixel level for image segmentation tasks. U-Net uses multi-scale convolutional layers to extract visual information; nevertheless, these layers are unable to record long-distance correlations. In order to gather both local and global information about the image, this work proposes a bidirectional Transformer U-Net (BTU-Net) model, which draws inspiration from the Transformer concept. The BTU-Net structure has an encoder with five down-sampling layers and a decoder with five up-sampling levels. Two-way transformer hybrid convolution modules are used in the final three layers, whereas multi-scale convolution modules are used in the first two layers. With the addition of convolution layers and two-way convolution layers, the quadratic complexity of the traditional self-attention mechanism decreases linearly. The IoU, F1-score, accuracy, recall, and precision scores of our suggested model are 61.9%, 67.2%, 83.9%, 63.3%, and 84.3%, respectively, and experiments have shown that they are comparable to other network models.