Estimating depth from images is a crucial computer vision task with wide-ranging applications in fields such as autonomous driving, drones, and virtual reality. Self- supervised monocular depth estimation utilizes image sequences to achieve semi-supervised learning and has shown promising application prospects. However, current self-supervised methods still suffer from deficiencies in enhancing feature dependencies and properly handling local information, resulting in limited performance and low prediction accuracy. In this work, we propose a novel network architecture, HCTNet, based on the U- Net framework, aimed at further improving prediction accuracy. The network utilizes a Hybrid CNN- Transformer as the depth encoder to capture and convey contextual information, demonstrating competitive results on the KITTI dataset.