RGB-D semantic segmentation has been widely studied and obtained remarkable performance. However, traditional methods fall short in exploiting complementary cues of different modalities. How to effectively fuse multi-modality features and multi-level features, is still a challenging problem in RGB-D semantic segmentation. To address this issue, we propose a novel network named cross-level-guided transformer (CLGFormer). Specifically, we devise a dynamic selection fusion module (DSF) to diminish the data discrepancy during multi-modality feature fusion. It adaptively selects multi-scale RGB features with the guidance of depth and employs channel attention to concentrate on significant channels. To eliminate the semantic gap between low-level detailed features and high-level semantic features, we adopt a cross-level-guided transformer module (CLGT) based on a bi-directional cross guidance strategy. CLGT module explicitly models spatial long-range dependencies and channel inter-dependencies to enhance the efficiency of multi-level feature fusion. Finally, an edge loss is introduced to solve the problem of edge inconsistency. Extensive experiments demonstrate that our CLGFormer outperforms other state-of-the-art methods and obtains 52.0%81.4%57.15% mIoU on NYUv2, 52.0%81.4%57.15% mIoU on Cityscapes and 52.0%81.4%57.15% mIoU on Semantic KITTI datasets.