Semantic segmentation is an important tool for computers to perceive the real world and it is the basis and key to solve high-level vision tasks. However, as the real scene contains a large number of complex objects and is also affected by factors such as occlusion, the performance of the segmentation method based on single-modal data is affected. In order to improve the accuracy of segmentation and reduce the influence of object occlusion, truncation and other factors, a 3D scene semantic segmentation framework based on 2D image features, geometric structure and global context information is proposed. It adopts a heterogeneous network to combine image and depth information effectively, thus solving the problem that the result of using single-modal data is not fine enough. Experimental results show that the proposed method is effective in understanding complex scenes.