An improved visual SLAM system based on object detection algorithm is proposed. In this project, object detection and monocular visual SLAM are combined to build a semisimple semantic map for scene. The 3D cube and the secondary surface are extracted from the 2D object by using the vanishing point method and the secondary surface restoration method, and the attitude of the 3D object is initialized. The backbone network is used for multi-level feature extraction and multi-level fusion with text features to achieve deep fusion of multi-source information. Secondly, in order to ensure the identification of small targets and large targets, this project intends to adopt a cone network structure. By cross-linking the sampled feature maps with the bottom-up feature maps, the large-scale targets can be accurately identified. A semantic loss function is constructed by introducing positional invariants between objects. The model is applied to BA algorithm. In the test of Pulane, a large-scale public database, the algorithm can achieve 95.86% accuracy. This method not only realizes the deep fusion of multi-modal features, but also enriches the fused multi-modal feature information, and has good detection performance.