In recent years, due to the rapid development of computer vision technology, real-time detection and tracking of objects using computer vision has become a hot topic. Object detection mainly consists of two parts: object segmentation and object recognition, that is, object acquisition. Accurate detection of targets is the key to achieving high-precision and high-efficiency tracking. Based on the above discussion, this article explored the target detection and tracking of survey robots using the multimodal Internet of Things (IoT) sensor data fusion of YOLOv5, and compared and analyzed the performance of direct frame difference method, traditional ORB method, and the algorithm in this paper through experiments. The results indicated that the accuracy and recall of the robot image object detection algorithm proposed in this paper were both above 90%, and higher than other algorithms, indicating that the algorithm has extremely high accuracy in object detection.