In the context of the Internet of Things, where efficient and accurate perception is crucial. Monocular 3D detection has gained attention due to its cost-effectiveness. This paper introduces an efficient method for monocular 3D object detection, termed Multi-Scale Enhanced Depth Knowledge Distillation (MDKD). Our approach simplifies the teacher network, eliminating the need for extra modal data input while improving the student network’s performance. Additionally, we present a Multi-Scale Depth Enhancement (MDE) module and a novel lightweight Squeeze-Excitation Former (SEFormer). Our method addresses the growing demand for precise object detection within IoT environments. Extensive experiments on the KITTI dataset validate our method’s effectiveness.