The abundant temporal information presented in videos can effectively address the issue of significant appearance deterioration observed in drone images. Learning temporal features in videos poses a significant challenge. Simple feature aggregation methods often result in the interference of background information into targets. To address this issue, we propose a simple yet effective Self and Cross Motion Extracted (SACME). The key idea is to leverage transformer layers and a novel temporal fusion network to learn and fuse temporal features. SACME serves as an addon module, enabling existing static object detectors to achieve high-performance video object detection without incurring significant additional computational costs. To validate the efficacy of our module, we select YOLOv7 as baseline and carried out comprehensive experiments on the VisDrone2019-VID dataset. Particularly, our SACME-YOLOv7 not only achieves a significant 5.1 % improvement in mean average precision (mAP) on the challenging VisDrone dataset, but also achieves a remarkable running speed of 31 frames per second (fps).