Unmanned aerial vehicle (UAV) aerial videos hold significant promise in surveillance, rescue operations, agriculture, and urban planning. Multi-object tracking is essential for processing UAV aerial videos, facing challenges like target occlusion, scale variations, rapid motion, and complex environments. Recent approaches use anchor-free object detectors to address identity ambiguity during appearance feature learning. However, classical convolutional neural network-based anchor-free detectors suffer accuracy drops in crowded scenarios. To tackle this, we propose a UAV multi-object tracking algorithm based on the Transformer. The algorithm utilizes the Transformer for local and global interaction in high-resolution input, generating accurate detection information for multi-object trajectory tracking. A GM-PHD method predicts target motion trajectories based on visual detections, and a matching pattern in the data association process enhances model robustness. Comparative validation on VisDrone and UAVDT datasets demonstrates the algorithm's effectiveness, offering a novel solution with broad applications in UAV multi-object tracking.