To address the challenge of varying target poses across different viewpoints, which result in issues such as target loss and tracking errors during object tracking, a motion model optimization algorithm based on multi-branch deep fusion has been proposed. First, an end-to-end sequence-level training method combining sampling and argmax tracker is added by improving the Transformer Tracking model to better capture moving targets. Second, a pixel-level feature fusion box refinement module is also introduced to effectively utilize spatial information and improve prediction accuracy. Experiments demonstrate that the method attains a 75.8% and 85.7% area under the curve and normalized accuracy, respectively, on the LaSOT vehicle portion of the dataset. Furthermore, the method has the capability to carry out real-time tracking at 31 fps.