Current research strive has tremendously changed the horizon of computer vision tasks in multiple agents tracking. Nevertheless, in the research of robotic assisted surgery, reliable surgical instrument tracking imposes challenge due to the high complexity in state modeling for the hierarchical structure of the instrument versus de-coupling the spatial-temporal correlations naturally embedded in the task. In this paper, we present a new tracking paradigm integrating the trajectory prediction to reduce the data association error that is propagated from the false detection. As a key component in the system, a proposed predictor disentangles the hierarchical modeling and agent kinematic learning by introducing inductive attention mechanism in spatial-temporal graph network. Experiments on real anatomical datasets show that our tracking-by-prediction scheme improves overall localization accuracy over the frames by up to 81%, in comparison to the generic pipelines of tracking, even with transductive graph representation learning, with a large margin of gain in terms of precise localization.