Video instance segmentation has emerged as a critical component in enabling connected vehicles to comprehend complex driving scenes, thereby facilitating navigation under various driving conditions. Recent advances focus on video-based solutions, which leverage temporal and spatial information to achieve superior performance compared to the traditional image-based approaches. However, these video-based solutions present challenges for efficient deployment at the edge due to their high computational and memory demands, making them inefficient for deployment on edge devices, such as intelligent vehicles. Furthermore, the large size of video data makes it impractical to upload to cloud servers. To address the latency challenge during on-device inference, we propose to incorporate early exits into the model. While the early exit strategy has been successful in image classification and natural language processing tasks, our study is the first to explore its application in video instance segmentation. Specifically, we incorporate early exits into the transformer-based video instance segmentation model, VisTR. Our experimental results on the YouTube-VIS dataset demonstrate that early exit can significantly speed up the inference by up to 4.83x with a minimal trade-off of only 3% in the averaged precision scores. Furthermore, our qualitative analysis confirms the satisfactory quality of the generated segmentation masks.