Due to the significant improvements in communication and network infrastructure, there is a tremendous exchange of visual data. The process of watching videos is time-consuming to the user. It requires an immediate solution to extract the meaningful parts of the video, reducing the user’s time and computational storage. In this paper, an object-based video summarization process is proposed. Object detection is done by training a CNN-based object detection model known as the You Only Look Once (YOLO V5) model. Detection is followed by an object-tracking method known as the Kalman filtering algorithm. The saliency score for each object in the frame is computed based on the various objects including motion, the centrality of an object, and other color-based features. The method is evaluated on benchmark datasets namely Open Video Project (OVP) and the custom-made video dataset. It has achieved precision, recall and F-score better than the state-of-the-art methods. The proposed method reduces computational time significantly.