The significance of the Knowledge Graph (KG) is rising across various industries. A KG is a powerful tool used to manage knowledge from vast amounts of resources effectively. In this paper, we propose an autonomous multimodal system for building a KG on the MuSe-CAR dataset. The system extracts several features of an entity from heterogeneous streams of the video, i.e., text and images, and represents them as fused Multi-Modal KG. We use the extracted features to explore the video content and learn to estimate queries and relationships between nodes of different modalities. This approach aims to enable users to perform a wider range of queries involving text and image data streams. The observation shows that multimodality either compensates for or corroborates knowledge in one stream with the other and allows the user to perform more queries. We evaluate the proposed system using a set of quantitative queries involving different data streams. The results of these queries can be used to gauge the system’s effectiveness. Based on these evaluations, it is possible to understand how well the system can extract knowledge from the dataset and how useful it is for downstream applications such as querying.