In today's digital world, the task of video summarization has gained immense importance within the realm of multimedia analysis. This relevance is largely driven by the exponential expansion in multimedia content consumption, encompassing audio, video, and images, which is readily available on-demand through various digital platforms. Automatic video summary is the process of creating a brief synopsis that summarizes the video by displaying its most useful and relevant elements, so consumers may rapidly comprehend the primary concept of a video without having to watch the entire material. Currently, the selection of the video segments to be included in the final summary is done in a variety of ways. The task is to analyze the video's multimedia data in search of relevant clues that will aid in decision-making. The proposed method TAVM (text, audio, and video mode) in this paper will provide the video summary using different multimedia elements of text, audio, and frames. The proposed TAVM method can be separated into three parts. The process begins with Video Processing, where the BEiT vision transformer is employed to recognize objects within the chosen frame. Following that, Audio Processing comes into play, which uses speech-to-text converters to transcribe the audio content. Finally, in the last step, the Summary Builder utilizes the GPT-3-based OpenAI API to generate a summary of the content. The experimental analysis on the benchmark dataset SumMe demonstrates the effectiveness of the proposed approach.