With the rapid development of video and network technology, the number of social short videos has increased significantly, and the traditional media processing methods cannot meet the actual application needs. To solve the problem that it is difficult to cluster social short videos effectively, a new idea of media fusion method, which is called a clustering algorithm of network short video, is proposed based on text and video model. Firstly, the text information of the video including the video title, related query words and total click video are extracted to construct the model. Then, according to the characteristics of short video, the video representation model is constructed to obtain the video content information. Finally, text information and video representation model are fused to achieve short video clustering.