Following the rise of internet, multimedia applications have become popular and been expanded rapidly. How to retrieve the user-specified video content is an important issue. Since the query varies from user to user for the same content, one turn search usually cannot find the satisfying answer. Therefore, we extend the typical single-turn multimodal video clip retrieval to a multi-turn system. To pull this off, video data are transformed into feature vectors. At each turn, the user inputs a query sentence to the system, and the system will output six candidate video clips accordingly. The system recommends video clips based on the results of a clustering algorithm and video retrievals benefit from the similarity feedback of previous turn. The results show that the proposed method outperformed the single-turn retrieval on the metrics of recall and the similarity.