Joint video moment retrieval and highlight detection aims to find the relevant moments and highlight clips in a video with natural language. It is an emerging task though its individual problems have been studied for a while. The current methods utilize transformer to interact between modals, which leads to a huge cost of parameters and computation in spite of great performance. To address this problem, we present a cross-modal attention mechanism to capture related features from different modalities in a few-parameter way. Furthermore, a lightweight multi-modal interaction model (MIM) is proposed to solve video moment retrieval and highlight detection jointly. In the case of greatly reducing the number of parameters, we achieve competitive performance and faster convergence speed compared to previous method. Extensive experiments on four datasets demonstrate the effectiveness of our method.