Existing video understanding models usually use a space-based uniform information extraction method, which is easy to ignore important visual semantics. At the same time, due to the inability to accurately distinguish positive and negative samples, visual semantics and predicted content cannot be accurately aligned, resulting in inaccurate content descriptions. In this paper, we design a method to highlight important features and a new semantic alignment loss function to improve the accuracy of the description. First, the image frame information is mapped into feature vectors, and important features are learned and selected through a selection extraction network based on the relationship between frames, and then extracted through a fully connected layer. Secondly, through the training of negative samples, the decoder can effectively identify the difficult samples, and on this basis, a new semantic alignment loss function is designed to adaptively assign weights to the loss calculated by negative samples to improve the relationship between text and images. Semantic relevance. Experimental results on MSVD, a dataset widely used in this field, show that our method can significantly improve the accuracy of video descriptions, and each indicator is significantly better than existing models.