The feature extraction method of traditional video description usually does not consider the correlation between temporal feature and spatial feature, resulting in insufficient comprehensive feature extraction and insufficient correlation. At the same time, the existing network parameters make the model over-rely on the parameter weights during training, which is easy to overfit and make the description inaccurate. In this paper, we design a feature extraction network based on spatiotemporal attention and a new pruning strategy to improve the accuracy of description. First of all, the key frames in the video are extracted by using the time-attention mechanism, and then the important regional information and background information of the key frames are extracted. Then the image features are extracted after the two features are closely related by the spatial fusion function. Secondly, by using variational discard method, the model adaptively adjusts the discard rate of neurons to select an optimal value, which effectively solves the problem of overfitting and makes the output description of the model more accurate. Experimental results on MSVD, a dataset widely used in this field, show that the proposed method can significantly improve the accuracy of video description.