In the field of speech instruction recognition, deep learning technology can significantly improve recognition performance, which has become a new research hotspot. However, due to the increasing scale of data, it is difficult to achieve the ideal classification effect using a single model. Aiming at this problem, a speech instruction recognition method based on Stacking ensemble learning is proposed. This method combines deep learning with ensemble learning and applies it to the task of speech instruction recognition. Perform preprocessing and feature extraction on speech data to extract different audio features; build multiple deep models as primary classifiers, and input different audio features into different primary classifiers for training. A secondary classifier is constructed based on the SoftMax regression model, the output of the primary classifier is used as the input of the secondary classifier, and the stacking ensemble algorithm is used for learning to obtain the final recognition result of the speech instruction. The effectiveness of the method is demonstrated through speech instruction recognition experiments on large-scale datasets.