Compared with the traditional multi-stage speech recognition model, the speech recognition model based on deep neural network has promising performance in accuracy and speed and can realize end-to-end speech conversion to text. However, its practical difficulties in training determine its performance limit. The method proposed in this paper is that combines the advantages of both by dividing the recognition task into two stages for processing. Both stages are carried out by a deep neural network model. First, the speech sequence is converted into a phoneme sequence, then the phoneme sequence is converted into a character sequence. The recognition process can be controlled more finely to achieve higher recognition accuracy with different loss functions and datasets. On the widely used open-source AiShell-1 Mandarin speech dataset, the acoustic model based on convolution in the first stage achieves a phonemic error rate of 1.90%, and the language model based on Bi-LSTM and self-attention in the second stage achieves a character accuracy of 99.4%. Finally, the speech recognition character error rate (CER) of the complete model with only 15M parameters is as low as 4.15%, achieving state-of-the-art accuracy.