Deep Speaker Embeddings with Convolutional Neural Network on Supervector for Text-Independent Speaker Recognition
- Resource Type
- Conference
- Authors
- Cai, Danwei; Cai, Zexin; Li, Ming
- Source
- 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2018. :1478-1482 Nov, 2018
- Subject
- Bioengineering
Communication, Networking and Broadcast Technologies
Components, Circuits, Devices and Systems
Signal Processing and Analysis
Phonetics
Mel frequency cepstral coefficient
Decision trees
Correlation
Task analysis
Indexes
Speaker verification
text-independent
CNN
supervector
deep speaker embedding
- Language
- ISSN
- 2640-0103
Lexical content variability in different utterances is the key challenge for text-independent speaker verification. In this paper, we investigate using supervector which has ability to reduce the impact of lexical content mismatch among different utterances for supervised speaker embedding learning. A DNN acoustic model is used to align a feature sequence to a set of senones and generate centered and normalized first order statistics supervector. Statistics vectors from similar senones are placed together and reshaped to an image to maintain the local continuity and correlation. The supervector image is then fed into residual convolutional neural network. The deep speaker embedding features are the outputs of the last hidden layer of the network and we employ a PLDA back-end for the subsequent modeling. Experimental results show that the proposed method outperforms the conventional GMM-UBM i-vector system and is complementary to the DNN-UBM i-vector system. The score level fusion system achieves 1.26% ERR and 0.260 DCF10 cost on the NIST SRE 10 extended core condition 5 task.