A Deep Speaker Embedding Transfer Method for Speaker Verification

  • Kai Zhou
  • Qun YangEmail author
  • Xiusong Sun
  • Shaohan Liu
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1074)


Recently a kind of deep speaker embedding called x-vector has been proposed. It is extracted from deep neural network and considered as a strong contender for next-generation representation for speaker recognition. However, training such DNNs requires a lot of data, usually thousands of hours. If we want to apply x-vector to Mandarin task but we only have a small amount of data, we can train DNNs on another language and fine-tune the DNNs with these little data. Firstly, we proposed a pure data driven method to transfer DNNs across languages and tasks. Secondly, we investigated the question that how to choose between training DNNs from scratch and reusing a pre-trained model by the transfer method we proposed. To answer the question, in this paper, we present the results of adapting a x-vector based speaker verification system from English to Mandarin by fine-tuning the front-end DNNs. We investigate the trend of performance improvement in these two training strategies when data size increasing. Experiment results show that adapting a pre-trained English model with a small amount of Mandarin data can easily reduce the equal error rate (EER). They also demonstrate that system trained from scratch is able to achieve better performance only when feed enough data. Finally, we test the performance of the two systems in noisy environment and found that system trained from scratch outperforms system fine-tuned with a pre-trained model.


Transfer learning Speaker verification Speaker recognition Deep learning Data augmentation 


  1. 1.
    Hansen, J.H.L., Hasan, T.: Speaker recognition by machines and humans: a tutorial review. IEEE Sig. Process. Mag. 32(6), 74–99 (2015)CrossRefGoogle Scholar
  2. 2.
    Zheng, T.F., Jin, Q., Li, L., et al.: An overview of robustness related issues in speaker recognition. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific, pp. 1–10. IEEE (2014)Google Scholar
  3. 3.
    Snyder, D., Garcia-Romero, D., Povey, D.: Time delay deep neural network-based universal background models for speaker recognition. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 92–97. IEEE (2015)Google Scholar
  4. 4.
    Snyder, D., Ghahremani, P., Povey, D., et al.: Deep neural network-based speaker embeddings for end-to-end speaker verification. In: Spoken Language Technology Workshop (SLT) 2016 IEEE, pp. 165–170. IEEE (2016)Google Scholar
  5. 5.
    Snyder, D., Garcia-Romero, D., Povey, D., et al.: Deep neural network embeddings for text-independent speaker verification. In: Proceedings of Interspeech, pp. 999–1003 (2017)Google Scholar
  6. 6.
    Snyder, D., Garcia-Romero, D,, Sell, G., et al.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)Google Scholar
  7. 7.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  8. 8.
    Kunze, J., Kirsch, L., Kurenkov, I., et al.: Transfer learning for speech recognition on a budget. arXiv preprint arXiv:1706.00290 (2017)
  9. 9.
    Pascual, S., Park, M., Serrà, J., et al.: Language and noise transfer in speech enhancement generative adversarial network. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5019–5023. IEEE (2018)Google Scholar
  10. 10.
    Heigold, G, Moreno I, Bengio S, et al. End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE (2016)Google Scholar
  11. 11.
    Zhang, S.X., Chen, Z., Zhao, Y., et al.: End-to-end attention based text-dependent speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 171–178. IEEE (2016)Google Scholar
  12. 12.
    Li, C., Ma, X., Jiang, B., et al.: Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304 (2017)
  13. 13.
    Bu, H., Du, J., Na, X., et al.: AIShell-1: an open-source Mandarin speech corpus and a speech recognition baseline. IN: 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1–5. IEEE (2017)Google Scholar
  14. 14.
    Snyder, D., Chen, G., Povey, D.: Musan: a music, speech, and noise corpus. arXiv preprint arXiv:1510.08484 (2015)
  15. 15.
    Ko, T., Peddinti, V., Povey, D., et al.: A study on data augmentation of reverberant speech for robust speech recognition. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5220–5224. IEEE (2017)Google Scholar
  16. 16.
    Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
  17. 17.
    Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
  18. 18.
    McLaren, M., Ferrer, L., Castan, D., et al.: The speakers in the wild (SITW) speaker recognition database. In: Interspeech, pp. 818–822 (2016)Google Scholar
  19. 19.
    Li, L., Tang, Z., Wang, D., et al.: Full-info training for deep speaker feature learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5369–5373. IEEE (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.College of Computer Science and TechnologyNanjing University of Aeronautics and AstronauticsNanjingChina

Personalised recommendations