Advertisement

On Residual CNN in Text-Dependent Speaker Verification Task

  • Egor Malykh
  • Sergey Novoselov
  • Oleg Kudashev
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)

Abstract

Deep learning approaches are still not very common in the speaker verification field. We investigate the possibility of using deep residual convolutional neural network with spectrograms as an input features in the text-dependent speaker verification task. Despite the fact that we were not able to surpass the baseline system in quality, we achieved a quite good results for such a new approach getting an 5.23% ERR on the RSR2015 evaluation part. Fusion of the baseline and proposed systems outperformed the best individual system by 18% relatively.

Keywords

Speaker verification Residual learning CNN FFT 

Notes

Acknowledgements

This work was financially supported by the Ministry of Education and Science of the Russian Federation, contract 14.578.21.0126 (ID RFMEFI57815X0126).

References

  1. 1.
    Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  2. 2.
    Kenny, P., Ouellet, P., Dehak, N., Gupta, V., Dumouchel, P.: A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008)CrossRefGoogle Scholar
  3. 3.
    Lei, Y., Scheffer, N., Ferrer, L., McLaren, M.: A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1695–1699. IEEE, May 2014Google Scholar
  4. 4.
    McLaren, M., Lei, Y., Ferrer, L.: Advances in deep neural network approaches to speaker recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4814–4818. IEEE, April 2015Google Scholar
  5. 5.
    Bhattacharya, G., Alam, J., Stafylakis, T., Kenny, P.: Deep Neural Network based Text-Dependent Speaker Recognition: Preliminary ResultsGoogle Scholar
  6. 6.
    Stafylakis, T., Kenny, P., Ouellet, P., Perez, J., Kockmann, M., Dumouchel, P.: Text-dependent speaker recognition using PLDA with uncertainty propagation. Matrix 500, 1 (2013)Google Scholar
  7. 7.
    Larcher, A., Lee, K. A., Ma, B., Li, H.: RSR2015: database for text-dependent speaker verification using multiple pass-phrases. In: INTERSPEECH, pp. 1580–1583, September 2012Google Scholar
  8. 8.
    Larcher, A., Lee, K.A., Ma, B., Li, H.: Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)CrossRefGoogle Scholar
  9. 9.
    Aronowitz, H.: Text dependent speaker verification using a small development set. In: Odyssey 2012-The Speaker and Language Recognition Workshop (2012)Google Scholar
  10. 10.
    Novoselov, S., Pekhovsky, T., Shulipa, A., Sholokhov, A.: Text-dependent GMM-JFA system for password based speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 729–737. IEEE, May 2014Google Scholar
  11. 11.
    Matějka, P., Glembek, O., Novotný, O., Plchot, O., Grézl, F., Burget, L., Cernocký, J.H.: Analysis of DNN approaches to speaker identification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5100–5104. IEEE, March 2016Google Scholar
  12. 12.
    Variani, E., Lei, X., McDermott, E., Moreno, I.L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE, May 2014Google Scholar
  13. 13.
    Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE, March 2016Google Scholar
  14. 14.
    Zhang, S.X., Chen, Z., Zhao, Y., Li, J., Gong, Y.: End-to-End attention based text-dependent speaker verification. In: 2016 IEEE Spoken Language Technology Workshop (SLT), pp. 171–178. IEEE, December 2016Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  16. 16.
    Zeinali, H., Burget, L., Sameti, H., Glembek, O., Plchot, O.: Deep neural networks and hidden markov models in i-vector-based text-dependent speaker verification. In: Odyssey-The Speaker and Language Recognition Workshop, June 2016Google Scholar
  17. 17.
    Chollet, F.: Keras (2015). http://keras.io
  18. 18.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Ghemawat, S., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems (2016). arXiv preprint: arXiv:1603.04467
  19. 19.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization, arXiv preprint: arXiv:1412.6980
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). doi: 10.1007/978-3-319-46493-0_38 CrossRefGoogle Scholar
  21. 21.
    Novoselov, S., Pekhovsky, T., Kudashev, O., Mendelev, V., Prudnikov, A.: Non-linear PLDA for i-vector speaker verification. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 214–218 (2015)Google Scholar
  22. 22.
    Kudashev, O., Novoselov, S., Pekhovsky, T., Simonchik, K., Lavrentyeva, G.: Usage of DNN in Speaker recognition: advantages and problems. In: Cheng, L., Liu, Q., Ronzhin, A. (eds.) ISNN 2016. LNCS, vol. 9719, pp. 82–91. Springer, Cham (2016). doi: 10.1007/978-3-319-40663-3_10 CrossRefGoogle Scholar
  23. 23.
    Novoselov, S., Pekhovsky, T., Shulipa, A., Kudashev, O.: PLDA-based system for text-prompted password speaker verification. In: 2015 12th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–5. IEEE, August 2015Google Scholar
  24. 24.
    Novoselov, S., Sukhmel, V., Sholokhov, A., Pekhovsky, T.: Employment of DTW-based HMM-GMM multi-session training in textdependent speaker verification. J. Instrum. Eng. 57(2), 77–84 (2014). (in Russian)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Egor Malykh
    • 1
  • Sergey Novoselov
    • 1
    • 2
  • Oleg Kudashev
    • 1
    • 2
  1. 1.ITMO UniversitySt. PetersburgRussia
  2. 2.STC-innovations Ltd.St. PetersburgRussia

Personalised recommendations