Skip to main content

Neural Embedding Extractors for Text-Independent Speaker Verification

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13721))

Included in the following conference series:

  • 892 Accesses

Abstract

In neural network-based speaker verification tasks, extraction of speaker embeddings plays a vital role. In this work, in order to extract speaker discriminant utterance level representations, we propose to employ two neural speaker embedding extractors that incorporate multi-stream hybrid neural network (MSHNN) and ensemble of neural speaker embedding networks. In the proposed MSHNN approach, an input acoustic feature frame is processed in multiple parallel hybrid neural network (HNN) pipelines where each stream has a unique dilation rate for incorporating diversified temporal resolution in embedding processing. The proposed ensemble neural speaker embedding extractor employs a hybrid neural network, a Time Delay Neural Network - Long Short-Term Memory (TDNN-LSTM) hybrid network and a time delay neural network (TDNN) in parallel manner for including diversified temporal resolution as well as for capturing the complementarity exists in different architectures. A set of speaker verification experiments were carried out on the CNCeleb and VoxCeleb corpora for evaluating the performances of the proposed systems. The proposed multi-stream hybrid neural network performs better than the conventional approaches trained on the same dataset. The ensemble approach is found to yield the best performance in terms of equal error rates (EER) and minimum detection cost function (minDCF) evaluation metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alam, J., Fathan, A., Kang, W.H.: Text-independent speaker verification employing CNN-LSTM-TDNN hybrid networks. In: Karpov, A., Potapova, R. (eds.) 23rd International Conference on Speech and Computer (SPECOM), LNCS, Springer, Cham. vol. 12997, pp. 1–13 (2021). https://doi.org/10.1007/978-3-030-87802-3_1

  2. Alam, J., Kang, W.H., Fathan, A.: Hybrid neural network-based deep embedding extractors for text-independent speaker verification. In: Proceedings of Odyssey (2022)

    Google Scholar 

  3. Cai, Y., Li, L., Wang, D., Abel, A.: Deep speaker vector normalization with maximum gaussianality training (2020)

    Google Scholar 

  4. Chung, J.S., et al.: In defence of metric learning for speaker recognition. In: Interspeech (2020)

    Google Scholar 

  5. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

  6. Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307

    Article  Google Scholar 

  7. Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech, pp. 3830–3834. ISCA (2020)

    Google Scholar 

  8. Fan, Y., et al.: Cn-celeb: a challenging chinese speaker recognition dataset. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7604–7608. IEEE (2020)

    Google Scholar 

  9. Gusev, A., et al.: Deep speaker embeddings for far-field speaker recognition on short utterances (2020)

    Google Scholar 

  10. Hajavi, A., Etemad, A.: A deep neural network for short-segment speaker recognition. Proc. Interspeech 2019, 2878–2882 (2019)

    Google Scholar 

  11. Han, K.J., Pan, J., Tadala, V.K.N., Ma, T., Povey, D.: Multistream CNN for robust acoustic modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6873–6877 (2021)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

  13. Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE Press (2016). https://doi.org/10.1109/ICASSP.2016.7472652

  14. Kang, W.H., Alam, J., Fathan, A.: Hybrid network with multi-level global-local statistics pooling for robust text-independent speaker recognition. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2021)

    Google Scholar 

  15. Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., Khudanpur, S.: A study on data augmentation of reverberant speech for robust speech recognition. In: Proceedings of IEEE ICASSP, pp. 5220–5224 (2017). https://doi.org/10.1109/ICASSP.2017.7953152

  16. Li, L., et al.: Cn-celeb: multi-genre speaker recognition (2020)

    Google Scholar 

  17. Li, L., et al.: CN-celeb: multi-genre speaker recognition (2021)

    Google Scholar 

  18. Monteiro, J., Albuquerque, I., Alam, J., Hjelm, R.D., Falk, T.: An end-to-end approach for the verification problem: learning the right distance. In: International Conference on Machine Learning (2020)

    Google Scholar 

  19. Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)

  20. Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., Shchemelinin, V.: On deep speaker embeddings for text-independent speaker recognition. CoRR abs/1804.10080 (2018). https://arxiv.org/abs/1804.10080

  21. Povey, D., et al.: The kaldi speech recognition toolkit (2011)

    Google Scholar 

  22. Zhang, R., et al.: Aret: aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings of Interspeech, pp. 946–950 (2020)

    Google Scholar 

  23. Snyder, D., Chen, G., Povey, D.: MUSAN: A Music, Speech, and Noise Corpus (2015). arXiv:1510.08484v1

  24. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)

    Google Scholar 

  25. Snyder, D., et al.: The JHU Speaker Recognition System for the VOiCES 2019 Challenge. In: Proceedings of the Interspeech, pp. 2468–2472 (2019). https://doi.org/10.21437/Interspeech.2019-2979

  26. Tang, Y., Ding, G., Huang, J., He, X., Zhou, B.: Deep speaker embedding learning with multi-level pooling for text-independent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6116–6120 (2019)

    Google Scholar 

  27. Villalba, J., et al.: State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18. In: Proceedings of the Interspeech, pp. 1488–1492 (2019). https://dx.doi.org/10.21437/Interspeech.2019-2713

  28. Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018)

    Google Scholar 

  29. Wang, S., Rohdin, J., Plchot, O., Burget, L., Yu, K., Černocký, J.: Investigation of specaugment for deep speaker embedding learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7139–7143 (2020)

    Google Scholar 

  30. Xiang, X., Wang, S., Huang, H., Qian, Y., Yu, K.: Margin matters: towards more discriminative deep neural network embeddings for speaker recognition. arXiv preprint arXiv:1906.07317 (2019)

  31. Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)

    Google Scholar 

  32. Zhou, T., Zhao, Y., Wu, J.: Resnext and res2net structures for speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp. 301–307 (2021). https://doi.org/10.1109/SLT48900.2021.9383531

Download references

Acknowledgments

The authors wish to acknowledge the funding from the Government of Canada’s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338 and Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jahangir Alam .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Alam, J., Kang, W., Fathan, A. (2022). Neural Embedding Extractors for Text-Independent Speaker Verification. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20980-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20979-6

  • Online ISBN: 978-3-031-20980-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics