Abstract
In neural network-based speaker verification tasks, extraction of speaker embeddings plays a vital role. In this work, in order to extract speaker discriminant utterance level representations, we propose to employ two neural speaker embedding extractors that incorporate multi-stream hybrid neural network (MSHNN) and ensemble of neural speaker embedding networks. In the proposed MSHNN approach, an input acoustic feature frame is processed in multiple parallel hybrid neural network (HNN) pipelines where each stream has a unique dilation rate for incorporating diversified temporal resolution in embedding processing. The proposed ensemble neural speaker embedding extractor employs a hybrid neural network, a Time Delay Neural Network - Long Short-Term Memory (TDNN-LSTM) hybrid network and a time delay neural network (TDNN) in parallel manner for including diversified temporal resolution as well as for capturing the complementarity exists in different architectures. A set of speaker verification experiments were carried out on the CNCeleb and VoxCeleb corpora for evaluating the performances of the proposed systems. The proposed multi-stream hybrid neural network performs better than the conventional approaches trained on the same dataset. The ensemble approach is found to yield the best performance in terms of equal error rates (EER) and minimum detection cost function (minDCF) evaluation metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alam, J., Fathan, A., Kang, W.H.: Text-independent speaker verification employing CNN-LSTM-TDNN hybrid networks. In: Karpov, A., Potapova, R. (eds.) 23rd International Conference on Speech and Computer (SPECOM), LNCS, Springer, Cham. vol. 12997, pp. 1–13 (2021). https://doi.org/10.1007/978-3-030-87802-3_1
Alam, J., Kang, W.H., Fathan, A.: Hybrid neural network-based deep embedding extractors for text-independent speaker verification. In: Proceedings of Odyssey (2022)
Cai, Y., Li, L., Wang, D., Abel, A.: Deep speaker vector normalization with maximum gaussianality training (2020)
Chung, J.S., et al.: In defence of metric learning for speaker recognition. In: Interspeech (2020)
Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
Desplanques, B., Thienpondt, J., Demuynck, K.: ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In: Meng, H., Xu, B., Zheng, T.F. (eds.) Interspeech, pp. 3830–3834. ISCA (2020)
Fan, Y., et al.: Cn-celeb: a challenging chinese speaker recognition dataset. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7604–7608. IEEE (2020)
Gusev, A., et al.: Deep speaker embeddings for far-field speaker recognition on short utterances (2020)
Hajavi, A., Etemad, A.: A deep neural network for short-segment speaker recognition. Proc. Interspeech 2019, 2878–2882 (2019)
Han, K.J., Pan, J., Tadala, V.K.N., Ma, T., Povey, D.: Multistream CNN for robust acoustic modeling. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6873–6877 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Heigold, G., Moreno, I., Bengio, S., Shazeer, N.: End-to-end text-dependent speaker verification. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5115–5119. IEEE Press (2016). https://doi.org/10.1109/ICASSP.2016.7472652
Kang, W.H., Alam, J., Fathan, A.: Hybrid network with multi-level global-local statistics pooling for robust text-independent speaker recognition. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (2021)
Ko, T., Peddinti, V., Povey, D., Seltzer, M.L., Khudanpur, S.: A study on data augmentation of reverberant speech for robust speech recognition. In: Proceedings of IEEE ICASSP, pp. 5220–5224 (2017). https://doi.org/10.1109/ICASSP.2017.7953152
Li, L., et al.: Cn-celeb: multi-genre speaker recognition (2020)
Li, L., et al.: CN-celeb: multi-genre speaker recognition (2021)
Monteiro, J., Albuquerque, I., Alam, J., Hjelm, R.D., Falk, T.: An end-to-end approach for the verification problem: learning the right distance. In: International Conference on Machine Learning (2020)
Nagrani, A., Chung, J.S., Zisserman, A.: Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017)
Novoselov, S., Shulipa, A., Kremnev, I., Kozlov, A., Shchemelinin, V.: On deep speaker embeddings for text-independent speaker recognition. CoRR abs/1804.10080 (2018). https://arxiv.org/abs/1804.10080
Povey, D., et al.: The kaldi speech recognition toolkit (2011)
Zhang, R., et al.: Aret: aggregated residual extended time-delay neural networks for speaker verification. In: Proceedings of Interspeech, pp. 946–950 (2020)
Snyder, D., Chen, G., Povey, D.: MUSAN: A Music, Speech, and Noise Corpus (2015). arXiv:1510.08484v1
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Snyder, D., et al.: The JHU Speaker Recognition System for the VOiCES 2019 Challenge. In: Proceedings of the Interspeech, pp. 2468–2472 (2019). https://doi.org/10.21437/Interspeech.2019-2979
Tang, Y., Ding, G., Huang, J., He, X., Zhou, B.: Deep speaker embedding learning with multi-level pooling for text-independent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6116–6120 (2019)
Villalba, J., et al.: State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18. In: Proceedings of the Interspeech, pp. 1488–1492 (2019). https://dx.doi.org/10.21437/Interspeech.2019-2713
Wan, L., Wang, Q., Papir, A., Moreno, I.L.: Generalized end-to-end loss for speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883 (2018)
Wang, S., Rohdin, J., Plchot, O., Burget, L., Yu, K., Černocký, J.: Investigation of specaugment for deep speaker embedding learning. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7139–7143 (2020)
Xiang, X., Wang, S., Huang, H., Qian, Y., Yu, K.: Margin matters: towards more discriminative deep neural network embeddings for speaker recognition. arXiv preprint arXiv:1906.07317 (2019)
Xie, W., Nagrani, A., Chung, J.S., Zisserman, A.: Utterance-level aggregation for speaker recognition in the wild. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5791–5795. IEEE (2019)
Zhou, T., Zhao, Y., Wu, J.: Resnext and res2net structures for speaker verification. In: IEEE Spoken Language Technology Workshop (SLT), pp. 301–307 (2021). https://doi.org/10.1109/SLT48900.2021.9383531
Acknowledgments
The authors wish to acknowledge the funding from the Government of Canada’s New Frontiers in Research Fund (NFRF) through grant NFRFR-2021-00338 and Natural Sciences and Engineering Research Council of Canada (NSERC) through grant RGPIN-2019-05381 and Ministry of Economy and Innovation (MEI) of the Government of Quebec for the continued support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Alam, J., Kang, W., Fathan, A. (2022). Neural Embedding Extractors for Text-Independent Speaker Verification. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds) Speech and Computer. SPECOM 2022. Lecture Notes in Computer Science(), vol 13721. Springer, Cham. https://doi.org/10.1007/978-3-031-20980-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-20980-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20979-6
Online ISBN: 978-3-031-20980-2
eBook Packages: Computer ScienceComputer Science (R0)