Speaker recognition with global information modelling of raw waveforms

Wu, Yujiao; Dong, Jianping; Fang, Zulin; Zhang, Gexiang; Rong, Haina

doi:10.1007/s41965-024-00135-2

Speaker recognition with global information modelling of raw waveforms

Research Paper
Published: 11 March 2024

Volume 6, pages 42–51, (2024)
Cite this article

Journal of Membrane Computing Aims and scope Submit manuscript

Yujiao Wu¹^na1,
Jianping Dong²^na1,
Zulin Fang¹^na1,
Gexiang Zhang^1,3 &
…
Haina Rong⁴^na1

60 Accesses
Explore all metrics

Abstract

In recent years, methods to extract speaker-embedding information directly from raw waveforms have received much attention, with good results achieved by the RawNet3 network. However, the RawNet3 model only uses convolutional neural networks (CNNs) to extract speaker features directly from the raw waveform, which limits the perceptual field and leads to the model’s inability to learn more speaker features with long-term dependencies. This paper proposes a novel speaker recognition model with global information modelling of raw waveform, called GIMR-Net, which is able to extract more speaker features with long-term dependencies. The model uses the transformer structure in the network to extract global information and combines it with the CNN structure to extract local information, thus achieving the ability to model the global information of the raw waveform. Experiments show that the proposed GIMR-Net model is effective and outperforms the RawNet3 model on the Free ST Chinese Mandarin Corpus dataset. Specifically, the equal error rate of GIMR-Net is 1.22, a 12.9\(\%\) improvement compared with RawNet3. Finally, in the babble and factory noise environment, the experiment verifies that the model proposed in this paper still has the noise immunity of the RawNet3 model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-end speaker identification research based on multi-scale SincNet and CGAN

Article 02 August 2023

PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

Speaker Recognition Using 3D Convolutional Neural Network and GMM

Data availability

The data related to this work will be available on reasonable request.

References

Kabi, M., Mridha, M. F., Shin, J., Jahan, I., & Ohi, A. Q. (2021). A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access., 9, 79236–79263.
Article Google Scholar
Fechner, G.T. (1948). Elements of psychophysics. IEEE Access. 1860.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)., 5329–5333.
Desplanques, B., Thienpondt, J. & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv., 2005–07143.
Chung, J.S., Nagrani, A. & Zisserman, A. (2020). Voxceleb2: Deep speaker recognition. arXiv., 2005–07143.
Cai, W., Chen, J., Zhang, J., & Li, M. (2020). On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing., 28, 1038–1051.
Article Google Scholar
Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4), 788–798.
Article Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Google Scholar
Sainath, T., Weiss, R.J., Wilson, K., Senior, A.W. & Vinyals, O. (2015). Learning the speech front-end with raw waveform cldnns. Advances in neural information processing systems.
Ravanelli, M. & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)., 1021–1028.
Oglic, D., Cvetkovic, Z., Bell, P. & Renals, S. (2020). A deep 2d convolutional network for waveform-based speech recognition. Interspeech., 1654–1658.
Pariente, M., Cornell, S., Deleforge, A. & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 6364–6368.
Jung, J.W., Heo, H.S., Yang, I.H., Shim, H.J. & Yu, H.J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 5349–5353.
Muckenhirn, H., Doss, M.M. & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using cnns. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 4884–4888.
Zhu, G., Jiang, F. & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv, 2010–12951
Jung, J.W., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y. & Chung, J.S. (2022). Pushing the limits of raw waveform speaker recognition. arXiv, 2203–08488.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N. & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 35(12), 11106–11115.
Article Google Scholar
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
Wang, R., Ao, J., Zhou, L., Liu, S., Wei, Z., Ko, T. & Zhang, Y. (2022). Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6732–6736.
Noé, P.G., Parcollet, T. & Morchid, M. (2020). Cgcnn: Complex gabor convolutional neural network on raw speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7724–7728.
Andén, J., & Mallat, S. (2014). Deep scattering spectrum. In IEEE Transactions on Signal Processing., 62(16), 4114–4128.
Article MathSciNet Google Scholar
Balestriero, R., Cosentino, R., Glotin, H. & Baraniuk, R. (2018). Spline filters for end-to-end deep learning. In International conference on machine learning.PMRL., 364–373.
Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J. & Yu, H.J. (2019). Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv., 1904–08104.
Ba, J.L., Kiros, J.R. & Hinton, G.E. (2016). Layer normalization. arXiv., 1607–06450.
Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. In 2018 IEEE spoken language technology workshop (SLT), 1412–6980.
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T. & Soudry, D. (2020). Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8129–8138.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61972324), the Natural Science Foundation of Sichuan of China under Grant 2022NSFSC0462, Sichuan Science and Technology Program (2023NSFSC1985, 2023YFG0046, 2022YFG0181) and Research Fund of Chengdu University of Information Technology (KYTZ202149, KYTD202212).

Author information

Yujiao Wu, Jianping Dong, Zulin Fang and Haina Rong have contributed equally to this work.

Authors and Affiliations

College of Computer Science and Cyber Security, Chengdu University of Technology, Chengdu, 610059, China
Yujiao Wu, Zulin Fang & Gexiang Zhang
School of Nuclear Technology and Automation Engineering, Chengdu University of Technology, Chengdu, 610059, China
Jianping Dong
School of Automation, Chengdu University of Information Technology, Chengdu, 610225, China
Gexiang Zhang
School of Electrical Engineering, Southwest Jiaotong University, Chengdu, 611756, China
Haina Rong

Authors

Yujiao Wu
View author publications
You can also search for this author in PubMed Google Scholar
Jianping Dong
View author publications
You can also search for this author in PubMed Google Scholar
Zulin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Gexiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Haina Rong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gexiang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Y., Dong, J., Fang, Z. et al. Speaker recognition with global information modelling of raw waveforms. J Membr Comput 6, 42–51 (2024). https://doi.org/10.1007/s41965-024-00135-2

Download citation

Received: 23 August 2023
Accepted: 16 January 2024
Published: 11 March 2024
Issue Date: March 2024
DOI: https://doi.org/10.1007/s41965-024-00135-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker recognition with global information modelling of raw waveforms

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

Speaker Recognition Using 3D Convolutional Neural Network and GMM

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker recognition with global information modelling of raw waveforms

Abstract

Access this article

Similar content being viewed by others

End-to-end speaker identification research based on multi-scale SincNet and CGAN

PF-Net: Personalized Filter for Speaker Recognition from Raw Waveform

Speaker Recognition Using 3D Convolutional Neural Network and GMM

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation