Skip to main content
Log in

Speaker recognition with global information modelling of raw waveforms

  • Research Paper
  • Published:
Journal of Membrane Computing Aims and scope Submit manuscript

Abstract

In recent years, methods to extract speaker-embedding information directly from raw waveforms have received much attention, with good results achieved by the RawNet3 network. However, the RawNet3 model only uses convolutional neural networks (CNNs) to extract speaker features directly from the raw waveform, which limits the perceptual field and leads to the model’s inability to learn more speaker features with long-term dependencies. This paper proposes a novel speaker recognition model with global information modelling of raw waveform, called GIMR-Net, which is able to extract more speaker features with long-term dependencies. The model uses the transformer structure in the network to extract global information and combines it with the CNN structure to extract local information, thus achieving the ability to model the global information of the raw waveform. Experiments show that the proposed GIMR-Net model is effective and outperforms the RawNet3 model on the Free ST Chinese Mandarin Corpus dataset. Specifically, the equal error rate of GIMR-Net is 1.22, a 12.9\(\%\) improvement compared with RawNet3. Finally, in the babble and factory noise environment, the experiment verifies that the model proposed in this paper still has the noise immunity of the RawNet3 model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

The data related to this work will be available on reasonable request.

References

  1. Kabi, M., Mridha, M. F., Shin, J., Jahan, I., & Ohi, A. Q. (2021). A survey of speaker recognition: Fundamental theories, recognition methods and opportunities. IEEE Access., 9, 79236–79263.

    Article  Google Scholar 

  2. Fechner, G.T. (1948). Elements of psychophysics. IEEE Access. 1860.

  3. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP)., 5329–5333.

  4. Desplanques, B., Thienpondt, J. & Demuynck, K. (2020). Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. arXiv., 2005–07143.

  5. Chung, J.S., Nagrani, A. & Zisserman, A. (2020). Voxceleb2: Deep speaker recognition. arXiv., 2005–07143.

  6. Cai, W., Chen, J., Zhang, J., & Li, M. (2020). On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing., 28, 1038–1051.

    Article  Google Scholar 

  7. Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  8. Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

    Google Scholar 

  9. Sainath, T., Weiss, R.J., Wilson, K., Senior, A.W. & Vinyals, O. (2015). Learning the speech front-end with raw waveform cldnns. Advances in neural information processing systems.

  10. Ravanelli, M. & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)., 1021–1028.

  11. Oglic, D., Cvetkovic, Z., Bell, P. & Renals, S. (2020). A deep 2d convolutional network for waveform-based speech recognition. Interspeech., 1654–1658.

  12. Pariente, M., Cornell, S., Deleforge, A. & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 6364–6368.

  13. Jung, J.W., Heo, H.S., Yang, I.H., Shim, H.J. & Yu, H.J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 5349–5353.

  14. Muckenhirn, H., Doss, M.M. & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using cnns. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)., 4884–4888.

  15. Zhu, G., Jiang, F. & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv, 2010–12951

  16. Jung, J.W., Kim, Y.J., Heo, H.S., Lee, B.J., Kwon, Y. & Chung, J.S. (2022). Pushing the limits of raw waveform speaker recognition. arXiv, 2203–08488.

  17. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N. & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

  18. Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., & Zhang, W. (2021). Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, 35(12), 11106–11115.

    Article  Google Scholar 

  19. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z. & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.

  20. Wang, R., Ao, J., Zhou, L., Liu, S., Wei, Z., Ko, T. & Zhang, Y. (2022). Multi-view self-attention based transformer for speaker recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6732–6736.

  21. Noé, P.G., Parcollet, T. & Morchid, M. (2020). Cgcnn: Complex gabor convolutional neural network on raw speech. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7724–7728.

  22. Andén, J., & Mallat, S. (2014). Deep scattering spectrum. In IEEE Transactions on Signal Processing., 62(16), 4114–4128.

    Article  MathSciNet  Google Scholar 

  23. Balestriero, R., Cosentino, R., Glotin, H. & Baraniuk, R. (2018). Spline filters for end-to-end deep learning. In International conference on machine learning.PMRL., 364–373.

  24. Jung, J.W., Heo, H.S., Kim, J.H., Shim, H.J. & Yu, H.J. (2019). Rawnet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. arXiv., 1904–08104.

  25. Ba, J.L., Kiros, J.R. & Hinton, G.E. (2016). Layer normalization. arXiv., 1607–06450.

  26. Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. In 2018 IEEE spoken language technology workshop (SLT), 1412–6980.

  27. Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T. & Soudry, D. (2020). Augment your batch: Improving generalization through instance repetition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8129–8138.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61972324), the Natural Science Foundation of Sichuan of China under Grant 2022NSFSC0462, Sichuan Science and Technology Program (2023NSFSC1985, 2023YFG0046, 2022YFG0181) and Research Fund of Chengdu University of Information Technology (KYTZ202149, KYTD202212).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gexiang Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Dong, J., Fang, Z. et al. Speaker recognition with global information modelling of raw waveforms. J Membr Comput 6, 42–51 (2024). https://doi.org/10.1007/s41965-024-00135-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41965-024-00135-2

Keywords

Navigation