Skip to main content
Log in

Relative Significance of Speech Sounds in Speaker Verification Systems

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Automatic speaker verification (ASV) is the task of authenticating claimed identity of a speaker from his/her voice characteristics. State-of-the-art ASV systems rely on capturing the voice signature of a speaker in a fixed-dimensional embedding. Recent studies reported that the performance of the ASV system improves when phonetic information obtained from a phoneme recognizer is appended to the frame-level speech representations. This work aims at analyzing the relative significance of various phonetic classes in extracting the speaker discriminative embeddings. We use the temporal attention mechanism to analyze the importance of different phonetic classes in speaker verification. It is observed that vowels, fricatives, and nasals receive relatively higher attention in the speaker verification task. This observation is in accordance with the subjective studies reported earlier, which signify the speaker discriminative characteristics of vowels and nasals. In the process, we demonstrate the efficiency of self-supervised phonetic information in extracting robust speaker embeddings. The proposed self-supervised phonetic attentive ASV system achieved a relative improvement of 29.2% over the baseline x-vector system and 19.3% over its supervised counterpart.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The datasets generated during and/or analyzed during the current study are available in the Voxceleb-1 repository, [https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html].

  2. The datasets generated during and/or analyzed during the current study are available in the TIMIT repository, [https://catalog.ldc.upenn.edu/LDC93s1].

References

  1. K. Amino, T. Arai, Perceptual speaker identification using monosyllabic stimuli-effects of the nucleus vowels and speaker characteristics contained in nasals. In Ninth Annual Conference of the International Speech Communication Association (2008)

  2. K. Amino, T. Sugawara, T. Arai, Idiosyncrasy of nasal sounds in human speaker identification and their acoustic properties. Acoust. Sci. Technol. 27, 233–235 (2006). https://doi.org/10.1250/ast.27.233

    Article  Google Scholar 

  3. S. Bhati, S. Nayak, K. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications (2017), pp. 2133–2137. https://doi.org/10.21437/Interspeech.2017-1476

  4. P.D. Bricker, S. Pruzansky, Effects of stimulus content and duration on talker identification. J. Acoust. Soc. Am. 40(6), 1441–1449 (1966)

    Article  Google Scholar 

  5. T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (PMLR, 2020), pp. 1597–1607

  6. J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)

    Article  Google Scholar 

  7. N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)

    Article  Google Scholar 

  8. C.T. Do, C. Barras, V.B. Le, A.K. Sarkar, Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data. In Proceedings of the Interspeech 2013 (2013), pp. 2484–2488. https://doi.org/10.21437/Interspeech.2013-415

  9. J.P. Eatock, J.S. Mason, A quantitative assessment of the relative speaker discriminating properties of phonemes. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 1994), vol. 1, pp. I–133

  10. J.P. Eatock, J.S.D. Mason, Phoneme performance in speaker recognition. In Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 1992) (1992), pp. 1411–1414

  11. J.S. Garofolo, Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)

  12. H. Kamper, A. Jansen, S. Goldwater, A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017). https://doi.org/10.1016/j.csl.2017.04.008

    Article  Google Scholar 

  13. L. Kersta, Voiceprint identification. Nature 196, 1476–4687 (1962)

    Article  Google Scholar 

  14. T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)

    Article  Google Scholar 

  15. B.J. Lee, J.Y. Choi, H.G. Kang, Phonetically optimized speaker modeling for robust speaker recognition. J. Acoust. Soc. Am. 126, EL100–EL106 (2009). https://doi.org/10.1121/1.3204765

    Article  Google Scholar 

  16. Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887

  17. S. Ling, J. Salazar, Y. Liu, K. Kirchhoff, A. Amazon, Bertphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proceeding of the Odyssey (2020), pp. 9–16

  18. Y. Liu, L. He, J. Liu, M.T. Johnson, Speaker embedding extraction with phonetic information. arXiv preprint arXiv:1804.04862 (2018)

  19. A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pesan, L. Burget, J. Gonzalez-Rodriguez, Analysis and optimization of bottleneck features for speaker recognition. In Odyssey vol 2016 (2016), pp. 352–357

  20. S.J. Luck, Neurophysiology of selective attention. Attention 1, 257–295 (1998)

    Google Scholar 

  21. M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885

  22. V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, vol 27 (2014)

  23. F. Nolan, The phonetic bases of speaker recognition. Ph.D. thesis, University of Cambridge (1980)

  24. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210

  25. T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Bengaluru, 2006)

    Google Scholar 

  26. L.R. Rabiner, R.W. Schafer, Introduction to Digital Speech Processing, vol. 1 (Now Publishers Inc, Delft, 2007)

    Book  MATH  Google Scholar 

  27. S. Sankala, B.S.M. Rafi, S.R.M. Kodukula, Self attentive context dependent speaker embedding for speaker verification. In 2020 National Conference on Communications (NCC) (IEEE, 2020), pp. 1–5

  28. S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019)

  29. R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)

  30. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5329–5333

  31. S. Sreekanth, S.M. Rafi B, K. Sri Rama Murty, S. Bhati, Speaker embedding extraction with virtual phonetic information. In 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (2019), pp. 1–5. https://doi.org/10.1109/GlobalSIP45357.2019.8969551

  32. H. Su, S. Wegmann, Factor analysis based speaker verification using asr. In Interspeech 2016 (2016), pp. 2223–2227. https://doi.org/10.21437/Interspeech.2016-1157

  33. H. van den Heuvel, T.C. Rietveld, Speaker related variability in cepstral representations of Dutch speech segments. In ICSLP (1992)

  34. W.A. Van Dommelen, Acoustic parameters in human speaker recognition. Lang. Speech 33(3), 259–272 (1990)

    Article  Google Scholar 

  35. E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363

  36. W.D. Voiers, Perceptual bases of speaker identity. J. Acoust. Soc. Am. 36(6), 1065–1073 (1964)

    Article  Google Scholar 

  37. Q. Wang, K. Okabe, K.A. Lee, H. Yamamoto, T. Koshinaka, Attention mechanism in speaker recognition: What does it learn in deep speaker embedding? In 2018 IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1052–1059. https://doi.org/10.1109/SLT.2018.8639586

Download references

Acknowledgements

This work was supported by DST National Mission Interdisciplinary Cyber-Physical Systems (NM-ICPS), Technology Innovation Hub on Autonomous Navigation and Data Acquisition Systems: TiHAN Foundations at Indian Institute of Technology (IIT), Hyderabad

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Shaik Mohammad Rafi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rafi, B.S.M., Sankala, S. & Murty, K.S.R. Relative Significance of Speech Sounds in Speaker Verification Systems. Circuits Syst Signal Process 42, 5412–5427 (2023). https://doi.org/10.1007/s00034-023-02360-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-023-02360-z

Keywords

Navigation