Relative Significance of Speech Sounds in Speaker Verification Systems

Rafi, B. Shaik Mohammad; Sankala, Sreekanth; Murty, K. Sri Rama

doi:10.1007/s00034-023-02360-z

Relative Significance of Speech Sounds in Speaker Verification Systems

Published: 11 April 2023

Volume 42, pages 5412–5427, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

B. Shaik Mohammad Rafi ORCID: orcid.org/0000-0002-4964-3328¹,
Sreekanth Sankala¹ &
K. Sri Rama Murty¹

183 Accesses
1 Citation
Explore all metrics

Abstract

Automatic speaker verification (ASV) is the task of authenticating claimed identity of a speaker from his/her voice characteristics. State-of-the-art ASV systems rely on capturing the voice signature of a speaker in a fixed-dimensional embedding. Recent studies reported that the performance of the ASV system improves when phonetic information obtained from a phoneme recognizer is appended to the frame-level speech representations. This work aims at analyzing the relative significance of various phonetic classes in extracting the speaker discriminative embeddings. We use the temporal attention mechanism to analyze the importance of different phonetic classes in speaker verification. It is observed that vowels, fricatives, and nasals receive relatively higher attention in the speaker verification task. This observation is in accordance with the subjective studies reported earlier, which signify the speaker discriminative characteristics of vowels and nasals. In the process, we demonstrate the efficiency of self-supervised phonetic information in extracting robust speaker embeddings. The proposed self-supervised phonetic attentive ASV system achieved a relative improvement of 29.2% over the baseline x-vector system and 19.3% over its supervised counterpart.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Article Open access 08 May 2024

Notes

The datasets generated during and/or analyzed during the current study are available in the Voxceleb-1 repository, [https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html].
The datasets generated during and/or analyzed during the current study are available in the TIMIT repository, [https://catalog.ldc.upenn.edu/LDC93s1].

References

K. Amino, T. Arai, Perceptual speaker identification using monosyllabic stimuli-effects of the nucleus vowels and speaker characteristics contained in nasals. In Ninth Annual Conference of the International Speech Communication Association (2008)
K. Amino, T. Sugawara, T. Arai, Idiosyncrasy of nasal sounds in human speaker identification and their acoustic properties. Acoust. Sci. Technol. 27, 233–235 (2006). https://doi.org/10.1250/ast.27.233
Article Google Scholar
S. Bhati, S. Nayak, K. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications (2017), pp. 2133–2137. https://doi.org/10.21437/Interspeech.2017-1476
P.D. Bricker, S. Pruzansky, Effects of stimulus content and duration on talker identification. J. Acoust. Soc. Am. 40(6), 1441–1449 (1966)
Article Google Scholar
T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (PMLR, 2020), pp. 1597–1607
J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
Article Google Scholar
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
C.T. Do, C. Barras, V.B. Le, A.K. Sarkar, Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data. In Proceedings of the Interspeech 2013 (2013), pp. 2484–2488. https://doi.org/10.21437/Interspeech.2013-415
J.P. Eatock, J.S. Mason, A quantitative assessment of the relative speaker discriminating properties of phonemes. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 1994), vol. 1, pp. I–133
J.P. Eatock, J.S.D. Mason, Phoneme performance in speaker recognition. In Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 1992) (1992), pp. 1411–1414
J.S. Garofolo, Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)
H. Kamper, A. Jansen, S. Goldwater, A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017). https://doi.org/10.1016/j.csl.2017.04.008
Article Google Scholar
L. Kersta, Voiceprint identification. Nature 196, 1476–4687 (1962)
Article Google Scholar
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Article Google Scholar
B.J. Lee, J.Y. Choi, H.G. Kang, Phonetically optimized speaker modeling for robust speaker recognition. J. Acoust. Soc. Am. 126, EL100–EL106 (2009). https://doi.org/10.1121/1.3204765
Article Google Scholar
Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887
S. Ling, J. Salazar, Y. Liu, K. Kirchhoff, A. Amazon, Bertphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proceeding of the Odyssey (2020), pp. 9–16
Y. Liu, L. He, J. Liu, M.T. Johnson, Speaker embedding extraction with phonetic information. arXiv preprint arXiv:1804.04862 (2018)
A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pesan, L. Burget, J. Gonzalez-Rodriguez, Analysis and optimization of bottleneck features for speaker recognition. In Odyssey vol 2016 (2016), pp. 352–357
S.J. Luck, Neurophysiology of selective attention. Attention 1, 257–295 (1998)
Google Scholar
M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885
V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, vol 27 (2014)
F. Nolan, The phonetic bases of speaker recognition. Ph.D. thesis, University of Cambridge (1980)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Bengaluru, 2006)
Google Scholar
L.R. Rabiner, R.W. Schafer, Introduction to Digital Speech Processing, vol. 1 (Now Publishers Inc, Delft, 2007)
Book MATH Google Scholar
S. Sankala, B.S.M. Rafi, S.R.M. Kodukula, Self attentive context dependent speaker embedding for speaker verification. In 2020 National Conference on Communications (NCC) (IEEE, 2020), pp. 1–5
S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019)
R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5329–5333
S. Sreekanth, S.M. Rafi B, K. Sri Rama Murty, S. Bhati, Speaker embedding extraction with virtual phonetic information. In 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (2019), pp. 1–5. https://doi.org/10.1109/GlobalSIP45357.2019.8969551
H. Su, S. Wegmann, Factor analysis based speaker verification using asr. In Interspeech 2016 (2016), pp. 2223–2227. https://doi.org/10.21437/Interspeech.2016-1157
H. van den Heuvel, T.C. Rietveld, Speaker related variability in cepstral representations of Dutch speech segments. In ICSLP (1992)
W.A. Van Dommelen, Acoustic parameters in human speaker recognition. Lang. Speech 33(3), 259–272 (1990)
Article Google Scholar
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
W.D. Voiers, Perceptual bases of speaker identity. J. Acoust. Soc. Am. 36(6), 1065–1073 (1964)
Article Google Scholar
Q. Wang, K. Okabe, K.A. Lee, H. Yamamoto, T. Koshinaka, Attention mechanism in speaker recognition: What does it learn in deep speaker embedding? In 2018 IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1052–1059. https://doi.org/10.1109/SLT.2018.8639586

Download references

Acknowledgements

This work was supported by DST National Mission Interdisciplinary Cyber-Physical Systems (NM-ICPS), Technology Innovation Hub on Autonomous Navigation and Data Acquisition Systems: TiHAN Foundations at Indian Institute of Technology (IIT), Hyderabad

Author information

Authors and Affiliations

Speech Information Processing Lab, Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Sangareddy, India
B. Shaik Mohammad Rafi, Sreekanth Sankala & K. Sri Rama Murty

Authors

B. Shaik Mohammad Rafi
View author publications
You can also search for this author in PubMed Google Scholar
Sreekanth Sankala
View author publications
You can also search for this author in PubMed Google Scholar
K. Sri Rama Murty
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Shaik Mohammad Rafi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rafi, B.S.M., Sankala, S. & Murty, K.S.R. Relative Significance of Speech Sounds in Speaker Verification Systems. Circuits Syst Signal Process 42, 5412–5427 (2023). https://doi.org/10.1007/s00034-023-02360-z

Download citation

Received: 03 June 2022
Revised: 15 March 2023
Accepted: 16 March 2023
Published: 11 April 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00034-023-02360-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Relative Significance of Speech Sounds in Speaker Verification Systems

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Multi-view Self-supervised Learning and Multi-scale Feature Fusion for Automatic Speech Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation