Abstract
Automatic speaker verification (ASV) is the task of authenticating claimed identity of a speaker from his/her voice characteristics. State-of-the-art ASV systems rely on capturing the voice signature of a speaker in a fixed-dimensional embedding. Recent studies reported that the performance of the ASV system improves when phonetic information obtained from a phoneme recognizer is appended to the frame-level speech representations. This work aims at analyzing the relative significance of various phonetic classes in extracting the speaker discriminative embeddings. We use the temporal attention mechanism to analyze the importance of different phonetic classes in speaker verification. It is observed that vowels, fricatives, and nasals receive relatively higher attention in the speaker verification task. This observation is in accordance with the subjective studies reported earlier, which signify the speaker discriminative characteristics of vowels and nasals. In the process, we demonstrate the efficiency of self-supervised phonetic information in extracting robust speaker embeddings. The proposed self-supervised phonetic attentive ASV system achieved a relative improvement of 29.2% over the baseline x-vector system and 19.3% over its supervised counterpart.
Similar content being viewed by others
Notes
The datasets generated during and/or analyzed during the current study are available in the Voxceleb-1 repository, [https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html].
The datasets generated during and/or analyzed during the current study are available in the TIMIT repository, [https://catalog.ldc.upenn.edu/LDC93s1].
References
K. Amino, T. Arai, Perceptual speaker identification using monosyllabic stimuli-effects of the nucleus vowels and speaker characteristics contained in nasals. In Ninth Annual Conference of the International Speech Communication Association (2008)
K. Amino, T. Sugawara, T. Arai, Idiosyncrasy of nasal sounds in human speaker identification and their acoustic properties. Acoust. Sci. Technol. 27, 233–235 (2006). https://doi.org/10.1250/ast.27.233
S. Bhati, S. Nayak, K. Murty, Unsupervised speech signal to symbol transformation for zero resource speech applications (2017), pp. 2133–2137. https://doi.org/10.21437/Interspeech.2017-1476
P.D. Bricker, S. Pruzansky, Effects of stimulus content and duration on talker identification. J. Acoust. Soc. Am. 40(6), 1441–1449 (1966)
T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (PMLR, 2020), pp. 1597–1607
J. Chorowski, R.J. Weiss, S. Bengio, A. van den Oord, Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
C.T. Do, C. Barras, V.B. Le, A.K. Sarkar, Augmenting short-term cepstral features with long-term discriminative features for speaker verification of telephone data. In Proceedings of the Interspeech 2013 (2013), pp. 2484–2488. https://doi.org/10.21437/Interspeech.2013-415
J.P. Eatock, J.S. Mason, A quantitative assessment of the relative speaker discriminating properties of phonemes. In Proceedings of ICASSP’94. IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 1994), vol. 1, pp. I–133
J.P. Eatock, J.S.D. Mason, Phoneme performance in speaker recognition. In Proceedings of the 2nd International Conference on Spoken Language Processing (ICSLP 1992) (1992), pp. 1411–1414
J.S. Garofolo, Timit acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993)
H. Kamper, A. Jansen, S. Goldwater, A segmental framework for fully-unsupervised large-vocabulary speech recognition. Comput. Speech Lang. 46, 154–174 (2017). https://doi.org/10.1016/j.csl.2017.04.008
L. Kersta, Voiceprint identification. Nature 196, 1476–4687 (1962)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
B.J. Lee, J.Y. Choi, H.G. Kang, Phonetically optimized speaker modeling for robust speaker recognition. J. Acoust. Soc. Am. 126, EL100–EL106 (2009). https://doi.org/10.1121/1.3204765
Y. Lei, N. Scheffer, L. Ferrer, M. McLaren, A novel scheme for speaker recognition using a phonetically-aware deep neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1695–1699. https://doi.org/10.1109/ICASSP.2014.6853887
S. Ling, J. Salazar, Y. Liu, K. Kirchhoff, A. Amazon, Bertphone: phonetically-aware encoder representations for utterance-level speaker and language recognition. In Proceeding of the Odyssey (2020), pp. 9–16
Y. Liu, L. He, J. Liu, M.T. Johnson, Speaker embedding extraction with phonetic information. arXiv preprint arXiv:1804.04862 (2018)
A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Plchot, J. Pesan, L. Burget, J. Gonzalez-Rodriguez, Analysis and optimization of bottleneck features for speaker recognition. In Odyssey vol 2016 (2016), pp. 352–357
S.J. Luck, Neurophysiology of selective attention. Attention 1, 257–295 (1998)
M. McLaren, Y. Lei, L. Ferrer, Advances in deep neural network approaches to speaker recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), pp. 4814–4818. https://doi.org/10.1109/ICASSP.2015.7178885
V. Mnih, N. Heess, A. Graves, et al. Recurrent models of visual attention. In Advances in Neural Information Processing Systems, vol 27 (2014)
F. Nolan, The phonetic bases of speaker recognition. Ph.D. thesis, University of Cambridge (1980)
V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2015), pp. 5206–5210
T.F. Quatieri, Discrete-Time Speech Signal Processing: Principles and Practice (Pearson Education India, Bengaluru, 2006)
L.R. Rabiner, R.W. Schafer, Introduction to Digital Speech Processing, vol. 1 (Now Publishers Inc, Delft, 2007)
S. Sankala, B.S.M. Rafi, S.R.M. Kodukula, Self attentive context dependent speaker embedding for speaker verification. In 2020 National Conference on Communications (NCC) (IEEE, 2020), pp. 1–5
S. Schneider, A. Baevski, R. Collobert, M. Auli, wav2vec: unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862 (2019)
R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 (2017)
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2018), pp. 5329–5333
S. Sreekanth, S.M. Rafi B, K. Sri Rama Murty, S. Bhati, Speaker embedding extraction with virtual phonetic information. In 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP) (2019), pp. 1–5. https://doi.org/10.1109/GlobalSIP45357.2019.8969551
H. Su, S. Wegmann, Factor analysis based speaker verification using asr. In Interspeech 2016 (2016), pp. 2223–2227. https://doi.org/10.21437/Interspeech.2016-1157
H. van den Heuvel, T.C. Rietveld, Speaker related variability in cepstral representations of Dutch speech segments. In ICSLP (1992)
W.A. Van Dommelen, Acoustic parameters in human speaker recognition. Lang. Speech 33(3), 259–272 (1990)
E. Variani, X. Lei, E. McDermott, I.L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 4052–4056. https://doi.org/10.1109/ICASSP.2014.6854363
W.D. Voiers, Perceptual bases of speaker identity. J. Acoust. Soc. Am. 36(6), 1065–1073 (1964)
Q. Wang, K. Okabe, K.A. Lee, H. Yamamoto, T. Koshinaka, Attention mechanism in speaker recognition: What does it learn in deep speaker embedding? In 2018 IEEE Spoken Language Technology Workshop (SLT) (2018), pp. 1052–1059. https://doi.org/10.1109/SLT.2018.8639586
Acknowledgements
This work was supported by DST National Mission Interdisciplinary Cyber-Physical Systems (NM-ICPS), Technology Innovation Hub on Autonomous Navigation and Data Acquisition Systems: TiHAN Foundations at Indian Institute of Technology (IIT), Hyderabad
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rafi, B.S.M., Sankala, S. & Murty, K.S.R. Relative Significance of Speech Sounds in Speaker Verification Systems. Circuits Syst Signal Process 42, 5412–5427 (2023). https://doi.org/10.1007/s00034-023-02360-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-023-02360-z