Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Aziz, Shahid; Ankita; Shahnawazuddin, S.

doi:10.1007/s00034-023-02535-8

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Published: 05 November 2023

Volume 43, pages 1715–1740, (2024)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

The task of developing an automatic speaker verification (ASV) system for children is extremely challenging due to unavailability of sufficiently large and free speech corpora from child speakers . On the other hand, hundreds of hours of speech data from adult speakers are freely available. Therefore, majority of the works on speaker verification reported in the literature deal predominantly with adults’ speech, while only a few works dealing with children’s speech have been published. The challenges in developing a robust ASV system for child speakers are further exacerbated when we use short utterances which is largely unexplored in the case of children’s speech . Therefore, in this paper, we have focused on children’s speaker verification using short utterances. To deal with data scarcity, several out-of-domain data augmentation techniques have been utilized. Since the out-of-domain data used in this study is from adult speakers which is acoustically very different from children’s speech, we have resorted to techniques like prosody modification, formant modification, and voice conversion in order to render it acoustically similar to children’s speech prior to augmentation. This helps in not only increasing the amount of training data, but also in effectively capturing the missing target attributes relevant to children’s speech. A staggering relative improvement of 33.57% in equal error rate with respect to the baseline system trained solely on child dataset speaks volume of the effectiveness of the proposed data augmentation technique in this paper. Further to that, we have also proposed frame-level concatenation of Mel-frequency cepstral coefficients (MFCC) with frequency-domain linear prediction coefficients, in order to simultaneously model the spectral as well as temporal envelopes. The proposed idea of frame-level concatenation is expected to further enhance the discrimination among the speakers. This novel approach, when combined with data augmentation, helps in further improving the performance of the speaker verification system. The experimental results support our claims, wherein we have achieved an overall relative reduction of \(38.04\%\) for equal error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 6

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Spectral warping based data augmentation for low resource children’s speaker verification

Article Open access 03 November 2023

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Article 05 February 2024

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

M. Athineos, D. Ellis: Frequency-domain linear prediction for temporal features. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pp. 261–266 (2003)
A. Batliner, M. Blomberg , S. D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong: The PF_STAR children’s speech corpus. In: Proc. INTERSPEECH, pp. 2761–2764 (2005)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoustic, Speech Signal Processing 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Article Google Scholar
M. Eskenazi, J. Mostow, D. Graff: The CMU Kids Corpus LDC97S63. https://catalog.ldc.upenn.edu/LDC97S63 (1997)
S. Fernando, V. Sethu, E. Ambikairajah: Sub-band envelope features using frequency domain linear prediction for short duration language identification. In: INTERSPEECH, pp. 1818–1822 (2018)
S. Ganapathy, P. Rajan, H. Hermansky: Multi-layer perceptron based speech activity detection for speaker verification. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 321–324. IEEE (2011)
M. Islam: Frequency domain linear prediction-based robust text-dependent speaker identification. In: 2016 international conference on innovations in science, engineering and technology (ICISET), pp. 1–4. IEEE (2016)
T. Kaneko, H. Kameoka: Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293 (2017)
H.K. Kathania, S.R. Kadiri, P. Alku, M. Kurimo: Study of formant modification for children asr. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7429–7433. IEEE (2020)
R. Kethireddy, S.R. Kadiri, S.V. Gangashetty, Exploration of temporal dynamics of frequency domain linear prediction cepstral coefficients for dialect classification. Appl. Acoustics 188, 108553 (2022)
Article Google Scholar
V. Kumar, A. Kumar, S. Shahnawazuddin, Creating robust children’s asr system in zero-resource condition through out-of-domain data augmentation. Circuits Syst. Signal Process. 41(4), 2205–2220 (2022)
Article Google Scholar
S. Lee, A. Potamianos, S. Narayanan, Acoustics of children’s speech: Developmental changes of temporal and spectral parameters. J. Acoust. Soc. Am. 105(3), 1455–1468 (1999)
Article ADS PubMed CAS Google Scholar
G. Mantena, S. Achanta, K. Prahallad, Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping. IEEE/ACM Trans. Audio, Speech, nd Language Process. 22(5), 946–955 (2014)
Article Google Scholar
A. Poddar, M. Sahidullah, G. Saha, Speaker verification with short utterances: a review of challenges, trends and opportunities. IET Biometrics 7(2), 91–101 (2018)
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi Speech recognition toolkit. In: Proc. ASRU (2011)
Povey, D., Zhang, X., Khudanpur, S.: Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455 (2014)
Prasanna, S.R.M., Govind, D., Rao, K.S., Yegnanarayana, B.: Fast prosody modification using instants of significant excitation. In: Proc. Int. Conf. on Speech Prosody (2010)
Quateier, T.F.: Discrete time processing of speech signals- principles and practice (1997)
Robinson, T., Fransen, J., Pye, D., Foote, J., Renals, S.: WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. In: Proc. ICASSP, vol. 1, pp. 81–84 (1995). https://doi.org/10.1109/ICASSP.1995.479278
S. Safavi, M. Russell, P. Jančovič, Automatic speaker, age-group and gender identification from children’s speech. Comput. Speech & Language 50, 141–156 (2018)
Article Google Scholar
Sahidullah, M., Kinnunen, T., Hanilçi, C.: A comparison of features for synthetic speech detection (2015)
S. Shahnawazuddin, N. Adiga, H.K. Kathania, B.T. Sai, Creating speaker independent asr system through prosody modification based data augmentation. Pattern Recognition Lett. 131, 213–218 (2020)
Article ADS Google Scholar
S. Shahnawazuddin, N. Adiga, H.K. Kathania, B.T. Sai, Creating speaker independent asr system through prosody modification based data augmentation. Pattern Recogn. Lett. 131, 213–218 (2020). https://doi.org/10.1016/j.patrec.2019.12.019
Article ADS Google Scholar
S. Shahnawazuddin, N. Adiga, B.T. Sai, W. Ahmad, H.K. Kathania, Developing speaker independent asr system using limited data through prosody modification based on fuzzy classification of spectral bins. Digital Signal Processing 93, 34–42 (2019)
Article Google Scholar
Shahnawazuddin, S., Ahmad, W., Adiga, N., Kumar, A.: In-domain and out-of-domain data augmentation to improve children’s speaker verification system in limited data scenario. In: Proc. ICASSP, pp. 7554–7558 (2020). https://doi.org/10.1109/ICASSP40776.2020.9053891
S. Shahnawazuddin, W. Ahmad, N. Adiga, A. Kumar, Children’s speaker verification in low and zero resource conditions. Digital Signal Processing 116, 103115 (2021)
Article Google Scholar
K. Shobaki, J.P. Hosom, R. Cole: Cslu: Kids’ speech version 1.1. Linguistic Data Consortium (2007)
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur: X-vectors: Robust dnn embeddings for speaker recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5329–5333. IEEE (2018)
S.S. Stevens, J. Volkmann, E.B. Newman, A scale for the measurement of the psychological magnitude pitch. J. Acoustical Soc. Am. 8(3), 185–190 (1937)
Article ADS Google Scholar
V. Stojanovic, N. Nedic, Robust identification of oe model with constrained output using optimal input design. J. Franklin Inst. 353(2), 576–593 (2016)
Article MathSciNet Google Scholar
S. Thomas, S. Ganapathy, H. Hermansky, Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Process. Lett. 15, 681–684 (2008)
Article ADS Google Scholar
Wickramasinghe, B., Irtza, S., Ambikairajah, E., Epps, J.: Frequency domain linear prediction features for replay spoofing attack detection. In: Interspeech, pp. 661–665 (2018)
Yeung, G., Alwan, A.: On the difficulties of automatic speech recognition for kindergarten-aged children. Interspeech 2018 (2018)
Z. Zhang, X. Song, X. Sun, V. Stojanovic, Hybrid-driven-based fuzzy secure filtering for nonlinear parabolic partial differential equation systems with cyber attacks. Int. J. Adapt. Control Signal Process. 37(2), 380–398 (2023)
Article MathSciNet Google Scholar

Download references

Funding

Funding information is not applicable / no funding was received.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Patna, Patna, India
Shahid Aziz, Ankita & S. Shahnawazuddin

Authors

Shahid Aziz
View author publications
You can also search for this author in PubMed Google Scholar
Ankita
View author publications
You can also search for this author in PubMed Google Scholar
S. Shahnawazuddin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahid Aziz.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical Approval

The work presented in the uploaded manuscript is an original one, and the manuscript is not currently under consideration for publication elsewhere.

Consent for Publication

It is hereby confirmed that the manuscript has been read and approved for submission by all the named authors. It is therefore requested to consider the submitted manuscript for publication in the esteemed journal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Aziz, S., Ankita & Shahnawazuddin, S. Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions. Circuits Syst Signal Process 43, 1715–1740 (2024). https://doi.org/10.1007/s00034-023-02535-8

Download citation

Received: 22 December 2022
Revised: 07 October 2023
Accepted: 08 October 2023
Published: 05 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00034-023-02535-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Abstract

Access this article

Similar content being viewed by others

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Spectral warping based data augmentation for low resource children’s speaker verification

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Short-Utterance-Based Children’s Speaker Verification in Low-Resource Conditions

Abstract

Access this article

Similar content being viewed by others

Enhancing Children’s Short Utterance Based ASV Using Data Augmentation Techniques and Feature Concatenation Approach

Spectral warping based data augmentation for low resource children’s speaker verification

Role of Data Augmentation and Effective Conservation of High-Frequency Contents in the Context Children’s Speaker Verification System

Data Availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation