Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Radha, Kodali; Bansal, Mohan

doi:10.1007/s10772-023-10039-8

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Published: 08 September 2023

Volume 26, pages 651–663, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

102 Accesses
4 Citations
Explore all metrics

Abstract

This article reveals the results of age-dependent gender identification from raw speech using the recently developed non-native children’s English speech corpus. Convolutional neural networks (CNN), which can learn low-level speech patterns, may be provided with raw waveforms rather than employing hand-crafted features. Furthermore, the filters learned with traditional CNNs were noisy since they included all filter parameters. SincNet, on the other hand, may provide more relevant filter parameters simply by adopting a sinc-layer instead of the first convolutional layer. The two parameters that can be learned in sincNet are the low and high cutoff frequencies of the rectangular band-pass filter. They have the ability to interpret key speech characteristics associated with the speaker, such as pitch and formants. The sincNet model is notably improved in this study by replacing the baseline Mel scale initializations to equivalent rectangular bandwidth initializations, which has the additional benefit of assigning extra filters in the lower spectral region. It’s also worth pointing out that the sincNet model is best suited to identifying the gender of the children. The experimental findings in age-dependent gender identification of the non-native children outperformed the baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Article 10 February 2020

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Article 28 November 2023

Age group classification and gender recognition from speech with temporal convolutional neural networks

Article Open access 13 January 2022

Data Availibility

The open access data that support the findings of this study is made publicly available in KAGGLE repository, “https://www.kaggle.com/dsv/4416485”. More details about the data collection are given in Sect. 3.

References

Alashban, A. A., & Alotaibi, Y. A. (2021). Speaker gender classification in mono-language and cross-language using BLSTM network. In: 2021 44th International conference on telecommunications and signal processing(TSP), (pp. 66–71). IEEEEE
Alnuaim, A. A., Zakariah, M., Shashidhar, C., Hatamleh, W. A., Tarazi, H., Shukla, P. K., & Ratna, R. (2022). Speaker gender recognition based on deep neural networks and ResNet50. Wireless Communications and Mobile Computing 2022.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.
Google Scholar
Bansal, M., & Sircar, P. (2019). Phoneme based model for gender identification and adult-child classification. In: 2019 13th International conference on signal processing and communication systems (ICSPCS), (pp. 1–7). IEEE.
Batliner, A., Hacker, C., Steidl, S., Nöth, E., D'Arcy, S., Russell, M. J., & Wong, M. (2004). You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus.
Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning paradigm. International Journal of Speech Technology, 24, 367–388.
Article Google Scholar
Bhattacharya, G., Alam, M. J., & Kenny, P. (2017, August). Deep Speaker Embeddings for Short-Duration Speaker Verification. In Interspeech, (pp. 1517–1521).
Chaudhary, G., Srivastava, S., & Bhardwaj, S. (2017). Feature extraction methods for speaker recognition: A review. International Journal of Pattern Recognition and Artificial Intelligence, 31(12), 1750041.
Article Google Scholar
Fainberg, J., Klejch, O., Loweimi, E., Bell, P., & Renals, S. (2019). Acoustic model adaptation from raw waveforms with SincNet. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), (pp. 897–904). IEEE
Gautam, S., & Singh, L. (2017). Development of spectro-temporal features of speech in children. International Journal of Speech Technology, 20(3), 543–551.
Article Google Scholar
Gautam, S., & Singh, L. (2019). The development of spectral features in the speech of Indian children. Sādhanā, 44(1), 1–7.
Article Google Scholar
Gupta, M., Bharti, S. S., & Agarwal, S. (2019). Gender-based speaker recognition from speech signals using gmm model. Modern Physics Letters B, 33(35), 1950438.
Article MathSciNet Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning pmlr, (pp. 448–456).
Jung, J. W., Heo, H. S., Yang, I. H., Shim, H. J., & Yu, H. J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5349–5353). IEEE
Jung, J.-W., Heo, H.-S., Yang, I., Shim, H.-J., & Yu, H.-J. (2018). Avoiding speaker overfitting in end-to-end DNNs using raw waveform for text-independent speaker verification. Extraction, 8(12), 23–24.
Google Scholar
Kabil, S. H., Muckenhirn, H., & Magimai-Doss, M. (2018). On learning to identify genders from raw speech signal using CNNs. In: Interspeech, (pp. 287–291).
Karthikeyan, V., & Suja Priyadharsini, S. (2021). A strong hybrid adaboost classification algorithm for speaker recognition. Sādhanā, 46(3), 1–19.
Article MathSciNet Google Scholar
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint http://arXiv.org/1412.6980
Krishna, D., Amrutha, D., Reddy, S. S., Acharya, A., Garapati, P. A., & Triveni, B. (2020). Language independent gender identification from raw waveform using multi-scale convolutional neural networks. In: 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 6559–6563). IEEE.
Lebourdais, M., Tahon, M., Laurent, A., & Meignier, S. (2022). Overlapped speech and gender detection with WavLM pre-trained features. arXiv preprint http://arXiv.org/2209.04167
Loweimi, E., Bell, P., & Renals, S. (2019). On learning interpretable CNNs with parametric modulated kernel-based filters. In: Interspeech, (pp. 3480–3484).
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the international conference on machine learning (ICML), icml 30, p. 3. Citeseer
Mallouh, A. A., Qawaqneh, Z., & Barkana, B. D. (2018). New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. Neural Computing and Applications, 30(8), 2581–2593.
Article Google Scholar
Moore, B. C., & Glasberg, B. R. (1983). Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Society of America, 74(3), 750–753.
Article Google Scholar
Muckenhirn, H., Doss, M.M.-, & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 4884–4888). IEEE.
Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6364–6368). IEEE.
Peic Tukuljac, H., Ricaud, B., Aspert, N., & Colbois, L. (2022). Learnable filter-banks for CNN-based audio applications. In: Proceedings of the northern Lights Deep Learning Workshop 2022.
Rabiner, L., & Schafer, R. (2010). Theory and applications of digital speech processing. Prentice Hall Press.
Google Scholar
Radha, K., & Bansal, M. (2022). Non-native children english speech (NNCES) corpus. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4416485
Article Google Scholar
Radha, K., & Bansal, M. (2023). Feature fusion and ablation analysis in gender identification of preschool children from spontaneous speech. Circuits Systems and Signal Processing. https://doi.org/10.1007/s00034-023-02399-y
Article Google Scholar
Radha, K., & Bansal, M. (2022). Audio augmentation for non-native children’s speech recognition through discriminative learning. Entropy, 24(10), 1490.
Article Google Scholar
Radha, K., & Bansal, M. (2023). Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. International Journal of Information Technology, 15(3), 1375–1385.
Article Google Scholar
Radha, K., Bansal, M., & Shabber, S. M. (2022). Accent classification of native and non-native children using harmonic pitch. In 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP), (pp. 1–6). IEEE.
Radha, K., Bansal, M., & Sharma, R. (2023). Whitening Transformation of i-vectors in Closed-Set Speaker Verification of Children. In 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 243–248). IEEE.
Ravanelli, M., & Bengio, Y. (2018). Interpretable convolutional filters with sincnet. arXiv preprint http://arXiv:1811.09725.
Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)(pp. 1021–1028), IEEE.
Rao, K. S. (2011). Role of neural network models for developing speech systems. Sadhana, 36(5), 783–836.
Article Google Scholar
Raschka, S. (2014). An overview of general performance metrics of binary classifier systems. arXiv preprint http://arXiv:1410.5330
Richardson, F., Reynolds, D., & Dehak, N. (2015). A unified deep neural network for speaker and language recognition. arXiv preprint http//arXiv:1504.00923.
Rogol, A. D., Clark, P. A., & Roemmich, J. N. (2000). Growth and pubertal development in children and adolescents: Effects of diet and physical activity. The American Journal of Clinical Nutrition, 72(2), 521–528.
Article Google Scholar
Sarma, M., Sarma, K. K., & Goel, N. K. (2020). Children’s age and gender recognition from raw speech waveform using DNN. In Advances in Intelligent Computing and Communication: Proceedings of ICAC 2019, (pp. 1–9). Springer.
Sarma, M., Sarma, K. K., & Goel, N. K. (2020). Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker. International Journal of Speech Technology, 23, 223–240.
Article Google Scholar
Safavi, S., Russell, M., & Jančovič, P. (2018). Automatic speaker, age-group and gender identification from children’s speech. Computer Speech Language, 50, 141–156.
Article Google Scholar
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., & Narayanan, S. (2010). The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of INTERSPEECH 2010, Makuhari, (pp. 4052–4056). IEEE.
Schwoebel, J. Survey Lex. https://www.surveylex.com/. Accessed: 2022-01-01
Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4052–4056). IEEE.
Zhu, G., Jiang, F., & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv preprint http://arXiv.org/2010.12951
Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9), 1633–1644.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics Engineering, VIT-AP University, Amaravati, 522237, Andhra Pradesh, India
Kodali Radha & Mohan Bansal

Authors

Kodali Radha
View author publications
You can also search for this author in PubMed Google Scholar
Mohan Bansal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohan Bansal.

Ethics declarations

Conflicts of interest

The authors declare no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Radha, K., Bansal, M. Towards modeling raw speech in gender identification of children using sincNet over ERB scale. Int J Speech Technol 26, 651–663 (2023). https://doi.org/10.1007/s10772-023-10039-8

Download citation

Received: 15 February 2023
Accepted: 11 August 2023
Published: 08 September 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10772-023-10039-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Abstract

Access this article

Similar content being viewed by others

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Age group classification and gender recognition from speech with temporal convolutional neural networks

Data Availibility

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

Abstract

Access this article

Similar content being viewed by others

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Age group classification and gender recognition from speech with temporal convolutional neural networks

Data Availibility

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation