Skip to main content
Log in

Towards modeling raw speech in gender identification of children using sincNet over ERB scale

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This article reveals the results of age-dependent gender identification from raw speech using the recently developed non-native children’s English speech corpus. Convolutional neural networks (CNN), which can learn low-level speech patterns, may be provided with raw waveforms rather than employing hand-crafted features. Furthermore, the filters learned with traditional CNNs were noisy since they included all filter parameters. SincNet, on the other hand, may provide more relevant filter parameters simply by adopting a sinc-layer instead of the first convolutional layer. The two parameters that can be learned in sincNet are the low and high cutoff frequencies of the rectangular band-pass filter. They have the ability to interpret key speech characteristics associated with the speaker, such as pitch and formants. The sincNet model is notably improved in this study by replacing the baseline Mel scale initializations to equivalent rectangular bandwidth initializations, which has the additional benefit of assigning extra filters in the lower spectral region. It’s also worth pointing out that the sincNet model is best suited to identifying the gender of the children. The experimental findings in age-dependent gender identification of the non-native children outperformed the baseline models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availibility

The open access data that support the findings of this study is made publicly available in KAGGLE repository, “https://www.kaggle.com/dsv/4416485”. More details about the data collection are given in Sect. 3.

References

  • Alashban, A. A., & Alotaibi, Y. A. (2021). Speaker gender classification in mono-language and cross-language using BLSTM network. In: 2021 44th International conference on telecommunications and signal processing(TSP), (pp. 66–71). IEEEEE

  • Alnuaim, A. A., Zakariah, M., Shashidhar, C., Hatamleh, W. A., Tarazi, H., Shukla, P. K., & Ratna, R. (2022). Speaker gender recognition based on deep neural networks and ResNet50. Wireless Communications and Mobile Computing 2022.

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33, 12449–12460.

    Google Scholar 

  • Bansal, M., & Sircar, P. (2019). Phoneme based model for gender identification and adult-child classification. In: 2019 13th International conference on signal processing and communication systems (ICSPCS), (pp. 1–7). IEEE.

  • Batliner, A., Hacker, C., Steidl, S., Nöth, E., D'Arcy, S., Russell, M. J., & Wong, M. (2004). You stupid tin box-children interacting with the aibo robot: A cross-linguistic emotional speech corpus.

  • Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning paradigm. International Journal of Speech Technology, 24, 367–388.

    Article  Google Scholar 

  • Bhattacharya, G., Alam, M. J., & Kenny, P. (2017, August). Deep Speaker Embeddings for Short-Duration Speaker Verification. In Interspeech, (pp. 1517–1521).

  • Chaudhary, G., Srivastava, S., & Bhardwaj, S. (2017). Feature extraction methods for speaker recognition: A review. International Journal of Pattern Recognition and Artificial Intelligence, 31(12), 1750041.

    Article  Google Scholar 

  • Fainberg, J., Klejch, O., Loweimi, E., Bell, P., & Renals, S. (2019). Acoustic model adaptation from raw waveforms with SincNet. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), (pp. 897–904). IEEE

  • Gautam, S., & Singh, L. (2017). Development of spectro-temporal features of speech in children. International Journal of Speech Technology, 20(3), 543–551.

    Article  Google Scholar 

  • Gautam, S., & Singh, L. (2019). The development of spectral features in the speech of Indian children. Sādhanā, 44(1), 1–7.

    Article  Google Scholar 

  • Gupta, M., Bharti, S. S., & Agarwal, S. (2019). Gender-based speaker recognition from speech signals using gmm model. Modern Physics Letters B, 33(35), 1950438.

    Article  MathSciNet  Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning pmlr, (pp. 448–456).

  • Jung, J. W., Heo, H. S., Yang, I. H., Shim, H. J., & Yu, H. J. (2018). A complete end-to-end speaker verification system using deep neural networks: From raw signals to verification result. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5349–5353). IEEE

  • Jung, J.-W., Heo, H.-S., Yang, I., Shim, H.-J., & Yu, H.-J. (2018). Avoiding speaker overfitting in end-to-end DNNs using raw waveform for text-independent speaker verification. Extraction, 8(12), 23–24.

    Google Scholar 

  • Kabil, S. H., Muckenhirn, H., & Magimai-Doss, M. (2018). On learning to identify genders from raw speech signal using CNNs. In: Interspeech, (pp. 287–291).

  • Karthikeyan, V., & Suja Priyadharsini, S. (2021). A strong hybrid adaboost classification algorithm for speaker recognition. Sādhanā, 46(3), 1–19.

    Article  MathSciNet  Google Scholar 

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint http://arXiv.org/1412.6980

  • Krishna, D., Amrutha, D., Reddy, S. S., Acharya, A., Garapati, P. A., & Triveni, B. (2020). Language independent gender identification from raw waveform using multi-scale convolutional neural networks. In: 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 6559–6563). IEEE.

  • Lebourdais, M., Tahon, M., Laurent, A., & Meignier, S. (2022). Overlapped speech and gender detection with WavLM pre-trained features. arXiv preprint http://arXiv.org/2209.04167

  • Loweimi, E., Bell, P., & Renals, S. (2019). On learning interpretable CNNs with parametric modulated kernel-based filters. In: Interspeech, (pp. 3480–3484).

  • Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the international conference on machine learning (ICML), icml 30, p. 3. Citeseer

  • Mallouh, A. A., Qawaqneh, Z., & Barkana, B. D. (2018). New transformed features generated by deep bottleneck extractor and a GMM-UBM classifier for speaker age and gender classification. Neural Computing and Applications, 30(8), 2581–2593.

    Article  Google Scholar 

  • Moore, B. C., & Glasberg, B. R. (1983). Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. The Journal of the Acoustical Society of America, 74(3), 750–753.

    Article  Google Scholar 

  • Muckenhirn, H., Doss, M.M.-, & Marcell, S. (2018). Towards directly modeling raw speech signal for speaker verification using CNNs. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 4884–4888). IEEE.

  • Pariente, M., Cornell, S., Deleforge, A., & Vincent, E. (2020). Filterbank design for end-to-end speech separation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 6364–6368). IEEE.

  • Peic Tukuljac, H., Ricaud, B., Aspert, N., & Colbois, L. (2022). Learnable filter-banks for CNN-based audio applications. In: Proceedings of the northern Lights Deep Learning Workshop 2022.

  • Rabiner, L., & Schafer, R. (2010). Theory and applications of digital speech processing. Prentice Hall Press.

    Google Scholar 

  • Radha, K., & Bansal, M. (2022). Non-native children english speech (NNCES) corpus. Kaggle. https://doi.org/10.34740/KAGGLE/DSV/4416485

    Article  Google Scholar 

  • Radha, K., & Bansal, M. (2023). Feature fusion and ablation analysis in gender identification of preschool children from spontaneous speech. Circuits Systems and Signal Processing. https://doi.org/10.1007/s00034-023-02399-y

    Article  Google Scholar 

  • Radha, K., & Bansal, M. (2022). Audio augmentation for non-native children’s speech recognition through discriminative learning. Entropy, 24(10), 1490.

    Article  Google Scholar 

  • Radha, K., & Bansal, M. (2023). Closed-set automatic speaker identification using multi-scale recurrent networks in non-native children. International Journal of Information Technology, 15(3), 1375–1385.

    Article  Google Scholar 

  • Radha, K., Bansal, M., & Shabber, S. M. (2022). Accent classification of native and non-native children using harmonic pitch. In 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP), (pp. 1–6). IEEE.

  • Radha, K., Bansal, M., & Sharma, R. (2023). Whitening Transformation of i-vectors in Closed-Set Speaker Verification of Children. In 2023 10th International Conference on Signal Processing and Integrated Networks (SPIN) (pp. 243–248). IEEE.

  • Ravanelli, M., & Bengio, Y. (2018). Interpretable convolutional filters with sincnet. arXiv preprint http://arXiv:1811.09725.

  • Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with sincnet. In 2018 IEEE spoken language technology workshop (SLT)(pp. 1021–1028), IEEE.

  • Rao, K. S. (2011). Role of neural network models for developing speech systems. Sadhana, 36(5), 783–836.

    Article  Google Scholar 

  • Raschka, S. (2014). An overview of general performance metrics of binary classifier systems. arXiv preprint http://arXiv:1410.5330

  • Richardson, F., Reynolds, D., & Dehak, N. (2015). A unified deep neural network for speaker and language recognition. arXiv preprint http//arXiv:1504.00923.

  • Rogol, A. D., Clark, P. A., & Roemmich, J. N. (2000). Growth and pubertal development in children and adolescents: Effects of diet and physical activity. The American Journal of Clinical Nutrition, 72(2), 521–528.

    Article  Google Scholar 

  • Sarma, M., Sarma, K. K., & Goel, N. K. (2020). Children’s age and gender recognition from raw speech waveform using DNN. In Advances in Intelligent Computing and Communication: Proceedings of ICAC 2019, (pp. 1–9). Springer.

  • Sarma, M., Sarma, K. K., & Goel, N. K. (2020). Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker. International Journal of Speech Technology, 23, 223–240.

    Article  Google Scholar 

  • Safavi, S., Russell, M., & Jančovič, P. (2018). Automatic speaker, age-group and gender identification from children’s speech. Computer Speech Language, 50, 141–156.

    Article  Google Scholar 

  • Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., & Narayanan, S. (2010). The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of INTERSPEECH 2010, Makuhari, (pp. 4052–4056). IEEE.

  • Schwoebel, J. Survey Lex. https://www.surveylex.com/. Accessed: 2022-01-01

  • Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), (pp. 4052–4056). IEEE.

  • Zhu, G., Jiang, F., & Duan, Z. (2020). Y-vector: Multiscale waveform encoder for speaker embedding. arXiv preprint http://arXiv.org/2010.12951

  • Zhang, C., Koishida, K., & Hansen, J. H. (2018). Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9), 1633–1644.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohan Bansal.

Ethics declarations

Conflicts of interest

The authors declare no conflict of interest.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Radha, K., Bansal, M. Towards modeling raw speech in gender identification of children using sincNet over ERB scale. Int J Speech Technol 26, 651–663 (2023). https://doi.org/10.1007/s10772-023-10039-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10039-8

Keywords

Navigation