Skip to main content
Log in

Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

We propose a method which provides age of the speaker as an additional information while training a machine learning model for gender identification. To achieve this objective, we design a multi-task learning Deep Neural Network (DNN) model where the primary output layer has the speakers’ gender as target. Further, we use age group of the speaker as auxiliary target for each utterance, where age groups are created considering the gender of the speaker. We experimentally prove that multi-task learning DNN outperforms Gaussian Mixture Model (GMM) or single-task learning DNN trained only for gender recognition for more real life oriented datasets. For such datasets we have recordings of speakers’ from all age groups (children to seniors). We use raw speech waveform as input to our DNN which executes the multi-task learning with the freedom to follow gender and age discriminative features during training. The raw waveform front end uses convolutional layer based filter learning. Further, we use Long Short Term Memory cell based recurrent projection (LSTMP) layers for modeling temporal dynamics of speech from learned feature representation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Alhussein, M., Ali, Z., Imran, M., & Abdul, W. (2016). Automatic gender detection based on characteristics of vocal folds for mobile healthcare system, Mobile Information Systems, vol 2016, 1-12.

    Article  Google Scholar 

  • Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y. (2017). An exploration of dropout with LSTMs, in Proceedings of Interspeech 2017, The 11th Annual Conference of the International Speech Communication Association, August 20–24, Stockholm, Sweden.

  • Ghahremani, P., Manohar, V., Povey, D., & Khudanpur, S. (2016). Acoustic modelling from the signal domain using CNNs, In Proceedings of Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, September 8–12, San Francisco, CA, USA.

  • Goel, N. K., Sarma, M., Kushwah, T. S., Agrawal, D. K., Iqbal, Z., & Chauhan, S. (2018). Extracting speaker’s gender, accent, age and emotional state from speech, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.

  • Golik, P., Tuske, Z., Schluter, R., & Ney, H. (2015). Convolutional Neural Networks for Acoustic Modeling of Raw Time Signal in LVCSR, In Proceedings of Interspeech 2015, The 16th Annual Conference of the International Speech Communication Association, (pp. 26–30). September 6–10, Dresden, Germany.

  • Hebbar, R., Somandepalli, K., & Narayanan, S. (2018). Improving Gender Identification in Movie Audio using Cross-Domain Data In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.

  • Hermansky, H., & Sharma, S. (1998). TRAP- Classifiers for Temporal Patterns, In Proceedings of 5th International Conference On Spoken Language Processing.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech sound waves using restricted Boltzmann machines, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (pp. 5884–5887). Prague, Czech Republic.

  • Kabil, S. H., Muckenhirn, H., & Doss, M. M. (2018). On Learning to Identify Genders from Raw Speech Signal using CNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 2–6). Hyderabad, India.

  • Kumar, N., Nasir, M., Georgiou, P., & Narayanan, S. S. (2016). Robust multichannel gender classification from speech in movie audio, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.

  • Levitan, S. I., Mishra, T., & Bangalore, S. (2016). Automatic Identification of Gender from Speech, In Proceedings of Interspeech 2016, The 17th Annual Conference of the International Speech Communication Association, (pp. 8–12). San Francisco, USA.

  • Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech and Language,27, 151–167.

    Article  Google Scholar 

  • Li, M., Jung, C., & Han, K. J. (2010). Combining five acoustic level modeling methods for automatic speaker age and gender recognition, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.

  • Meinedo, H., & Trancoso, I. (2010). Age and gender classification using fusion of acoustic and prosodic features, In Proceedings of Interspeech 2010, The 11th Annual Conference of the International Speech Communication Association, (pp. 26–30). Makuhari, Chiba, Japan.

  • Meinedo, H., Trancoso, I., & (2011). Age and gender detection in the I-DASH project. ACM Transactions on Speech and Language Processing, 7(4), 13.

    Article  Google Scholar 

  • Palaz, D., Magimai-Doss, M., & Collobert, R. (2015). Analysis of CNN-based speech recognition system using raw speech as input, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.

  • Palaz, D., Magimai-Doss, M., & Collobert, R. (2019). End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition. Speech Communication, 108, 15–32.

    Article  Google Scholar 

  • Peddinti, V., Povey, D., & Khudanpur, S. (2015). A time delay neural network architecture for efficient modeling of long temporal contexts, In Proceedings of Interspeech 2015, the 16th Annual Conference of the International Speech Communication Association, September 6–10. Dresden, Germany.

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlcek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The kaldi speech recognition toolkit, In Proceedings of IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Hilton Waikoloa Village, Big Island, Hawaii, US.

  • Povey, D., Zhang, X., & Khudanpur, S. (2015). Parallel training of deep neural networks with natural gradient and parameter averaging, In Proceedings of International Conference on Learning Representations (ICLR).

  • Ruder, S. An overview of multi-task learning in deep neural networks. Retrieved from https://arxiv.org/pdf/1706.05098.pdf. Accessed 25 Feb 2018.

  • Sainath, T. N., Kingsbury, B., Mohamed, A., & Ramabhadran, B. (2013). Learning Filter Banks within a Deep Neural Network Framework, In Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 297–302.

  • Sainath, T. N., Weiss, R. J., Senior, A., Wilson, K. W., & Vinyals, O. (2015). Learning the speech front-end with raw waveform CLDNNs, In Proceedings of Interspeech, The 16th Annual Conference of the International Speech Communication Association, September 6–10, Dresden, Germany.

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short–term memory recurrent neural network architectures for large scale acoustic modeling, In Proceedings of the Interspeech 2014-15th Annual Conference of the International Speech Communication Association, September 14–18, Singapore.

  • Sarma, M., Ghahremani, P., Povey, D., Goel, N. K., Sarma, K. K., & Dehak, N. (2018). Emotion Identification from raw speech signals using DNNs, In Proceedings of Interspeech 2018, The 19th Annual Conference of the International Speech Communication Association, (pp. 3097–3101). Hyderabad, India.

  • Speaker Recognition Evaluation. (2000). Retrieved from https://catalog.ldc.upenn.edu/LDC2001S97. Accessed 3 Nov 2018

  • Shobaki, K., Hosom, J., & Cole, R. A. (2000).The OGI kids’ speech corpus and recognizers, In Proceedings of Interspeech 2000, The 6th International Conference on Spoken Language Processing, ICSLP 2000 / Interspeech 2000, (pp. 16–20). Beijing, China.

  • Switchbord Cellular Part-I. Retrieved from https://catalog.ldc.upenn.edu/LDC2001S13. Accessed 18 Feb 2019.

  • The 2008 NIST Speaker Recognition Evaluation Results. Retrieved from https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results. Accessed 15 Mar 2017.

  • The NIST Year 2010 Speaker Recognition Evaluation Plan. Retrieved from https://www.nist.gov/system/files/documents/itl/iad/mig/NIST_SRE10_evalplan-r6.pdf. Accessed 30 Mar 2017

  • Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M. A., Schuller, B., Zafeiriou, S. (2016). Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2016, March 20–25, Shanghai, China.

  • Tuske, Z., Golik, P., Schluter, R., & Ney, H. (2014). Acoustic modeling with deep neural networks using raw time signal for LVCSR, In Proceedings of Interspeech 2014, The 15th Annual Conference of the International Speech Communication Association, (pp. 890–894). 14–18 Singapore.

  • Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mousmita Sarma.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarma, M., Sarma, K.K. & Goel, N.K. Multi-task learning DNN to improve gender identification from speech leveraging age information of the speaker. Int J Speech Technol 23, 223–240 (2020). https://doi.org/10.1007/s10772-020-09680-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-020-09680-4

Keywords

Navigation