Skip to main content

Transfer Learning in Speaker’s Age and Gender Recognition

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

Abstract

In this paper, we study an application of transfer learning approach to speaker’s age and gender recognition task. Recently, speech analysis systems, which take images of log Mel-spectrograms or MFCCs as input for classification, are gaining popularity. Therefore, we used pretrained models that showed good performance on ImageNet task, such as AlexNet, VGG-16, ResNet18, ResNet34, ResNet50, as well as state-of-the-art EfficientNet-B4 from Google. Additionally, we trained 1D CNN and TDNN models for speaker’s age and gender recognition. We compared performance of these models in age (4 classes), gender (3 classes) and joint age and gender (7 classes) recognition. Despite high performance of pretrained models in ImageNet task, our TDNN models showed better UAR results in all tasks presented in this study: age (UAR = 51.719%), gender (UAR = 81.746%) and joint age and gender (UAR = 48.969%) recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: INTERSPEECH 2009, pp. 312–315 (2009)

    Google Scholar 

  2. Kaya, H., et al.: Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics. In: Proceedings of International on Audio/Visual Emotion Challenge and Workshop, pp. 27–35 (2019)

    Google Scholar 

  3. Schuller, B. et al.: The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH 2010, pp. 2794–2797 (2010)

    Google Scholar 

  4. Schuller, B. et al.: The INTERSPEECH 2019 computational paralinguistics challenge: Styrian dialects, continuous sleepiness, baby sounds & orca activity. In: INTERSPEECH 2019, pp. 2378–2382 (2019)

    Google Scholar 

  5. Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J.: The INTERSPEECH 2011 speaker state challenge. In: INTERSPEECH 2011, pp. 3201–3204 (2011)

    Google Scholar 

  6. Schuller, B. et al.: The INTERSPEECH 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: INTERSPEECH 2017, pp. 3442–3446 (2017)

    Google Scholar 

  7. Schuller, B. et al.: The INTERSPEECH 2020 computational paralinguistics challenge: elderly emotion, breathing & masks. In: INTERSPEECH 2020, p. 5 (2020)

    Google Scholar 

  8. Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)

    Article  Google Scholar 

  9. Yu, D., Seltzer, M.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH 2011, pp. 237–240 (2011)

    Google Scholar 

  10. Dalmia, S., Li, X., Metze, F., Black, A.: Domain robust feature extraction for rapid low resource ASR development. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 258–265 (2018)

    Google Scholar 

  11. Sainath, T., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618 (2013)

    Google Scholar 

  12. Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2011)

    Article  Google Scholar 

  13. Wang, Y. et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, p. 10 (2017)

  14. Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH 2017, pp. 999–1003 (2017)

    Google Scholar 

  15. Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)

    Google Scholar 

  16. Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., Mccree, A.: Speaker diarization using deep neural network embeddings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4930–4934 (2017)

    Google Scholar 

  17. Snyder, D., Garcia-Romero, D., Mccree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Proceedings of Odyssey, pp. 105–111 (2018)

    Google Scholar 

  18. Ghahremani, P., et al.: End-to-end deep neural network age estimation. In: INTERSPEECH 2018, pp. 277–281 (2018)

    Google Scholar 

  19. Burkhardt, F., Eckert, M., Johannsen, W., Stegmann, J.: A database of age and gender annotated telephone speech. In: Proceedings of 7th International Conference on Language Resources and Evaluation (LREC), pp. 1562–1565 (2010)

    Google Scholar 

  20. Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM Multimedia International Conference, pp. 1459–1462 (2010)

    Google Scholar 

  21. Kockmann, M., Burget, L., Cernocký, J.: Brno university of technology system for INTERSPEECH 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, pp. 2822–2825 (2010)

    Google Scholar 

  22. Meinedo, H., Trancoso, I.: Age and gender classification using fusion of acoustic and prosodic features. In: INTERSPEECH 2010, pp. 2818–2821 (2010)

    Google Scholar 

  23. Li, M., Han, K., Narayanan, S.: Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)

    Article  Google Scholar 

  24. Yücesoy, E., Nabiyev, V.: A new approach with score-level fusion for the classification of a speaker age and gender. Comput. Electr. Eng. 53, 29–39 (2016)

    Article  Google Scholar 

  25. Sarma, M., Sarma, K., Goel, K.: Children’s age and gender recognition from raw speech waveform using DNN. In: Proceedings of Advances in Intelligent Computing and Communication (ICAS), pp. 1–9 (2020)

    Google Scholar 

  26. Shobaki, K., Hosom, J.: The OGI kids speech corpus and recognizers. In: INTERSPEECH 2000, p. 4 (2000)

    Google Scholar 

  27. Martin, A., Greenberg, C.: NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: INTERSPEECH 2009, pp. 2579–2582 (2009)

    Google Scholar 

  28. Martin, A., Greenberg, C.: The NIST 2010 speaker recognition evaluation. In: INTERSPEECH 2010, pp. 2726–2729 (2010)

    Google Scholar 

  29. Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  30. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, p. 14 (2014)

  31. Russakovsky, O., Deng, J., Su, H., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  32. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  33. Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv: 1905.11946, p. 10 (2019)

    Google Scholar 

  34. Zhou, H., Bai, X., Du, J.: An investigation of transfer learning mechanism for acoustic scene classification. In: Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 404–408 (2018)

    Google Scholar 

  35. Tang, D., Zeng, J., Li, M.: An end-to-end deep learning framework for speech emotion recognition of a typical individuals. In: INTERSPEECH 2018, pp. 162–166 (2018)

    Google Scholar 

  36. Tripathi, S., Kumar, A., Ramesh, A., Singh, C., Yenigalla, P.: Focal loss based residual convolutional neural network for speech emotion recognition. arXiv preprint arXiv:1906.05682 (2019)

  37. Hartmann, W., et al.: Improved single system conversational telephone speech recognition with VGG bottleneck features. In: INTERSPEECH 2017, pp. 112–116 (2017)

    Google Scholar 

  38. Sankupellay, M., Konovalov, D.: Bird call recognition using deep convolutional neural network, ResNet-50. Acoustics Australia 7(9), 8 (2018)

    Google Scholar 

  39. Markitantov, M., Verkholyak, O.: Automatic recognition of speaker age and gender based on deep neural networks. In: Proceedings of International Conference on Speech and Computer (SPECOM), pp. 327–336 (2019)

    Google Scholar 

  40. McFee, B. et al.: Librosa: audio and music signal analysis in Python. In: Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)

    Google Scholar 

  41. Paszke, A. et al.: Automatic differentiation in PyTorch. In: Proceedings of Autodiff Workshop of Neural Information Processing Systems (NIPS), p. 4 (2017)

    Google Scholar 

Download references

Acknowledgements

This research is supported by the Russian Science Foundation (project No. 18-11-00145).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxim Markitantov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Markitantov, M. (2020). Transfer Learning in Speaker’s Age and Gender Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-60276-5_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-60275-8

  • Online ISBN: 978-3-030-60276-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics