Transfer Learning in Speaker’s Age and Gender Recognition

Markitantov, Maxim

doi:10.1007/978-3-030-60276-5_32

Maxim Markitantov¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12335))

Included in the following conference series:

International Conference on Speech and Computer

1718 Accesses
3 Citations

Abstract

In this paper, we study an application of transfer learning approach to speaker’s age and gender recognition task. Recently, speech analysis systems, which take images of log Mel-spectrograms or MFCCs as input for classification, are gaining popularity. Therefore, we used pretrained models that showed good performance on ImageNet task, such as AlexNet, VGG-16, ResNet18, ResNet34, ResNet50, as well as state-of-the-art EfficientNet-B4 from Google. Additionally, we trained 1D CNN and TDNN models for speaker’s age and gender recognition. We compared performance of these models in age (4 classes), gender (3 classes) and joint age and gender (7 classes) recognition. Despite high performance of pretrained models in ImageNet task, our TDNN models showed better UAR results in all tasks presented in this study: age (UAR = 51.719%), gender (UAR = 81.746%) and joint age and gender (UAR = 48.969%) recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: INTERSPEECH 2009, pp. 312–315 (2009)
Google Scholar
Kaya, H., et al.: Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics. In: Proceedings of International on Audio/Visual Emotion Challenge and Workshop, pp. 27–35 (2019)
Google Scholar
Schuller, B. et al.: The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH 2010, pp. 2794–2797 (2010)
Google Scholar
Schuller, B. et al.: The INTERSPEECH 2019 computational paralinguistics challenge: Styrian dialects, continuous sleepiness, baby sounds & orca activity. In: INTERSPEECH 2019, pp. 2378–2382 (2019)
Google Scholar
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J.: The INTERSPEECH 2011 speaker state challenge. In: INTERSPEECH 2011, pp. 3201–3204 (2011)
Google Scholar
Schuller, B. et al.: The INTERSPEECH 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: INTERSPEECH 2017, pp. 3442–3446 (2017)
Google Scholar
Schuller, B. et al.: The INTERSPEECH 2020 computational paralinguistics challenge: elderly emotion, breathing & masks. In: INTERSPEECH 2020, p. 5 (2020)
Google Scholar
Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)
Article Google Scholar
Yu, D., Seltzer, M.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH 2011, pp. 237–240 (2011)
Google Scholar
Dalmia, S., Li, X., Metze, F., Black, A.: Domain robust feature extraction for rapid low resource ASR development. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 258–265 (2018)
Google Scholar
Sainath, T., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618 (2013)
Google Scholar
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2011)
Article Google Scholar
Wang, Y. et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, p. 10 (2017)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH 2017, pp. 999–1003 (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Google Scholar
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., Mccree, A.: Speaker diarization using deep neural network embeddings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4930–4934 (2017)
Google Scholar
Snyder, D., Garcia-Romero, D., Mccree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Proceedings of Odyssey, pp. 105–111 (2018)
Google Scholar
Ghahremani, P., et al.: End-to-end deep neural network age estimation. In: INTERSPEECH 2018, pp. 277–281 (2018)
Google Scholar
Burkhardt, F., Eckert, M., Johannsen, W., Stegmann, J.: A database of age and gender annotated telephone speech. In: Proceedings of 7th International Conference on Language Resources and Evaluation (LREC), pp. 1562–1565 (2010)
Google Scholar
Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM Multimedia International Conference, pp. 1459–1462 (2010)
Google Scholar
Kockmann, M., Burget, L., Cernocký, J.: Brno university of technology system for INTERSPEECH 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, pp. 2822–2825 (2010)
Google Scholar
Meinedo, H., Trancoso, I.: Age and gender classification using fusion of acoustic and prosodic features. In: INTERSPEECH 2010, pp. 2818–2821 (2010)
Google Scholar
Li, M., Han, K., Narayanan, S.: Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)
Article Google Scholar
Yücesoy, E., Nabiyev, V.: A new approach with score-level fusion for the classification of a speaker age and gender. Comput. Electr. Eng. 53, 29–39 (2016)
Article Google Scholar
Sarma, M., Sarma, K., Goel, K.: Children’s age and gender recognition from raw speech waveform using DNN. In: Proceedings of Advances in Intelligent Computing and Communication (ICAS), pp. 1–9 (2020)
Google Scholar
Shobaki, K., Hosom, J.: The OGI kids speech corpus and recognizers. In: INTERSPEECH 2000, p. 4 (2000)
Google Scholar
Martin, A., Greenberg, C.: NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: INTERSPEECH 2009, pp. 2579–2582 (2009)
Google Scholar
Martin, A., Greenberg, C.: The NIST 2010 speaker recognition evaluation. In: INTERSPEECH 2010, pp. 2726–2729 (2010)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, p. 14 (2014)
Russakovsky, O., Deng, J., Su, H., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv: 1905.11946, p. 10 (2019)
Google Scholar
Zhou, H., Bai, X., Du, J.: An investigation of transfer learning mechanism for acoustic scene classification. In: Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 404–408 (2018)
Google Scholar
Tang, D., Zeng, J., Li, M.: An end-to-end deep learning framework for speech emotion recognition of a typical individuals. In: INTERSPEECH 2018, pp. 162–166 (2018)
Google Scholar
Tripathi, S., Kumar, A., Ramesh, A., Singh, C., Yenigalla, P.: Focal loss based residual convolutional neural network for speech emotion recognition. arXiv preprint arXiv:1906.05682 (2019)
Hartmann, W., et al.: Improved single system conversational telephone speech recognition with VGG bottleneck features. In: INTERSPEECH 2017, pp. 112–116 (2017)
Google Scholar
Sankupellay, M., Konovalov, D.: Bird call recognition using deep convolutional neural network, ResNet-50. Acoustics Australia 7(9), 8 (2018)
Google Scholar
Markitantov, M., Verkholyak, O.: Automatic recognition of speaker age and gender based on deep neural networks. In: Proceedings of International Conference on Speech and Computer (SPECOM), pp. 327–336 (2019)
Google Scholar
McFee, B. et al.: Librosa: audio and music signal analysis in Python. In: Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)
Google Scholar
Paszke, A. et al.: Automatic differentiation in PyTorch. In: Proceedings of Autodiff Workshop of Neural Information Processing Systems (NIPS), p. 4 (2017)
Google Scholar

Download references

Acknowledgements

This research is supported by the Russian Science Foundation (project No. 18-11-00145).

Author information

Authors and Affiliations

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS), St. Petersburg, Russia
Maxim Markitantov

Authors

Maxim Markitantov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maxim Markitantov .

Editor information

Editors and Affiliations

St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Institute for Applied and Mathematical Linguistics, Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Markitantov, M. (2020). Transfer Learning in Speaker’s Age and Gender Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-60276-5_32
Published: 29 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics