Abstract
In this paper, we study an application of transfer learning approach to speaker’s age and gender recognition task. Recently, speech analysis systems, which take images of log Mel-spectrograms or MFCCs as input for classification, are gaining popularity. Therefore, we used pretrained models that showed good performance on ImageNet task, such as AlexNet, VGG-16, ResNet18, ResNet34, ResNet50, as well as state-of-the-art EfficientNet-B4 from Google. Additionally, we trained 1D CNN and TDNN models for speaker’s age and gender recognition. We compared performance of these models in age (4 classes), gender (3 classes) and joint age and gender (7 classes) recognition. Despite high performance of pretrained models in ImageNet task, our TDNN models showed better UAR results in all tasks presented in this study: age (UAR = 51.719%), gender (UAR = 81.746%) and joint age and gender (UAR = 48.969%) recognition.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Schuller, B., Steidl, S., Batliner, A.: The INTERSPEECH 2009 emotion challenge. In: INTERSPEECH 2009, pp. 312–315 (2009)
Kaya, H., et al.: Predicting depression and emotions in the cross-roads of cultures, para-linguistics, and non-linguistics. In: Proceedings of International on Audio/Visual Emotion Challenge and Workshop, pp. 27–35 (2019)
Schuller, B. et al.: The INTERSPEECH 2010 paralinguistic challenge. In: INTERSPEECH 2010, pp. 2794–2797 (2010)
Schuller, B. et al.: The INTERSPEECH 2019 computational paralinguistics challenge: Styrian dialects, continuous sleepiness, baby sounds & orca activity. In: INTERSPEECH 2019, pp. 2378–2382 (2019)
Schuller, B., Steidl, S., Batliner, A., Schiel, F., Krajewski, J.: The INTERSPEECH 2011 speaker state challenge. In: INTERSPEECH 2011, pp. 3201–3204 (2011)
Schuller, B. et al.: The INTERSPEECH 2017 computational paralinguistics challenge: Addressee, cold & snoring. In: INTERSPEECH 2017, pp. 3442–3446 (2017)
Schuller, B. et al.: The INTERSPEECH 2020 computational paralinguistics challenge: elderly emotion, breathing & masks. In: INTERSPEECH 2020, p. 5 (2020)
Pan, S., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)
Yu, D., Seltzer, M.: Improved bottleneck features using pretrained deep neural networks. In: INTERSPEECH 2011, pp. 237–240 (2011)
Dalmia, S., Li, X., Metze, F., Black, A.: Domain robust feature extraction for rapid low resource ASR development. In: Proceedings of IEEE Spoken Language Technology Workshop (SLT), pp. 258–265 (2018)
Sainath, T., Mohamed, A., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618 (2013)
Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2011)
Wang, Y. et al.: Tacotron: towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, p. 10 (2017)
Snyder, D., Garcia-Romero, D., Povey, D., Khudanpur, S.: Deep neural network embeddings for text-independent speaker verification. In: INTERSPEECH 2017, pp. 999–1003 (2017)
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333 (2018)
Garcia-Romero, D., Snyder, D., Sell, G., Povey, D., Mccree, A.: Speaker diarization using deep neural network embeddings. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4930–4934 (2017)
Snyder, D., Garcia-Romero, D., Mccree, A., Sell, G., Povey, D., Khudanpur, S.: Spoken language recognition using x-vectors. In: Proceedings of Odyssey, pp. 105–111 (2018)
Ghahremani, P., et al.: End-to-end deep neural network age estimation. In: INTERSPEECH 2018, pp. 277–281 (2018)
Burkhardt, F., Eckert, M., Johannsen, W., Stegmann, J.: A database of age and gender annotated telephone speech. In: Proceedings of 7th International Conference on Language Resources and Evaluation (LREC), pp. 1562–1565 (2010)
Eyben, F., Wöllmer, M., Schuller, B.: OpenSMILE – the Munich versatile and fast open-source audio feature extractor. In: Proceedings of ACM Multimedia International Conference, pp. 1459–1462 (2010)
Kockmann, M., Burget, L., Cernocký, J.: Brno university of technology system for INTERSPEECH 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, pp. 2822–2825 (2010)
Meinedo, H., Trancoso, I.: Age and gender classification using fusion of acoustic and prosodic features. In: INTERSPEECH 2010, pp. 2818–2821 (2010)
Li, M., Han, K., Narayanan, S.: Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Comput. Speech Lang. 27(1), 151–167 (2013)
Yücesoy, E., Nabiyev, V.: A new approach with score-level fusion for the classification of a speaker age and gender. Comput. Electr. Eng. 53, 29–39 (2016)
Sarma, M., Sarma, K., Goel, K.: Children’s age and gender recognition from raw speech waveform using DNN. In: Proceedings of Advances in Intelligent Computing and Communication (ICAS), pp. 1–9 (2020)
Shobaki, K., Hosom, J.: The OGI kids speech corpus and recognizers. In: INTERSPEECH 2000, p. 4 (2000)
Martin, A., Greenberg, C.: NIST 2008 speaker recognition evaluation: performance across telephone and room microphone channels. In: INTERSPEECH 2009, pp. 2579–2582 (2009)
Martin, A., Greenberg, C.: The NIST 2010 speaker recognition evaluation. In: INTERSPEECH 2010, pp. 2726–2729 (2010)
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, p. 14 (2014)
Russakovsky, O., Deng, J., Su, H., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Tan, M., Le, Q.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv: 1905.11946, p. 10 (2019)
Zhou, H., Bai, X., Du, J.: An investigation of transfer learning mechanism for acoustic scene classification. In: Proceedings of International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 404–408 (2018)
Tang, D., Zeng, J., Li, M.: An end-to-end deep learning framework for speech emotion recognition of a typical individuals. In: INTERSPEECH 2018, pp. 162–166 (2018)
Tripathi, S., Kumar, A., Ramesh, A., Singh, C., Yenigalla, P.: Focal loss based residual convolutional neural network for speech emotion recognition. arXiv preprint arXiv:1906.05682 (2019)
Hartmann, W., et al.: Improved single system conversational telephone speech recognition with VGG bottleneck features. In: INTERSPEECH 2017, pp. 112–116 (2017)
Sankupellay, M., Konovalov, D.: Bird call recognition using deep convolutional neural network, ResNet-50. Acoustics Australia 7(9), 8 (2018)
Markitantov, M., Verkholyak, O.: Automatic recognition of speaker age and gender based on deep neural networks. In: Proceedings of International Conference on Speech and Computer (SPECOM), pp. 327–336 (2019)
McFee, B. et al.: Librosa: audio and music signal analysis in Python. In: Proceedings of 14th Python in Science Conference, pp. 18–24 (2015)
Paszke, A. et al.: Automatic differentiation in PyTorch. In: Proceedings of Autodiff Workshop of Neural Information Processing Systems (NIPS), p. 4 (2017)
Acknowledgements
This research is supported by the Russian Science Foundation (project No. 18-11-00145).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Markitantov, M. (2020). Transfer Learning in Speaker’s Age and Gender Recognition. In: Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science(), vol 12335. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-60276-5_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60275-8
Online ISBN: 978-3-030-60276-5
eBook Packages: Computer ScienceComputer Science (R0)