Abstract
The speech signal is one of the most effective data sources used in human–computer interaction and is widely used in many applications such as speech/speaker recognition, emotion recognition, language recognition, and age and gender recognition. In this study, two convolutional neural networks, 1D and 2D, are designed to recognize the age and gender class of the speaker. These models are created by stacking four feature learning blocks (FLBs) and one classification block. Two different feature vectors are used in their inputs, which are formed with mel-frequency cepstrum coefficients. Each FLB consists of a convolution layer, a batch normalization layer, a ReLU layer, a max pooling layer, and a dropout layer, while the classification block consists of a flatten layer, two fully connected layers, and a softmax layer. In the study, besides the parameter optimization made by manual search method, model optimization is also carried out by trying different combinations of the basic components that make up the FLBs. In the experiments with the Common Voice Turkish dataset, the highest validation accuracy is obtained as 66.26% for the 1D model and 94.40% for the 2D model. These results reveal the effectiveness of the proposed 2D model in age and gender recognition.
Similar content being viewed by others
Data availability
The data sets are publicly available.
References
Alkhawaldeh RS (2019) DGR: gender recognition of human speech using one-dimensional conventional neural network. Sci Program. https://doi.org/10.1155/2019/7213717
Alnuaim AA, Zakariah M, Shashidhar C, Hatamleh WA, Tarazi H, Shukla PK, Ratna R (2022) Speaker gender recognition based on deep neural networks and ResNet50. Wirel Commun Mob Comput. https://doi.org/10.1155/2022/4444388
Bhangale KB, Mohanaprasad K (2021) A review on speech processing using machine learning paradigm. Int J Speech Technol 24:367–388. https://doi.org/10.1007/s10772-021-09808-0
Büyük O, Arslan ML (2018) Combination of long-term and short-term features for age identification from voice. Adv Electr Comput Eng 18(2):101–108. https://doi.org/10.4316/AECE.2018.02013
Ertam F (2019) An effective gender recognition approach using voice data via deeper LSTM networks. Appl Acoust 156:351–358. https://doi.org/10.1016/j.apacoust.2019.07.033
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Chen T (2018) Recent advances in convolutional neural networks. Pattern recognit 77:354–377. https://doi.org/10.1016/j.patcog.2017.10.013
Han S, Yang H (2018) Understanding adoption of intelligent personal assistants: a parasocial relationship perspective. Ind Manag Data Syst 118(3):618–636. https://doi.org/10.1108/IMDS-05-2017-0214
Heracleous P, Yoneyama A (2019) A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PLoS ONE 14(8):e0220386. https://doi.org/10.1371/journal.pone.0220386
Hu Y, Wu D, Nucci A (2012) Pitch-based gender identification with two-stage classification. Secur Commun Netw 5(2):211–225. https://doi.org/10.1002/sec.308
Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Humaniz Comput 10:1787–1798. https://doi.org/10.1007/s12652-017-0644-8
Kacur J, Puterka B, Pavlovicova J, Oravec M (2022) Frequency, time, representation and modeling aspects for major speech and audio processing applications. Sensors 22(16):6304. https://doi.org/10.3390/s22166304
Kwasny D, Hemmerling D (2021) Gender and age estimation methods based on speech using deep neural networks. Sensors 21(14):4785. https://doi.org/10.3390/s21144785
La Mura M, Lamberti P (2020) Human-machine interaction personalization: a review on gender and emotion recognition through speech analysis. In: IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp 319–323 https://doi.org/10.1109/MetroInd4.0IoT48571.2020.9138203
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038
Muhuri PS, Chatterjee P, Yuan X, Roy K, Esterline A (2020) Using a long short-term memory recurrent neural network (LSTM-RNN) to classify network attacks. Information 11(5):243. https://doi.org/10.3390/info11050243
Mozilla Common Voice (2022) Common Voice. (n.d.). Retrieved from https://commonvoice.mozilla.org/tr/datasets. Accessed April 27, 2022
Omeroglu AN, Mohammed HM, Oral EA (2022) Multi-modal voice pathology detection architecture based on deep and handcrafted feature fusion. Eng Sci Technol Int J 36:101148. https://doi.org/10.1016/j.jestch.2022.101148
Qawaqneh Z, Mallouh AA, Barkana BD (2017) Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl-Based Syst 115:5–14. https://doi.org/10.1016/j.knosys.2016.10.008
Rao KS, Manjunath KE (2017) Speech recognition using articulatory and excitation source features. Springer, New York
Sahoo BB, Jha R, Singh A, Kumar D (2019) Long short-term memory (LSTM) recurrent neural network for low-flow hydrological time series forecasting. Acta Geophys 67(5):1471–1481. https://doi.org/10.1007/s11600-019-00330-1
Sánchez-Hevia HA, Gil-Pita R, Utrilla-Manso M, Rosa-Zurera M (2022) Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimed Tools Appl 81(3):3535–3552. https://doi.org/10.1007/s11042-021-11614-4
Shagi GU, Aji S (2022) A machine learning approach for gender identification using statistical features of pitch in speeches. Appl Acoust 185:108392. https://doi.org/10.1016/j.apacoust.2021.108392
Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Procedia Comput Sci 151:37–44. https://doi.org/10.1016/j.procs.2019.04.009
Tanner DC, Tanner ME (2004) Forensic aspects of speech patterns: voice prints, speaker profiling, lie and intoxication detection. Lawyers & Judges Publishing Company, Tucson
Tirumala SS, Shahamiri SR, Garhwal AS, Wang R (2017) Speaker identification features extraction methods: a systematic review. Expert Syst Appl 90:250–271. https://doi.org/10.1016/j.eswa.2017.08.015
Tursunov A, Choeh JY, Kwon S (2021) Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 21(17):5892. https://doi.org/10.3390/s21175892
Vlaj D, Zgank A (2022) Acoustic gender and age classification as an aid to human-computer interaction in a smart home environment. Mathematics 11(1):169. https://doi.org/10.3390/math11010169
Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629. https://doi.org/10.1007/s13244-018-0639-9
Author information
Authors and Affiliations
Contributions
EY was involved in conceptualization, methodology, investigation, project administration, software, validation, visualization, writing—original draft, and writing—review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yücesoy, E. Speaker age and gender recognition using 1D and 2D convolutional neural networks. Neural Comput & Applic 36, 3065–3075 (2024). https://doi.org/10.1007/s00521-023-09153-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-09153-0