Speaker age and gender recognition using 1D and 2D convolutional neural networks

Yücesoy, Ergün

doi:10.1007/s00521-023-09153-0

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Original Article
Published: 28 November 2023

Volume 36, pages 3065–3075, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ergün Yücesoy ORCID: orcid.org/0000-0003-1707-384X¹

368 Accesses
1 Altmetric
Explore all metrics

Abstract

The speech signal is one of the most effective data sources used in human–computer interaction and is widely used in many applications such as speech/speaker recognition, emotion recognition, language recognition, and age and gender recognition. In this study, two convolutional neural networks, 1D and 2D, are designed to recognize the age and gender class of the speaker. These models are created by stacking four feature learning blocks (FLBs) and one classification block. Two different feature vectors are used in their inputs, which are formed with mel-frequency cepstrum coefficients. Each FLB consists of a convolution layer, a batch normalization layer, a ReLU layer, a max pooling layer, and a dropout layer, while the classification block consists of a flatten layer, two fully connected layers, and a softmax layer. In the study, besides the parameter optimization made by manual search method, model optimization is also carried out by trying different combinations of the basic components that make up the FLBs. In the experiments with the Common Voice Turkish dataset, the highest validation accuracy is obtained as 66.26% for the 1D model and 94.40% for the 2D model. These results reveal the effectiveness of the proposed 2D model in age and gender recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

Article Open access 07 March 2023

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Data availability

The data sets are publicly available.

References

Alkhawaldeh RS (2019) DGR: gender recognition of human speech using one-dimensional conventional neural network. Sci Program. https://doi.org/10.1155/2019/7213717
Article Google Scholar
Alnuaim AA, Zakariah M, Shashidhar C, Hatamleh WA, Tarazi H, Shukla PK, Ratna R (2022) Speaker gender recognition based on deep neural networks and ResNet50. Wirel Commun Mob Comput. https://doi.org/10.1155/2022/4444388
Article Google Scholar
Bhangale KB, Mohanaprasad K (2021) A review on speech processing using machine learning paradigm. Int J Speech Technol 24:367–388. https://doi.org/10.1007/s10772-021-09808-0
Article Google Scholar
Büyük O, Arslan ML (2018) Combination of long-term and short-term features for age identification from voice. Adv Electr Comput Eng 18(2):101–108. https://doi.org/10.4316/AECE.2018.02013
Article Google Scholar
Ertam F (2019) An effective gender recognition approach using voice data via deeper LSTM networks. Appl Acoust 156:351–358. https://doi.org/10.1016/j.apacoust.2019.07.033
Article Google Scholar
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Chen T (2018) Recent advances in convolutional neural networks. Pattern recognit 77:354–377. https://doi.org/10.1016/j.patcog.2017.10.013
Article Google Scholar
Han S, Yang H (2018) Understanding adoption of intelligent personal assistants: a parasocial relationship perspective. Ind Manag Data Syst 118(3):618–636. https://doi.org/10.1108/IMDS-05-2017-0214
Article Google Scholar
Heracleous P, Yoneyama A (2019) A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PLoS ONE 14(8):e0220386. https://doi.org/10.1371/journal.pone.0220386
Article Google Scholar
Hu Y, Wu D, Nucci A (2012) Pitch-based gender identification with two-stage classification. Secur Commun Netw 5(2):211–225. https://doi.org/10.1002/sec.308
Article Google Scholar
Huang Y, Tian K, Wu A, Zhang G (2019) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J Ambient Intell Humaniz Comput 10:1787–1798. https://doi.org/10.1007/s12652-017-0644-8
Article Google Scholar
Kacur J, Puterka B, Pavlovicova J, Oravec M (2022) Frequency, time, representation and modeling aspects for major speech and audio processing applications. Sensors 22(16):6304. https://doi.org/10.3390/s22166304
Article Google Scholar
Kwasny D, Hemmerling D (2021) Gender and age estimation methods based on speech using deep neural networks. Sensors 21(14):4785. https://doi.org/10.3390/s21144785
Article Google Scholar
La Mura M, Lamberti P (2020) Human-machine interaction personalization: a review on gender and emotion recognition through speech analysis. In: IEEE International Workshop on Metrology for Industry 4.0 & IoT, pp 319–323 https://doi.org/10.1109/MetroInd4.0IoT48571.2020.9138203
Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE (2017) A survey of deep neural network architectures and their applications. Neurocomputing 234:11–26. https://doi.org/10.1016/j.neucom.2016.12.038
Article Google Scholar
Muhuri PS, Chatterjee P, Yuan X, Roy K, Esterline A (2020) Using a long short-term memory recurrent neural network (LSTM-RNN) to classify network attacks. Information 11(5):243. https://doi.org/10.3390/info11050243
Article Google Scholar
Mozilla Common Voice (2022) Common Voice. (n.d.). Retrieved from https://commonvoice.mozilla.org/tr/datasets. Accessed April 27, 2022
Omeroglu AN, Mohammed HM, Oral EA (2022) Multi-modal voice pathology detection architecture based on deep and handcrafted feature fusion. Eng Sci Technol Int J 36:101148. https://doi.org/10.1016/j.jestch.2022.101148
Article Google Scholar
Qawaqneh Z, Mallouh AA, Barkana BD (2017) Deep neural network framework and transformed MFCCs for speaker’s age and gender classification. Knowl-Based Syst 115:5–14. https://doi.org/10.1016/j.knosys.2016.10.008
Article Google Scholar
Rao KS, Manjunath KE (2017) Speech recognition using articulatory and excitation source features. Springer, New York
Book Google Scholar
Sahoo BB, Jha R, Singh A, Kumar D (2019) Long short-term memory (LSTM) recurrent neural network for low-flow hydrological time series forecasting. Acta Geophys 67(5):1471–1481. https://doi.org/10.1007/s11600-019-00330-1
Article Google Scholar
Sánchez-Hevia HA, Gil-Pita R, Utrilla-Manso M, Rosa-Zurera M (2022) Age group classification and gender recognition from speech with temporal convolutional neural networks. Multimed Tools Appl 81(3):3535–3552. https://doi.org/10.1007/s11042-021-11614-4
Article Google Scholar
Shagi GU, Aji S (2022) A machine learning approach for gender identification using statistical features of pitch in speeches. Appl Acoust 185:108392. https://doi.org/10.1016/j.apacoust.2021.108392
Article Google Scholar
Shaqra FA, Duwairi R, Al-Ayyoub M (2019) Recognizing emotion from speech based on age and gender using hierarchical models. Procedia Comput Sci 151:37–44. https://doi.org/10.1016/j.procs.2019.04.009
Article Google Scholar
Tanner DC, Tanner ME (2004) Forensic aspects of speech patterns: voice prints, speaker profiling, lie and intoxication detection. Lawyers & Judges Publishing Company, Tucson
Google Scholar
Tirumala SS, Shahamiri SR, Garhwal AS, Wang R (2017) Speaker identification features extraction methods: a systematic review. Expert Syst Appl 90:250–271. https://doi.org/10.1016/j.eswa.2017.08.015
Article Google Scholar
Tursunov A, Choeh JY, Kwon S (2021) Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms. Sensors 21(17):5892. https://doi.org/10.3390/s21175892
Article Google Scholar
Vlaj D, Zgank A (2022) Acoustic gender and age classification as an aid to human-computer interaction in a smart home environment. Mathematics 11(1):169. https://doi.org/10.3390/math11010169
Article Google Scholar
Yamashita R, Nishio M, Do RKG, Togashi K (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629. https://doi.org/10.1007/s13244-018-0639-9
Article Google Scholar

Download references

Author information

Authors and Affiliations

Vocational School of Technical Sciences, Ordu University, Ordu, 52200, Turkey
Ergün Yücesoy

Authors

Ergün Yücesoy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

EY was involved in conceptualization, methodology, investigation, project administration, software, validation, visualization, writing—original draft, and writing—review & editing.

Corresponding author

Correspondence to Ergün Yücesoy.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yücesoy, E. Speaker age and gender recognition using 1D and 2D convolutional neural networks. Neural Comput & Applic 36, 3065–3075 (2024). https://doi.org/10.1007/s00521-023-09153-0

Download citation

Received: 13 March 2023
Accepted: 20 October 2023
Published: 28 November 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s00521-023-09153-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker age and gender recognition using 1D and 2D convolutional neural networks

Abstract

Access this article

Similar content being viewed by others

Real-time facial emotion recognition system among children with autism based on deep learning and IoT

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation