Advertisement

Children’s Age and Gender Recognition from Raw Speech Waveform Using DNN

  • Mousmita SarmaEmail author
  • Kandarpa Kumar Sarma
  • Nagendra Kumar Goel
Conference paper
Part of the Lecture Notes in Networks and Systems book series (LNNS, volume 109)

Abstract

We propose raw speech waveform-based end-to-end deep neural network (DNN) architectures to estimate age and gender of children within the age range of 4–14 years. To achieve this objective, we design single-task and multi-task learning DNN configuration. In the multi-task learning DNN, we use age and gender as separate label in two output layers and jointly optimize the total objective loss. We use a data-driven approach of learning feature from raw waveform within the DNN, which provides the learning process freedom to learn gender and age discriminative features during training. Interleaving time-delay neural network and long short-term memory (TDNN-LSTM) layers with time-restricted self-attention mechanism has been used for modeling of speech temporal dynamics. Experimental results provide a comparative analysis of single-task and multi-task learning process for age and gender recognition from children’s speech.

Keywords

Gender recognition Age estimation Multi-task learning DNN 

References

  1. 1.
    Kumar N, Nasir M, Georgiou P, Narayanan SS (2016) Robust multichannel gender classification from speech in movie audio. In: Proceeding INERSPEECH 2016, The 17th annual conference of the international speech communication association. SanFrancisco, USA, 8–12 September, 2016Google Scholar
  2. 2.
    Kabil SH, Muckenhirn H, Doss MM (2018) on learning to identify genders from raw speech signal using CNNs. In: Proceeding INERSPEECH 2018, The 19th annual conference of the international speech communication association. Hyderabad, India, 2–6 September, 2018Google Scholar
  3. 3.
    Safavi S, Russell M, Jancovic P (2014) Identification of age-group from childrens speech by computers and humans. In: Proceeding INTERSPEECH 2014, The 15th annual conference of the international speech communication association. Singapore, 14–18 September, 2014Google Scholar
  4. 4.
    Safavi S, Russell M, Jancovic P (2018) Automatic speaker, age-group and gender identification from childrens speech. Comput Speech Lang 50:141–156CrossRefGoogle Scholar
  5. 5.
    Ghahremani P, Manohar V, Povey D, Khudanpur S (2016) Acoustic modellingfrom the signal domain using CNNs. In: Proceeding INTERSPEECH 2016, 17th annual conference of the international speech communication association. San Francisco, CA, USA, 8–12 September 2016Google Scholar
  6. 6.
    Sarma M, Ghahremani P, Povey D, Goel NK, Sarma KK, Dehak N (2018) Emotion identification from raw speech signals using DNNs. In: Proceeding INERSPEECH2018, The 19th annual conference of the international speech communication association. Hyderabad, India, 2–6 September, 2018Google Scholar
  7. 7.
    Shobaki K, Hosom J, Cole (2000) The OGI kids speech corpus and recognizers. In: Proceeding INTERSPEECH 2000, The 6th international conference on spoken language processing, ICSLP 2000/INTERSPEECH 2000, Beijing, China, 16–20 October, 2000Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Mousmita Sarma
    • 1
    Email author
  • Kandarpa Kumar Sarma
    • 1
  • Nagendra Kumar Goel
    • 2
  1. 1.Department of Electronics and Communication EngineeringGauhati UniversityGuwahatiIndia
  2. 2.GoVivace Inc.McLeanUSA

Personalised recommendations