Children’s Speaker Recognition Method Based on Multi-dimensional Features

  • Ning JiaEmail author
  • Chunjun Zheng
  • Wei Sun
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11888)


In life, the voice signals collected by people are essentially mixed signals, which mainly include information related to speaker characteristics, such as gender, age and emotional state. The commonality and characteristics of traditional single-dimensional speaker information recognition are analyzed, and children’s individualized analysis is carried out for common acoustic feature parameters such as prosodic features, sound quality features and spectral-based features. Therefore, considering the temporal characteristics of voice, combined with the Time-Delay Neural Network (TDNN) model, Bidirectional Long Short-Term Memory model and the attention mechanism, the multi-channel model is trained to form a speaker recognition problem solution for children’s speaker recognition. A large number of experimental results show that on the basis of guaranteeing the accuracy of age and gender recognition, higher accuracy of children’s voiceprint recognition can be obtained.


Children’s speaker recognition Bidirectional Long Short-Term Memory Time delay neural network Attention mechanism 



This paper is funded by the “Dalian Key Laboratory for the Application of Big Data and Data Science”.


  1. 1.
    Phapatanaburi, K., Wang, L., Sakagami, R.: Distant-talking accent recognition by combining GMM and DNN. Multimedia Tools Appl. 75(9), 5109–5124 (2016)CrossRefGoogle Scholar
  2. 2.
    Wang, J., Yang, Y., Mao, J., et al.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)Google Scholar
  3. 3.
    Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: Proceedings of the 28th IEEE International Conference on Tools with Artificial Intelligence, pp. 490–494 (2016)Google Scholar
  4. 4.
    Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61(3), 85–94 (2014)Google Scholar
  5. 5.
    Abdullah, H., Garcia, W., Peeters, C., et al.: Practical hidden voice attacks against speech and speaker recognition systems (2019)Google Scholar
  6. 6.
    Mary, L.: Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition (2019)Google Scholar
  7. 7.
    Wang, Y., Fan, X., Chen, I.F., et al.: End-to-end anchored speech recognition (2019)Google Scholar
  8. 8.
    Lakomkin, E., Zamani, M.A., Weber, C., et al.: Incorporating end-to-end speech recognition models for sentiment analysis (2019)Google Scholar
  9. 9.
    Harb, H., Chen, L.: Vlice-based gender identification in multimedia applications. Int. J. Pattern Recogn. Artif. Intell. 19(2), 63–78 (2005)Google Scholar
  10. 10.
    Liu, Z., Wu, Z., Li, T., et al.: GMM and CNN hybrid method for short utterance speaker recognition. IEEE Trans. Ind. Inform. 14(7), 3244–3252 (2018)CrossRefGoogle Scholar
  11. 11.
    Ravanelli, M., Bengio, Y.: Speech and speaker recognition from raw waveform with SincNet (2018)Google Scholar
  12. 12.
    Parthasarathy, S., Busso, C.: Predicting speaker recognition reliability by considering emotional content. In: International Conference on Affective Computing & Intelligent Interaction (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Dalian Neusoft University of InformationDalianChina
  2. 2.Dalian Maritime UniversityDalianChina

Personalised recommendations