Children’s Speaker Recognition Method Based on Multi-dimensional Features
In life, the voice signals collected by people are essentially mixed signals, which mainly include information related to speaker characteristics, such as gender, age and emotional state. The commonality and characteristics of traditional single-dimensional speaker information recognition are analyzed, and children’s individualized analysis is carried out for common acoustic feature parameters such as prosodic features, sound quality features and spectral-based features. Therefore, considering the temporal characteristics of voice, combined with the Time-Delay Neural Network (TDNN) model, Bidirectional Long Short-Term Memory model and the attention mechanism, the multi-channel model is trained to form a speaker recognition problem solution for children’s speaker recognition. A large number of experimental results show that on the basis of guaranteeing the accuracy of age and gender recognition, higher accuracy of children’s voiceprint recognition can be obtained.
KeywordsChildren’s speaker recognition Bidirectional Long Short-Term Memory Time delay neural network Attention mechanism
This paper is funded by the “Dalian Key Laboratory for the Application of Big Data and Data Science”.
- 2.Wang, J., Yang, Y., Mao, J., et al.: CNN-RNN: a unified framework for multi-label image classification. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 2285–2294 (2016)Google Scholar
- 3.Jiang, H., Lu, Y., Xue, J.: Automatic soccer video event detection based on a deep neural network combined CNN and RNN. In: Proceedings of the 28th IEEE International Conference on Tools with Artificial Intelligence, pp. 490–494 (2016)Google Scholar
- 4.Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61(3), 85–94 (2014)Google Scholar
- 5.Abdullah, H., Garcia, W., Peeters, C., et al.: Practical hidden voice attacks against speech and speaker recognition systems (2019)Google Scholar
- 6.Mary, L.: Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition (2019)Google Scholar
- 7.Wang, Y., Fan, X., Chen, I.F., et al.: End-to-end anchored speech recognition (2019)Google Scholar
- 8.Lakomkin, E., Zamani, M.A., Weber, C., et al.: Incorporating end-to-end speech recognition models for sentiment analysis (2019)Google Scholar
- 9.Harb, H., Chen, L.: Vlice-based gender identification in multimedia applications. Int. J. Pattern Recogn. Artif. Intell. 19(2), 63–78 (2005)Google Scholar
- 11.Ravanelli, M., Bengio, Y.: Speech and speaker recognition from raw waveform with SincNet (2018)Google Scholar
- 12.Parthasarathy, S., Busso, C.: Predicting speaker recognition reliability by considering emotional content. In: International Conference on Affective Computing & Intelligent Interaction (2017)Google Scholar