Three-stage speaker verification architecture in emotional talking environments
Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and “Emotional Prosody Speech and Transcripts” dataset. Our results show that speaker verification based on both gender information and emotion information is superior to each of speaker verification based on gender information only, emotion information only, and neither gender information nor emotion information. The attained average speaker verification performance based on the proposed framework is very alike to that attained in subjective assessment by human listeners.
KeywordsEmotion recognition Emotional talking environments Gender recognition Hidden Markov models Speaker verification Suprasegmental hidden Markov models
The authors of this work would like to thank “University of Sharjah” for funding their work through the competitive research projects entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models”, No. 1602040348-P.
Ismail Shahin wrote the paper, developed some of the used classifiers, and did some experiments. Ali Bou Nassif suggested using some classifiers, he performed some experiments, and he wrote the research questions. All authors read and approved the final manuscript.
Ismail Shahin and Ali Bou Nassif would like to thank University of Sharjah for funding their work through the competitive research project entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models”, No. 1602040348-P.
Compliance with ethical standards
Conflict of interest
The authors declare that they have no competing interests.
This study does not involve any animal participants.
- Chen, L., Lee, K. A., Chng, E.-S., Ma, B., Li, H., & Dai, L. R., (2016). Content-aware local variability vector for speaker verification with short utterance. In The 41st IEEE international conference on acoustics, speech and signal processing, Shanghai, China, March 2016 (pp. 5485–5489).Google Scholar
- Emotional Prosody Speech and Transcripts dataset. (2016). Retrieved November 15, 2016, from http://www.ldc.upenn.edu/Catalog/CatalogEntry. jsp?catalogId = LDC2002S28.
- Harb, H., & Chen, L. (2003). Gender identification using a general audio classifier. In International Conference on Multimedia and Expo 2003 (ICME’03), July 2003, (pp. 733–736).Google Scholar
- Polzin, T. S., & Waibel, A. H., (1998). Detecting emotions in speech. Cooperative multimodal communication. In second international conference 1998, CMC 1998.Google Scholar
- Reynolds, D. A. (1995). Automatic speaker recognition using Gaussian mixture speaker models. The Lincoln Laboratory Journal, 8(2), 173–192.Google Scholar
- Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. ICASSP 2002, 4, IV-4072–IV-4075.Google Scholar
- Shahin, I. (2009). Verifying speakers in emotional environments. In The 9th IEEE international symposium on signal processing and information technology, Ajman, United Arab Emirates, December 2009, (pp. 328–333).Google Scholar
- Vogt, T., & Andre, E., (2006). Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.Google Scholar
- Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J., (2006). Study on speaker verification on emotional speech. In Proceedings of International Conference on Spoken Language Processing, INTERSPEECH 2006. September 2006, (pp. 2102–2105).Google Scholar