Three-stage speaker verification architecture in emotional talking environments

Shahin, Ismail; Nassif, Ali Bou

doi:10.1007/s10772-018-9543-4

Three-stage speaker verification architecture in emotional talking environments

Published: 28 August 2018

Volume 21, pages 915–930, (2018)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Ismail Shahin¹ &
Ali Bou Nassif¹

181 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

Speaker verification performance in neutral talking environment is usually high, while it is sharply decreased in emotional talking environments. This performance degradation in emotional environments is due to the problem of mismatch between training in neutral environment while testing in emotional environments. In this work, a three-stage speaker verification architecture has been proposed to enhance speaker verification performance in emotional environments. This architecture is comprised of three cascaded stages: gender identification stage followed by an emotion identification stage followed by a speaker verification stage. The proposed framework has been evaluated on two distinct and independent emotional speech datasets: in-house dataset and “Emotional Prosody Speech and Transcripts” dataset. Our results show that speaker verification based on both gender information and emotion information is superior to each of speaker verification based on gender information only, emotion information only, and neither gender information nor emotion information. The attained average speaker verification performance based on the proposed framework is very alike to that attained in subjective assessment by human listeners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker Identification Enhancement Using Emotional Features

Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

Article 18 November 2019

Speaker Verification Systems: A Comprehensive Review

References

Bosch, L. T. (2003). Emotions, speech and the ASR framework. Speech Communication, 40, 213–225.
Article MATH Google Scholar
Chen, L., Lee, K. A., Chng, E.-S., Ma, B., Li, H., & Dai, L. R., (2016). Content-aware local variability vector for speaker verification with short utterance. In The 41st IEEE international conference on acoustics, speech and signal processing, Shanghai, China, March 2016 (pp. 5485–5489).
Emotional Prosody Speech and Transcripts dataset. (2016). Retrieved November 15, 2016, from http://www.ldc.upenn.edu/Catalog/CatalogEntry. jsp?catalogId = LDC2002S28.
Hansen, J. H. L., & Hasan, T., (2015). Speaker recognition by machines and humans: a tutorial review. IEEE Signal Processing Magazine, 32(6), 74–99.
Article Google Scholar
Harb, H., & Chen, L. (2003). Gender identification using a general audio classifier. In International Conference on Multimedia and Expo 2003 (ICME’03), July 2003, (pp. 733–736).
Huang, M. X., Ngai, G., Hua, K. A., Chan, S. C. F., & Leong, H. V. (2016). Identifying user-specific facial affects from spontaneous expressions with minimal annotation. IEEE Transactions on Affective Computing, 7(4), 360–373. https://doi.org/10.1109/TAFFC.2015.2495222.
Article Google Scholar
Lee, C. M., & Narayanan, S. S. (2005). Towards detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing, 13(2), 293–303.
Article Google Scholar
Mariooryad, S., & Busso, C. (2016). Facial expression recognition in the presence of speech using blind lexical compensation. IEEE Transactions on Affective Computing, 7(4), 346–359. https://doi.org/10.1109/TAFFC.2015.2490070.
Article Google Scholar
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.
Article Google Scholar
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.
Article Google Scholar
Pillay, S. G., Ariyaeeinia, A., Pawlewski, M., & Sivakumaran, P. (2009). Speaker verification under mismatched data conditions. IET Signal Processing, 3(4), 236–246.
Article Google Scholar
Pitsikalis, V., & Maragos, P. (2009). Analysis and classification of speech signals by generalized fractal dimension features. Speech Communication, 51(12), 1206–1223.
Article Google Scholar
Pittermann, J., Pittermann, A., & Minker, W. (2010). Emotion recognition and adaptation in spoken dialogue systems. International Journal of Speech Technology, 13, 49–60.
Article Google Scholar
Polzin, T. S., & Waibel, A. H., (1998). Detecting emotions in speech. Cooperative multimodal communication. In second international conference 1998, CMC 1998.
Reynolds, D. A. (1995). Automatic speaker recognition using Gaussian mixture speaker models. The Lincoln Laboratory Journal, 8(2), 173–192.
Google Scholar
Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. ICASSP 2002, 4, IV-4072–IV-4075.
Google Scholar
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B., (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Article Google Scholar
Scherer, K. R., Johnstone, T., Klasmeyer, G., & Banziger, T. (2000). Can automatic speaker verification be improved by training the algorithms on emotional speech? Proceedings of International Conference on Spoken Language Processing, 2, 807–810.
Article Google Scholar
Shahin, I. (2008). Speaker identification in the shouted environment using suprasegmental hidden Markov models. Signal Processing, 88(11), 2700–2708.
Article MATH Google Scholar
Shahin, I. (2009). Verifying speakers in emotional environments. In The 9th IEEE international symposium on signal processing and information technology, Ajman, United Arab Emirates, December 2009, (pp. 328–333).
Shahin, I. (2011). Identifying speakers using their emotion cues. International Journal of Speech Technology, 14(2), 89–98. https://doi.org/10.1007/s10772-011-9089-1.
Article Google Scholar
Shahin, I. (2012). Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2s and SPHMMs. Journal on Multimodal User Interfaces, 6, 59–71. https://doi.org/10.1007/s12193-011-0082-4.
Article Google Scholar
Shahin, I. (2013a). Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. International Journal of Speech Technology, 16(3), 341–351. https://doi.org/10.1007/s10772-013-9188-2.
Article Google Scholar
Shahin, I. (2013b). Speaker identification in emotional talking environments based on CSPHMM2s. Engineering Applications of Artificial Intelligence, 26, 1652–1659. https://doi.org/10.1016/j.engappai.2013.03.013.
Article Google Scholar
Shahin, I. (2013c). Gender-dependent emotion recognition based on HMMs and SPHMMs. International Journal of Speech Technology, 16(2), 133–141. https://doi.org/10.1007/s10772-012-9170-4.
Article Google Scholar
Shahin, I. (2014). Novel third-order hidden Markov models for speaker identification in shouted talking environments. Engineering Applications of Artificial Intelligence, 35, 316–323. https://doi.org/10.1016/j.engappai.2014.07.006.
Article Google Scholar
Shahin, I. (2016). Employing emotion cues to verify speakers in emotional talking environments. Journal of Intelligent Systems, Special Issue on Intelligent Healthcare Systems, 25(1), 3–17. https://doi.org/10.1515/jisys-2014-0118.
MathSciNet Google Scholar
Shahin, I., & Ba-Hutair, M. N. (2015). Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s. International Journal of Speech Technology, 18(1), 77–90, https://doi.org/10.1007/s10772-014-9251-7.
Article Google Scholar
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features, and methods. Speech Communication, 48(9), 1162–1181.
Article Google Scholar
Vogt, T., & Andre, E., (2006). Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, 2006.
Wang, L., Wang, J., Li, L., Zheng, T. F., & Soong, F. K. (2016). Improving speaker verification performance against long-term speaker variability. Speech Communication, 79, 14–29.
Article Google Scholar
Wu, W., Zheng, T. F., Xu, M. X., & Bao, H. J., (2006). Study on speaker verification on emotional speech. In Proceedings of International Conference on Spoken Language Processing, INTERSPEECH 2006. September 2006, (pp. 2102–2105).
Yegnanarayana, B., Prasanna, S. R. M., Zachariah, J. M., & Gupta, C. S. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification systems. IEEE Transactions on Speech and Audio Processing, 13(4), 575–582.
Article Google Scholar
Zhou, G., Hansen, J. H. L., & Kaiser, J. F. (2001). Nonlinear feature based classification of speech under stress. IEEE Transactions on Speech & Audio Processing, 9(3), 201–216.
Article Google Scholar

Download references

Acknowledgements

The authors of this work would like to thank “University of Sharjah” for funding their work through the competitive research projects entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models”, No. 1602040348-P.

Funding

Ismail Shahin and Ali Bou Nassif would like to thank University of Sharjah for funding their work through the competitive research project entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models”, No. 1602040348-P.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Sharjah, P. O. Box 27272, Sharjah, United Arab Emirates
Ismail Shahin & Ali Bou Nassif

Authors

Ismail Shahin
View author publications
You can also search for this author in PubMed Google Scholar
Ali Bou Nassif
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Ismail Shahin wrote the paper, developed some of the used classifiers, and did some experiments. Ali Bou Nassif suggested using some classifiers, he performed some experiments, and he wrote the research questions. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ismail Shahin.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Informed consent

This study does not involve any animal participants.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahin, I., Nassif, A.B. Three-stage speaker verification architecture in emotional talking environments. Int J Speech Technol 21, 915–930 (2018). https://doi.org/10.1007/s10772-018-9543-4

Download citation

Received: 03 April 2018
Accepted: 26 July 2018
Published: 28 August 2018
Issue Date: 15 December 2018
DOI: https://doi.org/10.1007/s10772-018-9543-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Three-stage speaker verification architecture in emotional talking environments

Abstract

Access this article

Similar content being viewed by others

Speaker Identification Enhancement Using Emotional Features

Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

Speaker Verification Systems: A Comprehensive Review

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed consent

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Three-stage speaker verification architecture in emotional talking environments

Abstract

Access this article

Similar content being viewed by others

Speaker Identification Enhancement Using Emotional Features

Speaker Modeling Using Emotional Speech for More Robust Speaker Identification

Speaker Verification Systems: A Comprehensive Review

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Informed consent

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation