Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Shahin, Ismail; Nassif, Ali Bou; Hamsa, Shibani

doi:10.1007/s00521-018-3760-2

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Original Article
Published: 04 October 2018

Volume 32, pages 2575–2587, (2020)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Ismail Shahin¹,
Ali Bou Nassif¹ &
Shibani Hamsa¹

781 Accesses
38 Citations
1 Altmetric
Explore all metrics

Abstract

This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian mixture model-deep neural network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases: Emirati speech database (Arabic United Arab Emirates dataset) and “speech under simulated and actual stress” English dataset. The proposed classifier outperforms classical classifiers such as multilayer perceptron and support vector machine in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker identification in stressful talking environments based on convolutional neural network

Article 05 July 2021

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

Article 22 June 2021

A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models

References

Furui S (1991) Speaker-dependent-feature-extraction, recognition and processing techniques. Speech Commun 10:505–520
Article Google Scholar
Wang Y, Tang F, Zheng U (2012) Robust text-independent speaker identification in a time-varying noisy environment. J Softw 7:1975–1980
Google Scholar
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Collias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80
Article Google Scholar
Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405 (Special issue)
Article Google Scholar
Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2 s. Eng Appl Artif Intell 26(7):1652–1659. https://doi.org/10.1016/j.engappai.2013.03.013
Article Google Scholar
Li D, Yang Y, Wu Z, Wu T (2005) Emotion-state conversion for speaker recognition. In: Affective computing and intelligent interaction. LNCS, vol 3784. Springer, Berlin, pp 403–410
Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53:768–785
Article Google Scholar
Bao H, Xu M, Zheng TF (2007) Emotion attribute projection for speaker recognition on emotional speech. In: Proceedings of the 8th MLPual conference of the international speech communication association (Interspeech’07), Antwerp, Belgium, pp 601–604
Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech Technol 14:35–48
Article Google Scholar
Jawarkar NP, Holambe RS, Basu TK (2012) Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Frontiers in computer education, Volume 133 of the series advances in intelligent and soft computing, pp 569–576
Mansour A, Lachiri Z (2016) Emotional speaker recognition in simulated and spontaneous context. In: 2nd International conference on advanced technologies for signal and image processing (ATSIP), pp 776–781
Shahin I (2011) Identifying speakers using their emotion cues. Int J Speech Technol 14(2):89–98. https://doi.org/10.1007/s10772-011-9089-1
Article Google Scholar
Shahin I (2013) Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. Int J Speech Technol 16(3):341–351. https://doi.org/10.1007/s10772-013-9188-2
Article Google Scholar
Shahin I, Nasser Ba-Hutair M (2014) Emarati speaker identification. In: 12th International conference on signal processing (ICSP 2014), HangZhou, China, pp 488–493
Shahin I (2012) Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2 s and SPHMMs. J Multimodal User Interfaces 6(1):59–71. https://doi.org/10.1007/s12193-011-0082-4
Article Google Scholar
Shahin I, Nasser Ba-Hutair M (2015) Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s. Int J Speech Technol 18(1):77–90. https://doi.org/10.1007/s10772-014-9251-7
Article Google Scholar
George T, Fabien R, Raymond B, Erik M, Mihalis AN, Björn S, Stefanos Z (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
Erik MS, Youngmoo EK (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics, pp 16–19
Matejka P, Glembek O, Navotny O, Plchot O, Grezl F, Burget L, Cernocky J (2016) Analysis of DNN approaches to speaker identification. In: International conference on acoustics, speech and signal processing, pp 5100–5104
Lee H, Pham P, Largman Y, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1096–1104
Google Scholar
Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675
Article Google Scholar
Ali H, Tran SN, Benetos E, d’Avila Garcez AS (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29(6):13–19. https://doi.org/10.1007/s00521-016-2501-7
Article Google Scholar
Geeta N, Soni MK (2014) Speaker recognition using support vector machine. Int J Comput Appl 87(2):7–10
Google Scholar
Marcel K, Sven EK, Martin S, Edin A, Andreas W (2016) Speaker identification and verification using support vector machines and sparse kernel logistic regression. In: International workshop on intelligent computing in pattern analysis/synthesis (IWICPAS), pp 176–184
Sharma A, Snghand SP, Kumar VK (2005) Text-independent speaker identification using back propagation MLP network classifier for a closed set of speakers. In: Proceedings of the fifth IEEE international symposium on signal processing and information technology
Srinivas V, Santhi Rani C, Madhu T (2014) Neural network based classification for speaker identification. Int J Signal Process Image Process Pattern Recognit 7:109–120
Google Scholar
Anagnostopoulos CN, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43:155–177. https://doi.org/10.1007/s10462-012-9368-5
Article Google Scholar
Adell J, Benafonte A, Escudero D (2005) Analysis of prosodic features: towards modeling of emotional and pragmatic attributes of speech. XXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, Granada, Spain
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features and methods. Speech Commun 48(9):1162–1181
Article Google Scholar
Bosch LT (2003) Emotions, speech and the ASR framework. Speech Commun 40(1–2):213–225
MATH Google Scholar
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
Reynold DA (1995) Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–82
Article Google Scholar
McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Interspeech, pp 686–690
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolution network, pp 1–5. arXiv:1505.00853v2
Hogg RV, Craig AT (1970) Chapter 4: Introduction to mathematical statistics. Collier-Macmillan, London
Google Scholar
Hansen JHL, Bou-Ghazale S (1997) Getting started with SUSAS: a speech under simulated and actual stress database. In: International conference on speech communication and technology, EUROSPEECH-97, Rhodes, Greece, vol 4, pp 1743–1746
Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323. https://doi.org/10.1016/j.engappai.2014.07.006
Article Google Scholar

Download references

Acknowledgements

Ismail Shahin, Ali Bou Nassif and Shibani Hamsa would like to thank “University of Sharjah” for funding their work through the two competitive research projects entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models,” No. 1602040348-P, and “Capturing, Studying, and Analyzing Arabic Emirati-Accented Speech Database in Stressful and Emotional Talking Environments for Different Applications,” No. 1602040349-P.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Sharjah, P.O. Box: 27272, Sharjah, United Arab Emirates
Ismail Shahin, Ali Bou Nassif & Shibani Hamsa

Authors

Ismail Shahin
View author publications
You can also search for this author in PubMed Google Scholar
Ali Bou Nassif
View author publications
You can also search for this author in PubMed Google Scholar
Shibani Hamsa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ismail Shahin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This study does not involve any experiments on animals.

Consent of parents’ of minors

This study includes very few speakers who are less than 18 years old. A consent from the minors’ parents was provided before conducting the experiments.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahin, I., Nassif, A.B. & Hamsa, S. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput & Applic 32, 2575–2587 (2020). https://doi.org/10.1007/s00521-018-3760-2

Download citation

Received: 21 August 2017
Accepted: 28 September 2018
Published: 04 October 2018
Issue Date: April 2020
DOI: https://doi.org/10.1007/s00521-018-3760-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Abstract

Access this article

Similar content being viewed by others

Speaker identification in stressful talking environments based on convolutional neural network

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Consent of parents’ of minors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Abstract

Access this article

Similar content being viewed by others

Speaker identification in stressful talking environments based on convolutional neural network

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A Study on Text-Independent Speaker Recognition Systems in Emotional Conditions Using Different Pattern Recognition Models

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Human and animal rights

Consent of parents’ of minors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation