Abstract
This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian mixture model-deep neural network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases: Emirati speech database (Arabic United Arab Emirates dataset) and “speech under simulated and actual stress” English dataset. The proposed classifier outperforms classical classifiers such as multilayer perceptron and support vector machine in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.
Similar content being viewed by others
References
Furui S (1991) Speaker-dependent-feature-extraction, recognition and processing techniques. Speech Commun 10:505–520
Wang Y, Tang F, Zheng U (2012) Robust text-independent speaker identification in a time-varying noisy environment. J Softw 7:1975–1980
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Collias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80
Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405 (Special issue)
Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2 s. Eng Appl Artif Intell 26(7):1652–1659. https://doi.org/10.1016/j.engappai.2013.03.013
Li D, Yang Y, Wu Z, Wu T (2005) Emotion-state conversion for speaker recognition. In: Affective computing and intelligent interaction. LNCS, vol 3784. Springer, Berlin, pp 403–410
Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53:768–785
Bao H, Xu M, Zheng TF (2007) Emotion attribute projection for speaker recognition on emotional speech. In: Proceedings of the 8th MLPual conference of the international speech communication association (Interspeech’07), Antwerp, Belgium, pp 601–604
Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech Technol 14:35–48
Jawarkar NP, Holambe RS, Basu TK (2012) Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Frontiers in computer education, Volume 133 of the series advances in intelligent and soft computing, pp 569–576
Mansour A, Lachiri Z (2016) Emotional speaker recognition in simulated and spontaneous context. In: 2nd International conference on advanced technologies for signal and image processing (ATSIP), pp 776–781
Shahin I (2011) Identifying speakers using their emotion cues. Int J Speech Technol 14(2):89–98. https://doi.org/10.1007/s10772-011-9089-1
Shahin I (2013) Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. Int J Speech Technol 16(3):341–351. https://doi.org/10.1007/s10772-013-9188-2
Shahin I, Nasser Ba-Hutair M (2014) Emarati speaker identification. In: 12th International conference on signal processing (ICSP 2014), HangZhou, China, pp 488–493
Shahin I (2012) Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2 s and SPHMMs. J Multimodal User Interfaces 6(1):59–71. https://doi.org/10.1007/s12193-011-0082-4
Shahin I, Nasser Ba-Hutair M (2015) Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s. Int J Speech Technol 18(1):77–90. https://doi.org/10.1007/s10772-014-9251-7
George T, Fabien R, Raymond B, Erik M, Mihalis AN, Björn S, Stefanos Z (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)
Erik MS, Youngmoo EK (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics, pp 16–19
Matejka P, Glembek O, Navotny O, Plchot O, Grezl F, Burget L, Cernocky J (2016) Analysis of DNN approaches to speaker identification. In: International conference on acoustics, speech and signal processing, pp 5100–5104
Lee H, Pham P, Largman Y, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1096–1104
Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675
Ali H, Tran SN, Benetos E, d’Avila Garcez AS (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29(6):13–19. https://doi.org/10.1007/s00521-016-2501-7
Geeta N, Soni MK (2014) Speaker recognition using support vector machine. Int J Comput Appl 87(2):7–10
Marcel K, Sven EK, Martin S, Edin A, Andreas W (2016) Speaker identification and verification using support vector machines and sparse kernel logistic regression. In: International workshop on intelligent computing in pattern analysis/synthesis (IWICPAS), pp 176–184
Sharma A, Snghand SP, Kumar VK (2005) Text-independent speaker identification using back propagation MLP network classifier for a closed set of speakers. In: Proceedings of the fifth IEEE international symposium on signal processing and information technology
Srinivas V, Santhi Rani C, Madhu T (2014) Neural network based classification for speaker identification. Int J Signal Process Image Process Pattern Recognit 7:109–120
Anagnostopoulos CN, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43:155–177. https://doi.org/10.1007/s10462-012-9368-5
Adell J, Benafonte A, Escudero D (2005) Analysis of prosodic features: towards modeling of emotional and pragmatic attributes of speech. XXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, Granada, Spain
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features and methods. Speech Commun 48(9):1162–1181
Bosch LT (2003) Emotions, speech and the ASR framework. Speech Commun 40(1–2):213–225
Reynold DA (1995) Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–82
McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Interspeech, pp 686–690
Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolution network, pp 1–5. arXiv:1505.00853v2
Hogg RV, Craig AT (1970) Chapter 4: Introduction to mathematical statistics. Collier-Macmillan, London
Hansen JHL, Bou-Ghazale S (1997) Getting started with SUSAS: a speech under simulated and actual stress database. In: International conference on speech communication and technology, EUROSPEECH-97, Rhodes, Greece, vol 4, pp 1743–1746
Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323. https://doi.org/10.1016/j.engappai.2014.07.006
Acknowledgements
Ismail Shahin, Ali Bou Nassif and Shibani Hamsa would like to thank “University of Sharjah” for funding their work through the two competitive research projects entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models,” No. 1602040348-P, and “Capturing, Studying, and Analyzing Arabic Emirati-Accented Speech Database in Stressful and Emotional Talking Environments for Different Applications,” No. 1602040349-P.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Human and animal rights
This study does not involve any experiments on animals.
Consent of parents’ of minors
This study includes very few speakers who are less than 18 years old. A consent from the minors’ parents was provided before conducting the experiments.
Rights and permissions
About this article
Cite this article
Shahin, I., Nassif, A.B. & Hamsa, S. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput & Applic 32, 2575–2587 (2020). https://doi.org/10.1007/s00521-018-3760-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3760-2