Skip to main content
Log in

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian mixture model-deep neural network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases: Emirati speech database (Arabic United Arab Emirates dataset) and “speech under simulated and actual stress” English dataset. The proposed classifier outperforms classical classifiers such as multilayer perceptron and support vector machine in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Furui S (1991) Speaker-dependent-feature-extraction, recognition and processing techniques. Speech Commun 10:505–520

    Article  Google Scholar 

  2. Wang Y, Tang F, Zheng U (2012) Robust text-independent speaker identification in a time-varying noisy environment. J Softw 7:1975–1980

    Google Scholar 

  3. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Collias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80

    Article  Google Scholar 

  4. Fragopanagos N, Taylor JG (2005) Emotion recognition in human-computer interaction. Neural Netw 18:389–405 (Special issue)

    Article  Google Scholar 

  5. Shahin I (2013) Speaker identification in emotional talking environments based on CSPHMM2 s. Eng Appl Artif Intell 26(7):1652–1659. https://doi.org/10.1016/j.engappai.2013.03.013

    Article  Google Scholar 

  6. Li D, Yang Y, Wu Z, Wu T (2005) Emotion-state conversion for speaker recognition. In: Affective computing and intelligent interaction. LNCS, vol 3784. Springer, Berlin, pp 403–410

  7. Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53:768–785

    Article  Google Scholar 

  8. Bao H, Xu M, Zheng TF (2007) Emotion attribute projection for speaker recognition on emotional speech. In: Proceedings of the 8th MLPual conference of the international speech communication association (Interspeech’07), Antwerp, Belgium, pp 601–604

  9. Koolagudi SG, Krothapalli RS (2011) Two stage emotion recognition based on speaking rate. Int J Speech Technol 14:35–48

    Article  Google Scholar 

  10. Jawarkar NP, Holambe RS, Basu TK (2012) Text-independent speaker identification in emotional environments: a classifier fusion approach. In: Frontiers in computer education, Volume 133 of the series advances in intelligent and soft computing, pp 569–576

  11. Mansour A, Lachiri Z (2016) Emotional speaker recognition in simulated and spontaneous context. In: 2nd International conference on advanced technologies for signal and image processing (ATSIP), pp 776–781

  12. Shahin I (2011) Identifying speakers using their emotion cues. Int J Speech Technol 14(2):89–98. https://doi.org/10.1007/s10772-011-9089-1

    Article  Google Scholar 

  13. Shahin I (2013) Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. Int J Speech Technol 16(3):341–351. https://doi.org/10.1007/s10772-013-9188-2

    Article  Google Scholar 

  14. Shahin I, Nasser Ba-Hutair M (2014) Emarati speaker identification. In: 12th International conference on signal processing (ICSP 2014), HangZhou, China, pp 488–493

  15. Shahin I (2012) Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2 s and SPHMMs. J Multimodal User Interfaces 6(1):59–71. https://doi.org/10.1007/s12193-011-0082-4

    Article  Google Scholar 

  16. Shahin I, Nasser Ba-Hutair M (2015) Talking condition recognition in stressful and emotional talking environments based on CSPHMM2s. Int J Speech Technol 18(1):77–90. https://doi.org/10.1007/s10772-014-9251-7

    Article  Google Scholar 

  17. George T, Fabien R, Raymond B, Erik M, Mihalis AN, Björn S, Stefanos Z (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP)

  18. Erik MS, Youngmoo EK (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE workshop on applications of signal processing to audio and acoustics, pp 16–19

  19. Matejka P, Glembek O, Navotny O, Plchot O, Grezl F, Burget L, Cernocky J (2016) Analysis of DNN approaches to speaker identification. In: International conference on acoustics, speech and signal processing, pp 5100–5104

  20. Lee H, Pham P, Largman Y, Ng A (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1096–1104

    Google Scholar 

  21. Richardson F, Reynolds D, Dehak N (2015) Deep neural network approaches to speaker and language recognition. IEEE Signal Process Lett 22(10):1671–1675

    Article  Google Scholar 

  22. Ali H, Tran SN, Benetos E, d’Avila Garcez AS (2018) Speaker recognition with hybrid features from a deep belief network. Neural Comput Appl 29(6):13–19. https://doi.org/10.1007/s00521-016-2501-7

    Article  Google Scholar 

  23. Geeta N, Soni MK (2014) Speaker recognition using support vector machine. Int J Comput Appl 87(2):7–10

    Google Scholar 

  24. Marcel K, Sven EK, Martin S, Edin A, Andreas W (2016) Speaker identification and verification using support vector machines and sparse kernel logistic regression. In: International workshop on intelligent computing in pattern analysis/synthesis (IWICPAS), pp 176–184

  25. Sharma A, Snghand SP, Kumar VK (2005) Text-independent speaker identification using back propagation MLP network classifier for a closed set of speakers. In: Proceedings of the fifth IEEE international symposium on signal processing and information technology

  26. Srinivas V, Santhi Rani C, Madhu T (2014) Neural network based classification for speaker identification. Int J Signal Process Image Process Pattern Recognit 7:109–120

    Google Scholar 

  27. Anagnostopoulos CN, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev 43:155–177. https://doi.org/10.1007/s10462-012-9368-5

    Article  Google Scholar 

  28. Adell J, Benafonte A, Escudero D (2005) Analysis of prosodic features: towards modeling of emotional and pragmatic attributes of speech. XXI Congreso de la Sociedad Española para el Procesamiento del Lenguaje Natural, SEPLN, Granada, Spain

  29. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features and methods. Speech Commun 48(9):1162–1181

    Article  Google Scholar 

  30. Bosch LT (2003) Emotions, speech and the ASR framework. Speech Commun 40(1–2):213–225

    MATH  Google Scholar 

  31. http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/

  32. Reynold DA (1995) Robust text independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Process 3:72–82

    Article  Google Scholar 

  33. McLaren M, Lei Y, Scheffer N, Ferrer L (2014) Application of convolutional neural networks to speaker recognition in noisy conditions. In: Interspeech, pp 686–690

  34. Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolution network, pp 1–5. arXiv:1505.00853v2

  35. Hogg RV, Craig AT (1970) Chapter 4: Introduction to mathematical statistics. Collier-Macmillan, London

    Google Scholar 

  36. Hansen JHL, Bou-Ghazale S (1997) Getting started with SUSAS: a speech under simulated and actual stress database. In: International conference on speech communication and technology, EUROSPEECH-97, Rhodes, Greece, vol 4, pp 1743–1746

  37. Shahin I (2014) Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng Appl Artif Intell 35:316–323. https://doi.org/10.1016/j.engappai.2014.07.006

    Article  Google Scholar 

Download references

Acknowledgements

Ismail Shahin, Ali Bou Nassif and Shibani Hamsa would like to thank “University of Sharjah” for funding their work through the two competitive research projects entitled “Emotion Recognition in each of Stressful and Emotional Talking Environments Using Artificial Models,” No. 1602040348-P, and “Capturing, Studying, and Analyzing Arabic Emirati-Accented Speech Database in Stressful and Emotional Talking Environments for Different Applications,” No. 1602040349-P.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ismail Shahin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Human and animal rights

This study does not involve any experiments on animals.

Consent of parents’ of minors

This study includes very few speakers who are less than 18 years old. A consent from the minors’ parents was provided before conducting the experiments.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahin, I., Nassif, A.B. & Hamsa, S. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput & Applic 32, 2575–2587 (2020). https://doi.org/10.1007/s00521-018-3760-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-018-3760-2

Keywords

Navigation