Skip to main content
Log in

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech emotion recognition (SER), an important method of emotional human–machine interaction, has been the focus of much research in recent years. Motivated by powerful Deep Convolutional Neural Networks (DCNNs) to learn features and the landmark success of these networks in the field of image classification, the present study aimed to prepare a pre-trained DCNN model for SER and provide compatible input to these networks by converting a speech signal into a 3D tensor. First, using a reconstructed phase space, speech samples are reconstructed in a 3D phase space. Studies have shown that the patterns formed in this space contain meaningful emotional features of the speaker. To provide an input that is compatible with DCNN, a new speech signal representation called Chaogram was introduced as the projection of these patterns, and three channels similar to RGB images were obtained. In the next step, image enhancement techniques were used to highlight the details of Chaogram images. Then, the Visual Geometry Group (VGG) DCNN pre-trained on the large ImageNet dataset is utilized to learn Chaogram high-level features and corresponding emotion classes. Finally, transfer learning is performed on the proposed model, and the presented model is fine-tuned on our datasets. To optimize the hyper-parameter arrangement of architecture-determined CNNs, an innovative DCNN-GWO (gray wolf optimization) is also presented. The results of this study on two public datasets of emotions, i.e., EMO-DB and eNTERFACE05, show the promising performance of the proposed model, which can greatly improve SER applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Data Availability

The data that support the findings of this study are freely available at [10, 41].

References

  1. B.J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(4), 1249 (2021). https://doi.org/10.3390/s21041249

    Article  Google Scholar 

  2. M. Ahsanullah, B.G. Kibria, M. Shakil, Normal distribution in Normal and student st distributions and their applications (Springer, 2014)

    Book  MATH  Google Scholar 

  3. P.J.M. Ali, R.H. Faraj, E. Koya, P.J.M. Ali, R.H. Faraj, Data normalization and standardization: a technical report. Mach. Learn. Tech. Rep. 1(1), 1–6 (2014). https://doi.org/10.13140/RG.2.2.28948.04489

    Article  Google Scholar 

  4. H. Altun, G. Polat, Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst. Appl. 36(4), 8197–8203 (2009). https://doi.org/10.1016/j.eswa.2008.10.005

    Article  Google Scholar 

  5. C.-N. Anagnostopoulos, T. Iliou, I. Giannoukos, Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5

    Article  Google Scholar 

  6. M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020

    Article  MATH  Google Scholar 

  7. A. Bakhshi, S. Chalup, A. Harimi, S.M. Mirhassani, Recognition of emotion from speech using evolutionary cepstral coefficients. Multim. Tools Appl. 79(47), 35739–35759 (2020). https://doi.org/10.1007/s11042-020-09591-1

    Article  Google Scholar 

  8. A. Bhavan, P. Chauhan, R.R. Shah, Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 184, 104886 (2019). https://doi.org/10.1016/j.knosys.2019.104886

    Article  Google Scholar 

  9. E. Bozkurt, E. Erzin, C.E. Erdem, A.T. Erdem, Formant position based weighted spectral features for emotion recognition. Speech Commun. 53(9–10), 1186–1197 (2011). https://doi.org/10.1016/j.specom.2011.04.003

    Article  Google Scholar 

  10. F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European conference on speech communication and technology, 2005. doi:https://doi.org/10.21437/Interspeech.2005-446

  11. Y. Chavhan, M. Dhore, P. Yesaware, Speech emotion recognition using support vector machine. Int. J. Computer Appl. 1(20), 6–9 (2010). https://doi.org/10.1007/978-3-642-21402-8_35

    Article  Google Scholar 

  12. F. Chollet, Deep learning with Python. Manning New York, 2018.

  13. F. Dellaert, T. Polzin, and A. Waibel, "Recognizing emotion in speech," in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, 1996, vol. 3: IEEE, pp. 1970–1973. DOI: https://doi.org/10.1109/ICSLP.1996.608022

  14. S. Demircan, H. Kahramanli, Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018). https://doi.org/10.1007/s00521-016-2712-y

    Article  Google Scholar 

  15. J. Deng, X. Xu, Z. Zhang, S. Frühholz, B. Schuller, Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 26(1), 31–43 (2017). https://doi.org/10.1109/TASLP.2017.2759338

    Article  Google Scholar 

  16. F. Eyben, Real-time speech and music classification by large audio feature space extraction (Springer, 2015)

    MATH  Google Scholar 

  17. F. Eyben et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015). https://doi.org/10.1109/TAFFC.2015.2457417

    Article  Google Scholar 

  18. M. Fallahzadeh, F. Farokhi, A. Harimi, R. Sabbaghi-Nadooshan, Facial expression recognition based on image gradient and deep convolutional neural network. J. AI Data Mining 9(2), 259–268 (2021)

    Google Scholar 

  19. H. Faris, I. Aljarah, M.A. Al-Betar, S. Mirjalili, Grey wolf optimizer: a review of recent variants and applications. Neural Comput. Appl. 30(2), 413–435 (2018). https://doi.org/10.1007/s00521-017-3272-5

    Article  Google Scholar 

  20. M. Giollo, D. Gunceler, Y. Liu, and D. Willett, "Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio," arXiv preprint arXiv:2011.12696, 2020. doi: https://doi.org/10.48550/arXiv.2011.12696

  21. N. Hajarolasvadi, H. Demirel, 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019). https://doi.org/10.3390/e21050479

    Article  Google Scholar 

  22. K. Han, D. Yu, and I. Tashev, "Speech emotion recognition using deep neural network and extreme learning machine," in Fifteenth annual Conference of the international speech communication association, 2014.

  23. A. Harimi, A. AhmadyFard, A. Shahzadi, K. Yaghmaie, Anger or joy? Emotion recognition using nonlinear dynamics of speech. Appl. Artif. Intell. 29(7), 675–696 (2015). https://doi.org/10.1080/08839514.2015.1051891

    Article  Google Scholar 

  24. A. Harimi, H.S. Fakhr, A. Bakhshi, Recognition of emotion using reconstructed phase space of speech. Malays. J. Comput. Sci. 29(4), 262–271 (2016). https://doi.org/10.22452/mjcs.vol29no4.2

    Article  Google Scholar 

  25. K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2016, pp. 770–778. doi: https://doi.org/10.48550/arXiv.1512.03385

  26. Z. Huang, M. Dong, Q. Mao, and Y. Zhan, "Speech emotion recognition using CNN," in Proceedings of the 22nd ACM International Conference on Multimedia, 2014: ACM, pp. 801–804. DOI:https://doi.org/10.37200/IJPR/V24I8/PR280260

  27. F. Hutter, L. Kotthoff, J. Vanschoren, Automated machine learning: methods, systems, challenges (Springer Nature, 2019)

    Book  Google Scholar 

  28. K.M. Indrebo, R.J. Povinelli, M.T. Johnson, Sub-banded reconstructed phase spaces for speech recognition. Speech Commun. 48(7), 760–774 (2006). https://doi.org/10.1016/j.specom.2004.12.002

    Article  Google Scholar 

  29. D. Issa, M.F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020). https://doi.org/10.1016/j.bspc.2020.101894

    Article  Google Scholar 

  30. M. T. Johnson, A. C. Lindgren, R. J. Povinelli, and X. Yuan, "Performance of nonlinear speech enhancement using phase space reconstruction," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 1: IEEE, pp. I-I. DOI: https://doi.org/10.1109/ICASSP.2003.1198932

  31. V. Joshi, R. Zhao, R. R. Mehta, K. Kumar, and J. Li, "Transfer learning approaches for streaming end-to-end speech recognition system," arXiv preprint arXiv:2008.05086, 2020. doi: https://doi.org/10.48550/arXiv.2008.05086

  32. M.B. Kennel, R. Brown, H.D. Abarbanel, Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A 45(6), 3403 (1992). https://doi.org/10.1103/PhysRevA.45.3403

    Article  Google Scholar 

  33. R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124

    Article  Google Scholar 

  34. E.H. Kim, K.H. Hyun, S.H. Kim, Y.K. Kwak, Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Trans. Mechatron. 14(3), 317–325 (2009). https://doi.org/10.1109/TMECH.2008.2008644

    Article  Google Scholar 

  35. Y. Kim, H. Lee, and E. M. Provost, "Deep learning for robust feature generation in audiovisual emotion recognition," in 2013 IEEE International Conference on acoustics, speech and signal processing, 2013: IEEE, pp. 3687–3691. DOI: https://doi.org/10.1109/ICASSP.2013.6638346

  36. J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, B. Schuller, Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012). https://doi.org/10.1016/j.neucom.2011.12.021

    Article  Google Scholar 

  37. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012). https://doi.org/10.1145/3065386

    Article  Google Scholar 

  38. E. Lieskovská, M. Jakubec, R. Jarina, M. Chmulík, A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10(10), 1163 (2021). https://doi.org/10.3390/electronics10101163

    Article  Google Scholar 

  39. I. Luengo, E. Navas, I. Hernáez, Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multim. 12(6), 490–501 (2010). https://doi.org/10.1109/TMM.2010.2051872

    Article  Google Scholar 

  40. H.-G. Ma, C.-Z. Han, Selection of embedding dimension and delay time in phase space reconstruction. Front. Electr. Electron. Eng. China 1(1), 111–114 (2006). https://doi.org/10.3390/e23020221

    Article  Google Scholar 

  41. O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006: IEEE, pp. 8–8. DOI: https://doi.org/10.1109/ICDEW.2006.145

  42. S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014). https://doi.org/10.1016/j.advengsoft.2013.12.007

    Article  Google Scholar 

  43. H. Moayedi, H. Nguyen, L. Kok Foong, Nonlinear evolutionary swarm intelligence of grasshopper optimization algorithm and gray wolf optimization for weight adjustment of neural network. Eng. Computers 37(2), 1265–1275 (2021). https://doi.org/10.1007/s00366-019-00882-2

    Article  Google Scholar 

  44. J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000). https://doi.org/10.1007/s005210070006

    Article  MATH  Google Scholar 

  45. Y. Niu, D. Zou, Y. Niu, Z. He, and H. Tan, "A breakthrough in speech emotion recognition using deep retinal convolution neural networks," arXiv preprint arXiv:1707.09917, 2017. doi: https://doi.org/10.48550/arXiv.1707.09917

  46. T.-L. Pao, C. S. Chien, Y.-T. Chen, J.-H. Yeh, Y.-M. Cheng, and W.-Y. Liao, "Combination of multiple classifiers for improving emotion recognition in Mandarin speech," in Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), 2007, vol. 1: IEEE, pp. 35–38. DOI: https://doi.org/10.1109/IIHMSP.2007.4457487

  47. C.L. Phillips, J.M. Parr, E.A. Riskin, T. Prabhakar, Signals, systems, and transforms (Prentice Hall, 2003)

    Google Scholar 

  48. P. Prajith, "Investigations on the applications of dynamical instabilities and deterministic chaos for speech signal processing," 2008.

  49. B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 2: IEEE, pp. II-1. DOI: https://doi.org/10.1109/ICME.2003.1220939

  50. B. Schuller, G. Rigoll, and M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1: IEEE, pp. I-577. DOI: https://doi.org/10.1109/ICASSP.2004.1326051

  51. B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, "Acoustic emotion recognition: A benchmark comparison of performances," in 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, 2009: IEEE, pp. 552–557. DOI:https://doi.org/10.1109/ASRU.2009.5372886

  52. A. Shahzadi, A. Ahmadyfard, A. Harimi, K. Yaghmaie, Speech emotion recognition using nonlinear dynamics features. Turkish J. Electr. Eng. Computer Sci. 23, 871 (2015). https://doi.org/10.3906/elk-1302-90

    Article  Google Scholar 

  53. Y. Shekofteh, F. Almasganj, Feature extraction based on speech attractors in the reconstructed phase space for automatic speech recognition systems. ETRI J. 35(1), 100–108 (2013). https://doi.org/10.4218/etrij.13.0112.0074

    Article  Google Scholar 

  54. K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. doi: https://doi.org/10.48550/arXiv.1409.1556

  55. A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller, "Deep neural networks for acoustic emotion recognition: raising the benchmarks," in 2011 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2011: IEEE, pp. 5688–5691. DOI: https://doi.org/10.1109/ICASSP.2011.5947651

  56. Y. Sun, X.-Y. Zhang, J.-H. Ma, C.-X. Song, H.-F. Lv, Nonlinear dynamic feature extraction based on phase space reconstruction for the classification of speech and emotion. Mathem. Probl. Eng. 871, 45 (2020). https://doi.org/10.1155/2020/9452976

    Article  Google Scholar 

  57. C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2015, pp. 1–9. doi: https://doi.org/10.48550/arXiv.1409.4842

  58. G. Trigeorgis et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 5200–5204. DOI: https://doi.org/10.1109/ICASSP.2016.7472669

  59. T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021). https://doi.org/10.1016/j.knosys.2020.106547

    Article  Google Scholar 

  60. D. Ververidis and C. Kotropoulos, "Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm," in 2005 IEEE International Conference on Multimedia and Expo, 2005: IEEE, pp. 1500–1503. DOI: https://doi.org/10.1109/ICME.2005.1521717

  61. Y. Wang, H. Zhang, G. Zhang, cPSO-CNN: An efficient PSO-based algorithm for fine-tuning hyper-parameters of convolutional neural networks. Swarm Evol. Comput. 49, 114–123 (2019). https://doi.org/10.1016/j.swevo.2019.06.002

    Article  Google Scholar 

  62. S. Wu, T.H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011). https://doi.org/10.1016/j.specom.2010.08.013

    Article  Google Scholar 

  63. Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, B. Schuller, Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio, Speech, Language Process. 27(11), 1675–1685 (2019). https://doi.org/10.1109/TASLP.2019.2925934

    Article  Google Scholar 

  64. X. Xu et al., A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans. Audio, Speech, Language Process. 25(7), 1436–1449 (2017). https://doi.org/10.1109/TASLP.2017.2694704

    Article  Google Scholar 

  65. B. Yang, M. Lugger, Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010). https://doi.org/10.1016/j.sigpro.2009.09.009

    Article  MATH  Google Scholar 

  66. S. Zhalehpour, O. Onder, Z. Akhtar, C.E. Erdem, BAUM-1: A spontaneous audio-visual face database of affective and mental states. IEEE Trans. Affect. Comput. 8(3), 300–313 (2016). https://doi.org/10.1109/TAFFC.2016.2553038

    Article  Google Scholar 

  67. S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multim. 20(6), 1576–1590 (2017). https://doi.org/10.1109/TMM.2017.2766843

    Article  Google Scholar 

  68. Z. Zhang, E. Coutinho, J. Deng, B. Schuller, “Cooperative learning and its application to emotion recognition from speech,.” IEEE/ACM Trans. Audio Speech Language Process. (TASLP) 23(1), 115–126 (2015). https://doi.org/10.1109/TASLP.2014.2375558

    Article  Google Scholar 

  69. J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018). https://doi.org/10.1049/iet-spr.2017.0320

    Article  Google Scholar 

  70. J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fardad Farokhi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Falahzadeh, M.R., Farokhi, F., Harimi, A. et al. Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition. Circuits Syst Signal Process 42, 449–492 (2023). https://doi.org/10.1007/s00034-022-02130-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-022-02130-3

Keyword

Navigation