Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Falahzadeh, Mohammad Reza; Farokhi, Fardad; Harimi, Ali; Sabbaghi-Nadooshan, Reza

doi:10.1007/s00034-022-02130-3

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Published: 25 August 2022

Volume 42, pages 449–492, (2023)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Mohammad Reza Falahzadeh¹,
Fardad Farokhi ORCID: orcid.org/0000-0002-6045-5424²,
Ali Harimi³ &
…
Reza Sabbaghi-Nadooshan¹

1060 Accesses
16 Citations
Explore all metrics

Abstract

Speech emotion recognition (SER), an important method of emotional human–machine interaction, has been the focus of much research in recent years. Motivated by powerful Deep Convolutional Neural Networks (DCNNs) to learn features and the landmark success of these networks in the field of image classification, the present study aimed to prepare a pre-trained DCNN model for SER and provide compatible input to these networks by converting a speech signal into a 3D tensor. First, using a reconstructed phase space, speech samples are reconstructed in a 3D phase space. Studies have shown that the patterns formed in this space contain meaningful emotional features of the speaker. To provide an input that is compatible with DCNN, a new speech signal representation called Chaogram was introduced as the projection of these patterns, and three channels similar to RGB images were obtained. In the next step, image enhancement techniques were used to highlight the details of Chaogram images. Then, the Visual Geometry Group (VGG) DCNN pre-trained on the large ImageNet dataset is utilized to learn Chaogram high-level features and corresponding emotion classes. Finally, transfer learning is performed on the proposed model, and the presented model is fine-tuned on our datasets. To optimize the hyper-parameter arrangement of architecture-determined CNNs, an innovative DCNN-GWO (gray wolf optimization) is also presented. The results of this study on two public datasets of emotions, i.e., EMO-DB and eNTERFACE05, show the promising performance of the proposed model, which can greatly improve SER applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Robust Deep Transfer Learning Model for Accurate Speech Emotion Classification

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data Availability

The data that support the findings of this study are freely available at [10, 41].

References

B.J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(4), 1249 (2021). https://doi.org/10.3390/s21041249
Article Google Scholar
M. Ahsanullah, B.G. Kibria, M. Shakil, Normal distribution in Normal and student st distributions and their applications (Springer, 2014)
Book MATH Google Scholar
P.J.M. Ali, R.H. Faraj, E. Koya, P.J.M. Ali, R.H. Faraj, Data normalization and standardization: a technical report. Mach. Learn. Tech. Rep. 1(1), 1–6 (2014). https://doi.org/10.13140/RG.2.2.28948.04489
Article Google Scholar
H. Altun, G. Polat, Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst. Appl. 36(4), 8197–8203 (2009). https://doi.org/10.1016/j.eswa.2008.10.005
Article Google Scholar
C.-N. Anagnostopoulos, T. Iliou, I. Giannoukos, Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5
Article Google Scholar
M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
Article MATH Google Scholar
A. Bakhshi, S. Chalup, A. Harimi, S.M. Mirhassani, Recognition of emotion from speech using evolutionary cepstral coefficients. Multim. Tools Appl. 79(47), 35739–35759 (2020). https://doi.org/10.1007/s11042-020-09591-1
Article Google Scholar
A. Bhavan, P. Chauhan, R.R. Shah, Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 184, 104886 (2019). https://doi.org/10.1016/j.knosys.2019.104886
Article Google Scholar
E. Bozkurt, E. Erzin, C.E. Erdem, A.T. Erdem, Formant position based weighted spectral features for emotion recognition. Speech Commun. 53(9–10), 1186–1197 (2011). https://doi.org/10.1016/j.specom.2011.04.003
Article Google Scholar
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European conference on speech communication and technology, 2005. doi:https://doi.org/10.21437/Interspeech.2005-446
Y. Chavhan, M. Dhore, P. Yesaware, Speech emotion recognition using support vector machine. Int. J. Computer Appl. 1(20), 6–9 (2010). https://doi.org/10.1007/978-3-642-21402-8_35
Article Google Scholar
F. Chollet, Deep learning with Python. Manning New York, 2018.
F. Dellaert, T. Polzin, and A. Waibel, "Recognizing emotion in speech," in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, 1996, vol. 3: IEEE, pp. 1970–1973. DOI: https://doi.org/10.1109/ICSLP.1996.608022
S. Demircan, H. Kahramanli, Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018). https://doi.org/10.1007/s00521-016-2712-y
Article Google Scholar
J. Deng, X. Xu, Z. Zhang, S. Frühholz, B. Schuller, Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 26(1), 31–43 (2017). https://doi.org/10.1109/TASLP.2017.2759338
Article Google Scholar
F. Eyben, Real-time speech and music classification by large audio feature space extraction (Springer, 2015)
MATH Google Scholar
F. Eyben et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015). https://doi.org/10.1109/TAFFC.2015.2457417
Article Google Scholar
M. Fallahzadeh, F. Farokhi, A. Harimi, R. Sabbaghi-Nadooshan, Facial expression recognition based on image gradient and deep convolutional neural network. J. AI Data Mining 9(2), 259–268 (2021)
Google Scholar
H. Faris, I. Aljarah, M.A. Al-Betar, S. Mirjalili, Grey wolf optimizer: a review of recent variants and applications. Neural Comput. Appl. 30(2), 413–435 (2018). https://doi.org/10.1007/s00521-017-3272-5
Article Google Scholar
M. Giollo, D. Gunceler, Y. Liu, and D. Willett, "Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio," arXiv preprint arXiv:2011.12696, 2020. doi: https://doi.org/10.48550/arXiv.2011.12696
N. Hajarolasvadi, H. Demirel, 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019). https://doi.org/10.3390/e21050479
Article Google Scholar
K. Han, D. Yu, and I. Tashev, "Speech emotion recognition using deep neural network and extreme learning machine," in Fifteenth annual Conference of the international speech communication association, 2014.
A. Harimi, A. AhmadyFard, A. Shahzadi, K. Yaghmaie, Anger or joy? Emotion recognition using nonlinear dynamics of speech. Appl. Artif. Intell. 29(7), 675–696 (2015). https://doi.org/10.1080/08839514.2015.1051891
Article Google Scholar
A. Harimi, H.S. Fakhr, A. Bakhshi, Recognition of emotion using reconstructed phase space of speech. Malays. J. Comput. Sci. 29(4), 262–271 (2016). https://doi.org/10.22452/mjcs.vol29no4.2
Article Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2016, pp. 770–778. doi: https://doi.org/10.48550/arXiv.1512.03385
Z. Huang, M. Dong, Q. Mao, and Y. Zhan, "Speech emotion recognition using CNN," in Proceedings of the 22nd ACM International Conference on Multimedia, 2014: ACM, pp. 801–804. DOI:https://doi.org/10.37200/IJPR/V24I8/PR280260
F. Hutter, L. Kotthoff, J. Vanschoren, Automated machine learning: methods, systems, challenges (Springer Nature, 2019)
Book Google Scholar
K.M. Indrebo, R.J. Povinelli, M.T. Johnson, Sub-banded reconstructed phase spaces for speech recognition. Speech Commun. 48(7), 760–774 (2006). https://doi.org/10.1016/j.specom.2004.12.002
Article Google Scholar
D. Issa, M.F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020). https://doi.org/10.1016/j.bspc.2020.101894
Article Google Scholar
M. T. Johnson, A. C. Lindgren, R. J. Povinelli, and X. Yuan, "Performance of nonlinear speech enhancement using phase space reconstruction," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 1: IEEE, pp. I-I. DOI: https://doi.org/10.1109/ICASSP.2003.1198932
V. Joshi, R. Zhao, R. R. Mehta, K. Kumar, and J. Li, "Transfer learning approaches for streaming end-to-end speech recognition system," arXiv preprint arXiv:2008.05086, 2020. doi: https://doi.org/10.48550/arXiv.2008.05086
M.B. Kennel, R. Brown, H.D. Abarbanel, Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A 45(6), 3403 (1992). https://doi.org/10.1103/PhysRevA.45.3403
Article Google Scholar
R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124
Article Google Scholar
E.H. Kim, K.H. Hyun, S.H. Kim, Y.K. Kwak, Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Trans. Mechatron. 14(3), 317–325 (2009). https://doi.org/10.1109/TMECH.2008.2008644
Article Google Scholar
Y. Kim, H. Lee, and E. M. Provost, "Deep learning for robust feature generation in audiovisual emotion recognition," in 2013 IEEE International Conference on acoustics, speech and signal processing, 2013: IEEE, pp. 3687–3691. DOI: https://doi.org/10.1109/ICASSP.2013.6638346
J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, B. Schuller, Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012). https://doi.org/10.1016/j.neucom.2011.12.021
Article Google Scholar
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012). https://doi.org/10.1145/3065386
Article Google Scholar
E. Lieskovská, M. Jakubec, R. Jarina, M. Chmulík, A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10(10), 1163 (2021). https://doi.org/10.3390/electronics10101163
Article Google Scholar
I. Luengo, E. Navas, I. Hernáez, Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multim. 12(6), 490–501 (2010). https://doi.org/10.1109/TMM.2010.2051872
Article Google Scholar
H.-G. Ma, C.-Z. Han, Selection of embedding dimension and delay time in phase space reconstruction. Front. Electr. Electron. Eng. China 1(1), 111–114 (2006). https://doi.org/10.3390/e23020221
Article Google Scholar
O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006: IEEE, pp. 8–8. DOI: https://doi.org/10.1109/ICDEW.2006.145
S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014). https://doi.org/10.1016/j.advengsoft.2013.12.007
Article Google Scholar
H. Moayedi, H. Nguyen, L. Kok Foong, Nonlinear evolutionary swarm intelligence of grasshopper optimization algorithm and gray wolf optimization for weight adjustment of neural network. Eng. Computers 37(2), 1265–1275 (2021). https://doi.org/10.1007/s00366-019-00882-2
Article Google Scholar
J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000). https://doi.org/10.1007/s005210070006
Article MATH Google Scholar
Y. Niu, D. Zou, Y. Niu, Z. He, and H. Tan, "A breakthrough in speech emotion recognition using deep retinal convolution neural networks," arXiv preprint arXiv:1707.09917, 2017. doi: https://doi.org/10.48550/arXiv.1707.09917
T.-L. Pao, C. S. Chien, Y.-T. Chen, J.-H. Yeh, Y.-M. Cheng, and W.-Y. Liao, "Combination of multiple classifiers for improving emotion recognition in Mandarin speech," in Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), 2007, vol. 1: IEEE, pp. 35–38. DOI: https://doi.org/10.1109/IIHMSP.2007.4457487
C.L. Phillips, J.M. Parr, E.A. Riskin, T. Prabhakar, Signals, systems, and transforms (Prentice Hall, 2003)
Google Scholar
P. Prajith, "Investigations on the applications of dynamical instabilities and deterministic chaos for speech signal processing," 2008.
B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 2: IEEE, pp. II-1. DOI: https://doi.org/10.1109/ICME.2003.1220939
B. Schuller, G. Rigoll, and M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1: IEEE, pp. I-577. DOI: https://doi.org/10.1109/ICASSP.2004.1326051
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, "Acoustic emotion recognition: A benchmark comparison of performances," in 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, 2009: IEEE, pp. 552–557. DOI:https://doi.org/10.1109/ASRU.2009.5372886
A. Shahzadi, A. Ahmadyfard, A. Harimi, K. Yaghmaie, Speech emotion recognition using nonlinear dynamics features. Turkish J. Electr. Eng. Computer Sci. 23, 871 (2015). https://doi.org/10.3906/elk-1302-90
Article Google Scholar
Y. Shekofteh, F. Almasganj, Feature extraction based on speech attractors in the reconstructed phase space for automatic speech recognition systems. ETRI J. 35(1), 100–108 (2013). https://doi.org/10.4218/etrij.13.0112.0074
Article Google Scholar
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. doi: https://doi.org/10.48550/arXiv.1409.1556
A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller, "Deep neural networks for acoustic emotion recognition: raising the benchmarks," in 2011 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2011: IEEE, pp. 5688–5691. DOI: https://doi.org/10.1109/ICASSP.2011.5947651
Y. Sun, X.-Y. Zhang, J.-H. Ma, C.-X. Song, H.-F. Lv, Nonlinear dynamic feature extraction based on phase space reconstruction for the classification of speech and emotion. Mathem. Probl. Eng. 871, 45 (2020). https://doi.org/10.1155/2020/9452976
Article Google Scholar
C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2015, pp. 1–9. doi: https://doi.org/10.48550/arXiv.1409.4842
G. Trigeorgis et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 5200–5204. DOI: https://doi.org/10.1109/ICASSP.2016.7472669
T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021). https://doi.org/10.1016/j.knosys.2020.106547
Article Google Scholar
D. Ververidis and C. Kotropoulos, "Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm," in 2005 IEEE International Conference on Multimedia and Expo, 2005: IEEE, pp. 1500–1503. DOI: https://doi.org/10.1109/ICME.2005.1521717
Y. Wang, H. Zhang, G. Zhang, cPSO-CNN: An efficient PSO-based algorithm for fine-tuning hyper-parameters of convolutional neural networks. Swarm Evol. Comput. 49, 114–123 (2019). https://doi.org/10.1016/j.swevo.2019.06.002
Article Google Scholar
S. Wu, T.H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011). https://doi.org/10.1016/j.specom.2010.08.013
Article Google Scholar
Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, B. Schuller, Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio, Speech, Language Process. 27(11), 1675–1685 (2019). https://doi.org/10.1109/TASLP.2019.2925934
Article Google Scholar
X. Xu et al., A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans. Audio, Speech, Language Process. 25(7), 1436–1449 (2017). https://doi.org/10.1109/TASLP.2017.2694704
Article Google Scholar
B. Yang, M. Lugger, Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010). https://doi.org/10.1016/j.sigpro.2009.09.009
Article MATH Google Scholar
S. Zhalehpour, O. Onder, Z. Akhtar, C.E. Erdem, BAUM-1: A spontaneous audio-visual face database of affective and mental states. IEEE Trans. Affect. Comput. 8(3), 300–313 (2016). https://doi.org/10.1109/TAFFC.2016.2553038
Article Google Scholar
S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multim. 20(6), 1576–1590 (2017). https://doi.org/10.1109/TMM.2017.2766843
Article Google Scholar
Z. Zhang, E. Coutinho, J. Deng, B. Schuller, “Cooperative learning and its application to emotion recognition from speech,.” IEEE/ACM Trans. Audio Speech Language Process. (TASLP) 23(1), 115–126 (2015). https://doi.org/10.1109/TASLP.2014.2375558
Article Google Scholar
J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018). https://doi.org/10.1049/iet-spr.2017.0320
Article Google Scholar
J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran
Mohammad Reza Falahzadeh & Reza Sabbaghi-Nadooshan
Department of Biomedical Engineering, Central Tehran Branch, Islamic Azad University, Tehran, Iran
Fardad Farokhi
Department of Electrical Engineering, Shahrood Branch, Islamic Azad University, Shahrood, Iran
Ali Harimi

Authors

Mohammad Reza Falahzadeh
View author publications
You can also search for this author in PubMed Google Scholar
Fardad Farokhi
View author publications
You can also search for this author in PubMed Google Scholar
Ali Harimi
View author publications
You can also search for this author in PubMed Google Scholar
Reza Sabbaghi-Nadooshan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fardad Farokhi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Falahzadeh, M.R., Farokhi, F., Harimi, A. et al. Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition. Circuits Syst Signal Process 42, 449–492 (2023). https://doi.org/10.1007/s00034-022-02130-3

Download citation

Received: 13 July 2021
Revised: 20 July 2022
Accepted: 20 July 2022
Published: 25 August 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00034-022-02130-3

Keyword

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

A Robust Deep Transfer Learning Model for Accurate Speech Emotion Classification

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keyword

Navigation

Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

A Robust Deep Transfer Learning Model for Accurate Speech Emotion Classification

Multi-dimensional Convolutional Neural Network for Speech Emotion Recognition

Speech Emotion Recognition Using Pre-trained and Fine-Tuned Transfer Learning Approaches

Data Availability

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keyword

Search

Navigation