Abstract
Speech emotion recognition (SER), an important method of emotional human–machine interaction, has been the focus of much research in recent years. Motivated by powerful Deep Convolutional Neural Networks (DCNNs) to learn features and the landmark success of these networks in the field of image classification, the present study aimed to prepare a pre-trained DCNN model for SER and provide compatible input to these networks by converting a speech signal into a 3D tensor. First, using a reconstructed phase space, speech samples are reconstructed in a 3D phase space. Studies have shown that the patterns formed in this space contain meaningful emotional features of the speaker. To provide an input that is compatible with DCNN, a new speech signal representation called Chaogram was introduced as the projection of these patterns, and three channels similar to RGB images were obtained. In the next step, image enhancement techniques were used to highlight the details of Chaogram images. Then, the Visual Geometry Group (VGG) DCNN pre-trained on the large ImageNet dataset is utilized to learn Chaogram high-level features and corresponding emotion classes. Finally, transfer learning is performed on the proposed model, and the presented model is fine-tuned on our datasets. To optimize the hyper-parameter arrangement of architecture-determined CNNs, an innovative DCNN-GWO (gray wolf optimization) is also presented. The results of this study on two public datasets of emotions, i.e., EMO-DB and eNTERFACE05, show the promising performance of the proposed model, which can greatly improve SER applications.
Similar content being viewed by others
References
B.J. Abbaschian, D. Sierra-Sosa, A. Elmaghraby, Deep learning techniques for speech emotion recognition, from databases to models. Sensors 21(4), 1249 (2021). https://doi.org/10.3390/s21041249
M. Ahsanullah, B.G. Kibria, M. Shakil, Normal distribution in Normal and student st distributions and their applications (Springer, 2014)
P.J.M. Ali, R.H. Faraj, E. Koya, P.J.M. Ali, R.H. Faraj, Data normalization and standardization: a technical report. Mach. Learn. Tech. Rep. 1(1), 1–6 (2014). https://doi.org/10.13140/RG.2.2.28948.04489
H. Altun, G. Polat, Boosting selection of speech related features to improve performance of multi-class SVMs in emotion detection. Expert Syst. Appl. 36(4), 8197–8203 (2009). https://doi.org/10.1016/j.eswa.2008.10.005
C.-N. Anagnostopoulos, T. Iliou, I. Giannoukos, Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2015). https://doi.org/10.1007/s10462-012-9368-5
M. El Ayadi, M.S. Kamel, F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
A. Bakhshi, S. Chalup, A. Harimi, S.M. Mirhassani, Recognition of emotion from speech using evolutionary cepstral coefficients. Multim. Tools Appl. 79(47), 35739–35759 (2020). https://doi.org/10.1007/s11042-020-09591-1
A. Bhavan, P. Chauhan, R.R. Shah, Bagged support vector machines for emotion recognition from speech. Knowl.-Based Syst. 184, 104886 (2019). https://doi.org/10.1016/j.knosys.2019.104886
E. Bozkurt, E. Erzin, C.E. Erdem, A.T. Erdem, Formant position based weighted spectral features for emotion recognition. Speech Commun. 53(9–10), 1186–1197 (2011). https://doi.org/10.1016/j.specom.2011.04.003
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss, "A database of German emotional speech," in Ninth European conference on speech communication and technology, 2005. doi:https://doi.org/10.21437/Interspeech.2005-446
Y. Chavhan, M. Dhore, P. Yesaware, Speech emotion recognition using support vector machine. Int. J. Computer Appl. 1(20), 6–9 (2010). https://doi.org/10.1007/978-3-642-21402-8_35
F. Chollet, Deep learning with Python. Manning New York, 2018.
F. Dellaert, T. Polzin, and A. Waibel, "Recognizing emotion in speech," in Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96, 1996, vol. 3: IEEE, pp. 1970–1973. DOI: https://doi.org/10.1109/ICSLP.1996.608022
S. Demircan, H. Kahramanli, Application of fuzzy C-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018). https://doi.org/10.1007/s00521-016-2712-y
J. Deng, X. Xu, Z. Zhang, S. Frühholz, B. Schuller, Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio, Speech, Language Process. 26(1), 31–43 (2017). https://doi.org/10.1109/TASLP.2017.2759338
F. Eyben, Real-time speech and music classification by large audio feature space extraction (Springer, 2015)
F. Eyben et al., The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 7(2), 190–202 (2015). https://doi.org/10.1109/TAFFC.2015.2457417
M. Fallahzadeh, F. Farokhi, A. Harimi, R. Sabbaghi-Nadooshan, Facial expression recognition based on image gradient and deep convolutional neural network. J. AI Data Mining 9(2), 259–268 (2021)
H. Faris, I. Aljarah, M.A. Al-Betar, S. Mirjalili, Grey wolf optimizer: a review of recent variants and applications. Neural Comput. Appl. 30(2), 413–435 (2018). https://doi.org/10.1007/s00521-017-3272-5
M. Giollo, D. Gunceler, Y. Liu, and D. Willett, "Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio," arXiv preprint arXiv:2011.12696, 2020. doi: https://doi.org/10.48550/arXiv.2011.12696
N. Hajarolasvadi, H. Demirel, 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 21(5), 479 (2019). https://doi.org/10.3390/e21050479
K. Han, D. Yu, and I. Tashev, "Speech emotion recognition using deep neural network and extreme learning machine," in Fifteenth annual Conference of the international speech communication association, 2014.
A. Harimi, A. AhmadyFard, A. Shahzadi, K. Yaghmaie, Anger or joy? Emotion recognition using nonlinear dynamics of speech. Appl. Artif. Intell. 29(7), 675–696 (2015). https://doi.org/10.1080/08839514.2015.1051891
A. Harimi, H.S. Fakhr, A. Bakhshi, Recognition of emotion using reconstructed phase space of speech. Malays. J. Comput. Sci. 29(4), 262–271 (2016). https://doi.org/10.22452/mjcs.vol29no4.2
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2016, pp. 770–778. doi: https://doi.org/10.48550/arXiv.1512.03385
Z. Huang, M. Dong, Q. Mao, and Y. Zhan, "Speech emotion recognition using CNN," in Proceedings of the 22nd ACM International Conference on Multimedia, 2014: ACM, pp. 801–804. DOI:https://doi.org/10.37200/IJPR/V24I8/PR280260
F. Hutter, L. Kotthoff, J. Vanschoren, Automated machine learning: methods, systems, challenges (Springer Nature, 2019)
K.M. Indrebo, R.J. Povinelli, M.T. Johnson, Sub-banded reconstructed phase spaces for speech recognition. Speech Commun. 48(7), 760–774 (2006). https://doi.org/10.1016/j.specom.2004.12.002
D. Issa, M.F. Demirci, A. Yazici, Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020). https://doi.org/10.1016/j.bspc.2020.101894
M. T. Johnson, A. C. Lindgren, R. J. Povinelli, and X. Yuan, "Performance of nonlinear speech enhancement using phase space reconstruction," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 1: IEEE, pp. I-I. DOI: https://doi.org/10.1109/ICASSP.2003.1198932
V. Joshi, R. Zhao, R. R. Mehta, K. Kumar, and J. Li, "Transfer learning approaches for streaming end-to-end speech recognition system," arXiv preprint arXiv:2008.05086, 2020. doi: https://doi.org/10.48550/arXiv.2008.05086
M.B. Kennel, R. Brown, H.D. Abarbanel, Determining embedding dimension for phase-space reconstruction using a geometrical construction. Phys. Rev. A 45(6), 3403 (1992). https://doi.org/10.1103/PhysRevA.45.3403
R.A. Khalil, E. Jones, M.I. Babar, T. Jan, M.H. Zafar, T. Alhussain, Speech emotion recognition using deep learning techniques: a review. IEEE Access 7, 117327–117345 (2019). https://doi.org/10.1109/ACCESS.2019.2936124
E.H. Kim, K.H. Hyun, S.H. Kim, Y.K. Kwak, Improved emotion recognition with a novel speaker-independent feature. IEEE/ASME Trans. Mechatron. 14(3), 317–325 (2009). https://doi.org/10.1109/TMECH.2008.2008644
Y. Kim, H. Lee, and E. M. Provost, "Deep learning for robust feature generation in audiovisual emotion recognition," in 2013 IEEE International Conference on acoustics, speech and signal processing, 2013: IEEE, pp. 3687–3691. DOI: https://doi.org/10.1109/ICASSP.2013.6638346
J. Krajewski, S. Schnieder, D. Sommer, A. Batliner, B. Schuller, Applying multiple classifiers and non-linear dynamics features for detecting sleepiness from speech. Neurocomputing 84, 65–75 (2012). https://doi.org/10.1016/j.neucom.2011.12.021
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012). https://doi.org/10.1145/3065386
E. Lieskovská, M. Jakubec, R. Jarina, M. Chmulík, A Review on Speech Emotion Recognition Using Deep Learning and Attention Mechanism. Electronics 10(10), 1163 (2021). https://doi.org/10.3390/electronics10101163
I. Luengo, E. Navas, I. Hernáez, Feature analysis and evaluation for automatic emotion identification in speech. IEEE Trans. Multim. 12(6), 490–501 (2010). https://doi.org/10.1109/TMM.2010.2051872
H.-G. Ma, C.-Z. Han, Selection of embedding dimension and delay time in phase space reconstruction. Front. Electr. Electron. Eng. China 1(1), 111–114 (2006). https://doi.org/10.3390/e23020221
O. Martin, I. Kotsia, B. Macq, and I. Pitas, "The eNTERFACE'05 audio-visual emotion database," in 22nd International Conference on Data Engineering Workshops (ICDEW'06), 2006: IEEE, pp. 8–8. DOI: https://doi.org/10.1109/ICDEW.2006.145
S. Mirjalili, S.M. Mirjalili, A. Lewis, Grey wolf optimizer. Adv. Eng. Softw. 69, 46–61 (2014). https://doi.org/10.1016/j.advengsoft.2013.12.007
H. Moayedi, H. Nguyen, L. Kok Foong, Nonlinear evolutionary swarm intelligence of grasshopper optimization algorithm and gray wolf optimization for weight adjustment of neural network. Eng. Computers 37(2), 1265–1275 (2021). https://doi.org/10.1007/s00366-019-00882-2
J. Nicholson, K. Takahashi, R. Nakatsu, Emotion recognition in speech using neural networks. Neural Comput. Appl. 9(4), 290–296 (2000). https://doi.org/10.1007/s005210070006
Y. Niu, D. Zou, Y. Niu, Z. He, and H. Tan, "A breakthrough in speech emotion recognition using deep retinal convolution neural networks," arXiv preprint arXiv:1707.09917, 2017. doi: https://doi.org/10.48550/arXiv.1707.09917
T.-L. Pao, C. S. Chien, Y.-T. Chen, J.-H. Yeh, Y.-M. Cheng, and W.-Y. Liao, "Combination of multiple classifiers for improving emotion recognition in Mandarin speech," in Third International Conference on Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP 2007), 2007, vol. 1: IEEE, pp. 35–38. DOI: https://doi.org/10.1109/IIHMSP.2007.4457487
C.L. Phillips, J.M. Parr, E.A. Riskin, T. Prabhakar, Signals, systems, and transforms (Prentice Hall, 2003)
P. Prajith, "Investigations on the applications of dynamical instabilities and deterministic chaos for speech signal processing," 2008.
B. Schuller, G. Rigoll, and M. Lang, "Hidden Markov model-based speech emotion recognition," in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003, vol. 2: IEEE, pp. II-1. DOI: https://doi.org/10.1109/ICME.2003.1220939
B. Schuller, G. Rigoll, and M. Lang, "Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture," in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004, vol. 1: IEEE, pp. I-577. DOI: https://doi.org/10.1109/ICASSP.2004.1326051
B. Schuller, B. Vlasenko, F. Eyben, G. Rigoll, and A. Wendemuth, "Acoustic emotion recognition: A benchmark comparison of performances," in 2009 IEEE Workshop on Automatic Speech Recognition and Understanding, 2009: IEEE, pp. 552–557. DOI:https://doi.org/10.1109/ASRU.2009.5372886
A. Shahzadi, A. Ahmadyfard, A. Harimi, K. Yaghmaie, Speech emotion recognition using nonlinear dynamics features. Turkish J. Electr. Eng. Computer Sci. 23, 871 (2015). https://doi.org/10.3906/elk-1302-90
Y. Shekofteh, F. Almasganj, Feature extraction based on speech attractors in the reconstructed phase space for automatic speech recognition systems. ETRI J. 35(1), 100–108 (2013). https://doi.org/10.4218/etrij.13.0112.0074
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014. doi: https://doi.org/10.48550/arXiv.1409.1556
A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, and B. Schuller, "Deep neural networks for acoustic emotion recognition: raising the benchmarks," in 2011 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2011: IEEE, pp. 5688–5691. DOI: https://doi.org/10.1109/ICASSP.2011.5947651
Y. Sun, X.-Y. Zhang, J.-H. Ma, C.-X. Song, H.-F. Lv, Nonlinear dynamic feature extraction based on phase space reconstruction for the classification of speech and emotion. Mathem. Probl. Eng. 871, 45 (2020). https://doi.org/10.1155/2020/9452976
C. Szegedy et al., "Going deeper with convolutions," in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2015, pp. 1–9. doi: https://doi.org/10.48550/arXiv.1409.4842
G. Trigeorgis et al., "Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network," in 2016 IEEE International Conference on acoustics, speech and signal processing (ICASSP), 2016: IEEE, pp. 5200–5204. DOI: https://doi.org/10.1109/ICASSP.2016.7472669
T. Tuncer, S. Dogan, U.R. Acharya, Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl.-Based Syst. 211, 106547 (2021). https://doi.org/10.1016/j.knosys.2020.106547
D. Ververidis and C. Kotropoulos, "Emotional speech classification using Gaussian mixture models and the sequential floating forward selection algorithm," in 2005 IEEE International Conference on Multimedia and Expo, 2005: IEEE, pp. 1500–1503. DOI: https://doi.org/10.1109/ICME.2005.1521717
Y. Wang, H. Zhang, G. Zhang, cPSO-CNN: An efficient PSO-based algorithm for fine-tuning hyper-parameters of convolutional neural networks. Swarm Evol. Comput. 49, 114–123 (2019). https://doi.org/10.1016/j.swevo.2019.06.002
S. Wu, T.H. Falk, W.-Y. Chan, Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011). https://doi.org/10.1016/j.specom.2010.08.013
Y. Xie, R. Liang, Z. Liang, C. Huang, C. Zou, B. Schuller, Speech emotion classification using attention-based LSTM. IEEE/ACM Trans. Audio, Speech, Language Process. 27(11), 1675–1685 (2019). https://doi.org/10.1109/TASLP.2019.2925934
X. Xu et al., A two-dimensional framework of multiple kernel subspace learning for recognizing emotion in speech. IEEE/ACM Trans. Audio, Speech, Language Process. 25(7), 1436–1449 (2017). https://doi.org/10.1109/TASLP.2017.2694704
B. Yang, M. Lugger, Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010). https://doi.org/10.1016/j.sigpro.2009.09.009
S. Zhalehpour, O. Onder, Z. Akhtar, C.E. Erdem, BAUM-1: A spontaneous audio-visual face database of affective and mental states. IEEE Trans. Affect. Comput. 8(3), 300–313 (2016). https://doi.org/10.1109/TAFFC.2016.2553038
S. Zhang, S. Zhang, T. Huang, W. Gao, Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Trans. Multim. 20(6), 1576–1590 (2017). https://doi.org/10.1109/TMM.2017.2766843
Z. Zhang, E. Coutinho, J. Deng, B. Schuller, “Cooperative learning and its application to emotion recognition from speech,.” IEEE/ACM Trans. Audio Speech Language Process. (TASLP) 23(1), 115–126 (2015). https://doi.org/10.1109/TASLP.2014.2375558
J. Zhao, X. Mao, L. Chen, Learning deep features to recognise speech emotion using merged deep CNN. IET Signal Proc. 12(6), 713–721 (2018). https://doi.org/10.1049/iet-spr.2017.0320
J. Zhao, X. Mao, L. Chen, Speech emotion recognition using deep 1D and 2D CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Falahzadeh, M.R., Farokhi, F., Harimi, A. et al. Deep Convolutional Neural Network and Gray Wolf Optimization Algorithm for Speech Emotion Recognition. Circuits Syst Signal Process 42, 449–492 (2023). https://doi.org/10.1007/s00034-022-02130-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-022-02130-3