Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network


Aiming at the problems of insufficient information and poor recognition rate in single-mode emotion recognition, a multi-mode emotion recognition method based on deep belief network is proposed. Firstly, speech and expression signals are preprocessed and feature extracted to obtain high-level features of single-mode signals. Then, the high-level speech features and expression features are fused by using the bimodal deep belief network (BDBN), and the multimodal fusion features for classification are obtained, and the redundant information between modes is removed. Finally, the multi-modal fusion features are classified by LIBSVM to realize the final emotion recognition. Based on the Friends data set, the proposed model is demonstrated experimentally. The experimental results show that the recognition accuracy of multimodal fusion feature is the best, which is 90.89%, and the unweighted recognition accuracy of the proposed model is 86.17%, which is better than other comparison methods, and has certain research value and practicability.

This is a preview of subscription content, access via your institution.


  1. 1.

    Rahdari, F., Rashedi, E., Eftekhari, M.: A Multimodal Emotion Recognition System Using Facial Landmark Analysis[J]. Iranian Journal of Science and Technology. Trans. Electr. Eng. 43(JUL.SUPPL.1), S171–S189 (2019)

    Google Scholar 

  2. 2.

    Nemati, S., Rohani-Dezfuli, A.R., Basiri, E., et al.: A hybrid latent space data fusion method for multimodal emotion recognition[J]. IEEE Access. 7(4), 172948–172964 (2019)

    Article  Google Scholar 

  3. 3.

    Wang, Y.: Multimodal emotion recognition algorithm based on edge network emotion element compensation and data fusion[J]. Pers. Ubiquit. Comput. 23(3–4), 383–392 (2019)

    Article  Google Scholar 

  4. 4.

    Wang, Z., Zhou, X., Wang, W., Liang, C.: Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video[J]. Int. J. Mach. Learn. Cybern. 11(4), 923–934 (2020)

    Article  Google Scholar 

  5. 5.

    Xia, K., Hu, T., Si, W.: Editorial for the special issue on "research on methods of multimodal information fusion in emotion recognition"[J]. Pers. Ubiquit. Comput. 23(3–4), 359–361 (2019)

    Article  Google Scholar 

  6. 6.

    Jaiswal, M.: Interpreting multimodal machine learning models trained for emotion recognition to address robustness and privacy concerns[J]. Proc. AAAI Conf. Artificial Intell. 34(10), 13716–13717 (2020)

    Google Scholar 

  7. 7.

    Jaiswal, M., Provost, E.M.: Privacy enhanced multimodal neural representations for emotion recognition[J]. Proc. AAAI Conf. Artificial Intell. 34(5), 7985–7993 (2020)

    Google Scholar 

  8. 8.

    Choi, D.Y., Kim, D.H., Song, B.C.: Multimodal attention network for continuous-time emotion recognition using video and EEG signals[J]. IEEE Access. 8, 203814–203826 (2020)

    Article  Google Scholar 

  9. 9.

    Zheng, W.L., Liu, W., Lu, Y., Lu, B.L., Cichocki, A.: EmotionMeter: a multimodal framework for recognizing human emotions[J]. IEEE Trans. Cybern. 49, 1110–1122 (2019)

    Article  Google Scholar 

  10. 10.

    Noroozi, F., Marjanovic, M., Njegus, A., Escalera, S., Anbarjafari, G.: Audio-visual emotion recognition in video clips[J]. Affect. Comput. IEEE Trans. 10(1), 60–75 (2019)

    Article  Google Scholar 

  11. 11.

    Seng, J.K.P., Ang, L.M.: Multimodal emotion and sentiment modeling from unstructured big data: challenges, architecture, & techniques[J]. IEEE Access. 7(5), 90982–90998 (2019)

    Article  Google Scholar 

  12. 12.

    Avots, E., Sapinski, T., Bachmann, M., et al.: Audiovisual emotion recognition in wild[J]. Mach. Vis. Appl. 30(5), 975–985 (2019)

    Article  Google Scholar 

  13. 13.

    Kim, Y., Provost, E.M.: ISLA: temporal segmentation and labeling for audio-visual emotion recognition[J]. Affect. Comput. IEEE Trans. 10(2), 196–208 (2019)

    Article  Google Scholar 

  14. 14.

    Li, D.H., Wang, Z., Wang, C.H., et al.: The fusion of electroencephalography and facial expression for continuous emotion recognition[J]. IEEE Access. 7(7), 155724–155736 (2019)

    Article  Google Scholar 

  15. 15.

    Hu, M., Wang, H., Wang, X., et al.: Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks[J]. J. Vis. Commun. Image Represent. 59, 176–185 (2019)

    Article  Google Scholar 

  16. 16.

    Azad, R., Asadi-Aghbolaghi, M., Kasaei, S., Escalera, S.: Dynamic 3D hand gesture recognition by learning weighted depth motion maps[J]. IEEE Trans. Circuits Syst. Video Technol. 29(6), 1729–1740 (2019)

    Article  Google Scholar 

  17. 17.

    Li, X., Song, D., Zhang, P., et al.: Emotion recognition from multi-channel EEG data throughConvolutional recurrent neural network[C]// international conference on bioinformatics andBiomedicine. IEEE. 3(4), 352–359 (2017)

    Google Scholar 

  18. 18.

    A A R , A M M , B S M A . Dear-Mulsemedia: dataset for emotion analysis and recognition in response to multiple sensorial media[J]. Inf. Fusion, 2021, 65(3):37–49

  19. 19.

    Egger, M., Ley, M., Hanke, S.: Emotion recognition from physiological signal analysis: a review[J]. Electron. Notes Theor. Comput. Sci. 343(5), 35–55 (2019)

    Article  Google Scholar 

  20. 20.

    Mittal, T., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues[J]. Proc. AAAI Conf. Artificial Intell. 34(2), 1359–1367 (2020)

    Google Scholar 

  21. 21.

    Zhang, H.: Expression-EEG based collaborative multimodal emotion recognition using deep AutoEncoder[J]. IEEE Access. 8(3), 164130–164143 (2020)

    Article  Google Scholar 

  22. 22.

    Jaratrotkamjorn, A.: Bimodal emotion recognition using deep belief network[J]. ECTI Trans. Comput. Inf. Technol. (ECTI-CIT). 15(1), 73–81 (2021)

    Article  Google Scholar 

  23. 23.

    Li, Y., Ishi, C.T., Inoue, K., et al.: Expressing reactive emotion based on multimodal emotion recognition for natural conversation in human–robot interaction*[J]. Adv. Robot. 33(1), 1–12 (2019)

    Article  Google Scholar 

  24. 24.

    Li, J., Zhong, J., Wang, M.: Unsupervised recurrent neural network with parametric Bias framework for human emotion recognition with multimodal sensor data fusion[J]. Sensors and materials. 32(4), 1261–1277 (2020)

    Article  Google Scholar 

  25. 25.

    Tzirakis, P., Chen, J., Zafeiriou, S., Schuller, B.: End-to-end multimodal affect recognition in real-world environments[J]. Inf. Fusion. 68(5), 46–53 (2021)

    Article  Google Scholar 

  26. 26.

    Rao, P.: Weighted normalization fusion approach for multimodal emotion recognition[J]. Int. J. Sci. Technol. Res. 9(4), 3092–3098 (2020)

    Google Scholar 

  27. 27.

    Schmidt, T., Schlindwein, M., Lichtner, K., et al.: Investigating the Relationship Between Emotion Recognition Software and Usability Metrics[J]. i-com. 19(2), 139–151 (2020)

    Article  Google Scholar 

  28. 28.

    Mansouri-Benssassi, E., Ye, J.: Synch-graph: multisensory emotion recognition through neural synchrony via graph convolutional networks[J]. Proc. AAAI Conf. Artificial Intell. 34(2), 1351–1358 (2020)

    Google Scholar 

  29. 29.

    Hare, M.M., Garcia, A.M., Hart, K.C., Graziano, P.A.: Intervention response among preschoolers with ADHD: the role of emotion understanding[J]. J. Sch. Psychol. 84(6), 19–31 (2021)

    Article  Google Scholar 

  30. 30.

    de Boer, M.J., Jürgens, T., Cornelissen, F.W., et al.: Degraded visual and auditory input individually impair audiovisual emotion recognition from speech-like stimuli, but no evidence for an exacerbated effect from combined degradation[J]. Vis. Res. 180(2), 51–62 (2021)

    Article  Google Scholar 

  31. 31.

    Caldas, O.I., Aviles, O.F., Rodriguez-Guerrero, C.: Effects of presence and challenge variations on emotional engagement in immersive virtual environments[J]. IEEE Trans. Neural Syst. Rehab. Eng. 28(5), 1109–1116 (2020)

    Article  Google Scholar 

  32. 32.

    Yadegaridehkordi, E., Noor, N.F.B.M., Bin Ayub, M.N., et al.: Affective computing in education: a systematic review and future research[J]. Comput. Educ. 142(11), 1–19 (2019)

    Google Scholar 

  33. 33.

    Gupta, K.S.: Development of music player application using emotion recognition[J]. Intl. J. Modern Trends Sci. Technol. 7(1), 54–57 (2021)

    Article  Google Scholar 

Download references


This work is supported This work was supported in part by the Natural Science Foundation of Shandong Province of China under Grant ZR2016AM30, Social Science Planning Research Project of Shandong Province under Grant 18CLYJ50, in part by the Shandong Soft Science Research Program under Grant 2018RKB01144, and in part by The Project of Shandong Province Higher Educational Science and Technology Program under Grant J15LN15.

Author information



Corresponding author

Correspondence to Longxi Chen.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, D., Chen, L., Wang, Z. et al. Speech Expression Multimodal Emotion Recognition Based on Deep Belief Network. J Grid Computing 19, 22 (2021). https://doi.org/10.1007/s10723-021-09564-0

Download citation


  • Bimodal deep belief network
  • Speech signal
  • Expression signal
  • Multimodal emotion recognition