Skip to main content

Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features

  • 559 Accesses

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13315)


In this paper, a computational reasoning framework that can interpret social signals of the person in interaction by focusing on the person’s emotional state is presented. Two distinct sources of social signals are used for this study: facial and voice emotion modalities. As a part of the first modality, a Convolutional Neural Network (CNN) is used to extract and process the facial features based on live stream video. The voice emotion analysis containing two sub-modalities is driven by CNN and Long Short-Term Memory (LSTM) networks. The networks are analyzing the acoustic and linguistic features of the voice to determine the possible emotional cues of the person in interaction. Relying on the multimodal information fusion, the system then fuses data into a single hypothesis. Results of such reasoning are used to autonomously generate the robot responses which are shown in a form of non-verbal facial animations projected on the ‘face’ surface of the affective robot head PLEA. Built-in functionalities of the robot can provide a degree of situational embodiment, self-explainability and context-driven interaction.


  • Nonverbal behavior
  • Multimodal interaction
  • Artificial intelligence
  • Cognitive robotics
  • Social signal processing

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-031-05061-9_23
  • Chapter length: 14 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   84.99
Price excludes VAT (USA)
  • ISBN: 978-3-031-05061-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   109.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.


  1. Stipancic, T., Koren, L., Korade, D., Rosenberg, D.: PLEA: a social robot with teaching and interacting capabilities. J. Pac. Rim Psychol. 15 (2021).

  2. Barrett, L.F.: How Emotions are Made: The Secret Life of the Brain (2017)

    Google Scholar 

  3. Stipancic, T., Rosenberg, D., Nishida, T., Jerbic, B.: Context driven model for simulating human perception – a design perspective. In: Design Computing and Cognition DCC 2016 (2016)

    Google Scholar 

  4. Wathan, J., Burrows, A.M., Waller, B.M., McComb, K.: EquiFACS: the equine facial action coding system. PLoS ONE (2015).

    CrossRef  Google Scholar 

  5. Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. In: International Conference on Computational Science (2017).

  6. Melzer, A., Shafir, T., Tsachor, R.P.: How do we recognize emotion from movement? Specific motor components contribute to the recognition of each emotion. Front. Psychol. 10 (2019).

  7. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th IEEE International Conference on Data Mining (ICDM), Barcelona (2016).

  8. Koolagudi, S.G., Murthy, Y.V.S., Bhaskar, S.P.: Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int. J. Speech Technol. 21(1), 167–183 (2018).

    CrossRef  Google Scholar 

  9. Eyben, F., Schuller, B., openSMILE:): the Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Rec. 6, 4–13 (2015).

  10. Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018).

    CrossRef  Google Scholar 

  11. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019).

    CrossRef  Google Scholar 

  12. Anand, N.: Convolutional and recurrent nets for detecting emotion from audio data. Convoluted Feelings (2015)

    Google Scholar 

  13. Savigny, J., Purwarianti, A.: Emotion classification on Youtube comments using word embedding. In: 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (2017).

  14. Bandhakavi, A., Wiratunga, N., Padmanabhan, D., Massie, S.: Lexicon based feature extraction for emotion text classification. Pattern Recogn. Lett. 93, 133–142 (2017).

    CrossRef  Google Scholar 

  15. Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory and ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016).

    CrossRef  Google Scholar 

  16. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017).

    CrossRef  Google Scholar 

  17. Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015).

    CrossRef  Google Scholar 

  18. Li, J., Qiu, T., Wen, C., Xie, K., Wen, F.Q.: Robust face recognition using the deep C2D-CNN model based on decision-level fusion. Sensors 18(7) (2018).

  19. Amer, M.R., Shields, T., Siddiquie, B., Tamrakar, A., Divakaran, A., Chai, S.: Deep multimodal fusion: a hybrid approach. Int. J. Comput. Vision 126(2–4), 440–456 (2017).

    MathSciNet  CrossRef  MATH  Google Scholar 

  20. Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020).

    MathSciNet  CrossRef  MATH  Google Scholar 

  21. Zhu, H., Wang, Z., Shi, Y., Hua, Y., Xu, G., Deng, L.: Multimodal fusion method based on self-attention mechanism. Wirel. Commun. Mob. Comput. (2020).

    CrossRef  Google Scholar 

  22. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence, 5634–5641 (2018). arXiv:1802.00927

  23. Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., Lipton, Z.C.: Learning to deceive with attention-based explanations. ACL (2020).

    CrossRef  Google Scholar 

  24. Xu, Q., Zhu, L., Dai, T., Yan, C.: Aspect-based sentiment classification with multi-attention network. Neurocomputing 388, 135–143 (2020).

  25. Verma, S., et al.: Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis. In: 2020 IEEE International Conference on Data Mining (2020).

  26. Karandeep, S.G., Aleksandr, R.: Face Detection OpenCV (2021). Accessed 31 Jan 2022

  27. Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: IEEE 19th International Symposium on Intelligent Systems and Informatics, Subotica (2021).

  28. Koren, L., Stipancic, T.: Multimodal emotion analysis based on acoustic and linguistic features of the voice. In: Meiselwitz, G. (ed.) HCII 2021. LNCS, vol. 12774, pp. 301–311. Springer, Cham (2021).

    CrossRef  Google Scholar 

  29. Haq, S., Jackson, P.: Speaker-dependent audio-visual emotion recognition. In: AVSP (2009)

    Google Scholar 

  30. Cao, H.W., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 377–390 (2014).

  31. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE (2018).

    CrossRef  Google Scholar 

  32. Pichora-Fuller, M.K., Dupuis, K.: Toronto Emotional Speech Set (TESS). Toronto Emotional Speech Set (TESS), Toronto (2020).

  33. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisabon, pp. 1517–1520 (2005)

    Google Scholar 

  34. Stipancic, T., Jerbic, B.: Self-adaptive vision system. In: CamarinhaMatos, L.M., Pereira, P., Ribeiro, L. (eds.) DoCEIS 2010. IAICT, vol. 314, pp. 195–202. Springer, Heidelberg (2010).

    CrossRef  Google Scholar 

  35. Koren, L., Stipancic, T., Ricko, A., Orsag, L.: Person localization model based on a fusion of acoustic and visual inputs. Electronics (2022).

  36. Stipančić, T., Jerbić, B., Ćurković, P.: Bayesian approach to robot group control. In: International Conference in Electrical Engineering and Intelligent Systems, London (2012).

Download references


This work has been supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184).”

Author information

Authors and Affiliations


Corresponding author

Correspondence to Leon Koren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Koren, L., Stipancic, T., Ricko, A., Orsag, L. (2022). Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features. In: Meiselwitz, G. (eds) Social Computing and Social Media: Design, User Experience and Impact. HCII 2022. Lecture Notes in Computer Science, vol 13315. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-05060-2

  • Online ISBN: 978-3-031-05061-9

  • eBook Packages: Computer ScienceComputer Science (R0)