Skip to main content

Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features

  • Conference paper
  • First Online:
Social Computing and Social Media: Design, User Experience and Impact (HCII 2022)

Abstract

In this paper, a computational reasoning framework that can interpret social signals of the person in interaction by focusing on the person’s emotional state is presented. Two distinct sources of social signals are used for this study: facial and voice emotion modalities. As a part of the first modality, a Convolutional Neural Network (CNN) is used to extract and process the facial features based on live stream video. The voice emotion analysis containing two sub-modalities is driven by CNN and Long Short-Term Memory (LSTM) networks. The networks are analyzing the acoustic and linguistic features of the voice to determine the possible emotional cues of the person in interaction. Relying on the multimodal information fusion, the system then fuses data into a single hypothesis. Results of such reasoning are used to autonomously generate the robot responses which are shown in a form of non-verbal facial animations projected on the ‘face’ surface of the affective robot head PLEA. Built-in functionalities of the robot can provide a degree of situational embodiment, self-explainability and context-driven interaction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Stipancic, T., Koren, L., Korade, D., Rosenberg, D.: PLEA: a social robot with teaching and interacting capabilities. J. Pac. Rim Psychol. 15 (2021). https://doi.org/10.1177/18344909211037019

  2. Barrett, L.F.: How Emotions are Made: The Secret Life of the Brain (2017)

    Google Scholar 

  3. Stipancic, T., Rosenberg, D., Nishida, T., Jerbic, B.: Context driven model for simulating human perception – a design perspective. In: Design Computing and Cognition DCC 2016 (2016)

    Google Scholar 

  4. Wathan, J., Burrows, A.M., Waller, B.M., McComb, K.: EquiFACS: the equine facial action coding system. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0131738

    Article  Google Scholar 

  5. Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. In: International Conference on Computational Science (2017). https://doi.org/10.1016/j.procs.2017.05.025

  6. Melzer, A., Shafir, T., Tsachor, R.P.: How do we recognize emotion from movement? Specific motor components contribute to the recognition of each emotion. Front. Psychol. 10 (2019). https://doi.org/10.3389/fpsyg.2019.01389

  7. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th IEEE International Conference on Data Mining (ICDM), Barcelona (2016). https://doi.org/10.1109/ICDM.2016.0055

  8. Koolagudi, S.G., Murthy, Y.V.S., Bhaskar, S.P.: Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int. J. Speech Technol. 21(1), 167–183 (2018). https://doi.org/10.1007/s10772-018-9495-8

    Article  Google Scholar 

  9. Eyben, F., Schuller, B., openSMILE:): the Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Rec. 6, 4–13 (2015). https://doi.org/10.1145/2729095.2729097

  10. Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018). https://doi.org/10.1007/s10772-018-9491-z

    Article  Google Scholar 

  11. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  12. Anand, N.: Convolutional and recurrent nets for detecting emotion from audio data. Convoluted Feelings (2015)

    Google Scholar 

  13. Savigny, J., Purwarianti, A.: Emotion classification on Youtube comments using word embedding. In: 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (2017). https://doi.org/10.1109/ICAICTA.2017.8090986

  14. Bandhakavi, A., Wiratunga, N., Padmanabhan, D., Massie, S.: Lexicon based feature extraction for emotion text classification. Pattern Recogn. Lett. 93, 133–142 (2017). https://doi.org/10.1016/j.patrec.2016.12.009

    Article  Google Scholar 

  15. Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory and ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016). https://doi.org/10.1109/MCI.2016.2572540

    Article  Google Scholar 

  16. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017). https://doi.org/10.1016/j.inffus.2017.02.003

    Article  Google Scholar 

  17. Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015). https://doi.org/10.1016/j.neunet.2014.10.005

    Article  Google Scholar 

  18. Li, J., Qiu, T., Wen, C., Xie, K., Wen, F.Q.: Robust face recognition using the deep C2D-CNN model based on decision-level fusion. Sensors 18(7) (2018). https://doi.org/10.3390/s18072080

  19. Amer, M.R., Shields, T., Siddiquie, B., Tamrakar, A., Divakaran, A., Chai, S.: Deep multimodal fusion: a hybrid approach. Int. J. Comput. Vision 126(2–4), 440–456 (2017). https://doi.org/10.1007/s11263-017-0997-7

    Article  MathSciNet  MATH  Google Scholar 

  20. Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020). https://doi.org/10.1162/neco_a_01273

    Article  MathSciNet  MATH  Google Scholar 

  21. Zhu, H., Wang, Z., Shi, Y., Hua, Y., Xu, G., Deng, L.: Multimodal fusion method based on self-attention mechanism. Wirel. Commun. Mob. Comput. (2020). https://doi.org/10.1155/2020/8843186

    Article  Google Scholar 

  22. Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence, 5634–5641 (2018). arXiv:1802.00927

  23. Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., Lipton, Z.C.: Learning to deceive with attention-based explanations. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.432

    Article  Google Scholar 

  24. Xu, Q., Zhu, L., Dai, T., Yan, C.: Aspect-based sentiment classification with multi-attention network. Neurocomputing 388, 135–143 (2020). https://doi.org/10.1016/j.neucom.2020.01.024

  25. Verma, S., et al.: Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis. In: 2020 IEEE International Conference on Data Mining (2020). https://doi.org/10.1109/ICDM50108.2020.00065

  26. Karandeep, S.G., Aleksandr, R.: Face Detection OpenCV (2021). https://github.com/groverkds/face_detection_opencv. Accessed 31 Jan 2022

  27. Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: IEEE 19th International Symposium on Intelligent Systems and Informatics, Subotica (2021). https://doi.org/10.1109/SISY52375.2021.9582508

  28. Koren, L., Stipancic, T.: Multimodal emotion analysis based on acoustic and linguistic features of the voice. In: Meiselwitz, G. (ed.) HCII 2021. LNCS, vol. 12774, pp. 301–311. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77626-8_20

    Chapter  Google Scholar 

  29. Haq, S., Jackson, P.: Speaker-dependent audio-visual emotion recognition. In: AVSP (2009)

    Google Scholar 

  30. Cao, H.W., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 377–390 (2014). https://doi.org/10.1109/TAFFC.2014.2336244

  31. Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE (2018). https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  32. Pichora-Fuller, M.K., Dupuis, K.: Toronto Emotional Speech Set (TESS). Toronto Emotional Speech Set (TESS), Toronto (2020). https://doi.org/10.5683/SP2/E8H2MF

  33. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisabon, pp. 1517–1520 (2005)

    Google Scholar 

  34. Stipancic, T., Jerbic, B.: Self-adaptive vision system. In: CamarinhaMatos, L.M., Pereira, P., Ribeiro, L. (eds.) DoCEIS 2010. IAICT, vol. 314, pp. 195–202. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11628-5_21

    Chapter  Google Scholar 

  35. Koren, L., Stipancic, T., Ricko, A., Orsag, L.: Person localization model based on a fusion of acoustic and visual inputs. Electronics (2022). https://doi.org/10.3390/electronics11030440

  36. Stipančić, T., Jerbić, B., Ćurković, P.: Bayesian approach to robot group control. In: International Conference in Electrical Engineering and Intelligent Systems, London (2012). https://doi.org/10.1007/978-1-4614-2317-1_9

Download references

Acknowledgements

This work has been supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184).”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Leon Koren .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Koren, L., Stipancic, T., Ricko, A., Orsag, L. (2022). Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features. In: Meiselwitz, G. (eds) Social Computing and Social Media: Design, User Experience and Impact. HCII 2022. Lecture Notes in Computer Science, vol 13315. Springer, Cham. https://doi.org/10.1007/978-3-031-05061-9_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-05061-9_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-05060-2

  • Online ISBN: 978-3-031-05061-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics