Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features

Koren, Leon; Stipancic, Tomislav; Ricko, Andrija; Orsag, Luka

doi:10.1007/978-3-031-05061-9_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13315))

Included in the following conference series:

International Conference on Human-Computer Interaction

1826 Accesses
1 Citations

Abstract

In this paper, a computational reasoning framework that can interpret social signals of the person in interaction by focusing on the person’s emotional state is presented. Two distinct sources of social signals are used for this study: facial and voice emotion modalities. As a part of the first modality, a Convolutional Neural Network (CNN) is used to extract and process the facial features based on live stream video. The voice emotion analysis containing two sub-modalities is driven by CNN and Long Short-Term Memory (LSTM) networks. The networks are analyzing the acoustic and linguistic features of the voice to determine the possible emotional cues of the person in interaction. Relying on the multimodal information fusion, the system then fuses data into a single hypothesis. Results of such reasoning are used to autonomously generate the robot responses which are shown in a form of non-verbal facial animations projected on the ‘face’ surface of the affective robot head PLEA. Built-in functionalities of the robot can provide a degree of situational embodiment, self-explainability and context-driven interaction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Stipancic, T., Koren, L., Korade, D., Rosenberg, D.: PLEA: a social robot with teaching and interacting capabilities. J. Pac. Rim Psychol. 15 (2021). https://doi.org/10.1177/18344909211037019
Barrett, L.F.: How Emotions are Made: The Secret Life of the Brain (2017)
Google Scholar
Stipancic, T., Rosenberg, D., Nishida, T., Jerbic, B.: Context driven model for simulating human perception – a design perspective. In: Design Computing and Cognition DCC 2016 (2016)
Google Scholar
Wathan, J., Burrows, A.M., Waller, B.M., McComb, K.: EquiFACS: the equine facial action coding system. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0131738
Article Google Scholar
Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. In: International Conference on Computational Science (2017). https://doi.org/10.1016/j.procs.2017.05.025
Melzer, A., Shafir, T., Tsachor, R.P.: How do we recognize emotion from movement? Specific motor components contribute to the recognition of each emotion. Front. Psychol. 10 (2019). https://doi.org/10.3389/fpsyg.2019.01389
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th IEEE International Conference on Data Mining (ICDM), Barcelona (2016). https://doi.org/10.1109/ICDM.2016.0055
Koolagudi, S.G., Murthy, Y.V.S., Bhaskar, S.P.: Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int. J. Speech Technol. 21(1), 167–183 (2018). https://doi.org/10.1007/s10772-018-9495-8
Article Google Scholar
Eyben, F., Schuller, B., openSMILE:): the Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Rec. 6, 4–13 (2015). https://doi.org/10.1145/2729095.2729097
Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018). https://doi.org/10.1007/s10772-018-9491-z
Article Google Scholar
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Article Google Scholar
Anand, N.: Convolutional and recurrent nets for detecting emotion from audio data. Convoluted Feelings (2015)
Google Scholar
Savigny, J., Purwarianti, A.: Emotion classification on Youtube comments using word embedding. In: 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (2017). https://doi.org/10.1109/ICAICTA.2017.8090986
Bandhakavi, A., Wiratunga, N., Padmanabhan, D., Massie, S.: Lexicon based feature extraction for emotion text classification. Pattern Recogn. Lett. 93, 133–142 (2017). https://doi.org/10.1016/j.patrec.2016.12.009
Article Google Scholar
Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory and ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016). https://doi.org/10.1109/MCI.2016.2572540
Article Google Scholar
Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017). https://doi.org/10.1016/j.inffus.2017.02.003
Article Google Scholar
Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015). https://doi.org/10.1016/j.neunet.2014.10.005
Article Google Scholar
Li, J., Qiu, T., Wen, C., Xie, K., Wen, F.Q.: Robust face recognition using the deep C2D-CNN model based on decision-level fusion. Sensors 18(7) (2018). https://doi.org/10.3390/s18072080
Amer, M.R., Shields, T., Siddiquie, B., Tamrakar, A., Divakaran, A., Chai, S.: Deep multimodal fusion: a hybrid approach. Int. J. Comput. Vision 126(2–4), 440–456 (2017). https://doi.org/10.1007/s11263-017-0997-7
Article MathSciNet MATH Google Scholar
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020). https://doi.org/10.1162/neco_a_01273
Article MathSciNet MATH Google Scholar
Zhu, H., Wang, Z., Shi, Y., Hua, Y., Xu, G., Deng, L.: Multimodal fusion method based on self-attention mechanism. Wirel. Commun. Mob. Comput. (2020). https://doi.org/10.1155/2020/8843186
Article Google Scholar
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence, 5634–5641 (2018). arXiv:1802.00927
Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., Lipton, Z.C.: Learning to deceive with attention-based explanations. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.432
Article Google Scholar
Xu, Q., Zhu, L., Dai, T., Yan, C.: Aspect-based sentiment classification with multi-attention network. Neurocomputing 388, 135–143 (2020). https://doi.org/10.1016/j.neucom.2020.01.024
Verma, S., et al.: Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis. In: 2020 IEEE International Conference on Data Mining (2020). https://doi.org/10.1109/ICDM50108.2020.00065
Karandeep, S.G., Aleksandr, R.: Face Detection OpenCV (2021). https://github.com/groverkds/face_detection_opencv. Accessed 31 Jan 2022
Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: IEEE 19th International Symposium on Intelligent Systems and Informatics, Subotica (2021). https://doi.org/10.1109/SISY52375.2021.9582508
Koren, L., Stipancic, T.: Multimodal emotion analysis based on acoustic and linguistic features of the voice. In: Meiselwitz, G. (ed.) HCII 2021. LNCS, vol. 12774, pp. 301–311. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77626-8_20
Chapter Google Scholar
Haq, S., Jackson, P.: Speaker-dependent audio-visual emotion recognition. In: AVSP (2009)
Google Scholar
Cao, H.W., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 377–390 (2014). https://doi.org/10.1109/TAFFC.2014.2336244
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE (2018). https://doi.org/10.1371/journal.pone.0196391
Article Google Scholar
Pichora-Fuller, M.K., Dupuis, K.: Toronto Emotional Speech Set (TESS). Toronto Emotional Speech Set (TESS), Toronto (2020). https://doi.org/10.5683/SP2/E8H2MF
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisabon, pp. 1517–1520 (2005)
Google Scholar
Stipancic, T., Jerbic, B.: Self-adaptive vision system. In: CamarinhaMatos, L.M., Pereira, P., Ribeiro, L. (eds.) DoCEIS 2010. IAICT, vol. 314, pp. 195–202. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11628-5_21
Chapter Google Scholar
Koren, L., Stipancic, T., Ricko, A., Orsag, L.: Person localization model based on a fusion of acoustic and visual inputs. Electronics (2022). https://doi.org/10.3390/electronics11030440
Stipančić, T., Jerbić, B., Ćurković, P.: Bayesian approach to robot group control. In: International Conference in Electrical Engineering and Intelligent Systems, London (2012). https://doi.org/10.1007/978-1-4614-2317-1_9

Download references

Acknowledgements

This work has been supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184).”

Author information

Authors and Affiliations

Faculty of Mechanical Engineering and Naval Architecture, University of Zagreb, Zagreb, Croatia
Leon Koren, Tomislav Stipancic, Andrija Ricko & Luka Orsag

Authors

Leon Koren
View author publications
You can also search for this author in PubMed Google Scholar
Tomislav Stipancic
View author publications
You can also search for this author in PubMed Google Scholar
Andrija Ricko
View author publications
You can also search for this author in PubMed Google Scholar
Luka Orsag
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leon Koren .

Editor information

Editors and Affiliations

Towson University, Towson, MD, USA
Gabriele Meiselwitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Koren, L., Stipancic, T., Ricko, A., Orsag, L. (2022). Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features. In: Meiselwitz, G. (eds) Social Computing and Social Media: Design, User Experience and Impact. HCII 2022. Lecture Notes in Computer Science, vol 13315. Springer, Cham. https://doi.org/10.1007/978-3-031-05061-9_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-05061-9_23
Published: 16 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05060-2
Online ISBN: 978-3-031-05061-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics