Abstract
In this paper, a computational reasoning framework that can interpret social signals of the person in interaction by focusing on the person’s emotional state is presented. Two distinct sources of social signals are used for this study: facial and voice emotion modalities. As a part of the first modality, a Convolutional Neural Network (CNN) is used to extract and process the facial features based on live stream video. The voice emotion analysis containing two sub-modalities is driven by CNN and Long Short-Term Memory (LSTM) networks. The networks are analyzing the acoustic and linguistic features of the voice to determine the possible emotional cues of the person in interaction. Relying on the multimodal information fusion, the system then fuses data into a single hypothesis. Results of such reasoning are used to autonomously generate the robot responses which are shown in a form of non-verbal facial animations projected on the ‘face’ surface of the affective robot head PLEA. Built-in functionalities of the robot can provide a degree of situational embodiment, self-explainability and context-driven interaction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Stipancic, T., Koren, L., Korade, D., Rosenberg, D.: PLEA: a social robot with teaching and interacting capabilities. J. Pac. Rim Psychol. 15 (2021). https://doi.org/10.1177/18344909211037019
Barrett, L.F.: How Emotions are Made: The Secret Life of the Brain (2017)
Stipancic, T., Rosenberg, D., Nishida, T., Jerbic, B.: Context driven model for simulating human perception – a design perspective. In: Design Computing and Cognition DCC 2016 (2016)
Wathan, J., Burrows, A.M., Waller, B.M., McComb, K.: EquiFACS: the equine facial action coding system. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0131738
Tarnowski, P., Kolodziej, M., Majkowski, A., Rak, R.J.: Emotion recognition using facial expressions. In: International Conference on Computational Science (2017). https://doi.org/10.1016/j.procs.2017.05.025
Melzer, A., Shafir, T., Tsachor, R.P.: How do we recognize emotion from movement? Specific motor components contribute to the recognition of each emotion. Front. Psychol. 10 (2019). https://doi.org/10.3389/fpsyg.2019.01389
Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 16th IEEE International Conference on Data Mining (ICDM), Barcelona (2016). https://doi.org/10.1109/ICDM.2016.0055
Koolagudi, S.G., Murthy, Y.V.S., Bhaskar, S.P.: Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int. J. Speech Technol. 21(1), 167–183 (2018). https://doi.org/10.1007/s10772-018-9495-8
Eyben, F., Schuller, B., openSMILE:): the Munich open-source large-scale multimedia feature extractor. ACM SIGMultimedia Rec. 6, 4–13 (2015). https://doi.org/10.1145/2729095.2729097
Swain, M., Routray, A., Kabisatpathy, P.: Databases, features and classifiers for speech emotion recognition: a review. Int. J. Speech Technol. 21(1), 93–120 (2018). https://doi.org/10.1007/s10772-018-9491-z
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Sig. Process. Control 47, 312–323 (2019). https://doi.org/10.1016/j.bspc.2018.08.035
Anand, N.: Convolutional and recurrent nets for detecting emotion from audio data. Convoluted Feelings (2015)
Savigny, J., Purwarianti, A.: Emotion classification on Youtube comments using word embedding. In: 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (2017). https://doi.org/10.1109/ICAICTA.2017.8090986
Bandhakavi, A., Wiratunga, N., Padmanabhan, D., Massie, S.: Lexicon based feature extraction for emotion text classification. Pattern Recogn. Lett. 93, 133–142 (2017). https://doi.org/10.1016/j.patrec.2016.12.009
Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory and ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016). https://doi.org/10.1109/MCI.2016.2572540
Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017). https://doi.org/10.1016/j.inffus.2017.02.003
Poria, S., Cambria, E., Hussain, A., Huang, G.B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015). https://doi.org/10.1016/j.neunet.2014.10.005
Li, J., Qiu, T., Wen, C., Xie, K., Wen, F.Q.: Robust face recognition using the deep C2D-CNN model based on decision-level fusion. Sensors 18(7) (2018). https://doi.org/10.3390/s18072080
Amer, M.R., Shields, T., Siddiquie, B., Tamrakar, A., Divakaran, A., Chai, S.: Deep multimodal fusion: a hybrid approach. Int. J. Comput. Vision 126(2–4), 440–456 (2017). https://doi.org/10.1007/s11263-017-0997-7
Gao, J., Li, P., Chen, Z., Zhang, J.: A survey on deep learning for multimodal data fusion. Neural Comput. 32(5), 829–864 (2020). https://doi.org/10.1162/neco_a_01273
Zhu, H., Wang, Z., Shi, Y., Hua, Y., Xu, G., Deng, L.: Multimodal fusion method based on self-attention mechanism. Wirel. Commun. Mob. Comput. (2020). https://doi.org/10.1155/2020/8843186
Zadeh, A., Liang, P.P., Mazumder, N., Poria, S., Cambria, E., Morency, L.P.: Memory fusion network for multi-view sequential learning. In: Thirty-Second AAAI Conference on Artificial Intelligence, 5634–5641 (2018). arXiv:1802.00927
Pruthi, D., Gupta, M., Dhingra, B., Neubig, G., Lipton, Z.C.: Learning to deceive with attention-based explanations. ACL (2020). https://doi.org/10.18653/v1/2020.acl-main.432
Xu, Q., Zhu, L., Dai, T., Yan, C.: Aspect-based sentiment classification with multi-attention network. Neurocomputing 388, 135–143 (2020). https://doi.org/10.1016/j.neucom.2020.01.024
Verma, S., et al.: Deep-HOSeq: deep higher order sequence fusion for multimodal sentiment analysis. In: 2020 IEEE International Conference on Data Mining (2020). https://doi.org/10.1109/ICDM50108.2020.00065
Karandeep, S.G., Aleksandr, R.: Face Detection OpenCV (2021). https://github.com/groverkds/face_detection_opencv. Accessed 31 Jan 2022
Savchenko, A.V.: Facial expression and attributes recognition based on multi-task learning of lightweight neural networks. In: IEEE 19th International Symposium on Intelligent Systems and Informatics, Subotica (2021). https://doi.org/10.1109/SISY52375.2021.9582508
Koren, L., Stipancic, T.: Multimodal emotion analysis based on acoustic and linguistic features of the voice. In: Meiselwitz, G. (ed.) HCII 2021. LNCS, vol. 12774, pp. 301–311. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77626-8_20
Haq, S., Jackson, P.: Speaker-dependent audio-visual emotion recognition. In: AVSP (2009)
Cao, H.W., Cooper, D.G., Keutmann, M.K., Gur, R.C., Nenkova, A., Verma, R.: CREMA-D: crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 377–390 (2014). https://doi.org/10.1109/TAFFC.2014.2336244
Livingstone, S.R., Russo, F.A.: The Ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE (2018). https://doi.org/10.1371/journal.pone.0196391
Pichora-Fuller, M.K., Dupuis, K.: Toronto Emotional Speech Set (TESS). Toronto Emotional Speech Set (TESS), Toronto (2020). https://doi.org/10.5683/SP2/E8H2MF
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. In: 9th European Conference on Speech Communication and Technology, Lisabon, pp. 1517–1520 (2005)
Stipancic, T., Jerbic, B.: Self-adaptive vision system. In: CamarinhaMatos, L.M., Pereira, P., Ribeiro, L. (eds.) DoCEIS 2010. IAICT, vol. 314, pp. 195–202. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11628-5_21
Koren, L., Stipancic, T., Ricko, A., Orsag, L.: Person localization model based on a fusion of acoustic and visual inputs. Electronics (2022). https://doi.org/10.3390/electronics11030440
Stipančić, T., Jerbić, B., Ćurković, P.: Bayesian approach to robot group control. In: International Conference in Electrical Engineering and Intelligent Systems, London (2012). https://doi.org/10.1007/978-1-4614-2317-1_9
Acknowledgements
This work has been supported in part by the Croatian Science Foundation under the project “Affective Multimodal Interaction based on Constructed Robot Cognition—AMICORC (UIP-2020-02-7184).”
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Koren, L., Stipancic, T., Ricko, A., Orsag, L. (2022). Multimodal Emotion Analysis Based on Visual, Acoustic and Linguistic Features. In: Meiselwitz, G. (eds) Social Computing and Social Media: Design, User Experience and Impact. HCII 2022. Lecture Notes in Computer Science, vol 13315. Springer, Cham. https://doi.org/10.1007/978-3-031-05061-9_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-05061-9_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-05060-2
Online ISBN: 978-3-031-05061-9
eBook Packages: Computer ScienceComputer Science (R0)