Skip to main content

Advertisement

Log in

Real-time emotional health detection using fine-tuned transfer networks with multimodal fusion

  • S.I. : Neural Computing for IOT based Intelligent Healthcare Systems
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Recognizing and regulating human emotion or a wave of riding emotions are a vital life skill as it can play an important role in how a person thinks, behaves and acts. Accurate real-time emotion detection can revolutionize the human–computer interaction industry and has the potential to provide a proactive approach to mental health care. Several untapped sources of data, including social media data (psycholinguistic markers), multimodal data (audio and video signals) combined with the sensor-based psychophysiological and brain signals, help to comprehend the affective states and emotional experiences. In this work, we propose a model that utilizes three modalities, i.e., visual (facial expression and body gestures), audio (speech) and text (spoken content), to classify emotion into discrete categories based on Ekman’s model with an additional category for ‘neutral’ state. Transfer learning has been used with multistage fine-tuning for each modality instead of training on a single dataset to make the model generalizable. The use of multiple modalities allows integration of heterogeneous data from different sources effectively. The results of the three modalities are combined at the decision-level using weighted fusion technique. The proposed EmoHD model compares favorably to the state-of-the-art technique on two benchmark datasets MELD and IEMOCAP.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Availability of data and materials

Publicly accessible data have been used by the authors.

Code availability

Can be available on request.

References

  1. Picard RW, Vyzas E, Healey J (2001) Toward machine emotional intelligence: analysis of affective physiological state. IEEE Trans Pattern Anal Mach Intell 23(10):1175–1191

    Article  Google Scholar 

  2. Zhang S, Zhang S, Huang T, Gao W, Tian Q (2017) Learning affective features with a hybrid deep model for audio–visual emotion recognition. IEEE Trans Circuits Syst Video Technol 28(10):3030–3043

    Article  Google Scholar 

  3. Kumar A, Sharma K, Sharma A (2021) Hierarchical deep neural network for mental stress state detection using IoT based biomarkers. Pattern Recogn Lett 145:81–87

    Article  Google Scholar 

  4. Gunes H, Pantic M (2010) Automatic, dimensional and continuous emotion recognition. Int J Synthet Emot (IJSE) 1(1):68–99

    Article  Google Scholar 

  5. Szabóová M, Sarnovský M, Maslej Krešňáková V, Machová K (2020) Emotion analysis in human-robot interaction. Electronics 9(11):1761

    Article  Google Scholar 

  6. Rabiei M, Gasparetto A (2014) A system for feature classification of emotions based on speech analysis; applications to human-robot interaction. In: 2014 second RSI/ISM international conference on robotics and mechatronics (ICRoM), pp 795–800. IEEE.

  7. García-Magariño I, Chittaro L, Plaza I (2018) Bodily sensation maps: exploring a new direction for detecting emotions from user self-reported data. Int J Hum Comput Stud 113:32–47

    Article  Google Scholar 

  8. Zhang L, Walter S, Ma X, Werner P, Al-Hamadi A, Traue HC, Gruss S (2016) “BioVid Emo DB”: A multimodal database for emotion analyses validated by subjective ratings. In: 2016 IEEE symposium series on computational intelligence (SSCI) pp 1–6. IEEE.

  9. Bahreini K, Nadolski R, Westera W (2016) Towards multimodal emotion recognition in e-learning environments. Interact Learn Environ 24(3):590–605

    Article  Google Scholar 

  10. Ashwin TS, Jose J, Raghu G, Reddy GRM (2015) An e-learning system with multifacial emotion recognition using supervised machine learning. In: 2015 IEEE seventh international conference on technology for education (T4E), pp 23–26. IEEE.

  11. Ayvaz U, Gürüler H, Devrim MO (2017) Use of facial emotion recognition in e-learning systems. Iнфopмaцiйнi тexнoлoгiï i зacoби нaвчaння, (60, вип. 4), 95–104

  12. Zeng H, Shu X, Wang Y, Wang Y, Zhang L, Pong TC, Qu H (2020) EmotionCues: emotion-oriented visual summarization of classroom videos. IEEE Trans Vis Comput Gr

  13. Tu G, Fu Y, Li B, Gao J, Jiang YG, Xue X (2019) A multi-task neural approach for emotion attribution, classification, and summarization. IEEE Trans Multimedia 22(1):148–159

    Article  Google Scholar 

  14. Hossain MS, Muhammad G (2017) Emotion-aware connected healthcare big data towards 5G. IEEE Internet Things J 5(4):2399–2406

    Article  Google Scholar 

  15. Weitz K, Hassan T, Schmid U, Garbas J (2018) Towards explaining deep learning networks to distinguish facial expressions of pain and emotions. In: Forum Bildverarbeitung, pp 197–208

  16. Saravia E, Liu HCT, Huang YH, Wu J, Chen YS (2018) Carer: contextualized affect representations for emotion recognition. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 3687–3697

  17. Ekman P, Friesen W (1977) Facial action coding system: a technique for the measurement of facial movement. Consulting Psychologists Press Stanford University, Palo Alto

    Google Scholar 

  18. Datcu D, Rothkrantz L (2008) Semantic audio-visual data fusion for automatic emotion recognition. Euromedia’2008

  19. De Silva LC, Miyasato T, Nakatsu R (1997) Facial emotion recognition using multi-modal information. In: Information, communications and signal processing, 1997. ICICS., Proceedings of 1997 International Conference on, vol 1. IEEE, 1997, pp 397–401

  20. Datcu D, Rothkrantz LJ (2011) Emotion recognition using bimodal data fusion. In: Proceedings of the 12th international conference on computer systems and technologies. ACM, 2011, pp 122–128

  21. Schuller B (2011) Recognizing affect from linguistic information in 3d continuous space. IEEE Trans Affect Comput 2(4):192–205

    Article  Google Scholar 

  22. Metallinou A, Lee S, Narayanan S (2008) Audio-visual emotion recognition using gaussian mixture models for face and voice. In: Tenth IEEE international symposium on multimedia, 2008. ISM 2008. IEEE, 2008, pp 250–257

  23. Eyben F, Wollmer M, Graves A, Schuller B, Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J Multimodal User Interfaces 3(1–2):7–19

    Article  Google Scholar 

  24. Rosas V, Mihalcea R, Morency L-P (1977) Multimodal sentiment analysis of spanish online videos. In: IEEE intelligent systems, vol 28, no. 3, pp. 0038–45, 2013. P. Ekman and W. Friesen, Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Stanford University, Palo Alto, 1977.

  25. Rozgic V, Ananthakrishnan S, Saleem S, Kumar R, Prasad R (2012) Speech language & multimedia technol., raytheon bbn technol., Cambridge, Ma, Usa. In: Signal & information processing association annual summit and conference (APSIPA ASC), 2012 Asia-Pacific. IEEE, 2012, pp 1–4

  26. Soleymani M, Pantic M, Pun T (2011) Multimodal emotion recognition in response to videos. IEEE Trans Affect Comput 3(2):211–223

    Article  Google Scholar 

  27. Tzirakis P, Trigeorgis G, Nicolaou MA, Schuller BW, Zafeiriou S (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Sel Top Signal Process 11(8):1301–1309

    Article  Google Scholar 

  28. Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures. In: 2016 IEEE winter conference on applications of computer vision (WACV), pp 1–9. IEEE

  29. Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM), pp 439–448. IEEE

  30. Nguyen D, Nguyen K, Sridharan S, Ghasemi A, Dean D, Fookes C (2017) Deep spatio-temporal features for multimodal emotion recognition. In: 2017 IEEE winter conference on applications of computer vision (WACV), pp 1215–1223. IEEE

  31. Poria S, Hazarika D, Majumder N, Naik G, Cambria E, Mihalcea R (2018) Meld: a multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint https://arxiv.org/abs/1810.02508.

  32. Mittal T, Bhattacharya U, Chandra R, Bera A, Manocha D (2020) M3ER: multiplicative multimodal emotion recognition using facial, textual, and speech cues. In: AAAI, pp 1359–1367

  33. Delbrouck JB, Tits N, Dupont S (2020) Modulated fusion using transformer for linguistic-acoustic emotion recognition. arXiv preprint https://arxiv.org/abs/2010.02057

  34. Hagar AF, Abbas HM, Khalil MI (2019) Emotion recognition in videos for low-memory systems using deep-learning. In: 2019 14th international conference on computer engineering and systems (ICCES), pp 16–21. IEEE

  35. Iskhakova A, Wolf D, Meshcheryakov R (2020) Automated destructive behavior state detection on the 1D CNN-based voice analysis. In: International conference on speech and computer, pp 184–193. Springer, Cham

  36. Xie J, Xu X, Shu L (2018) WT feature based emotion recognition from multi-channel physiological signals with decision fusion. In: 2018 first asian conference on affective computing and intelligent interaction (ACII Asia), pp 1–6. IEEE

  37. Gideon J, Khorram S, Aldeneh Z, Dimitriadis D, Provost EM (2017) Progressive neural networks for transfer learning in emotion recognition. arXiv preprint https://arxiv.org/abs/1706.03256.

  38. Ouyang, X., Kawaai, S., Goh, E. G. H., Shen, S., Ding, W., Ming, H., & Huang, D. Y. (2017, November). Audio-visual emotion recognition using deep transfer learning and multiple temporal models. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (pp. 577–582).

  39. Kumar A, Sharma K, Sharma A (2021) Genetically optimized fuzzy C-means data clustering of IoMT-based biomarkers for fast affective state recognition in intelligent edge analytics. Applied Soft Computing, 107525

  40. Tavallali P, et al. (2021) An EM-based optimization of synthetic reduced nearest neighbor model towards multiple modalities representation with human interpretability, multimedia tools and applications

  41. Dresvyanskiy D, Ryumina E, Kaya H, Markitantov M, Karpov A, Minker W (2020) An audio-video deep and transfer learning framework for multimodal emotion recognition in the wild. arXiv preprint https://arxiv.org/abs/2010.03692

  42. Siriwardhana S, Reis A, Weerasekera R, Nanayakkara S (2020) Jointly fine-tuning "BERT-like" self supervised models to improve multimodal speech emotion recognition. arXiv preprint https://arxiv.org/abs/2008.06682

  43. Ekman P (1999) Basic emotions. Handb Cognit Emot 98(45–60):16

    Google Scholar 

  44. Abbas A, Abdelsamea MM, Gaber MM (2020) Detrac: Transfer learning of class decomposed medical images in convolutional neural networks. IEEE Access 8:74901–74913

    Article  Google Scholar 

  45. Huh M, Agrawal P, Efros AA (2016) What makes ImageNet good for transfer learning?. arXiv preprint https://arxiv.org/abs/1608.08614

  46. Busso C, Bulut M, Lee CC, Kazemzadeh A, Mower E, Kim S, Narayanan SS (2008) IEMOCAP: interactive emotional dyadic motion capture database. Lang Resour Eval 42(4):335–359

    Article  Google Scholar 

  47. Li W, Abtahi F, Zhu Z (2015) A deep feature based multi-kernel learning approach for video emotion recognition. In: Proceedings of the 2015 ACM on international conference on multimodal interaction, pp 483–490

  48. Wu Z, Shen C, Van Den Hengel A (2019) Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recogn 90:119–133

    Article  Google Scholar 

  49. Poria S, Cambria E, Bajpai R, Hussain A (2017) A review of affective computing: from unimodal analysis to multimodal fusion. Inf Fusion 37:98–125

    Article  Google Scholar 

  50. Kumar A, Sharma A, Arora A (2019) Anxious depression prediction in real-time social data. In: International conference on advances in engineering science management & technology (ICAESMT)-2019, Uttaranchal University, Dehradun, India

  51. Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio–visual emotional big data. Information Fusion 49:69–78

    Article  Google Scholar 

  52. Li W, Tsangouri C, Abtahi F, Zhu Z (2018) A recursive framework for expression recognition: from web images to deep models to game dataset. Mach Vis Appl 29(3):489–502

    Article  Google Scholar 

  53. Acheampong FA, Nunoo-Mensah H, Chen W (2021) Transformer models for text-based emotion detection: a review of BERT-based approaches. Artif Intell Rev, 1–41

  54. Hazarika D, Poria S, Zimmermann R, Mihalcea R (2021) Conversational transfer learning for emotion recognition. Inf Fusion 65:1–12

    Article  Google Scholar 

Download references

Funding

No Funding has been received.

Author information

Authors and Affiliations

Authors

Contributions

All the authors have equally contributed to the manuscript preparation.

Corresponding author

Correspondence to Akshi Kumar.

Ethics declarations

Conflicts of interest

The authors certify that there is no conflict of interest in the subject matter discussed in this manuscript.

Ethics approval

The work conducted is not plagiarized. No one has been harmed in this work.

Consent to participate

All the authors have given consent to submit the manuscript.

Consent for publication

Authors provide their consent for the publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, A., Sharma, K. & Kumar, A. Real-time emotional health detection using fine-tuned transfer networks with multimodal fusion. Neural Comput & Applic 35, 22935–22948 (2023). https://doi.org/10.1007/s00521-022-06913-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-06913-2

Keywords

Navigation