Skip to main content
Log in

CNN-Transformer based emotion classification from facial expressions and body gestures

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Classifying the correct emotion from different data sources such as text, images, videos, and speech has been an inspiring research area for researchers from various disciplines. Automatic emotion detection from videos and images is one of the most challenging tasks that have been analyzed using supervised and unsupervised machine learning methods. Deep learning has been also employed where the model has been trained by facial and body features using pose and landmark detectors and trackers. In this paper, facial and body features extracted by the OpenPose tool have been used for detecting basic 6, 7 and 9 emotions from videos and images by a novel deep neural network framework which combines the Gaussian mixture model with CNN, LSTM and Transformer to generate the CNN-LSTM model and CNN-Transformer model with and without Gaussian centers. The experiments which were conducted using two benchmark datasets, namely FABO and CK+, showed that the proposed transformer model with 9 and 12 Gaussian centers with video generation approach was able to achieve close to 100% classification accuracy for the FABO dataset which outperforms the other DNN frameworks for emotion detection. It reported over 90% accuracy for most combinations of features for both datasets leading to a comparable framework for video emotion classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Algorithm 2
Algorithm 3
Algorithm 4
Algorithm 5
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Abbreviations

FABO:

The Bi-modal Face and Body Gesture Database

CK+:

The Extended Cohn-Kanade Dataset

k-fold:

k-fold cross-validation

cnnf:

CNN model that trained with facial features only

cnnb:

CNN model that trained with body features only

cnnOutf:

Predictions obtained from the cnnf model.

cnnOutb:

Predictions obtained from the cnnfbmodel.

lstmf:

LSTM model that trained with facial features only

lstmb:

LSTM model that trained with body features only

lstmOutf:

Predictions obtained from the lstmf model.

lstmOutb:

Predictions obtained from the lstmb model.

mm:

The final DNN block used to join face and body models

transformerf:

Transformer based model trained with facial features

transformerb:

Transformer based model trained with body features

transformerOutf:

Predictions obtained from the transformerf model.

transformerOutb:

Predictions obtained from the transformerb model.

Dg:

Dataset with Gaussian mixture centers added.

TopX Frames:

The approach we select the 10 most informational frames

References

  1. Agrawal A, Mittal N (2020) Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy. Vis Comput 36(2):405–412

    Article  Google Scholar 

  2. Agrawal A, Mittal N (2020) Using cnn for facial expression recognition: a study of the effects of kernel size and number of filters on accuracy, Vis Comput, 36(2):405–412

  3. Akçay MB, Oǧuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers, Speech Commun, Elsevier, 116:56–76

  4. Akçay MB, Oǧuz K (2020) Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun, Elsevier 116:56–76

    Article  Google Scholar 

  5. Alswaidan N, El Bachir Menai M (2020) A survey of state-of-the-art approaches for emotion recognition in text. Springer, Knowl Inf Syst, pp 1–51

    Google Scholar 

  6. Alswaidan N, El Bachir Menai M (2020) A survey of state-of-the-art approaches for emotion recognition in text, Knowl Inf Syst, Springer, 1–51

  7. Bänziger T, Scherer KR (2010) Introducing the geneva multimodal emotion portrayal (gemep) corpus. A sourcebook, Blueprint for affective computing, pp 271–94

    Google Scholar 

  8. Bänziger T, Scherer KR (2010) Introducing the geneva multimodal emotion portrayal (gemep) corpus, Blueprint for affective computing: A sourcebook, p 271–94

  9. Barros P, Jirak D, Weber C, Wermter S (2015) Multimodal emotional state recognition using sequence-dependent deep hierarchical features. Neural Netw 72:140–151

    Article  PubMed  Google Scholar 

  10. Barros P, Churamani N, Sciutti A (2020) The facechannel: A fast and furious deep neural network for facial expression recognition. SN Comput Sci 1(6):1–10

    Article  Google Scholar 

  11. Barros P, Churamani N, Sciutti A (2020) The facechannel: A fast and furious deep neural network for facial expression recognition, SN Comput Sci, 1(6)1–10

  12. Barros P, Jirak D, Weber C, Wermter S (2015) Multimodal emotional state recognition using sequence-dependent deep hierarchical features, Neural Netw, 72:140–151

  13. Behoora I, Tucker CS (2015) Machine learning classification of design team members’ body language patterns for real time emotional state detection. Design Studies 39:100–127

    Article  Google Scholar 

  14. Borod JC (2000) The neuropsychology of emotion. Oxford University Press

    Google Scholar 

  15. Bota PJ, Wang C, Fred ALN, Da Silva HP (2019) A review, current challenges, and future possibilities on emotion recognition using machine learning and physiological signals. IEEE Access 7:140990–141020

    Article  Google Scholar 

  16. Broad CD (1954) Emotion and sentiment. J Aesthet Art Crit 13(2):203–214

    Article  Google Scholar 

  17. Calvo RA, Mac Kim S (2013) Emotions in text: dimensional and categorical models. Comput Intell 29(3):527–543

    Article  MathSciNet  Google Scholar 

  18. Chakraborty BK, Sarma D, Bhuyan MK, MacDorman KF (2018) Review of constraints on vision-based gesture recognition for human-computer interaction, IET Computer Vision, 12(1):3–15

  19. Chakraborty BK, Sarma D, Bhuyan MK, MacDorman KF (2018) Review of constraints on vision-based gesture recognition for human-computer interaction. IET Computer Vision 12(1):3–15

    Article  Google Scholar 

  20. Chen LF, Yen YS (2007) Taiwanese facial expression image database. brain mapping laboratory, Institute of Brain Science, National Yang-Ming University, Taipei, Taiwan, http://bml.ym.edu.tw/download/html

  21. Chul Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18(2):401

    Article  Google Scholar 

  22. Clore GL, Ortony A, Collins A (1988) The Cognitive Structure of Emotions. Cambridge University Press

    Google Scholar 

  23. Darwin C, Prodger P (1998) The expression of the emotions in man and animals. Oxford University Press, USA

    Book  Google Scholar 

  24. Dhall A, Goecke R, Lucey S, Gedeon T (2012) Collecting large, richly annotated facial-expression databases from movies. IEEE multimedia 3:34–41

    Article  Google Scholar 

  25. Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3–4):169–200

    Article  Google Scholar 

  26. Francesca N, Dagnes N, Marcolin F, Vezzetti E (2019) 3d approaches and challenges in facial expression recognition algorithms-a literature review. Appl Sci 9(18):3904

    Article  Google Scholar 

  27. Hu M, Wang H, Wang X, Yang J, Wang R (2019) Video facial emotion recognition based on local enhanced motion history image and cnn-ctslstm networks. J Vis Commun Image Represent, Elsevier 59:176–185

    Article  Google Scholar 

  28. Hu M, Wang H, Wang X, Yang J, Wang R (2019) Video facial emotion recognition based on local enhanced motion history image and cnn-ctslstm networks, J Vis Commun Image Represent, Elsevier, 59:176–185

  29. ialab admin Detecting human facial expression by common computer vision techniques, http://www.interactivearchitecture.org/detecting-human-facial-expression-by-common-computer-vision-techniques.html

  30. Kah Phooi Seng J, Li-Minn Ang K (2019) Multimodal emotion and sentiment modeling from unstructured big data: Challenges, architecture, & techniques. IEEE Access 7:90982–90998

    Article  Google Scholar 

  31. Kah Phooi Seng J, Li-Minn Ang K (2019) Multimodal emotion and sentiment modeling from unstructured big data: Challenges, architecture, & techniques, IEEE Access, 7:90982–90998

  32. Khalil RA, Jones E, Babar MI, Jan T, Zafar MH, Alhussain T (2019) Speech emotion recognition using deep learning techniques: A review. IEEE Access 7:117327–117345

    Article  Google Scholar 

  33. Kleinginna PR, Kleinginna AM (1981) A categorized list of emotion definitions, with suggestions for a consensual definition, Motiv Emot, 5(4):345–379

  34. Kleinginna PR, Kleinginna AM (1981) A categorized list of emotion definitions, with suggestions for a consensual definition. Motiv Emot 5(4):345–379

    Article  Google Scholar 

  35. Kossaifi J, Walecki R, Panagakis Y, Shen J, Schmitt M, Ringeval F, Han J et al (2019) Sewa db: A rich database for audio-visual emotion and sentiment research in the wild, IEEE Trans Pattern Anal Mach Intell

  36. Langner O, Dotsch R, Bijlstra G, Wigboldus DHJ, Hawk ST, Van Knippenberg AD (2010) Presentation and validation of the radboud faces database. Cogn Emot 24(8):1377–1388

    Article  Google Scholar 

  37. LeDoux JE (1984) Cognition and emotion. Handbook of cognitive neuroscience, Springer, US, pp 357–368

    Google Scholar 

  38. Li S, Deng W (2020) Deep facial expression recognition: A survey, IEEE Trans Affect Comput

  39. Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5):e0196391

    Article  PubMed  PubMed Central  Google Scholar 

  40. Lovheim H (2012) A new three-dimensional model for emotions and monoamine neurotransmitters, Med hypotheses, 78(2):341–348

  41. Lovheim H (2012) A new three-dimensional model for emotions and monoamine neurotransmitters. Med hypotheses 78(2):341–348

    Article  PubMed  Google Scholar 

  42. Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) he extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression, 2010 ieee computer society conference on computer vision and pattern recognition-workshops, IEEE, p 94–101

  43. Ly ST, Lee GS, Kim SH, Yang HJ (2019) Gesture-based emotion recognition by 3d-cnn and lstm with keyframes selection. Int J Contents 15(4):59–64

    ADS  Google Scholar 

  44. Lyons MJ, Budynek J, Akamatsu S (1999) Automatic classification of single facial images. IEEE Trans Pattern Anal Mach Intell 21(12):1357–1362

    Article  Google Scholar 

  45. Mungra D, Agrawal A, Sharma P, Tanwar S, Obaidat MS (2020) Pratit: a cnn-based emotion recognition system using histogram equalization and data augmentation. Multimedia Tools Appl 79(3):2285–2307

    Article  Google Scholar 

  46. Mungra D, Agrawal A, Sharma P, Tanwar S, Obaidat MS (2020) Pratit: a cnn-based emotion recognition system using histogram equalization and data augmentation, Multimedia Tools Appl, 79(3):2285–2307

  47. Nandwani P, Verma R (2021) A review on sentiment analysis and emotion detection from text. Soc Netw Anal Min 11(1):1–19

    Article  Google Scholar 

  48. Oatley K, Johnson-Laird PN (1987) Towards a cognitive theory of emotions. Cognit Emot 1(1):29–50

    Article  Google Scholar 

  49. Oatley K, Johnson-Laird PN (1987) Towards a cognitive theory of emotions, Cognit emot, 1(1):29–50

  50. Plutchik R (1980) Emotion: A Psychoevolutionary Synthesis. Harper and Row

    Google Scholar 

  51. Poria S, Majumder N, Mihalcea R, Hovy E (2019) Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access 7:100943–100953

    Article  Google Scholar 

  52. Poria S, Majumder N, Mihalcea R, Hovy E (2019) Emotion recognition in conversation: Research challenges, datasets, and recent advances, IEEE Access, 7:100943–100953

  53. Rafiqul Islam M, Ashad Kabir M, Ahmed A, Kamal ARM, Wang H, Ulhaq A (2018) Depression detection from social network data using machine learning techniques. Health Inf Sci Syst 6(1):1–12

    Google Scholar 

  54. Rafiqul Islam M, Ashad Kabir M, Ahmed A, Kamal ARM, Wang H, Ulhaq A (2018) Depression detection from social network data using machine learning techniques, Health Inf Sci Syst, 6(1):1–12

  55. Russell JA (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161–1178

    Article  Google Scholar 

  56. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion detection from text and speech: a survey. Soc Netw Anal Min, Springer 8(1):28

    Article  Google Scholar 

  57. Sailunaz K, Dhaliwal M, Rokne J, Alhajj R (2018) Emotion detection from text and speech: a survey, Soc Netw Anal Min, Springer, 8(1):28

  58. Santamaria-Granados L, Mendoza-Moreno JF, Ramirez-Gonzalez G (2021) Tourist recommender systems based on emotion recognition-a scientometric review. Future Internet 13(1):2

    Article  Google Scholar 

  59. Santhoshkumar R, Kalaiselvi Geetha M (2019) Deep learning approach for emotion recognition from human body movements with feedforward deep convolution neural networks. Procedia Comput Sci 152:158–165

    Article  Google Scholar 

  60. Santhoshkumar R, Kalaiselvi Geetha M (2019) Deep learning approach for emotion recognition from human body movements with feedforward deep convolution neural networks, Procedia Comput Sci, 152:158–165

  61. Sapiński T, Kamińska D, Pelikant A, Anbarjafari G (2019) Emotion recognition from skeletal movements. Entropy 21(7):646

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  62. Scherer KR (2000) Psychological models of emotion. The Neuropsychol Emot 137(3):137–162

    Google Scholar 

  63. Shaver P, Schwartz J, Kirson D, O’connor C (1987) Emotion knowledge: further exploration of a prototype approach. J Personal Soc Psychol 52(6):1061–1086

    Article  CAS  Google Scholar 

  64. Sreeja PS, Mahalakshmi GS (2017) Emotion models: A review. Int J Control Theory Appl 10(8):651–657

    Google Scholar 

  65. Sun X, Lv M (2019) Facial expression recognition based on a hybrid model combining deep and shallow features. Cogn Comput 11(4):587–597

    Article  Google Scholar 

  66. Sun X, Lv M (2019) Facial expression recognition based on a hybrid model combining deep and shallow features, Cogn Comput, 11(4):587–597

  67. Wang S, Li J, Cao T, Wang H, Tu P, Li Y (2020) Dance emotion recognition based on laban motion analysis using convolutional neural network and long short-term memory. IEEE Access 8:124928–124938

    Article  Google Scholar 

  68. Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines, Proc IEEE Conf Comput Vis Pattern Recog, 4724–4732

  69. Xie B, Sidulova M, Hyuk Park C (2021) Robust multimodal emotion recognition from conversation with transformer-based crossmodality fusion. Sensors 21(14):4913

    Article  PubMed  PubMed Central  ADS  Google Scholar 

  70. Yang D, Alsadoon A, Prasad PWC, Kumar Singh A, Elchouemi A (2018) An emotion recognition model based on facial recognition in virtual learning environment. Procedia Comput Sci 125:2–10

    Article  Google Scholar 

  71. Yang D, Alsadoon A, Prasad PWC, Kumar Singh A, Elchouemi A (2018) An emotion recognition model based on facial recognition in virtual learning environment, Procedia Comput Sci, 125:2–10

  72. Yu Z, Liu G, Liu Q, Deng J (2018) Spatio-temporal convolutional features with nested lstm for facial expression recognition. Neurocomputing 317:50–57

    Article  Google Scholar 

  73. Yu Z, Liu G, Liu Q, Deng J (2018) Spatio-temporal convolutional features with nested lstm for facial expression recognition, Neurocomputing, 317:50–57

  74. Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos. Image Vis Comput 29(9):607–619

    Article  Google Scholar 

  75. Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos, Image Vis Comput, 29(9):607–619

Download references

Acknowledgements

The numerical calculations reported in this paper were fully/partially performed at TUBITAK ULAKBIM, High Performance and the Grid Computing Center (TRUBA resources).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reda Alhajj.

Ethics declarations

Conflicts of interests

No conflict of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Karatay, B., Beştepe, D., Sailunaz, K. et al. CNN-Transformer based emotion classification from facial expressions and body gestures. Multimed Tools Appl 83, 23129–23171 (2024). https://doi.org/10.1007/s11042-023-16342-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16342-5

Keywords

Navigation