Skip to main content
Log in

A dynamic fusion of features from deep learning and the HOG-TOP algorithm for facial expression recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Facial expression recognition plays an essential role in surveillance videos, anxiety treatment, expression analysis, gesture recognition, computer games, patient monitoring, operator fatigue detection, and robotics. Therefore, facial expression recognition has attracted more and more attention over the years and became a difficult task because emotion can be influenced by several factors. Some approaches based on Deep Convolutional Neural Networks (DCNN) and Transfer Learning have been successful to recognize emotion in video sequences. However, these approaches remain limited because it is difficult to model spatio-temporal interactions between video frames or identify salient features to improve accuracy. In this article, we propose a facial expression recognition system combining the representations of Deep Learning features and dynamic texture features. For the Deep Learning part, we used the VGG19 model to extract facial features, which will feed the LSTM (Long Short Term Memory) cells in order to extract spatio-temporal information between frames. While the HOG-TOP (Histogram of Oriented Gradients from Three Orthogonal Planes) descriptor aims to extract dynamic textures from video sequences to characterize facial appearance changes. Finally, we combine both models with a Multimodal Compact Bilinear (MCB) algorithm to produce a robust descriptor vector. Classification was performed using the SVM (Support Vector Machine) classifier to predict the emotion class. The experimental part was carried out based on the INTERFACE05 dataset that the accuracy of facial expression recognition was increased almost 1% by the fusion method (98.44%) than the baseline approach (97.75%). To summarize, the proposed method obtain a higher accuracy and robust detection meaning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig.4
Fig. 5
Fig.6
Algorithm 1
Fig. 7
Fig. 8
Fig. 9
Algorithm 2
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data availability

The datasets analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Roccetti M, Marfia G, Zanichelli M (2010) The art and craft of making the tortellino: Playing with a digital gesture recognizer for preparing pasta culinary recipes. Comput Entertain 8(4). https://doi.org/10.1145/1921141.1921148

  2. Sariyanidi E, Gunes H, Cavallaro A (2015) Automatic analysis of facial affect: A survey of registration, representation, and recognition. IEEE Trans Pattern Anal Mach Intell 37(6):1113–1133. https://doi.org/10.1109/TPAMI.2014.2366127

    Article  PubMed  Google Scholar 

  3. Corive R et al (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Mag 18(1):32–80. https://doi.org/10.1109/79.911197

    Article  ADS  Google Scholar 

  4. Saste ST, Jagdale SM (2017) Emotion recognition from speech using MFCC and DWT for security system. Proc Int Conf Electron Commun Aerosp Technol ICECA 2017 2017-January:701–704. https://doi.org/10.1109/ICECA.2017.8203631

    Article  Google Scholar 

  5. Dang LT, Cooper EW, Kamei K (2014) Development of facial expression recognition for training video customer service representatives. IEEE Int Conf Fuzzy Syst 1297–1303. https://doi.org/10.1109/FUZZ-IEEE.2014.6891864

  6. Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition. ∗ University of Michigan Electrical Engineering and Computer Science, Ann Arbor, Michigan, USA. Electr Eng. pp 3687–3691 [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.428.5585&rep=rep1&type=pdf

  7. Kim J, Ricci M, Serre T (2018) Not-So-CLEVR: Learning same-different relations strains feedforward neural networks. Interface Focus 8(4). https://doi.org/10.1098/rsfs.2018.0011

  8. Mellouk W, Handouzi W (2020) Facial emotion recognition using deep learning: Review and insights. Procedia Comput Sci 175:689–694. https://doi.org/10.1016/j.procs.2020.07.101

    Article  Google Scholar 

  9. Baltrusaitis T, Ahuja C, Morency LP (2019) Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443. https://doi.org/10.1109/TPAMI.2018.2798607

    Article  PubMed  Google Scholar 

  10. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection to cite this version : Histograms of oriented gradients for human detection, 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp 886–893 [Online]. Available: http://lear.inrialpes.fr

  11. Zhao G, Pietikäinen M (2007) Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans Pattern Anal Mach Intell 29(6):915–928. https://doi.org/10.1109/TPAMI.2007.1110

    Article  PubMed  Google Scholar 

  12. Martin O, Kotsia I, Macq B, Pitas I (2006) The eNTERFACE’05 Audio-Visual emotion database, ICDEW 2006 - Proc. 22nd Int. Conf. Data Eng. Work (1) pp 2–9. https://doi.org/10.1109/ICDEW.2006.145

  13. Deng H-B, Jin L-W, Zhen L-X, Huang J-C (2005) A new facial expression recognition method based on local gabor filter bank and pca plus lda. Int J Inf Technol 11(11):86–96

    Google Scholar 

  14. Satiyan M (2010) Recognition of facial expression using haar wavelet transform. Int J Electr Electron Syst Res 3(June):89–96

    Google Scholar 

  15. Shan C, Gong S, McOwan PW (2009) Facial expression recognition based on Local Binary Patterns: A comprehensive study. Image Vis Comput 27(6):803–816. https://doi.org/10.1016/j.imavis.2008.08.005

    Article  Google Scholar 

  16. Kola DGR, Samayamantula SK (2021) A novel approach for facial expression recognition using local binary pattern with adaptive window. Multimed Tools Appl 80(2):2243–2262. https://doi.org/10.1007/s11042-020-09663-2

    Article  Google Scholar 

  17. Huang X, Zhao G, Pietikäinen M, Zheng W (2011) Expression recognition in videos using a weighted component-based feature descriptor. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6688 LNCS, pp 569–578. https://doi.org/10.1007/978-3-642-21227-7_53

  18. Huang X, He Q, Hong X, Zhao G, Pietikänen M (2014) Improved spatiotemporal local monogenic binary pattern for emotion recognition in the wild, ICMI 2014 - Proc. 2014 Int. Conf. Multimodal Interact. (March 2017) pp 514–520. https://doi.org/10.1145/2663204.2666278

  19. Huang X, Zhao G, Zheng W, Pietikäinen M (2012) Spatiotemporal local monogenic binary patterns for facial expression recognition. IEEE Signal Process Lett 19(5):243–246. https://doi.org/10.1109/LSP.2012.2188890

    Article  ADS  Google Scholar 

  20. Chen J, Chen Z, Chi Z, Fu H (2018) Facial expression recognition in video with multiple feature fusion. IEEE Trans Affect Comput 9(1):38–50. https://doi.org/10.1109/TAFFC.2016.2593719

    Article  CAS  Google Scholar 

  21. Long F, Wu T, Movellan JR, Bartlett MS, Littlewort G (2012) Learning spatiotemporal features by using independent component analysis with application to facial expression recognition. Neurocomputing 93:126–132. https://doi.org/10.1016/j.neucom.2012.04.017

    Article  Google Scholar 

  22. Chew SW, Rana R, Lucey P, Lucey S, Sridharan S (2011) Sparse temporal representations for facial expression recognition. Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics) 7088 LNCS(PART 2) pp 311–322. https://doi.org/10.1007/978-3-642-25346-1_28

  23. Almaev TR, Valstar MF (2013) Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition, Proc. - 2013 Hum. Assoc. Conf. Affect. Comput. Intell. Interact. ACII 2013, (August 2014) pp 356–361. https://doi.org/10.1109/ACII.2013.65

  24. Hasani B, Mahoor MH (2017) Facial expression recognition using enhanced deep 3D convolutional neural networks, IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Work. vol. 2017-July, pp 2278–2288. https://doi.org/10.1109/CVPRW.2017.282

  25. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-Decem, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  26. Szegedy C et al (2015) Going deeper with convolutions, Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07–12-June, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594

  27. Krizhevsky A, Hinton G (2012) ImageNet classification with deep convolutional neural networks (presentation). ImageNet Large Scale Vis Recognit Chall 27

  28. Gudi A, Tasli HE, Den Uyl TM, Maroulis A (2015) Deep learning based facs action unit occurrence and intensity estimation, 2015 11th IEEE Int. Conf. Work. Autom. Face Gesture Recognition, FG 2015, vol. 2015-Janua. https://doi.org/10.1109/FG.2015.7284873

  29. Tzirakis P, Trigeorgis G, Nicolaou MA, Schuller BW, Zafeiriou S (2017) End-to-end multimodal emotion recognition using deep neural networks. IEEE J Sel Top Signal Process 11(8):1301–1309. https://doi.org/10.1109/JSTSP.2017.2764438

    Article  ADS  Google Scholar 

  30. Nguyen D, Nguyen K, Sridharan S, Dean D, Fookes C (2018) Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput Vis Image Underst 174:33–42. https://doi.org/10.1016/j.cviu.2018.06.005

    Article  Google Scholar 

  31. Ranganathan H, Chakraborty S, Panchanathan S (2016) Multimodal emotion recognition using deep learning architectures, 2016 IEEE Winter Conf. Appl. Comput. Vision, WACV 2016. https://doi.org/10.1109/WACV.2016.7477679

  32. Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks, AVEC 2015 - Proc. 5th Int. Work. Audio/Visual Emot. Challenge, co-Located with MM 2015, pp 49–56. https://doi.org/10.1145/2808196.2811638

  33. Yan J, Zheng W, Cui Z, Tang C, Zhang T, Zong Y (2018) Multi-cue fusion for emotion recognition in the wild. Neurocomputing 309:27–35. https://doi.org/10.1016/j.neucom.2018.03.068

    Article  Google Scholar 

  34. Fan Y, Lu X, Li D, Liu Y (2016) Video-Based emotion recognition using CNN-RNN and C3D hybrid networks, ICMI 2016 - Proc. 18th ACM Int. Conf. Multimodal Interact. pp 445–450. https://doi.org/10.1145/2993148.2997632

  35. Miyoshi R, Nagata N, Hashimoto M (2021) Enhanced convolutional LSTM with spatial and temporal skip connections and temporal gates for facial expression recognition from video. Neural Comput Appl 33(13):7381–7392. https://doi.org/10.1007/s00521-020-05557-4

    Article  Google Scholar 

  36. Ding H, Zhou SK, Chellappa R (2017) FaceNet2ExpNet: Regularizing a deep face recognition net for expression recognition, Proc. - 12th IEEE Int. Conf. Autom. Face Gesture Recognition, FG 2017 - 1st Int. Work. Adapt. Shot Learn. Gesture Underst. Prod. ASL4GUP 2017, Biometrics Wild, Bwild 2017, Heteroge, pp 118–126. https://doi.org/10.1109/FG.2017.23

  37. Li B, Lima D (2021) Facial expression recognition via ResNet-50. Int J Cogn Comput Eng 2(January):57–64. https://doi.org/10.1016/j.ijcce.2021.02.002

    Article  Google Scholar 

  38. Gao J, Zhao Y (2021) TFE: A transformer architecture for occlusion aware facial expression recognition. Front Neurorobot 15(October):1–10. https://doi.org/10.3389/fnbot.2021.763100

    Article  ADS  Google Scholar 

  39. Ng HW, Nguyen VD, Vonikakis V, Winkler S (2015) Deep learning for emotion recognition on small datasets using transfer learning, ICMI 2015 - Proc. 2015 ACM Int. Conf. Multimodal Interact., pp 443–449. https://doi.org/10.1145/2818346.2830593

  40. Priyasad D, Fernando T, Denman S, Sridharan S, Fookes C (2020) Learning salient features for multimodal emotion recognition with recurrent neural networks and attention based fusion (August) pp. 21–26. https://doi.org/10.21437/avsp.2019-5

  41. Liu W et al (2016) SSD: Single shot multibox detector, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 9905 LNCS, pp 21–37. https://doi.org/10.1007/978-3-319-46448-0_2

  42. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition, 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp 1–14

  43. Staudemeyer RC, Morris ER (2019) Understanding LSTM -- a tutorial into Long Short-Term Memory Recurrent Neural Networks, pp 1–42, [Online]. Available: http://arxiv.org/abs/1909.09586

  44. Ooi CS, Seng KP, Ang LM, Chew LW (2014) A new approach of audio emotion recognition. Expert Syst Appl 41(13):5858–5869. https://doi.org/10.1016/j.eswa.2014.03.026

    Article  Google Scholar 

  45. Wu CH, Lin JC, Wei WL (2014) Survey on audiovisual emotion recognition: Databases, features, and data fusion strategies. APSIPA Trans Signal Inf Process 3(November):2014. https://doi.org/10.1017/ATSIP.2014.11

    Article  Google Scholar 

  46. Lisetti CL (1998) Affective computing. Pattern Anal Appl 1(1):71–73. https://doi.org/10.1007/bf01238028

    Article  Google Scholar 

  47. Lin TY, Roychowdhury A, Maji S (2015) Bilinear CNN models for fine-grained visual recognition, Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp 1449–1457. https://doi.org/10.1109/ICCV.2015.170

  48. Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding, EMNLP 2016 - Conf. Empir. Methods Nat. Lang. Process. Proc., pp 457–468. https://doi.org/10.18653/v1/d16-1044

  49. Delbrouck J-B, Dupont S (2017) Multimodal compact bilinear pooling for multimodal neural machine translation, no. 2014, pp 2014–2017, [Online]. Available: http://arxiv.org/abs/1703.08084

  50. Huang YY, Wang WY (2017) Deep residual learning for weakly-supervised relation extraction, EMNLP 2017 - Conf. Empir. Methods Nat. Lang. Process. Proc., pp 1803–1807. https://doi.org/10.18653/v1/d17-1191

  51. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2010) ImageNet: A large-scale hierarchical image database. IEEE Conf. Comput. Vis. pattern Recognit., no. June, pp 248–255. https://doi.org/10.1109/cvpr.2009.5206848

  52. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J Uncertainty Fuzziness Knowlege-Based Syst 6(2):107–116. https://doi.org/10.1142/S0218488598000094

    Article  MathSciNet  Google Scholar 

  53. Chen G (2016) A gentle tutorial of recurrent neural network with error backpropagation. pp 1–9 [Online]. Available: http://arxiv.org/abs/1610.02583

  54. Gritti T, Shan C, Jeanne V, Braspenning R (2008) “Local features based facial expression recognition with face registration errors,” 2008 8th IEEE Int. Conf. Autom. Face Gesture Recognition, FG 2008, no. October, pp. 1–8. https://doi.org/10.1109/AFGR.2008.4813379

  55. King DE (2009) Dlib-ml: A machine learning toolkit. J Mach Learn Res 10:1755–1758

    Google Scholar 

  56. Sagonas C, Antonakos E, Tzimiropoulos G, Zafeiriou S, Pantic M (2016) 300 Faces In-The-Wild Challenge: database and results. Image Vis Comput 47:3–18. https://doi.org/10.1016/j.imavis.2016.01.002

    Article  Google Scholar 

  57. Charikar M, Chen K, Farach-Colton M (2002) Finding frequent items in data streams. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) vol. 2380 LNCS:693–703. https://doi.org/10.1007/3-540-45465-9_59

  58. Pham N, Pagh R (2013) Fast and scalable polynomial kernels via explicit feature maps. Proc ACM SIGKDD Int Conf Knowl Discov Data Min vol. Part F1288, pp 239–247. https://doi.org/10.1145/2487575.2487591

  59. Wang Z, Zhou X, Wang W, Liang C (2020) Emotion recognition using multimodal deep learning in multiple psychophysiological signals and video. Int J Mach Learn Cybern 11(4):923–934. https://doi.org/10.1007/s13042-019-01056-8

    Article  Google Scholar 

  60. Priyasad D, Fernando T, Denman S, Sridharan S, Fookes C (2020) Attention driven fusion for multi-modal emotion recognition. ICASSP IEEE Int Conf Acoust Speech Signal Process- Proc 2020-May:3227–3231. https://doi.org/10.1109/ICASSP40776.2020.9054441

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hajar Chouhayebi.

Ethics declarations

Conflict of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chouhayebi, H., Mahraz, M.A., Riffi, J. et al. A dynamic fusion of features from deep learning and the HOG-TOP algorithm for facial expression recognition. Multimed Tools Appl 83, 32993–33017 (2024). https://doi.org/10.1007/s11042-023-16779-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16779-8

Keywords

Navigation