Skip to main content

Emotion-aware Multi-view Contrastive Learning for Facial Emotion Recognition

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

Abstract

When a person recognizes another’s emotion, he or she recognizes the (facial) features associated with emotional expression. So, for a machine to recognize facial emotion(s), the features related to emotional expression must be represented and described properly. However, prior arts based on label supervision not only failed to explicitly capture features related to emotional expression, but also were not interested in learning emotional representations. This paper proposes a novel approach to generate features related to emotional expression through feature transformation and to use them for emotional representation learning. Specifically, the contrast between the generated features and overall facial features is quantified through contrastive representation learning, and then facial emotions are recognized based on understanding of angle and intensity that describe the emotional representation in the polar coordinate, i.e., the Arousal-Valence space. Experimental results show that the proposed method improves the PCC/CCC performance by more than 10% compared to the runner-up method in the wild datasets and is also qualitatively better in terms of neural activation map. Code is available at https://github.com/kdhht2334/AVCE_FER.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Barros, P., Parisi, G., Wermter, S.: A personalized affective memory model for improving emotion recognition. In: International Conference on Machine Learning, pp. 485–494 (2019)

    Google Scholar 

  2. Bera, A., Randhavane, T., Manocha, D.: Modelling multi-channel emotions using facial expression and trajectory cues for improving socially-aware robot navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019)

    Google Scholar 

  3. Cerf, M., Frady, E.P., Koch, C.: Faces and text attract gaze independent of the task: experimental data and computer model. J. Vis. 9(12), 10 (2009)

    Article  Google Scholar 

  4. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607 (2020)

    Google Scholar 

  5. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, Doha, Qatar, pp. 1724–1734. ACL (2014)

    Google Scholar 

  6. Chuang, C.Y., Robinson, J., Lin, Y.C., Torralba, A., Jegelka, S.: Debiased contrastive learning. In: Advances in Neural Information Processing Systems, vol. 33, pp. 8765–8775. Curran Associates, Inc. (2020)

    Google Scholar 

  7. d’Apolito, S., Paudel, D.P., Huang, Z., Romero, A., Van Gool, L.: Ganmut: learning interpretable conditional space for gamut of emotions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 568–577 (2021)

    Google Scholar 

  8. Deng, D., Chen, Z., Zhou, Y., Shi, B.: Mimamo net: integrating micro-and macro-motion for video emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 2621–2628 (2020)

    Google Scholar 

  9. Dhall, A., Kaur, A., Goecke, R., Gedeon, T.: Emotiw 2018: audio-video, student engagement and group-level affect prediction. In: Proceedings of the 20th ACM International Conference on Multimodal Interaction, pp. 653–656 (2018)

    Google Scholar 

  10. Diamond, S., Boyd, S.: Cvxpy: A python-embedded modeling language for convex optimization. J. Mach. Learn. Res. 17(1), 2909–2913 (2016)

    MathSciNet  MATH  Google Scholar 

  11. Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (2018)

    Google Scholar 

  12. Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint. arXiv:2006.07733 (2020)

  13. Hasani, B., Mahoor, M.H.: Facial affect estimation in the wild using deep residual and convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 9–16 (2017)

    Google Scholar 

  14. Hasani, B., Negi, P.S., Mahoor, M.: Breg-next: facial affect computing using adaptive residual networks with bounded gradient. IEEE Trans. Affect. Comput. 13(2), 1023–1036 (2020)

    Article  Google Scholar 

  15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  16. Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2(3), 194–203 (2001)

    Article  Google Scholar 

  17. Jackson, J.C., et al.: Emotion semantics show both cultural variation and universal structure. Science 366(6472), 1517–1522 (2019)

    Article  Google Scholar 

  18. Jefferies, L.N., Smilek, D., Eich, E., Enns, J.T.: Emotional valence and arousal interact in attentional control. Psychol. Sci. 19(3), 290–295 (2008)

    Article  Google Scholar 

  19. Kim, D.H., Song, B.C.: Contrastive adversarial learning for person independent facial emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 5948–5956 (2021)

    Google Scholar 

  20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015). arxiv.org/abs/1412.6980

  21. Kollias, D., et al.: Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vis. 127(6), 907–929 (2019). https://doi.org/10.1007/s11263-019-01158-4

    Article  Google Scholar 

  22. Kollias, D., Zafeiriou, S.: Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. In: 30th British Machine Vision Conference 2019, BMVC 2019, Cardiff, UK, 9–12 September 2019, p. 297 (2019). https://www.bmvc2019.org/wp-content/uploads/papers/0399-paper.pdf

  23. Kossaifi, J., Toisoul, A., Bulat, A., Panagakis, Y., Hospedales, T.M., Pantic, M.: Factorized higher-order cnns with an application to spatio-temporal emotion estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6060–6069 (2020)

    Google Scholar 

  24. Kossaifi, J., Tzimiropoulos, G., Todorovic, S., Pantic, M.: Afew-va database for valence and arousal estimation in-the-wild. Image Vis. Comput. 65, 23–36 (2017)

    Article  Google Scholar 

  25. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

    Google Scholar 

  26. Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., Tang, J.: Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng. (2021)

    Google Scholar 

  27. Marinoiu, E., Zanfir, M., Olaru, V., Sminchisescu, C.: 3d human sensing, action and emotion recognition in robot assisted therapy of children with autism. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2158–2167 (2018)

    Google Scholar 

  28. Martins, A., Astudillo, R.: From softmax to sparsemax: a sparse model of attention and multi-label classification. In: International Conference on Machine Learning, pp. 1614–1623 (2016)

    Google Scholar 

  29. Mikels, J.A., Fredrickson, B.L., Larkin, G.R., Lindberg, C.M., Maglio, S.J., Reuter-Lorenz, P.A.: Emotional category data on images from the international affective picture system. Behav. Res. Methods 37(4), 626–630 (2005). https://doi.org/10.3758/BF03192732

    Article  Google Scholar 

  30. Mitenkova, A., Kossaifi, J., Panagakis, Y., Pantic, M.: Valence and arousal estimation in-the-wild with tensor methods. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–7 (2019)

    Google Scholar 

  31. Mollahosseini, A., Hasani, B., Mahoor, M.H.: Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10(1), 18–31 (2017)

    Article  Google Scholar 

  32. Mroueh, Y., Melnyk, I., Dognin, P., Ross, J., Sercu, T.: Improved mutual information estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9009–9017 (2021)

    Google Scholar 

  33. Niculae, V., Martins, A., Blondel, M., Cardie, C.: Sparsemap: differentiable sparse structured inference. In: International Conference on Machine Learning, pp. 3799–3808 (2018)

    Google Scholar 

  34. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  35. Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint. arXiv:1807.03748 (2018)

  36. Panda, R., Zhang, J., Li, H., Lee, J.Y., Lu, X., Roy-Chowdhury, A.K.: Contemplating visual emotions: Understanding and overcoming dataset bias. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 579–595 (2018)

    Google Scholar 

  37. Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev. Psychopathol. 17(3), 715–734 (2005)

    Article  Google Scholar 

  38. Rafaeli, A., Sutton, R.I.: Emotional contrast strategies as means of social influence: Lessons from criminal interrogators and bill collectors. Acad. Manag. J. 34(4), 749–775 (1991)

    Article  Google Scholar 

  39. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  40. Roy, S., Etemad, A.: Self-supervised contrastive learning of multi-view facial expressions. arXiv preprint. arXiv:2108.06723 (2021)

  41. Roy, S., Etemad, A.: Spatiotemporal contrastive learning of facial expressions in videos. arXiv preprint. arXiv:2108.03064 (2021)

  42. Sanchez, E., Tellamekala, M.K., Valstar, M., Tzimiropoulos, G.: Affective processes: stochastic modelling of temporal context for emotion and facial expression recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9074–9084 (2021)

    Google Scholar 

  43. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823 (2015)

    Google Scholar 

  44. Song, B.C., Kim, D.H.: Hidden emotion detection using multi-modal signals. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)

    Google Scholar 

  45. Srivastava, P., Srinivasan, N.: Time course of visual attention with emotional faces. Attention Percept. Psychophysics 72(2), 369–377 (2010). https://doi.org/10.3758/APP.72.2.369

    Article  Google Scholar 

  46. Taverner, J., Vivancos, E., Botti, V.: A multidimensional culturally adapted representation of emotions for affective computational simulation and recognition. IEEE Trans. Affect. Comput. (2020)

    Google Scholar 

  47. Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 776–794. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_45

    Chapter  Google Scholar 

  48. Tsai, Y.H., Zhao, H., Yamada, M., Morency, L.P., Salakhutdinov, R.: Neural methods for point-wise dependency estimation. In: Proceedings of the Neural Information Processing Systems Conference (Neurips) (2020)

    Google Scholar 

  49. Tsai, Y.H.H., Ma, M.Q., Yang, M., Zhao, H., Morency, L.P., Salakhutdinov, R.: Self-supervised representation learning with relative predictive coding. In: International Conference on Learning Representations (2021)

    Google Scholar 

  50. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30, pp. 5998–6008. Curran Associates, Inc. (2017)

    Google Scholar 

  51. Wang, Y., Pan, X., Song, S., Zhang, H., Huang, G., Wu, C.: Implicit semantic data augmentation for deep networks. Adv. Neural. Inf. Process. Syst. 32, 12635–12644 (2019)

    Google Scholar 

  52. Wei, Z., Zhang, J., Lin, Z., Lee, J.Y., Balasubramanian, N., Hoai, M., Samaras, D.: Learning visual emotion representations from web data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13106–13115 (2020)

    Google Scholar 

  53. Xue, F., Wang, Q., Guo, G.: Transfer: learning relation-aware facial expression representations with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3601–3610 (2021)

    Google Scholar 

  54. , Yang, J., Li, J., Li, L., Wang, X., Gao, X.: A circular-structured representation for visual emotion distribution learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4237–4246 (2021)

    Google Scholar 

  55. Zafeiriou, S., Kollias, D., Nicolaou, M.A., Papaioannou, A., Zhao, G., Kotsia, I.: Aff-wild: valence and arousal’in-the-wild’challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–41 (2017)

    Google Scholar 

  56. Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: self-supervised learning via redundancy reduction. In: Proceedings of the 38th International Conference on Machine Learning, Virtual Event, vol. 139, pp. 12310–12320. PMLR (2021)

    Google Scholar 

  57. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)

    Article  Google Scholar 

  58. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

    Google Scholar 

  59. Zhu, X., Xu, C., Tao, D.: Where and what? examining interpretable disentangled representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5861–5870 (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by IITP grants funded by the Korea government (MSIT) (No. 2021-0-02068, AI Innovation Hub and RS-2022-00155915, Artificial Intelligence Convergence Research Center(Inha University)), and was supported by the NRF grant funded by the Korea government (MSIT) (No. 2022R1A2C2010095 and No. 2022R1A4A1033549).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Byung Cheol Song .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 4449 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kim, D., Song, B.C. (2022). Emotion-aware Multi-view Contrastive Learning for Facial Emotion Recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19778-9_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics