Deep convolutional BiLSTM fusion network for facial expression recognition

Abstract

Deep learning algorithms have shown significant performance improvements for facial expression recognition (FER). Most deep learning-based methods, however, focus more attention on spatial appearance features for classification, discarding much useful temporal information. In this work, we present a novel framework that jointly learns spatial features and temporal dynamics for FER. Given the image sequence of an expression, spatial features are extracted from each frame using a deep network, while the temporal dynamics are modeled by a convolutional network, which takes a pair of consecutive frames as input. Finally, the framework accumulates clues from fused features by a BiLSTM network. In addition, the framework is end-to-end learnable, and thus temporal information can be adapted to complement spatial features. Experimental results on three benchmark databases, CK+, Oulu-CASIA and MMI, show that the proposed framework outperforms state-of-the-art methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    Afshar, S., Salah, A.A.: Facial expression recognition in the wild using improved dense trajectories and fisher vector encoding. In: Computer Vision and Pattern Recognition Workshops, pp. 1517–1525 (2016)

  2. 2.

    Agarwal, S., Santra, B., Mukherjee, D.P.: Anubhav: recognizing emotions through facial expression. Vis. Comput. 34, 1–15 (2016)

    Google Scholar 

  3. 3.

    Bargal, S.A., Barsoum, E., Ferrer, C.C., Zhang, C.: Emotion recognition in the wild from videos using images. In: ACM International Conference on Multimodal Interaction, pp. 433–436 (2016)

  4. 4.

    Chi, J., Tu, C., Zhang, C.: Dynamic 3D facial expression modeling using Laplacian smooth and multi-scale mesh matching. Vis. Comput. 30(6–8), 649–659 (2014)

    Article  Google Scholar 

  5. 5.

    Danelakis, A., Theoharis, T., Pratikakis, I.: A spatio-temporal wavelet-based descriptor for dynamic 3D facial expression retrieval and recognition. Vis. Comput. 32(6–8), 1–11 (2016)

    Google Scholar 

  6. 6.

    Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 467–474 (2015)

  7. 7.

    Ekman, P., Friesen, W.V.: Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 17(2), 124 (1971)

    Article  Google Scholar 

  8. 8.

    Fan, Y., Lu, X., Li, D., Liu, Y.: Video-based emotion recognition using CNN–RNN and C3D hybrid networks. In: ACM International Conference on Multimodal Interaction, pp. 445–450 (2016)

  9. 9.

    Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.H.: Challenges in representation learning: a report on three machine learning contests. In: International Conference on Neural Information Processing, pp. 117–124 (2013)

  10. 10.

    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)

    Article  Google Scholar 

  11. 11.

    Guo, Y., Zhao, G., Pietikainen, M.: Dynamic facial expression recognition using longitudinal facial expression atlases. In: European Conference on Computer Vision, pp. 631–644 (2012)

  12. 12.

    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  13. 13.

    Jaiswal, S., Valstar, M.: Deep learning the dynamic appearance and shape of facial action units. In: Applications of Computer Vision (WACV), pp. 1–8 (2016)

  14. 14.

    Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint fine-tuning in deep neural networks for facial expression recognition. In: IEEE International Conference on Computer Vision, pp. 2983–2991 (2015)

  15. 15.

    Kacem, A., Daoudi, M., Amor, B.B., Alvarezpaiva, J.C.: A novel space-time representation on the positive semidefinite cone for facial expression recognition. In: IEEE International Conference on Computer Vision, pp. 3199–3208 (2017)

  16. 16.

    Khorrami, P., Paine, T.L., Brady, K., Dagli, C., Huang, T.S.: How deep neural networks can improve emotion recognition on video data, pp. 619–623 (2016)

  17. 17.

    Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British Machine Vision Conference, pp. 1–10 (2008)

  18. 18.

    LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W.E., Jackel, L.D.: Handwritten digit recognition with a back-propagation network. In: Advances in Neural Information Processing Systems, pp. 396–404 (1990)

  19. 19.

    Liu, H., Jie, Z., Jayashree, K., Qi, M., Jiang, J., Yan, S., Feng, J.: Video-based person re-identification with accumulative motion context. In: CoRR (2017)

  20. 20.

    Liu, M., Li, S., Shan, S., Wang, R., Chen, X.: Deeply learning deformable facial action parts model for dynamic expression analysis. In: Asian Conference on Computer Vision, pp. 143–157 (2014)

  21. 21.

    Liu, M., Shan, S., Wang, R., Chen, X.: Learning expression lets on spatio-temporal manifold for dynamic facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1749–1756 (2014)

  22. 22.

    Lucey, P., Cohn, J.F., Kanade, T., Saragih, J.: The extended Cohn–Kanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: Computer Vision and Pattern Recognition Workshops, pp. 94–101 (2010)

  23. 23.

    Metaxas, D.N., Huang, J., Liu, B., Yang, P., Liu, Q., Zhong, L.: Learning active facial patches for expression analysis. In: Computer Vision and Pattern Recognition, pp. 2562–2569 (2012)

  24. 24.

    Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression recognition using deep neural networks. In: Applications of Computer Vision (WACV), pp. 1–10 (2016)

  25. 25.

    Ofodile, I., Kulkarni, K., Corneanu, C.A., Escalera, S., Baro, X., Hyniewska, S., Allik, J., Anbarjafari, G.: Automatic recognition of deceptive facial expressions of emotion. In: CoRR (2017)

  26. 26.

    Sanin, A., Sanderson, C., Harandi, M.T., Lovell, B.C.: Spatio-temporal covariance descriptors for action and gesture recognition. In: IEEE Workshop on Applications of Computer Vision, pp. 103–110 (2013)

  27. 27.

    Saudagare, P.V., Chaudhari, D.: Facial expression recognition using neural network-an overview. Int. J. Soft Comput. Eng. (IJSCE) 2(1), 224–227 (2012)

    Google Scholar 

  28. 28.

    Shan, C., Gong, S., McOwan, P.W.: Facial expression recognition based on local binary patterns: a comprehensive study. In: Image and Vision Computing, pp. 803–816 (2009)

  29. 29.

    Sikka, K., Sharma, G., Bartlett, M.: Lomo: latent ordinal model for facial analysis in videos. In: Computer Vision and Pattern Recognition, pp. 5580–5589 (2016)

  30. 30.

    Sikka, K., Wu, T., Susskind, J., Bartlett, M.: Exploring bag of words architectures in the facial expression domain. In: Computer Vision—ECCV 2012. Workshops and Demonstrations, pp. 250–259 (2012)

  31. 31.

    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: CoRR (2014)

  32. 32.

    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, pp. 4278–4284 (2017)

  33. 33.

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  34. 34.

    Taini, M., Zhao, G., Li, S.Z., Pietikainen, M.: Facial expression recognition from near-infrared video sequences. In: International Conference on Pattern Recognition, pp. 1–4 (2011)

  35. 35.

    Valstar, M., Pantic, M.: Induced disgust, happiness and surprise: an addition to the MMI facial expression database. In: Proceedings of the 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, p. 65 (2010)

  36. 36.

    Valstar, M.F., Almaev, T., Girard, J.M., Mckeown, G.: Fera 2015 second facial expression recognition and analysis challenge. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition, pp. 1–8 (2015)

  37. 37.

    Yang, P.: Learning active facial patches for expression analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2562–2569 (2012)

  38. 38.

    Yi, D., Lei, Z., Liao, S., Li, S.Z.: Learning face representation from scratch. In: CoRR (2014)

  39. 39.

    Yu, Z., Liu, Q., Liu, G.: Deeper cascaded peak-piloted network for weak expression recognition. Vis. Comput. 6–8, 1–9 (2017)

    MathSciNet  Google Scholar 

  40. 40.

    Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: ACM on International Conference on Multimodal Interaction, pp. 435–442 (2015)

  41. 41.

    Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23, 1499–1503 (2016)

    Article  Google Scholar 

  42. 42.

    Zhang, Z., Luo, P., Chen, C.L., Tang, X.: From facial expression recognition to interpersonal relation prediction. Int. J. Comput. Vis. 126(5), 550–569 (2018)

    MathSciNet  Article  Google Scholar 

  43. 43.

    Zhao, G., Huang, X., Taini, M., Li, S.Z., Pietikäinen, M.: Facial expression recognition from near-infrared videos. Image Vis. Comput. 29(9), 607–619 (2011)

    Article  Google Scholar 

  44. 44.

    Zhao, X., Liang, X., Liu, L., Li, T., Han, Y., Vasconcelos, N., Yan, S.: Peak-piloted deep network for facial expression recognition. In: European Conference on Computer Vision, pp. 425–442 (2016)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dandan Liang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liang, D., Liang, H., Yu, Z. et al. Deep convolutional BiLSTM fusion network for facial expression recognition. Vis Comput 36, 499–508 (2020). https://doi.org/10.1007/s00371-019-01636-3

Download citation

Keywords

  • Facial expression recognition
  • Deep network
  • BiLSTM
  • Spatial–temporal features