A video prediction approach for animating single face image

  • Yong ZhaoEmail author
  • Meshia Cédric Oveneke
  • Dongmei Jiang
  • Hichem Sahli


Generating dynamic 2D image-based facial expressions is a challenging task for facial animation. Much research work focused on performance-driven facial animation from given videos or images of a target face, while animating a single face image driven by emotion labels is a less explored problem. In this work, we treat the task of animating single face image from emotion labels as a conditional video prediction problem, and propose a novel framework by combining factored conditional restricted boltzmann machines (FCRBM) and reconstruction contractive auto-encoder (RCAE). A modified RCAE with an associated efficient training strategy is used to extract low dimensional features and reconstruct face images. FCRBM is used as animator to predict facial expression sequence in the feature space given discrete emotion labels and a frontal neutral face image as input. Both quantitative and qualitative evaluations on two facial expression databases, and comparison to state-of-the-art showed the effectiveness of our proposed framework for animating frontal neutral face image from given emotion labels.


Facial expression animation Image-based FCRBM Reconstruction contractive auto-encoder Emotion 



We thank Averbuch-Elor et al. for kindly providing the sequence for comparison. We thank Tao Yang for the kindly processing of the facial expression recognition experiments and all the students for their participation to the subjective analysis. We would like to thank the reviewer for their detailed comments and suggestions for the manuscript. We believe that the comments have identified important areas which required improvement. This work is supported by the Chinese Scholarship Council (CSC) (grant 201506290085), the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Natural Science Foundation of China (grant 61273265), the VUB Interdisciplinary Research Program through the EMO-App project, and the Agency for Innovation by Science and Technology in Flanders (IWT) – PhD grant nr. 131814.

Supplementary material

(AVI 26.7 MB)

(AVI 28.9 MB)


  1. 1.
    Alain G, Bengio Y (2014) What regularized auto-encoders learn from the data-generating distribution. J Mach Learn Res 15(1):3563–3593MathSciNetzbMATHGoogle Scholar
  2. 2.
    Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3382–3389Google Scholar
  3. 3.
    Averbuch-Elor H, Cohen-Or D, Kopf J, Cohen MF (2017) Bringing portraits to life. ACM Trans Graph (TOG) 36(6):196CrossRefGoogle Scholar
  4. 4.
    Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828CrossRefGoogle Scholar
  5. 5.
    Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., pp 187–194Google Scholar
  6. 6.
    Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Computer graphics forum vol 22. Wiley Online Library, pp 641–650Google Scholar
  7. 7.
    Cao Y, Tien WC, Faloutsos P, Pighin F (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302CrossRefGoogle Scholar
  8. 8.
    Cao C, Wu H, Weng Y, Shao T, Zhou K (2016) Real-time facial animation with image-based dynamic avatars. ACM Trans Graph (TOG) 35(4):126CrossRefGoogle Scholar
  9. 9.
    Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685CrossRefGoogle Scholar
  10. 10.
    Deng Z, Noh J (2008) Computer facial animation: a survey. In: Data-driven 3D facial animation. Springer, pp 1–28Google Scholar
  11. 11.
    Ding H, Zhou SK, Chellappa R (2017) Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: 2017 12th IEEE International conference on automatic face & gesture recognition (FG 2017). IEEE, pp 118–126Google Scholar
  12. 12.
    Ersotelos N, Dong F (2008) Building highly realistic facial modeling and animation: a survey. Vis Comput 24(1):13–30CrossRefGoogle Scholar
  13. 13.
    Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4884–4888Google Scholar
  14. 14.
    Garrido P, Zollhöfer M, Casas D, Valgaerts L, Varanasi K, Pérez P, Theobalt C (2016) Reconstruction of personalized 3d face rigs from monocular video. ACM Trans Graph (TOG) 35(3):28CrossRefGoogle Scholar
  15. 15.
    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672– 2680Google Scholar
  16. 16.
    Ichim AE, Bouaziz S, Pauly M (2015) Dynamic 3d avatar creation from hand-held video input. ACM Trans Graph (TOG) 34(4):45CrossRefGoogle Scholar
  17. 17.
    Jiang D, Zhao Y, Sahli H, Zhang Y (2014) Speech driven photo realistic facial animation based on an articulatory dbn model and aam features. Multimed Tools Appl 73(1):397–415CrossRefGoogle Scholar
  18. 18.
    Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:13126114
  19. 19.
    Liu Z, Shan Y, Zhang Z (2001) Expressive expression mapping with ratio images. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques. ACM, pp 271–276Google Scholar
  20. 20.
    Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 94–101Google Scholar
  21. 21.
    Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:14111784
  22. 22.
    Olszewski K, Li Z, Yang C, Zhou Y, Yu R, Huang Z, Xiang S, Saito S, Kohli P, Li H (2017) Realistic dynamic facial textures from a single image using gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5429–5438Google Scholar
  23. 23.
    Oveneke MC, Aliosha-Perez M, Zhao Y, Jiang D, Sahli H (2016) Efficient convolutional auto-encoding via random convexification and frequency-domain minimization. arXiv:161109232
  24. 24.
    Oveneke MC, Zhao Y, Jiang D, Sahli H (2017) Expressive face frontalization and its application to facial expression analysis. Tech. rep., Vrije Universiteit BrusselGoogle Scholar
  25. 25.
    Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 833–840Google Scholar
  26. 26.
    Shu Z, Yumer E, Hadap S, Sunkavalli K, Shechtman E, Samaras D (2017) Neural face editing with intrinsic image disentangling. arXiv:170404131
  27. 27.
    Stoiber N, Seguier R, Breton G (2009) Automatic design of a control interface for a synthetic face. In: Proceedings of the 14th international conference on intelligent user interfaces. ACM, pp 207–216Google Scholar
  28. 28.
    Susskind JM, Anderson AK, Hinton GE, Movellan JR (2008) Generating facial expressions with deep belief nets. INTECH Open Access PublisherGoogle Scholar
  29. 29.
    Sutskever I, Hinton GE, Taylor GW (2009) The recurrent temporal restricted boltzmann machine. In: Advances in neural information processing systems, pp 1601–1608Google Scholar
  30. 30.
    Taylor GW, Hinton GE (2009) Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1025–1032Google Scholar
  31. 31.
    Taylor GW, Hinton GE, Roweis ST (2007) Modeling human motion using binary latent variables. Adv Neural Inf Process Syst 19:1345Google Scholar
  32. 32.
    Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M (2016) Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2387–2395Google Scholar
  33. 33.
    Tulyakov S, Liu MY, Yang X, Kautz J (2018) Mocogan: decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535Google Scholar
  34. 34.
    Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017) Learning to generate long-term future via hierarchical prediction. arXiv:170405831
  35. 35.
    Wang Z, Bovik AC (2009) Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Process Mag 26(1):98–117CrossRefGoogle Scholar
  36. 36.
    Wang L, Soong FK (2015) Hmm trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74(22):9849–9869CrossRefGoogle Scholar
  37. 37.
    Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13 (4):600–612CrossRefGoogle Scholar
  38. 38.
    Yan X, Yang J, Sohn K, Lee H (2016) Attribute2image: conditional image generation from visual attributes. In: European conference on computer vision. Springer, pp 776–791Google Scholar
  39. 39.
    Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos. Image Vis Comput 29(9):607–619CrossRefGoogle Scholar
  40. 40.
    Zhao Y, Jiang D, Sahli H (2015) 3d emotional facial animation synthesis with factored conditional restricted Boltzmann machines. In: 2015 International conference on affective computing and intelligent interaction (ACII). IEEE, pp 797–803Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.NPU-VUB joint AVSP Research Lab, School of Computer ScienceNorthwestern Polytechnical University (NPU)Xi’anPeople’s Republic of China
  2. 2.VUB-NPU joint AVSP Research Lab, Department of Electronics & Informatics (ETRO)Vrije Universiteit Brussel (VUB)BrusselsBelgium
  3. 3.Interuniversity Microelectronics Centre (IMEC)HeverleeBelgium

Personalised recommendations