A video prediction approach for animating single face image


Generating dynamic 2D image-based facial expressions is a challenging task for facial animation. Much research work focused on performance-driven facial animation from given videos or images of a target face, while animating a single face image driven by emotion labels is a less explored problem. In this work, we treat the task of animating single face image from emotion labels as a conditional video prediction problem, and propose a novel framework by combining factored conditional restricted boltzmann machines (FCRBM) and reconstruction contractive auto-encoder (RCAE). A modified RCAE with an associated efficient training strategy is used to extract low dimensional features and reconstruct face images. FCRBM is used as animator to predict facial expression sequence in the feature space given discrete emotion labels and a frontal neutral face image as input. Both quantitative and qualitative evaluations on two facial expression databases, and comparison to state-of-the-art showed the effectiveness of our proposed framework for animating frontal neutral face image from given emotion labels.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Alain G, Bengio Y (2014) What regularized auto-encoders learn from the data-generating distribution. J Mach Learn Res 15(1):3563–3593

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Anderson R, Stenger B, Wan V, Cipolla R (2013) Expressive visual text-to-speech using active appearance models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3382–3389

  3. 3.

    Averbuch-Elor H, Cohen-Or D, Kopf J, Cohen MF (2017) Bringing portraits to life. ACM Trans Graph (TOG) 36(6):196

    Article  Google Scholar 

  4. 4.

    Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828

    Article  Google Scholar 

  5. 5.

    Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Proceedings of the 26th annual conference on computer graphics and interactive techniques. ACM Press/Addison-Wesley Publishing Co., pp 187–194

  6. 6.

    Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Computer graphics forum vol 22. Wiley Online Library, pp 641–650

  7. 7.

    Cao Y, Tien WC, Faloutsos P, Pighin F (2005) Expressive speech-driven facial animation. ACM Trans Graph (TOG) 24(4):1283–1302

    Article  Google Scholar 

  8. 8.

    Cao C, Wu H, Weng Y, Shao T, Zhou K (2016) Real-time facial animation with image-based dynamic avatars. ACM Trans Graph (TOG) 35(4):126

    Article  Google Scholar 

  9. 9.

    Cootes TF, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685

    Article  Google Scholar 

  10. 10.

    Deng Z, Noh J (2008) Computer facial animation: a survey. In: Data-driven 3D facial animation. Springer, pp 1–28

  11. 11.

    Ding H, Zhou SK, Chellappa R (2017) Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: 2017 12th IEEE International conference on automatic face & gesture recognition (FG 2017). IEEE, pp 118–126

  12. 12.

    Ersotelos N, Dong F (2008) Building highly realistic facial modeling and animation: a survey. Vis Comput 24(1):13–30

    Article  Google Scholar 

  13. 13.

    Fan B, Wang L, Soong FK, Xie L (2015) Photo-real talking head with deep bidirectional lstm. In: 2015 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4884–4888

  14. 14.

    Garrido P, Zollhöfer M, Casas D, Valgaerts L, Varanasi K, Pérez P, Theobalt C (2016) Reconstruction of personalized 3d face rigs from monocular video. ACM Trans Graph (TOG) 35(3):28

    Article  Google Scholar 

  15. 15.

    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672– 2680

  16. 16.

    Ichim AE, Bouaziz S, Pauly M (2015) Dynamic 3d avatar creation from hand-held video input. ACM Trans Graph (TOG) 34(4):45

    Article  Google Scholar 

  17. 17.

    Jiang D, Zhao Y, Sahli H, Zhang Y (2014) Speech driven photo realistic facial animation based on an articulatory dbn model and aam features. Multimed Tools Appl 73(1):397–415

    Article  Google Scholar 

  18. 18.

    Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv:13126114

  19. 19.

    Liu Z, Shan Y, Zhang Z (2001) Expressive expression mapping with ratio images. In: Proceedings of the 28th annual conference on computer graphics and interactive techniques. ACM, pp 271–276

  20. 20.

    Lucey P, Cohn JF, Kanade T, Saragih J, Ambadar Z, Matthews I (2010) The extended cohn-kanade dataset (ck+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer society conference on computer vision and pattern recognition workshops (CVPRW). IEEE, pp 94–101

  21. 21.

    Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:14111784

  22. 22.

    Olszewski K, Li Z, Yang C, Zhou Y, Yu R, Huang Z, Xiang S, Saito S, Kohli P, Li H (2017) Realistic dynamic facial textures from a single image using gans. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5429–5438

  23. 23.

    Oveneke MC, Aliosha-Perez M, Zhao Y, Jiang D, Sahli H (2016) Efficient convolutional auto-encoding via random convexification and frequency-domain minimization. arXiv:161109232

  24. 24.

    Oveneke MC, Zhao Y, Jiang D, Sahli H (2017) Expressive face frontalization and its application to facial expression analysis. Tech. rep., Vrije Universiteit Brussel

  25. 25.

    Rifai S, Vincent P, Muller X, Glorot X, Bengio Y (2011) Contractive auto-encoders: explicit invariance during feature extraction. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 833–840

  26. 26.

    Shu Z, Yumer E, Hadap S, Sunkavalli K, Shechtman E, Samaras D (2017) Neural face editing with intrinsic image disentangling. arXiv:170404131

  27. 27.

    Stoiber N, Seguier R, Breton G (2009) Automatic design of a control interface for a synthetic face. In: Proceedings of the 14th international conference on intelligent user interfaces. ACM, pp 207–216

  28. 28.

    Susskind JM, Anderson AK, Hinton GE, Movellan JR (2008) Generating facial expressions with deep belief nets. INTECH Open Access Publisher

  29. 29.

    Sutskever I, Hinton GE, Taylor GW (2009) The recurrent temporal restricted boltzmann machine. In: Advances in neural information processing systems, pp 1601–1608

  30. 30.

    Taylor GW, Hinton GE (2009) Factored conditional restricted Boltzmann machines for modeling motion style. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1025–1032

  31. 31.

    Taylor GW, Hinton GE, Roweis ST (2007) Modeling human motion using binary latent variables. Adv Neural Inf Process Syst 19:1345

    Google Scholar 

  32. 32.

    Thies J, Zollhofer M, Stamminger M, Theobalt C, Nießner M (2016) Face2face: real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2387–2395

  33. 33.

    Tulyakov S, Liu MY, Yang X, Kautz J (2018) Mocogan: decomposing motion and content for video generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1526–1535

  34. 34.

    Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017) Learning to generate long-term future via hierarchical prediction. arXiv:170405831

  35. 35.

    Wang Z, Bovik AC (2009) Mean squared error: love it or leave it? A new look at signal fidelity measures. IEEE Signal Process Mag 26(1):98–117

    Article  Google Scholar 

  36. 36.

    Wang L, Soong FK (2015) Hmm trajectory-guided sample selection for photo-realistic talking head. Multimed Tools Appl 74(22):9849–9869

    Article  Google Scholar 

  37. 37.

    Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. IEEE Trans Image Process 13 (4):600–612

    Article  Google Scholar 

  38. 38.

    Yan X, Yang J, Sohn K, Lee H (2016) Attribute2image: conditional image generation from visual attributes. In: European conference on computer vision. Springer, pp 776–791

  39. 39.

    Zhao G, Huang X, Taini M, Li SZ, PietikäInen M (2011) Facial expression recognition from near-infrared videos. Image Vis Comput 29(9):607–619

    Article  Google Scholar 

  40. 40.

    Zhao Y, Jiang D, Sahli H (2015) 3d emotional facial animation synthesis with factored conditional restricted Boltzmann machines. In: 2015 International conference on affective computing and intelligent interaction (ACII). IEEE, pp 797–803

Download references


We thank Averbuch-Elor et al. for kindly providing the sequence for comparison. We thank Tao Yang for the kindly processing of the facial expression recognition experiments and all the students for their participation to the subjective analysis. We would like to thank the reviewer for their detailed comments and suggestions for the manuscript. We believe that the comments have identified important areas which required improvement. This work is supported by the Chinese Scholarship Council (CSC) (grant 201506290085), the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), the Natural Science Foundation of China (grant 61273265), the VUB Interdisciplinary Research Program through the EMO-App project, and the Agency for Innovation by Science and Technology in Flanders (IWT) – PhD grant nr. 131814.

Author information



Corresponding author

Correspondence to Yong Zhao.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(AVI 26.7 MB)

(AVI 28.9 MB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, Y., Oveneke, M.C., Jiang, D. et al. A video prediction approach for animating single face image. Multimed Tools Appl 78, 16389–16410 (2019). https://doi.org/10.1007/s11042-018-6952-y

Download citation


  • Facial expression animation
  • Image-based
  • Reconstruction contractive auto-encoder
  • Emotion