A Recurrent Encoder-Decoder Network for Sequential Face Alignment

  • Xi PengEmail author
  • Rogerio S. Feris
  • Xiaoyu Wang
  • Dimitris N. Metaxas
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9905)


We propose a novel recurrent encoder-decoder network model for real-time video-based face alignment. Our proposed model predicts 2D facial point maps regularized by a regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features, yielding better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state-of-the-art in standard datasets.


Recurrent learning Encoder-decoder Face alignment 


  1. 1.
    Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: CVPR (2014)Google Scholar
  2. 2.
    Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative response map fitting with constrained local models. In: CVPR, pp. 3444–3451 (2013)Google Scholar
  3. 3.
    Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. CoRR (2015)Google Scholar
  4. 4.
    Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR (2011)Google Scholar
  5. 5.
    Black, M., Yacoob, Y.: Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In: CVPR, pp. 374–381 (1995)Google Scholar
  6. 6.
    Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. IJCV 107(2), 177–190 (2014)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259 (2014)Google Scholar
  8. 8.
    Chrysos, G.G., Antonakos, E., Zafeiriou, S., Snape, P.: Offline deformable face tracking in arbitrary videos. In: ICCVW, pp. 954–962 (2015)Google Scholar
  9. 9.
    Cootes, T.F., Taylor, C.J.: Active shape models-smart snakes. In: BMVC (1992)Google Scholar
  10. 10.
    Decarlo, D., Metaxas, D.: Optical flow constraints on deformable models with applications to face tracking. IJCV 38(2), 99–127 (2000)CrossRefzbMATHGoogle Scholar
  11. 11.
    FGNet: talking face video. Technical report (2004).
  12. 12.
    Gao, X., Su, Y., Li, X., Tao, D.: A review of active appearance models. IEEE Trans. Syst. Man Cybern. 40(2), 145–158 (2010)CrossRefGoogle Scholar
  13. 13.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  14. 14.
    Hong, S., Noh, H., Han, B.: Decoupled deep neural network for semi-supervised semantic segmentation. CoRR abs/1506.04924 (2015)Google Scholar
  15. 15.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 (2015)Google Scholar
  16. 16.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACMM, pp. 675–678 (2014)Google Scholar
  17. 17.
    Jourabloo, A., Liu, X.: Large-pose face alignment via cnn-based dense 3D model fitting. In: CVPR (2016)Google Scholar
  18. 18.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, June 2015Google Scholar
  19. 19.
    Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. CoRR abs/1511.02680 (2015)Google Scholar
  20. 20.
    Koestinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In: Workshop on Benchmarking Facial Image Analysis Technologies (2011)Google Scholar
  21. 21.
    Lai, H., Xiao, S., Cui, Z., Pan, Y., Xu, C., Yan, S.: Deep cascaded regression for face alignment (2015). arXiv:1510.09083v2
  22. 22.
    Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33712-3_49 Google Scholar
  23. 23.
    Learned-Miller, G.: Labeled faces in the wild: updates and new reporting procedures. Technical report. UM-CS-2014-003, University of Massachusetts, Amherst (2014)Google Scholar
  24. 24.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038 (2014)Google Scholar
  25. 25.
    Long, J.L., Zhang, N., Darrell, T.: Do convnets learn correspondence? In: NIPS, pp. 1601–1609 (2014)Google Scholar
  26. 26.
    Lu, L., Zhang, X., Cho, K., Renals, S.: A study of the recurrent neural network encoder-decoder for lar ge vocabulary speech recognition. In: INTERSPEECH (2015)Google Scholar
  27. 27.
    Mikolov, T., Joulin, A., Chopra, S., Mathieu, M., Ranzato, M.: Learning longer memory in recurrent neural networks. CoRR abs/1412.7753 (2014)Google Scholar
  28. 28.
    Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)Google Scholar
  29. 29.
    Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 504–513. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88693-8_37 CrossRefGoogle Scholar
  30. 30.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted Boltzmann machines. In: ICML, pp. 807–814 (2010)Google Scholar
  31. 31.
    Oh, J., Guo, X., Lee, H., Lewis, R.L., Singh, S.: Action-conditional video prediction using deep networks in atari games. In: NIPS, pp. 2845–2853 (2015)Google Scholar
  32. 32.
    Oliver, N., Pentland, A., Berard, F.: Lafter: lips and face real time tracker. In: CVPR, pp. 123–129 (1997)Google Scholar
  33. 33.
    Patras, I., Pantic, M.: Particle filtering with factorized likelihoods for tracking facial features. In: Proceedings of Automatic Face and Gesture Recognition, pp. 97–102 (2004)Google Scholar
  34. 34.
    Peng, X., Huang, J., Hu, Q., Zhang, S., Elgammal, A., Metaxas, D.: From circle to 3-sphere: head pose estimation by instance parameterization. CVIU 136, 92–102 (2015)Google Scholar
  35. 35.
    Peng, X., Huang, J., Hu, Q., Zhang, S., Metaxas, D.N.: Three-dimensional head pose estimation in-the-wild. In: FG, vol. 1, pp. 1–6 (2015)Google Scholar
  36. 36.
    Peng, X., Zhang, S., Yang, Y., Metaxas, D.N.: Piefa: personalized incremental and ensemble face alignment. In: ICCV (2015)Google Scholar
  37. 37.
    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: ICCVW (2013)Google Scholar
  38. 38.
    Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: a unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)Google Scholar
  39. 39.
    Shen, J., Zafeiriou, S., Chrysos, G., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: The first facial landmark tracking in-the-wild challenge: benchmark and results. In: ICCVW (2015)Google Scholar
  40. 40.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  41. 41.
    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR, pp. 3476–3483 (2013)Google Scholar
  42. 42.
    Sun, Y., Wang, X., Tang, X.: Deeply learned face representations are sparse, selective, and robust. In: CVPR, pp. 2892–2900 (2015)Google Scholar
  43. 43.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: closing the gap to human-level performance in face verification. In: CVPR (2014)Google Scholar
  44. 44.
    Tang, M., Peng, X.: Robust tracking with discriminative ranking lists. TIP 21(7), 3273–3281 (2012)MathSciNetGoogle Scholar
  45. 45.
    Tzimiropoulos, G.: Project-out cascaded regression with an application to face alignment. In: CVPR, pp. 3659–3667 (2015)Google Scholar
  46. 46.
    Veeriah, V., Zhuang, N., Qi, G.J.: Differential recurrent neural networks for action recognition. In: ICCV, December 2015Google Scholar
  47. 47.
    Wang, J., Cheng, Y., Feris, R.S.: Walk and learn: facial attribute representation learning from egocentric video and contextual data. In: CVPR (2016)Google Scholar
  48. 48.
    Wang, X., Yang, M., Zhu, S., Lin, Y.: Regionlets for generic object detection. TPAMI 37(10), 2071–2084 (2015)CrossRefGoogle Scholar
  49. 49.
    Wu, Y., Ji, Q.: Constrained joint cascade regression framework for simultaneous facial action unit recognition and facial landmark detection. In: CVPR (2016)Google Scholar
  50. 50.
    Xuehan-Xiong, D., la Torre, F.: Supervised descent method and its application to face alignment. In: CVPR (2013)Google Scholar
  51. 51.
    Yang, J., Reed, S., Yang, M.H., Lee, H.: Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In: NIPS (2015)Google Scholar
  52. 52.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, December 2015Google Scholar
  53. 53.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10590-1_53 Google Scholar
  54. 54.
    Zhang, J., Shan, S., Kan, M., Chen, X.: Coarse-to-Fine Auto-Encoder Networks (CFAN) for real-time face alignment. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 1–16. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10605-2_1 Google Scholar
  55. 55.
    Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 94–108. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_7 Google Scholar
  56. 56.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: ICCV, December 2015Google Scholar
  57. 57.
    Zhu, S., Li, C., Loy, C.C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR, pp. 4998–5006 (2015)Google Scholar
  58. 58.
    Zhu, X., Lei, Z., Liu, X., Shi, H., Li, S.Z.: Face alignment across large poses: a 3D solution. In: CVPR (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Xi Peng
    • 1
    Email author
  • Rogerio S. Feris
    • 2
  • Xiaoyu Wang
    • 3
  • Dimitris N. Metaxas
    • 1
  1. 1.Rutgers UniversityPiscatawayUSA
  2. 2.IBM T. J. Watson Research CenterYorktown HeightsUSA
  3. 3.Snapchat ResearchVeniceUSA

Personalised recommendations