A CNN Cascade for Landmark Guided Semantic Part Segmentation

  • Aaron S. JacksonEmail author
  • Michel Valstar
  • Georgios Tzimiropoulos
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9915)


This paper proposes a CNN cascade for semantic part segmentation guided by pose-specific information encoded in terms of a set of landmarks (or keypoints). There is large amount of prior work on each of these tasks separately, yet, to the best of our knowledge, this is the first time in literature that the interplay between pose estimation and semantic part segmentation is investigated. To address this limitation of prior work, in this paper, we propose a CNN cascade of tasks that firstly performs landmark localisation and then uses this information as input for guiding semantic part segmentation. We applied our architecture to the problem of facial part segmentation and report large performance improvement over the standard unguided network on the most challenging face datasets. Testing code and models will be published online at


Pose estimation Landmark localisation Semantic part segmentation Faces 



Aaron Jackson was funded by a PhD scholarship from the University of Nottingham. The work of Valstar is also funded by European Union Horizon 2020 research and innovation programme under grant agreement number 645378. Georgios Tzimiropoulos was supported in part by the EPSRC project EP/M02153X/1 Facial Deformable Models of Animals.


  1. 1.
    Cootes, T., Edwards, G., Taylor, C.: Active appearance models. TPAMI 23(6), 681–685 (2001)CrossRefGoogle Scholar
  2. 2.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  3. 3.
    Zhang, N., Shelhamer, E., Gao, Y., Darrell, T.: Fine-grained pose prediction, normalization, and recognition. arXiv preprint arXiv:1511.07063 (2015)
  4. 4.
    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: A semi-automatic methodology for facial landmark annotation. In: CVPR-W (2013)Google Scholar
  5. 5.
    Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: CVPR (2010)Google Scholar
  6. 6.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)CrossRefGoogle Scholar
  7. 7.
    Sánchez-Lozano, E., Martinez, B., Tzimiropoulos, G., Valstar, M.: Cascaded continuous regression for real-time incremental face tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 645–661. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-46484-8_39 CrossRefGoogle Scholar
  8. 8.
    Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. In: CVPR (2012)Google Scholar
  9. 9.
    Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR (2013)Google Scholar
  10. 10.
    Zhu, S., Li, C., Change Loy, C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR (2015)Google Scholar
  11. 11.
    Tzimiropoulos, G.: Project-out cascaded regression with an application to face alignment. In: CVPR (2015)Google Scholar
  12. 12.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  13. 13.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)Google Scholar
  14. 14.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  15. 15.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.: Conditional random fields as recurrent neural networks. In: CVPR (2015)Google Scholar
  16. 16.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1520–1528 (2015)Google Scholar
  17. 17.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)Google Scholar
  18. 18.
    Eslami, S., Williams, C.: A generative model for parts-based object segmentation. In: NIPS (2012)Google Scholar
  19. 19.
    Eslami, S.A., Heess, N., Williams, C.K., Winn, J.: The shape boltzmann machine: a strong model of object shape. IJCV 107(2), 155–176 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  20. 20.
    Tsogkas, S., Kokkinos, I., Papandreou, G., Vedaldi, A.: Deep learning for semantic part segmentation with high-level guidance. arXiv preprint arXiv:1505.02438 (2015)
  21. 21.
    Warrell, J., Prince, S.J.: Labelfaces: parsing facial features by multiclass labeling with an epitome prior. In: 2009 16th IEEE International Conference on Image Processing (ICIP), pp. 2481–2484. IEEE (2009)Google Scholar
  22. 22.
    Luo, P., Wang, X., Tang, X.: Hierarchical face parsing via deep learning. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2480–2487. IEEE (2012)Google Scholar
  23. 23.
    Liu, S., Yang, J., Huang, C., Yang, M.H.: Multi-objective convolutional learning for face labeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3451–3459 (2015)Google Scholar
  24. 24.
    Bo, Y., Fowlkes, C.C.: Shape-based pedestrian parsing. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2265–2272. IEEE (2011)Google Scholar
  25. 25.
    Kae, A., Sohn, K., Lee, H., Learned-Miller, E.: Augmenting CRFs with Boltzmann machine shape priors for image labeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2019–2026 (2013)Google Scholar
  26. 26.
    Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: detecting and representing objects using holistic models and body parts. In: CVPR (2014)Google Scholar
  27. 27.
    Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: the first facial landmark localization challenge. In: International Conference on Computer Vision, (ICCV-W), 300 Faces in-the-Wild Challenge (300-W), Sydney, Australia, 2013. IEEE (2013)Google Scholar
  28. 28.
    Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR (2011)Google Scholar
  29. 29.
    Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7574, pp. 679–692. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33712-3_49 CrossRefGoogle Scholar
  30. 30.
    Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark estimation in the wild. In: CVPR (2012)Google Scholar
  31. 31.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Aaron S. Jackson
    • 1
    Email author
  • Michel Valstar
    • 1
  • Georgios Tzimiropoulos
    • 1
  1. 1.School of Computer ScienceThe University of NottinghamNottinghamUK

Personalised recommendations