Skip to main content

Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild

Abstract

Recently, heatmap regression models have become popular due to their superior performance in locating facial landmarks. However, three major problems still exist among these models: (1) they are computationally expensive; (2) they usually lack explicit constraints on global shapes; (3) domain gaps are commonly present. To address these problems, we propose Pixel-in-Pixel Net (PIPNet) for facial landmark detection. The proposed model is equipped with a novel detection head based on heatmap regression, which conducts score and offset predictions simultaneously on low-resolution feature maps. By doing so, repeated upsampling layers are no longer necessary, enabling the inference time to be largely reduced without sacrificing model accuracy. Besides, a simple but effective neighbor regression module is proposed to enforce local constraints by fusing predictions from neighboring landmarks, which enhances the robustness of the new detection head. To further improve the cross-domain generalization capability of PIPNet, we propose self-training with curriculum. This training strategy is able to mine more reliable pseudo-labels from unlabeled data across domains by starting with an easier task, then gradually increasing the difficulty to provide more precise labels. Extensive experiments demonstrate the superiority of PIPNet, which obtains new state-of-the-art results on three out of six popular benchmarks under the supervised setting. The results on two cross-domain test sets are also consistently improved compared to the baselines. Notably, our lightweight version of PIPNet runs at 35.7 FPS and 200 FPS on CPU and GPU, respectively, while still maintaining a competitive accuracy to state-of-the-art methods. The code of PIPNet is available at https://github.com/jhb86253817/PIPNet.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. Bansal, A., Nanduri, A., Castillo, C. D., Ranjan R., & Chellappa, R., (2016). Umdfaces: An annotated face dataset for training deep networks. arXiv:1611.01484

  2. Bengio, Y., Louradour, J., Collobert, R., & Weston, J., (2009). Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning (pp. 41-48).

  3. Burgos-Artizzu, X. P., Perona, P., & Dollár, P., (2013). Robust face landmark estimation under occlusion. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 1513-1520).

  4. Chandran, P., Bradley, D., Gross, M., & Beeler, T., (2020). Attention-driven cropping for very high resolution facial landmark detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5861-5870).

  5. Chen, D., Hua, G., Wen, F., & Sun, J., (2016). Supervised transformer network for efficientface detection. In: European Conference on Computer Vision (pp. 122-138). Springer, Cham.

  6. Chen, L., Su, H., & Ji, Q., (2019). Face alignment with kernel density deep neural network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6992-7002).

  7. Chen, Y., Li, W., Sakaridis, C., Dai, D., & Gool, L. V., (2018). Domain adaptive faster r-cnn for object detection in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3339-3348).

  8. Dapogny, A., Bailly, K., Cord, & M., (2019). Decafa: Deep convolutional cascade for face alignment in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6893-6901).

  9. Deng, J., Roussos, A., Chrysos, G., Ververas, E., Kotsia, I., Shen, J., & Zafeiriou, S. (2019). The menpo benchmark for multi-pose 2d and 3d facial landmark localisation and tracking. International Journal of Computer Vision, 127(6), 599–624.

    Article  Google Scholar 

  10. Deng, J., Guo, J., Zhou, Y., Yu, J., Kotsia, I., & Zafeiriou, S., (2020). Retinaface: Single-stage dense face localisation in the wild. In: CVPR

  11. Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., & Jiao, J., (2018). Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 994-1003).

  12. Dong, X., & Yang, Y., (2019). Teacher supervises students how to learn from partially labeled images for facial landmark detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 783-792).

  13. Dong, X., Yan, Y., Ouyang, W., & Yang, Y., (2018). Style aggregated network for facial landmark detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 379-388).

  14. Feng, Z. H., Kittler, J., Christmas, W., Huber, P., & Wu, X. J., (2017). Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2481-2490).

  15. Feng, Z. H., Kittler, J., Awais, M., Huber, P., & Wu, X. J., (2018). Wing loss for robust facial landmark localisation with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2235-2245).

  16. Ganin, Y., & Lempitsky, V., (2015). Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (pp. 1180-1189). PMLR.

  17. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. (2016). Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1), 2030–2096.

    MathSciNet  MATH  Google Scholar 

  18. Ghiasi, G., & Fowlkes, C. C., (2014). Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2385-2392).

  19. He, Z., Zhang, J., Kan, M., Shan, S., & Chen, X., (2017). Robust fec-cnn: A high accuracy facial landmark detection system. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 98-104).

  20. Honari, S., Yosinski, J., Vincent, P., & Pal, C., (2016). Recombinator networks: Learning coarse-to-fine feature aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5743-5752).

  21. Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., & Kautz, J., (2018). Improving landmark localization with semi-supervised learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1546-1555).

  22. Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., & Adam, H., (2019). Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314-1324).

  23. Islam, M. A., Jia, S., & Bruce, N. D., (2020). How much position information do convolutional neural networks encode? In: ICLR

  24. Kang, G., Jiang, L., Yang, Y., & Hauptmann, A. G., (2019). Contrastive adaptation network for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4893-4902).

  25. Khan, M. H., McDonagh, J., & Tzimiropoulos, G., (2017). Synergy between face alignment and tracking via discriminative globalconsensus optimization. In: 2017 IEEE International Conference on Computer Vision (ICCV) (pp. 3811-3819). IEEE.

  26. Kingma, D. P., & Ba, J., (2015). Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations (ICLR).

  27. Koestinger, M., Wohlhart, P., Roth, P. M., & Bischof, H., (2011). Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: Proceeding First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies.

  28. Kumar, A., Marks, T. K., Mou, W., Wang, Y., Jones, M., Cherian, A., Koike-Akino, T., Liu, X., & Feng, C (2020). Luvli face alignment: Estimating landmarks’location, uncertainty, and visibility likelihood. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8236-8246).

  29. Liao, S., Jain, A. K., & Li, S. Z. (2013). Partial face recognition: Alignment-free approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(5), 1193–1205.

    Article  Google Scholar 

  30. Liu, H., Lu, J., Feng, J., & Zhou, J. (2017a). Two-stream transformer networks for video-based face alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(11), 2546–2554.

    Article  Google Scholar 

  31. Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., & Song, L., (2017b). Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 212-220).

  32. Liu, Z., Luo, P., Wang, X., & Tang, X. (2015). Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 3730-3738).

  33. Liu, Z., Zhu, X., Hu, G., Guo, H., Tang, M., Lei, Z., Robertson, N. M., & Wang, J., (2019). Semantic alignment: Finding semantically consistent ground-truth for facial landmark detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3467-3476).

  34. Long, M., Cao, Y., Wang, J., & Jordan, M. I., (2015). Learning transferable features with deep adaptation networks. In: International Conference on Machine Learning (pp. 97-105). PMLR.

  35. Lv, J., Shao, X., Xing, J., Cheng, C., & Zhou, X., (2017). A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3317-3326).

  36. Merget, D., Rock, M., & Rigoll, G., (2018). Robust facial landmark detection via a fully-convolutional local-global context network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 781-790).

  37. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision (pp. 483-499). Springer, Cham.

  38. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K., (2017). Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4903-4911).

  39. Papandreou, G., Zhu, T., Chen, L. C., Gidaris, S., Tompson, J., & Murphy, K., (2018). Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 269-286).

  40. Peng, P., Xiang, T., Wang, Y., Pontil, M., Gong, S., Huang, T., & Tian, Y., (2016). Unsupervised cross-dataset transfer learning for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1306-1315).

  41. Qian, S., Sun, K., Wu, W., Qian, C., & Jia, J., (2019). Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10153-10163).

  42. Ren, S., Cao, X., Wei, Y., & Sun, J. (2016). Face alignment via regressing local binary features. IEEE Transactions on Image Processing, 25(3), 1233–1245.

    MathSciNet  Article  Google Scholar 

  43. Robinson, J. P., Li, Y., Zhang, N., Fu, Y., & Tulyakov, S., (2019). Laplace landmark localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10103-10112).

  44. Ronneberger, O., Fischer, P., & Brox, T., (2015). U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, vol. 9351, pp. 234–241

  45. Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., & Pantic, M., (2013). 300 faces in-the-wild challenge: The first facial landmark localization challenge. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 397-403).

  46. Saito, K., Ushiku, Y., Harada, T., & Saenko, K., (2019). Strong-weak distribution alignment for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6956-6965).

  47. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C., (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. arXiv:1801.04381

  48. Shen, J., Zafeiriou, S., Chrysos, G. G., Kossaifi, J., Tzimiropoulos, G., & Pantic, M., (2015). The first facial landmark tracking in-the-wild challenge: Benchmark and results. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (pp. 50-58).

  49. Sun, Y., Wang, X., & Tang, X., (2013). Deep convolutional network cascade for facial point detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3476-3483).

  50. Tai, Y., Liang, Y., Liu, X., Duan, L., Li, J., Wang, C., Huang, F., & Chen, Y., (2019). Towards highly accurate and stable face alignment for high-resolution videos. In: Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 8893-8900).

  51. Taigman, Y., Yang, M., Ranzato, M., & Wolf, L., (2014). Deepface: Closing the gap to human-level performance in face verification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1701-1708).

  52. Tang, Z., Peng, X., Geng, S., Wu, L., Zhang, S., & Metaxas, D., (2018). Quantized densely connected u-nets for efficient landmark localization. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 339-354).

  53. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., & Niebner, M., (2016). Face2face: Real-time face capture and reenactment of rgb videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2387-2395).

  54. Trigeorgis, G., Snape, P., Nicolaou, M. A., Antonakos, E., & Zafeiriou, S., (2016). Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4177-4187).

  55. Valle, R., Buenaposada, J. M., Valdés, A., & Baumela, L., (2018). A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 585-601).

  56. Valle, R., Buenaposada, J. M., Valdés, A., & Baumela, L. (2019). Face alignment using a 3d deeply-initialized ensemble of regression trees. Computer Vision and Image Understanding, 189, 102846.

    Article  Google Scholar 

  57. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y, Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., & Xiao, B., (2019a). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  58. Wang, X., Bo, L., & Fuxin, L., (2019b). Adaptive wing loss for robust face alignment via heatmap regression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6971-6981).

  59. Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y., (2016). Convolutional pose machines. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 4724-4732).

  60. Wu, W., & Yang, S., (2017). Leveraging intra and inter-dataset variations forrobust face alignment. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 150-159).

  61. Wu, W. W., Qian, C., Yang, S., Wang, Q., Cai, Y., & Zhou, Q., (2018). Look at boundary: A boundary-aware face alignment algorithm. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2129-2138).

  62. Xiao, B., Wu, H., & Wei, Y., (2018). Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 466-481).

  63. Yang, J., Liu, Q., & Zhang, K., (2017). Stacked hourglass network for robust facial landmark localisation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 79-87).

  64. Yang, S., Luo, P., Loy, CC., & Tang, X., (2016). Wider face: A face detection benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5525-5533).

  65. Yu, H. X., Wu, A., & Zheng, W. S., (2017). Cross-view asymmetric metric learning for unsupervised person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision (pp. 994-1002).

  66. Yu, H. X., Zheng, W. S., Wu, A., Guo, X., Gong, S., & Lai, J. H., (2019). Unsupervised person re-identification by soft multilabel learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2148-2157).

  67. Zafeiriou, S., Trigeorgis, G., Chrysos, G., Deng, J., & Shen, J., (2017). The menpo facial landmark localisation challenge: A step towards the solution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (pp. 170-179).

  68. Zhang, Z., Luo, P., Loy, C. C., & Tang, X. (2016). Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5), 918–930.

    Article  Google Scholar 

  69. Zhao, F., Liao, S., Xie, GS., Zhao, J., Zhang, K., & Shao, L., (2020). Unsupervised domain adaptation with noiseresistible mutual-training for personre-identification. In: European Conference on Computer Vision (pp. 526-544). Springer, Cham.

  70. Zhong, Z., Zheng, L., Li, S., & Yang, Y., (2018). Generalizing a person retrieval model hetero- and homogeneously. In: Proceedings of the European Conference on Computer Vision (ECCV) (pp. 172-188).

  71. Zhu, M., Shi, D., Zheng, M., & Sadiq, M., (2019a). Robust facial landmark detection via occlusion-adaptive deep networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3486-3496).

  72. Zhu, S., Li, C., Loy, C. C., & Tang, X., (2015). Face alignment by coarse-to-fine shape searching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4998-5006).

  73. Zhu, S., Li, C., Loy, C. C., & Tang, X., (2016). Unconstrained face alignment via cascaded compositional learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3409-3417).

  74. Zhu, X., Pang, J., Yang, C., Shi, J., & Lin D (2019b). Adapting object detectors via selective cross-domain alignment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 687-696).

  75. Zou, X., Zhong, S., Yan, L., Zhao, X., Zhou, J., & Wu, Y., (2019). Learning robust facial landmark detection via hierarchical structured ensemble. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 141-150).

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shengcai Liao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Chen Change Loy.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Jin, H., Liao, S. & Shao, L. Pixel-in-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild. Int J Comput Vis 129, 3174–3194 (2021). https://doi.org/10.1007/s11263-021-01521-4

Download citation

Keywords

  • Facial landmark detection
  • Pixel-in-pixel regression
  • Self-training with curriculum
  • Unsupervised domain adaptation