Advertisement

Investigating Depth Domain Adaptation for Efficient Human Pose Estimation

  • Angel Martínez-GonzálezEmail author
  • Michael Villamizar
  • Olivier Canévet
  • Jean-Marc Odobez
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11130)

Abstract

Convolutional Neural Networks (CNN) are the leading models for human body landmark detection from RGB vision data. However, as such models require high computational load, an alternative is to rely on depth images which, due to their more simple nature, can allow the use of less complex CNNs and hence can lead to a faster detector. As learning CNNs from scratch requires large amounts of labeled data, which are not always available or expensive to obtain, we propose to rely on simulations and synthetic examples to build a large training dataset with precise labels. Nevertheless, the final performance on real data will suffer from the mismatch between the training and test data, also called domain shift between the source and target distributions. Thus in this paper, our main contribution is to investigate the use of unsupervised domain adaptation techniques to fill the gap in performance introduced by these distribution differences. The challenge lies in the important noise differences (not only gaussian noise, but many missing values around body limbs) between synthetic and real data, as well as the fact that we address a regression task rather than a classification one. In addition, we introduce a new public dataset of synthetically generated depth images to cover the cases of multi-person pose estimation. Our experiments show that domain adaptation provides some improvement, but that further network fine-tuning with real annotated data is worth including to supervise the adaptation process.

Keywords

Human pose estimation Adversarial learning Domain adaptation Machine learning 

Notes

Acknowledgments

This work was supported by the European Union under the EU Horizon 2020 Research and Innovation Action MuMMER (MultiModal Mall Entertainment Robot), project ID 688147, as well as the Mexican National Council for Science and Tecnology (CONACYT) under the PhD scholarships program.

References

  1. 1.
    CMU motion capture data. http://mocap.cs.cmu.edu/
  2. 2.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_44CrossRefGoogle Scholar
  3. 3.
    Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)Google Scholar
  4. 4.
    Carlucci, F.M., Porzi, L., Caputo, B., Ricci, E., Bulo, S.R.: Autodial: automatic domain alignment layers. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  5. 5.
    Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: 3D Vision (3DV) (2016)Google Scholar
  6. 6.
    Crabbe, B., Paiement, A., Hannuna, S., Mirmehdi, M.: Skeleton-free body pose estimation from depth images for movement analysis. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 312–320 (2015)Google Scholar
  7. 7.
    Csurka, G.: Domain adaptation for visual applications: a comprehensive survey. In: Domain Adaptation in Computer Vision Applications, chap. 1, pp. 1–35. Springer Series: Advances in Computer Vision and Pattern Recognition (2017)Google Scholar
  8. 8.
    Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17(1), 2096–2030 (2016)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Haque, A., Peng, B., Luo, Z., Alahi, A., Yeung, S., Fei-Fei, L.: Towards viewpoint invariant 3D human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 160–177. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_10CrossRefGoogle Scholar
  10. 10.
    Hu, P., Ramanan, D.: Bottom-up and top-down reasoning with hierarchical rectified Gaussians. In: Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5600–5609 (2016)Google Scholar
  11. 11.
    Insafutdinov, E., et al.: Articulated multi-person tracking in the wild. In: CVPR (2017). OralGoogle Scholar
  12. 12.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_3CrossRefGoogle Scholar
  13. 13.
    Iqbal, U., Milan, A., Gall, J.: Posetrack: joint multi-person pose estimation and tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  14. 14.
    Khoshelham, K., Elberink, S.O.: Accuracy and resolution of kinect depth data for indoor mapping applications. Sensors 12, 1437–1454 (2012). p. 8238 (2013)CrossRefGoogle Scholar
  15. 15.
    Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: closing the loop between 3D and 2D human representations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  16. 16.
    Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 136–144. Curran Associates Inc. (2016)Google Scholar
  17. 17.
    Martínez-González, A., Villamizar, M., Canévet, O., Odobez, J.-M.: Real-time convolutional networks for depth-based human pose estimation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018 (2018)Google Scholar
  18. 18.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  19. 19.
    Patricia, N., Cariucci, F.M., Caputo, B.: Deep depth domain adaptation: a case study. In: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, 22–29 October 2017, pp. 2645–2650 (2017)Google Scholar
  20. 20.
    Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  21. 21.
    Shotton, J., et al.: Real-time human pose recognition in parts from single depth images. In: Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, pp. 1297–1304. IEEE Computer Society, Washington (2011)Google Scholar
  22. 22.
    Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: IEEE Conference on Vision and Pattern Recognition, CVPR (2017)Google Scholar
  23. 23.
    Si, C., Wang, W., Wang, L., Tan, T.: Multistage adversarial losses for pose-based human image synthesis. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018Google Scholar
  24. 24.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)Google Scholar
  25. 25.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 27, pp. 1799–1807. Curran Associates Inc. (2014)Google Scholar
  26. 26.
    Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: International Conference in Computer Vision (ICCV) (2015)Google Scholar
  27. 27.
    Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  28. 28.
    Wang, K., Zhai, S., Cheng, H., Liang, X., Lin, L.: Human pose estimation from depth images via inference embedded multi-task learning. In: Proceedings of the 2016 ACM on Multimedia Conference, MM 2016, pp. 1227–1236. ACM, New York (2016)Google Scholar
  29. 29.
    Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)Google Scholar
  30. 30.
    Wu, C., Zhang, J., Savarese, S., Saxena, A.: Watch-n-patch: unsupervised understanding of actions and relations. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  31. 31.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2878–2890 (2013)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Angel Martínez-González
    • 1
    • 2
    Email author
  • Michael Villamizar
    • 1
  • Olivier Canévet
    • 1
  • Jean-Marc Odobez
    • 1
    • 2
  1. 1.Idiap Research InstituteMartignySwitzerland
  2. 2.École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland

Personalised recommendations