Advertisement

Weakly-Supervised 3D Hand Pose Estimation from Monocular RGB Images

  • Yujun Cai
  • Liuhao Ge
  • Jianfei Cai
  • Junsong Yuan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Compared with depth-based 3D hand pose estimation, it is more challenging to infer 3D hand pose from monocular RGB images, due to substantial depth ambiguity and the difficulty of obtaining fully-annotated training data. Different from existing learning-based monocular RGB-input approaches that require accurate 3D annotations for training, we propose to leverage the depth images that can be easily obtained from commodity RGB-D cameras during training, while during testing we take only RGB inputs for 3D joint predictions. In this way, we alleviate the burden of the costly 3D annotations in real-world dataset. Particularly, we propose a weakly-supervised method, adaptating from fully-annotated synthetic dataset to weakly-labeled real-world dataset with the aid of a depth regularizer, which generates depth maps from predicted 3D pose and serves as weak supervision for 3D pose regression. Extensive experiments on benchmark datasets validate the effectiveness of the proposed depth regularizer in both weakly-supervised and fully-supervised settings.

Keywords

3D hand pose estimation Weakly-supervised methods Depth regularizer 

References

  1. 1.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_34CrossRefGoogle Scholar
  2. 2.
    Chen, C.H., Ramanan, D.: 3D human pose estimation = 2D pose estimation + matching. In: CVPR, vol. 2, p. 6 (2017)Google Scholar
  3. 3.
    Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46723-8_49CrossRefGoogle Scholar
  4. 4.
    Dibra, E., Wolf, T., Oztireli, C., Gross, M.: How to refine 3D hand pose estimation from unlabelled depth data? In: 2017 International Conference on 3D Vision (3DV), pp. 135–144. IEEE (2017)Google Scholar
  5. 5.
    Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand PointNet: 3D hand pose estimation using point sets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426 (2018)Google Scholar
  6. 6.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)Google Scholar
  7. 7.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, p. 5 (2017)Google Scholar
  8. 8.
    Ge, L., Ren, Z., Yuan, J.: Point-to-point regression PointNet for 3D hand pose estimation. In: Proceedings of European Conference on Computer Vision (2018)Google Scholar
  9. 9.
    Girshick, R.: Fast R-CNN. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. IEEE (2015)Google Scholar
  10. 10.
    Gu, J., et al.: Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377 (2017)CrossRefGoogle Scholar
  11. 11.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
  12. 12.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  13. 13.
    Keskin, C., Kıraç, F., Kara, Y.E., Akarun, L.: Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 852–863. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33783-3_61CrossRefGoogle Scholar
  14. 14.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  15. 15.
    Liang, H., Yuan, J., Thalman, D.: Egocentric hand pose estimation and distance recovery in a single RGB image. In: 2015 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2015)Google Scholar
  16. 16.
    Liang, H., Yuan, J., Thalmann, D., Zhang, Z.: Model-based hand pose estimation via spatial-temporal hand parsing and 3D fingertip localization. Vis. Comput. 29(6–8), 837–848 (2013)CrossRefGoogle Scholar
  17. 17.
    Lu, S., Metaxas, D., Samaras, D., Oliensis, J.: Using multiple cues for hand tracking and model refinement. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Proceedings, vol. 2, pp. II-443. IEEE (2003)Google Scholar
  18. 18.
    Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), June 2018. https://handtracker.mpi-inf.mpg.de/projects/GANeratedHands/
  19. 19.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324 (2015)Google Scholar
  20. 20.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using Kinect. In: BmVC, vol. 1, p. 3 (2011)Google Scholar
  21. 21.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 2088–2095. IEEE (2011)Google Scholar
  22. 22.
    Panteleris, P., Oikonomidis, I., Argyros, A.: Using a single RGB frame for real time 3D hand pose estimation in the wild. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 436–445. IEEE (2018)Google Scholar
  23. 23.
    Rehg, J.M., Kanade, T.: DigitEyes: vision-based hand tracking for human-computer interaction. In: Proceedings of the 1994 IEEE Workshop on Motion of Non-Rigid and Articulated Objects, pp. 16–22. IEEE (1994)Google Scholar
  24. 24.
    Ren, Z., Yuan, J., Meng, J., Zhang, Z.: Robust part-based hand gesture recognition using Kinect sensor. IEEE Trans. Multimed. 15, 1110–1120 (2016)CrossRefGoogle Scholar
  25. 25.
    Sharp, T., et al.: Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642. ACM (2015)Google Scholar
  26. 26.
    Shotton, J., et al.: Efficient human pose estimation from single depth images. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2821–2840 (2013)CrossRefGoogle Scholar
  27. 27.
    Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–98 (2018)Google Scholar
  28. 28.
    Stenger, B., Thayananthan, A., Torr, P.H., Cipolla, R.: Model-based hand tracking using a hierarchical Bayesian filter. IEEE Trans. Pattern Anal. Mach. Intell. 28(9), 1372–1384 (2006)CrossRefGoogle Scholar
  29. 29.
    Sun, X., Xiao, B., Liang, S., Wei, Y.: Integral human pose regression. arXiv preprint arXiv:1711.08229 (2017)
  30. 30.
    Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.K., Shotton, J.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3325–3333 (2015)Google Scholar
  31. 31.
    Taylor, J., et al.: Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Trans. Graph. (TOG) 35(4), 143 (2016)CrossRefGoogle Scholar
  32. 32.
    Tome, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: CVPR 2017 Proceedings, pp. 2500–2509 (2017)Google Scholar
  33. 33.
    Tzeng, E., Hoffman, J., Darrell, T., Saenko, K.: Simultaneous deep transfer across domains and tasks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4068–4076. IEEE (2015)Google Scholar
  34. 34.
    Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118(2), 172–193 (2016)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Wang, R., Paris, S., Popović, J.: 6D hands: markerless hand-tracking for computer aided design. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 549–558. ACM (2011)Google Scholar
  36. 36.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)Google Scholar
  37. 37.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 365–382. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_22CrossRefGoogle Scholar
  38. 38.
    Wu, Y., Huang, T.S.: Capturing articulated human hand motion: a divide-and-conquer approach. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1, pp. 606–611. IEEE (1999)Google Scholar
  39. 39.
    Wu, Y., Huang, T.S.: View-independent recognition of hand postures. In: CVPR, p. 2088. IEEE (2000)Google Scholar
  40. 40.
    Xu, C., Cheng, L.: Efficient hand pose estimation from a single depth image. In: 2013 IEEE International Conference on Computer Vision (ICCV), pp. 3456–3462. IEEE (2013)Google Scholar
  41. 41.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1385–1392. IEEE (2011)Google Scholar
  42. 42.
    Yasin, H., Iqbal, U., Kruger, B., Weber, A., Gall, J.: A dual-source approach for 3D pose estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4948–4956 (2016)Google Scholar
  43. 43.
    Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)
  44. 44.
    Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  45. 45.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251. IEEE (2017)Google Scholar
  46. 46.
    Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: International Conference on Computer Vision (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institute for Media Innovation, Interdisciplinary Graduate SchoolNanyang Technological UniversitySingaporeSingapore
  2. 2.School of Computer Science and EngineeringNanyang Technological UniversitySingaporeSingapore
  3. 3.Department of Computer Science and EngineeringState University of New York at BuffaloBuffaloUSA

Personalised recommendations