Advertisement

JGR-P2O: Joint Graph Reasoning Based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)

Abstract

State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions, including voxel-to-voxel predictions, point-to-point regression, and pixel-wise estimations. Despite the good performance, those methods have a few issues in nature, such as the poor trade-off between accuracy and efficiency, and plain feature representation learning with local convolutions. In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. The key ideas are two-fold: (a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; (b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training. Specifically, we first propose a graph convolutional network (GCN) based joint graph reasoning module to model the complex dependencies among joints and augment the representation capability of each pixel. Then we densely estimate all pixels’ offsets to joints in both image plane and depth space and calculate the joints’ positions by a weighted average over all pixels’ predictions, totally discarding the complex post-processing operations. The proposed model is implemented with an efficient 2D fully convolutional network (FCN) backbone and has only about 1.4M parameters. Extensive experiments on multiple 3D hand pose estimation benchmarks demonstrate that the proposed method achieves new state-of-the-art accuracy while running very efficiently with around a speed of 110 fps on a single NVIDIA 1080Ti GPU (This work was supported in part by the National Natural Science Foundation of China under Grants 61976095, in part by the Science and Technology Planning Project of Guangdong Province under Grant 2018B030323026. This work was also partially supported by the Academy of Finland.). The code is available at https://github.com/fanglinpu/JGR-P2O.

Keywords

3D hand pose estimation Depth image Graph neural network 

Supplementary material

504443_1_En_8_MOESM1_ESM.pdf (719 kb)
Supplementary material 1 (pdf 719 KB)

References

  1. 1.
    Cai, Y., et al.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2272–2281 (2019)Google Scholar
  2. 2.
    Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395, 138–149 (2019)CrossRefGoogle Scholar
  3. 3.
    Chen, X., Wang, G., Zhang, C., Kim, T.K., Ji, X.: SHPR-Net: deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)CrossRefGoogle Scholar
  4. 4.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840 (2017)Google Scholar
  5. 5.
    Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Advances in Neural Information Processing Systems, pp. 3844–3852 (2016)Google Scholar
  6. 6.
    Du, K., Lin, X., Sun, Y., Ma, X.: CrossInfoNet: multi-task information sharing based hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9896–9905 (2019)Google Scholar
  7. 7.
    Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand PointNet: 3D hand pose estimation using point sets. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8417–8426 (2018)Google Scholar
  8. 8.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3593–3601 (2016)Google Scholar
  9. 9.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1991–2000 (2017)Google Scholar
  10. 10.
    Ge, L., Ren, Z., Yuan, J.: Point-to-point regression PointNet for 3D hand pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 489–505. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01261-8_29CrossRefGoogle Scholar
  11. 11.
    Guleryuz, O.G., Kaeser-Chen, C.: Fast lifting for 3D hand pose estimation in AR/VR applications. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 106–110. IEEE (2018)Google Scholar
  12. 12.
    Guo, H., Wang, G., Chen, X., Zhang, C.: Towards good practices for deep 3D hand pose estimation. arXiv preprint arXiv:1707.07248 (2017)
  13. 13.
    Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 125–143. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01252-6_8CrossRefGoogle Scholar
  14. 14.
    Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2540–2548 (2015)Google Scholar
  15. 15.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  16. 16.
    Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)Google Scholar
  17. 17.
    Li, S., Lee, D.: Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11927–11936 (2019)Google Scholar
  18. 18.
    Li, Y., Gupta, A.: Beyond grids: learning graph representations for visual recognition. In: Advances in Neural Information Processing Systems, pp. 9225–9235 (2018)Google Scholar
  19. 19.
    Liang, X., Hu, Z., Zhang, H., Lin, L., Xing, E.P.: Symbolic graph reasoning meets convolutions. In: Advances in Neural Information Processing Systems, pp. 1853–1863 (2018)Google Scholar
  20. 20.
    Madadi, M., Escalera, S., Baró, X., Gonzalez, J.: End-to-end global to local CNN learning for hand pose recovery in depth data. arXiv preprint arXiv:1705.09606 (2017)
  21. 21.
    Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., Bronstein, M.M.: Geometric deep learning on graphs and manifolds using mixture model CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124 (2017)Google Scholar
  22. 22.
    Moon, G., Yong Chang, J., Mu Lee, K.: V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088 (2018)Google Scholar
  23. 23.
    Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293 (2017)Google Scholar
  24. 24.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  25. 25.
    Oberweger, M., Lepetit, V.: DeepPrior++: improving fast and accurate 3D hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 585–594 (2017)Google Scholar
  26. 26.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807 (2015)
  27. 27.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3316–3324 (2015)Google Scholar
  28. 28.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Generalized feedback loop for joint hand-object pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 42, 1898–1912 (2019)CrossRefGoogle Scholar
  29. 29.
    Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7025–7034 (2017)Google Scholar
  30. 30.
    Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)Google Scholar
  31. 31.
    Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, pp. 5099–5108 (2017)Google Scholar
  32. 32.
    Remelli, E., Tkach, A., Tagliasacchi, A., Pauly, M.: Low-dimensionality calibration through local anisotropic scaling for robust hand model personalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2535–2543 (2017)Google Scholar
  33. 33.
    Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)Google Scholar
  34. 34.
    Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 824–832 (2015)Google Scholar
  35. 35.
    Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1868–1876 (2015)Google Scholar
  36. 36.
    Supančič, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: methods, data, and challenges. Int. J. Comput. Vis. 126(11), 1180–1198 (2018)CrossRefGoogle Scholar
  37. 37.
    Tang, D., Jin Chang, H., Tejani, A., Kim, T.K.: Latent regression forest: structured estimation of 3D articulated hand posture. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3786–3793 (2014)Google Scholar
  38. 38.
    Tang, D., et al.: Opening the black box: hierarchical sampling optimization for hand pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 41, 2161–2175 (2018)CrossRefGoogle Scholar
  39. 39.
    Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. (ToG) 33(5), 169 (2014)CrossRefGoogle Scholar
  40. 40.
    Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. Int. J. Comput. Vis. 118(2), 172–193 (2016)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)
  42. 42.
    Wan, C., Probst, T., Van Gool, L., Yao, A.: Crossing nets: combining GANs and VAEs with a shared latent space for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 680–689 (2017)Google Scholar
  43. 43.
    Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3D regression for hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5147–5156 (2018)Google Scholar
  44. 44.
    Wan, C., Yao, A., Van Gool, L.: Hand pose estimation from local surface normals. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 554–569. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_34CrossRefGoogle Scholar
  45. 45.
    Xiong, F., et al.: A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 793–802 (2019)Google Scholar
  46. 46.
    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  47. 47.
    Ye, Q., Yuan, S., Kim, T.-K.: Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 346–361. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_21CrossRefGoogle Scholar
  48. 48.
    Yuan, S., et al.: Depth-based 3D hand pose estimation: from current achievements to future goals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2636–2645 (2018)Google Scholar
  49. 49.
    Yuan, S., Ye, Q., Garcia-Hernando, G., Kim, T.K.: The 2017 hands in the million challenge on 3D hand pose estimation. arXiv preprint arXiv:1707.02237 (2017)
  50. 50.
    Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854 (2016)

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.South China University of TechnologyGuangzhouChina
  2. 2.Center for Machine Vision and Signal AnalysisUniversity of OuluOuluFinland
  3. 3.Huawei Noah’s Ark LabHong KongChina

Personalised recommendations