InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12365)


Analysis of hand-hand interactions is a crucial step towards better understanding human behavior. However, most researches in 3D hand pose estimation have focused on the isolated single hand case. Therefore, we firstly propose (1) a large-scale dataset, InterHand2.6M, and (2) a baseline network, InterNet, for 3D interacting hand pose estimation from a single RGB image. The proposed InterHand2.6M consists of 2.6 M labeled single and interacting hand frames under various poses from multiple subjects. Our InterNet simultaneously performs 3D single and interacting hand pose estimation. In our experiments, we demonstrate big gains in 3D interacting hand pose estimation accuracy when leveraging the interacting hand data in InterHand2.6M. We also report the accuracy of InterNet on InterHand2.6M, which serves as a strong baseline for this new dataset. Finally, we show 3D interacting hand pose estimation results from general images. Our code and dataset are available (



We would like to thank Alexander Hypes, David Whitewolf, Eric Brockmeyer, Kevyn McPhail, Mark Pitts, Matt Stewart, Michal Perdoch, Scott Ardisson, Steven Krenn, and Timothy Godisart for building the capture system, Autumn Trimble, Danielle Belko, Junko Saragih, Laura Millerschoen, Lucas Evans, Rohan Bali, Taylor Koska, and Xiaomin Luo for the data capture and annotation efforts, and Chenglei Wu, Jason Saragih, Tomas Simon, and Yaser Sheikh for constructive feedback on data collection and the paper. This work was partially supported by the Next-Generation Information Computing Development Program (NRF-2017M3C4A7069369) and the Visual Turing Test project (IITP-2017-0-01780) funded by the Ministry of Science and ICT of Korea.

Supplementary material

Supplementary material 1 (mp4 45751 KB)

Supplementary material 2 (mp4 9007 KB)

Supplementary material 3 (mp4 8480 KB)

Supplementary material 4 (mp4 18943 KB)

504476_1_En_33_MOESM5_ESM.pdf (2.4 mb)
Supplementary material 5 (pdf 2424 KB)


  1. 1.
    Ballan, L., Taneja, A., Gall, J., Van Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 640–653. Springer, Heidelberg (2012). Scholar
  2. 2.
    Cai, Y., Ge, L., Cai, J., Yuan, J.: Weakly-supervised 3D hand pose estimation from monocular RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 678–694. Springer, Cham (2018). Scholar
  3. 3.
    Chen, L., et al.: Generating realistic training images based on tonality-alignment generative adversarial networks for hand pose estimation. arXiv preprint arXiv:1811.09916 (2018)
  4. 4.
    Arrabales, R., Ledezma, A., Sanchis, A.: ConsScale: a pragmatic scale for measuring the level of consciousness in artificial agents. J. Conscious. Stud. 17(3–4), 131–164 (2010)Google Scholar
  5. 5.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. In: CVPR (2016)Google Scholar
  6. 6.
    Ge, L., Liang, H., Yuan, J., Thalmann, D.: 3D convolutional neural networks for efficient and robust hand pose estimation from single depth images. In: CVPR (2017)Google Scholar
  7. 7.
    Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yand, H.: Region ensemble network: improving convolutional network for hand pose estimation. ICIP (2017)Google Scholar
  8. 8.
    Ha, H., Perdoch, M., Alismail, H., So Kweon, I., Sheikh, Y.: Deltille grids for geometric camera calibration. In: CVPR (2017)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  10. 10.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  11. 11.
    Iqbal, U., Molchanov, P., Breuel, T., Gall, J., Kautz, J.: Hand pose estimation via latent 2.5D heatmap regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 125–143. Springer, Cham (2018). Scholar
  12. 12.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2014)Google Scholar
  13. 13.
    de La Gorce, M., Fleet, D.J., Paragios, N.: Model-based 3D hand pose estimation from monocular video. In: IEEE TPAMI (2011)Google Scholar
  14. 14.
    Li, W., et al.: Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148 (2019)
  15. 15.
    Moon, G., Ju, Y.C., Lee, K.M.: V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: CVPR (2018)Google Scholar
  16. 16.
    Moon, G., Ju, Y.C., Lee, K.M.: Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In: ICCV (2019)Google Scholar
  17. 17.
    Mueller, F., et al.: GANerated hands for real-time 3D hand tracking from monocular RGB. In: CVPR (2018)Google Scholar
  18. 18.
    Mueller, F., et al.: Real-time pose and shape reconstruction of two interacting hands with a single depth camera. In: ACM TOG (2019)Google Scholar
  19. 19.
    Mueller, F., Mehta, D., Sotnychenko, O., Sridhar, S., Casas, D., Theobalt, C.: Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: ICCV (2017)Google Scholar
  20. 20.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Tracking the articulated motion of two strongly interacting hands. In: CVPR (2012)Google Scholar
  21. 21.
    Paszke, A., et al.: Automatic differentiation in pytorch (2017)Google Scholar
  22. 22.
    Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: CVPR (2014)Google Scholar
  23. 23.
    Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. In: ACM TOG (2017)Google Scholar
  24. 24.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV (2015)Google Scholar
  25. 25.
    Sharp, T., et al.: Accurate, robust, and flexible real-time hand tracking. In: ACM Conference on Human Factors in Computing Systems (2015)Google Scholar
  26. 26.
    Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017)Google Scholar
  27. 27.
    Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)Google Scholar
  28. 28.
    Sridhar, S., Mueller, F., Zollhöfer, M., Casas, D., Oulasvirta, A., Theobalt, C.: Real-time joint tracking of a hand manipulating an object from RGB-D input. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 294–310. Springer, Cham (2016). Scholar
  29. 29.
    Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: CVPR (2015)Google Scholar
  30. 30.
    Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). Scholar
  31. 31.
    Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., Pauly, M.: Robust articulated-ICP for real-time hand tracking. In: Computer Graphics Forum (2015)Google Scholar
  32. 32.
    Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. In: ICML (2019)Google Scholar
  33. 33.
    Tang, D., Jin Chang, H., Tejani, A., Kim, T.K.: Latent regression forest: structured estimation of 3D articulated hand posture. In: CVPR (2014)Google Scholar
  34. 34.
    Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.K., Shotton, J.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: ICCV (2015)Google Scholar
  35. 35.
    Taylor, J., et al.: Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Trans. Graph. (TOG) 35, 143 (2016)Google Scholar
  36. 36.
    Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM TOG (2014)Google Scholar
  37. 37.
    Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. IJCV (2016)Google Scholar
  38. 38.
    Wan, C., Probst, T., Gool, L.V., Yao, A.: Self-supervised 3D hand pose estimation through training by fitting. In: CVPR (2019)Google Scholar
  39. 39.
    Wu, Y., Lin, J., Huang, T.S.: Analyzing and capturing articulated hand motion in image sequences. IEEE TPAMI (2005)Google Scholar
  40. 40.
    Xiong, F., et al.: A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: ICCV (2019)Google Scholar
  41. 41.
    Yang, L., Yao, A.: Disentangling latent hands for image synthesis and pose estimation. In: CVPR (2019)Google Scholar
  42. 42.
    Yuan, S., et al.: Depth-based 3D hand pose estimation: from current achievements to future goals. In: CVPR (2018)Google Scholar
  43. 43.
    Yuan, S., Ye, Q., Garcia-Hernando, G., Kim, T.K.: The 2017 hands in the million challenge on 3D hand pose estimation. arXiv preprint arXiv:1707.02237 (2017)
  44. 44.
    Yuan, S., Ye, Q., Stenger, B., Jain, S., Kim, T.K.: BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: CVPR (2017)Google Scholar
  45. 45.
    Zhang, J., Jiao, J., Chen, M., Qu, L., Xu, X., Yang, Q.: 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214 (2016)
  46. 46.
    Zimmermann, C., Brox, T.: Learning to estimate 3D hand pose from single RGB images. In: ICCV (2017)Google Scholar
  47. 47.
    Zimmermann, C., Ceylan, D., Yang, J., Russell, B., Argus, M., Brox, T.: FreiHand: a dataset for markerless capture of hand pose and shape from single RGB images. In: ICCV (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.ECE & ASRISeoul National UniversitySeoulKorea
  2. 2.Facebook Reality LabsPittsburghUSA

Personalised recommendations