Skip to main content
Log in

HandO: a hybrid 3D hand–object reconstruction model for unknown objects

  • Letter to the Editor
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

In various multimedia applications, it is of great significance to reconstruct 3D meshes of hands and objects from single RGB images. Mesh-based methods mainly resort to mesh displacements by estimating relative positions between hands and objects, while the distance may be inaccurate. Methods based on signed distance function (SDF) learn relative positions by concurrently sampling hand meshes and object meshes; unfortunately, these methods have very limited capability of reconstructing smooth surfaces with rich details. For example, SDF-based methods are inclined to lose the typologies. To the best of our knowledge, only limited works can simultaneously reconstruct the hands and objects with smooth surfaces and accurate relative positions. To this end, we present a novel hybrid model—hand–object Model (HandO) enabling the hand–object 3D reconstruction with smooth surfaces and accurate positions. Critically, our model for the first time makes the hybrid 3D representation for this task by bringing meshes, SDFs, and parametric models together. A feature extractor is employed to extract the image features, and SDF sample points are projected onto these features to extract the local features of each sampled point. Essentially, our model can be naturally extended to reconstruct a whole body holding an object via the new hybrid representation. Additionally, to overcome the lack of training data, a synthetic body-holding dataset is contributed to the community, thus facilitating the research of reconstructing the hand and object. It contains 31763 images of over 50 object categories. Extensive experiments demonstrate that our model can achieve better performance over the competitors on benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Notes

  1. The dataset is shown at https://baboon527.github.io/HandO/.

References

  1. Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9869–9878 (2020)

  2. Zhang, J., Pepose, S., Joo, H., Ramanan, D., Malik, J., Kanazawa, A.: Perceiving 3d human-object spatial arrangements from a single image in the wild. In: European conference on computer vision, pp. 34–51 (2020)

  3. Diller, C., Funkhouser, T., Dai, A.: Forecasting characteristic 3D poses of human actions. ArXiv Preprint. arXiv:2011.15079 (2020)

  4. Parger, M., Tang, C., Xu, Y., Twigg, C., Tao, L., Li, Y., Wang, R., Steinberger, M.: UNOC: understanding occlusion for embodied presence in virtual reality. ArXiv Preprint. arXiv:2012.03680 (2020)

  5. Hassan, M., Choutas, V., Tzionas, D., Black, M.: Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2282–2292 (2019)

  6. Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.: iMapper: interaction-guided scene mapping from monocular videos. ACM Trans. Graph. 38, 1–15 (2019)

    Article  Google Scholar 

  7. Zhang, Y., Hassan, M., Neumann, H., Black, M., Tang, S.: Generating 3d people in scenes without people. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6194–6204 (2020)

  8. Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.: Populating 3D scenes by learning human–scene interaction. ArXiv Preprint. arXiv:2012.11581 (2020)

  9. Liu, M., Pan, Z., Xu, K., Ganguly, K., Manocha, D.: Generating grasp poses for a high-dof gripper using neural networks. ArXiv Preprint. arXiv:1903.00425 (2019)

  10. Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S.: Grasping field: learning implicit representations for human grasps. ArXiv Preprint. arXiv:2008.04451 (2020)

  11. Fan, H., Su, H., Guibas, L.: A point set generation network for 3d object reconstruction from a single image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605–613 (2017)

  12. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3d outputs. In: Proceedings of the IEEE international conference on computer vision, pp. 2088–2096 (2017)

  13. Groueix, T., Fisher, M., Kim, V., Russell, B., Aubry, M.: A papier-mâché approach to learning 3d surface generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216–224 (2018)

  14. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.: Pixel2mesh: generating 3d mesh models from single rgb images. In: Proceedings of the European conference on computer vision (ECCV), pp. 52–67 (2018)

  15. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4460–4470 (2019)

  16. Han, X., Laga, H., Bennamoun, M.: Image-based 3D object reconstruction: state-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1578–1604 (2019)

    Article  Google Scholar 

  17. Choy, C., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In: European conference on computer vision, pp. 628–644 (2016)

  18. Wen, C., Zhang, Y., Li, Z., Fu, Y.: Pixel2mesh++: Multi-view 3d mesh generation via deformation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 1042–1051 (2019)

  19. Hasson, Y., Tekin, B., Bogo, F., Laptev, I., Pollefeys, M., Schmid, C.: Leveraging photometric consistency over time for sparsely supervised hand-object reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 571–580 (2020)

  20. Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F., Rogez, G.: Ganhand: predicting human grasp affordances in multi-object scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5031–5041 (2020)

  21. Cao, Z., Radosavovic, I., Kanazawa, A., Malik, J.: Reconstructing hand–object interactions in the wild. ArXiv Preprint. arXiv:2012.09856 (2020)

  22. Romero, J., Tzionas, D., Black, M.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. 36, 1–17 (2017)

    Article  Google Scholar 

  23. Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: Proceedings of the IEEE international conference on computer vision, pp. 4903–4911 (2017)

  24. Tekin, B., Bogo, F., Pollefeys, M.: H+ o: Unified egocentric recognition of 3d hand-object poses and interactions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4511–4520 (2019)

  25. Hasson, Y., Varol, G., Tzionas, D., Kalevatykh, I., Black, M., Laptev, I., Schmid, C.: Learning joint reconstruction of hands and manipulated objects. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11807–11816 (2019)

  26. Yang, L., Zhan, X., Li, K., Xu, W., Li, J., Lu, C.: CPF: Learning a contact potential field to model the hand–object interaction. ArXiv Preprint. arXiv:2012.00924 (2020)

  27. Taheri, O., Ghorbani, N., Black, M., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: European conference on computer vision, pp. 581–600 (2020)

  28. Song, C., Song, J., Huang, Q.: Hybridpose: 6d object pose estimation under hybrid representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 431–440 (2020)

  29. Yang, Z., Yan, S., Huang, Q.: Extreme relative pose network under hybrid representations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2455–2464 (2020)

  30. Cao, Z., Hidalgo, G., Simon, T., Wei, S., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 172–186 (2019)

    Article  Google Scholar 

  31. Rong, Y., Shiratori, T., Joo, H.: FrankMocap: fast monocular 3D hand and body motion capture by regression and integration. ArXiv Preprint. arXiv:2008.08324 (2020)

  32. Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3d annotation of hand and object poses. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3196–3206 (2020)

  33. Zhang, T., Huang, B., Wang, Y.: Object-occluded human shape and pose estimation from a single color image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7376–7385 (2020)

  34. Kulon, D., Guler, R.A., Kokkinos, I., et al.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4990–5000 (2020)

  35. Jiang, H., Liu, S., Wang, J., et al.: Hand-object contact consistency reasoning for human grasps generation. arXiv preprint. arXiv:2104.03304 (2021)

  36. Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. ArXiv Preprint. arXiv:1406.2283 (2014)

  37. Qi, C., Su, H., Mo, K., Guibas, L.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017)

  38. Qi, C., Yi, L., Su, H., Guibas, L.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. ArXiv Preprint. arXiv:1706.02413 (2017)

  39. Park, J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: Deepsdf: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 165–174 (2019)

  40. Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: Disn: deep implicit surface network for high-quality single-view 3d reconstruction. ArXiv Preprint. arXiv:1905.10711 (2019)

  41. Shi, Z., Yu, L., El-Latif, A., Ahmed, A., Niu, X.: Skeleton modulated topological perception map for rapid viewpoint selection. IEICE Trans. Inf. Syst. 95, 2585–2588 (2012)

    Article  Google Scholar 

  42. Shi, Z.-F., Yu, L.-Y., El-Latif, A., Ahmed, A., Le, D., Niu, X.-M.: A kinematics significance based skeleton map for rapid viewpoint selection. Res. J. Appl. Sci. Eng. Technol. 4, 2887–2892 (2012)

    Google Scholar 

  43. Gad, R., Talha, M., El-Latif, A., Ahmed, A., Zorkany, M., El-Sayed, A., El-Fishawy, N., Ghulam, M.: Iris recognition using multi-algorithmic approaches for cognitive internet of things (CIoT) framewok. Future Gener. Comput. Syst. 86, 178–191 (2018)

    Article  Google Scholar 

  44. Kumar, A., Singh, N., Kumar, P., Vijayvergia, A., Kumar, K.: A novel superpixel based color spatial feature for salient object detection. In: 2017 Conference on information and communication technology (CICT), IEEE, pp. 1–5 (2017)

  45. Kumain, S.C., Singh, M., Singh, N., Kumar, K.: An efficient Gaussian noise reduction technique for noisy images using optimized filter approach. In: IEEE in 2018 first international conference on secure cyber computing and communication (ICSCCC), IEEE, pp. 243–248 (2018)

  46. Atrish, A., Singh, N., Kumar, K., Kumar, V.: An automated hierarchical framework for player recognition in sports image. In: Proceedings of the international conference on video and image processing, pp. 103–108 (2017)

  47. Kumar, K., Shrimankar, D.D., Singh, N.: Key-lectures: keyframes extraction in video lectures. In: Machine intelligence and signal analysis, pp. 453–459. Springer, Singapore (2019)

    Chapter  Google Scholar 

  48. Sharma, S., Kumar, K., Singh, N.: Deep eigen space based ASL recognition system. IETE J Res (2020). https://doi.org/10.1080/03772063.2020.1780164

    Article  Google Scholar 

  49. Kumar, K.: Text query based summarized event searching interface system using deep learning over cloud. Multimedia Tools Appl. 80(7), 11079–11094 (2021)

    Article  Google Scholar 

  50. Sharma, S., Kumar, P., Kumar, K.: A-PNR: automatic plate number recognition. In: Proceedings of the 7th international conference on computer and communication technology, pp. 106–110 (2017)

  51. Loper, M., Black, M.: OpenDR: an approximate differentiable renderer. In: European conference on computer vision, pp. 154–169 (2014)

  52. Lorensen, W., Cline, H.: Marching cubes: a high resolution 3D surface construction algorithm. Comput. Graph. 21, 163–169 (1987)

    Article  Google Scholar 

  53. Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A., Tzionas, D., Black, M.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10975–10985 (2019)

  54. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 1–16 (2015)

    Article  Google Scholar 

  55. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.: First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 409–419 (2018)

  56. Mahmood, N., Ghorbani, N., Troje, N., Pons-Moll, G., Black, M.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5442–5451 (2019)

  57. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2304–2314 (2019)

  58. Bhatnagar, B., Tiwari, G., Theobalt, C., Pons-Moll, G.: Multi-garment net: learning to dress 3d people from images. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5420–5430 (2019)

  59. Brahmbhatt, S., Ham, C., Kemp, C., Hays, J.: Contactdb: analyzing and predicting grasp contact via thermal imaging. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8709–8719 (2019)

  60. Pumarola, A., Sanchez-Riera, J., Choi, G., Sanfeliu, A., Moreno-Noguer, F.: 3dpeople: modeling the geometry of dressed humans. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 2242–2251 (2019)

  61. Jakob, W.: Mitsuba renderer (2010). http://www.mitsuba-renderer.org. Accessed 1 Dec 2020

  62. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 2961–2969 (2017)

  63. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  64. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255 (2009)

  65. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. ArXiv Preprint. arXiv:1412.6980 (2014)

Download references

Acknowledgement

This research was jointly funded by Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103), Shanghai Research and Innovation Functional Program (No. 17DZ2260900).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yanwei Fu or Xiangyang Xue.

Additional information

Communicated by Yongdong Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, H., Cheang, C., Fu, Y. et al. HandO: a hybrid 3D hand–object reconstruction model for unknown objects. Multimedia Systems 28, 1845–1859 (2022). https://doi.org/10.1007/s00530-021-00874-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00874-7

Navigation