Skip to main content

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

Abstract

While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and benchmark multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve. Code and videos are available at https://yzqin.github.io/dexmv

Y. Qin and Y.-H. Wu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning (2004)

    Google Scholar 

  2. Aberman, K., Wu, R., Lischinski, D., Chen, B., Cohen-Or, D.: Learning character-agnostic motion for motion retargeting in 2d. arXiv preprint arXiv:1905.01680 (2019)

  3. Andrews, S., Kry, P.G.: Goal directed multi-finger manipulation: control policies and analysis. Comput. Graph. 37(7), 830–839 (2013)

    Article  Google Scholar 

  4. Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshops (2018)

    Google Scholar 

  5. Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching youtube. In: NeurIPS (2018)

    Google Scholar 

  6. Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3d hand pose estimation via neural rendering. In: CVPR (2019)

    Google Scholar 

  7. Bai, Y., Liu, C.K.: Dexterous manipulation using both palm and fingers (2014)

    Google Scholar 

  8. Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence (1995)

    Google Scholar 

  9. Baird III, L.C.: Advantage updating. Technical Report (1993)

    Google Scholar 

  10. Bicchi, A.: Hands for dexterous manipulation and robust grasping: a difficult road toward simplicity. IEEE Trans. Robot. Autom. 16(6), 652–662 (2000)

    Article  Google Scholar 

  11. Bojarski, M., et al.: End to end learning for self-driving cars. arXiv (2016)

    Google Scholar 

  12. Boukhayma, A., Bem, R.D., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR (2019)

    Google Scholar 

  13. Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: Contactdb: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)

    Google Scholar 

  14. Brahmbhatt, S., Handa, A., Hays, J., Fox, D.: Contactgrasp: functional multi-finger grasp synthesis from contact. arXiv (2019)

    Google Scholar 

  15. Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: the YDB object and model set and benchmarking protocols. arXiv (2015)

    Google Scholar 

  16. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)

  17. Chang, M., Gupta, A., Gupta, S.: Semantic visual navigation by watching Youtube videos. In: NIPS (2020)

    Google Scholar 

  18. Chao, Y.W., et al.: Dexycb: a benchmark for capturing hand grasping of objects. In: CVPR (2021)

    Google Scholar 

  19. Craig, J.J.: Introduction to Robotics: Mechanics and Control, 3/E. Pearson Education India, Noida (2009)

    Google Scholar 

  20. Dogar, M.R., Srinivasa, S.S.: Push-grasping with dexterous hands: mechanics and a method (2010)

    Google Scholar 

  21. Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control (2016)

    Google Scholar 

  22. Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. J. Neurosci. 5(7), 1688–1703 (1985)

    Article  Google Scholar 

  23. Fu, J., Luo, K., Levine, S.: Learning robust rewards with adversarial inverse reinforcement learning. arXiv (2017)

    Google Scholar 

  24. Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3d hand pose annotations. In: CVPR (2018)

    Google Scholar 

  25. Ge, L., et al.: 3d hand shape and pose estimation from a single RGB image. In: CVPR (2019)

    Google Scholar 

  26. Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3d annotation of hand and object poses. In: CVPR (2020)

    Google Scholar 

  27. Handa, A., et al.: Dexpilot: vision-based teleoperation of dexterous robotic hand-arm system. In: ICRA (2020)

    Google Scholar 

  28. Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)

    Google Scholar 

  29. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In: CVPR (2020)

    Google Scholar 

  30. Hecker, C., Raabe, B., Enslow, R.W., DeWeese, J., Maynard, J., van Prooijen, K.: Real-time motion retargeting to highly varied user-created morphologies. ACM Trans. Graph. (TOG) 27(3), 1–11 (2008)

    Article  Google Scholar 

  31. Ho, J., Ermon, S.: Generative adversarial imitation learning. In: NeurIPS (2016)

    Google Scholar 

  32. Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6d object pose estimation. In: CVPR (2019)

    Google Scholar 

  33. Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression. In: ECCV (2018)

    Google Scholar 

  34. Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. arXiv (2021)

    Google Scholar 

  35. Johnson, S.G.: The nlopt nonlinear-optimization package (2014)

    Google Scholar 

  36. Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: ICML (2018)

    Google Scholar 

  37. Kato, H., Ushiku, Y., Harada, T.: Neural 3d mesh renderer. In: CVPR (2018)

    Google Scholar 

  38. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: making RGB-based 3d detection and 6d pose estimation great again. In: ICCV (2017)

    Google Scholar 

  39. Khaled, S.M., et al.: Combinatorial color space models for skin detection in sub-continental human images. In: IVIC (2009)

    Google Scholar 

  40. Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)

    Google Scholar 

  41. Kumar, V., Xu, Z., Todorov, E.: Fast, strong and compliant pneumatic actuation for dexterous tendon-driven hands. In: ICRA (2013)

    Google Scholar 

  42. Kyriakopoulos, K.J., Saridis, G.N.: Minimum jerk path generation. In: Proceedings. 1988 IEEE International Conference on Robotics and Automation, pp. 364–369. IEEE (1988)

    Google Scholar 

  43. Li, S., et al.: Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network. In: ICRA (2019)

    Google Scholar 

  44. Liu, F., Ling, Z., Mu, T., Su, H.: State alignment-based imitation learning. In: ICLR (2020)

    Google Scholar 

  45. Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3d hand-object poses estimation with interactions in time. In: CVPR (2021)

    Google Scholar 

  46. Mandikal, P., Grauman, K.: Dexterous robotic grasping with object-centric visual affordances. arXiv (2020)

    Google Scholar 

  47. Nakamura, Y., Hanafusa, H.: Inverse kinematic solutions with singularity robustness for robot manipulator control (1986)

    Google Scholar 

  48. Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning (2000)

    Google Scholar 

  49. Okamura, A.M., Smaby, N., Cutkosky, M.R.: An overview of dexterous manipulation. In: ICRA (2000)

    Google Scholar 

  50. Akkaya, I., et al.: Solving rubik’s cube with a robot hand. OpenAI, arXiv (2019)

    Google Scholar 

  51. Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. OpenAI, arXiv (2018)

    Google Scholar 

  52. Peng, S., Liu, Y., Huang, Q.X., Bao, H., Zhou, X.: Pvnet: pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)

    Google Scholar 

  53. Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Netw. 21(4), 682–697 (2008)

    Article  Google Scholar 

  54. Pomerleau, D.A.: Alvinn: an autonomous land vehicle in a neural network. In: NeurIPS (1989)

    Google Scholar 

  55. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)

    Google Scholar 

  56. Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: ICCV (2017)

    Google Scholar 

  57. Radosavovic, I., Wang, X., Pinto, L., Malik, J.: State-only imitation learning for dexterous manipulation. In: IROS (2021)

    Google Scholar 

  58. Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv (2017)

    Google Scholar 

  59. Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations (2018)

    Google Scholar 

  60. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. In: ToG (2017)

    Google Scholar 

  61. Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: AISTATS (2010)

    Google Scholar 

  62. Rus, D.: In-hand dexterous manipulation of piecewise-smooth 3-d objects. Int. J. Robot. Res. 18(4), 355–381 (1999)

    Article  Google Scholar 

  63. Russell, S.: Learning agents for uncertain environments (1998)

    Google Scholar 

  64. Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. arXiv (2020)

    Google Scholar 

  65. Schmeckpeper, K., et al.: Learning predictive models from observation and interaction. arXiv (2019)

    Google Scholar 

  66. Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: ICML (2015)

    Google Scholar 

  67. Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)

    Google Scholar 

  68. Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: concept2robot: learning manipulation concepts from instructions and human demonstrations. In: RSS (2020)

    Google Scholar 

  69. Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: learning 6dof closed-loop grasping from low-cost demonstrations. Robot. Autom. Lett. 5(3), 4978–4985 (2020)

    Article  Google Scholar 

  70. Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)

    Google Scholar 

  71. Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: a dataset of whole-body human grasping of objects. In: ECCV (2020)

    Google Scholar 

  72. Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: CVPR (2018)

    Google Scholar 

  73. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)

    Google Scholar 

  74. Todorov, E., Jordan, M.I.: Smoothness maximization along a predefined path accurately predicts the speed profiles of complex arm movements. J. Neurophysiol. 80(2), 696–714 (1998)

    Article  Google Scholar 

  75. Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. arXiv (2018)

    Google Scholar 

  76. Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv (2018)

    Google Scholar 

  77. Večerík, M., et al.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv (2017)

    Google Scholar 

  78. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv (2018)

    Google Scholar 

  79. Young, S., Gandhi, D., Tulsiani, S., Gupta, A., Abbeel, P., Pinto, L.: Visual imitation made easy. arXiv (2020)

    Google Scholar 

  80. Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: CVPR (2017)

    Google Scholar 

Download references

Acknowledgement

This work was supported, in part, by grants from DARPA LwLL, NSF CCF-2112665 (TILOS), NSF 1730158 CI-New: Cognitive Hardware and Software Ecosystem Community Infrastructure (CHASE-CI), NSF ACI-1541349 CC*DNI Pacific Research Platform, and gifts from Meta, Google, Qualcomm and Picsart.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaolong Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2057 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Qin, Y. et al. (2022). DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_33

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics