DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

Qin, Yuzhe; Wu, Yueh-Hua; Liu, Shaowei; Jiang, Hanwen; Yang, Ruihan; Fu, Yang; Wang, Xiaolong

doi:10.1007/978-3-031-19842-7_33

Yuzhe Qin¹²,
Yueh-Hua Wu¹²,
Shaowei Liu¹²,
Hanwen Jiang¹²,
Ruihan Yang¹²,
Yang Fu¹² &
…
Xiaolong Wang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

3367 Accesses
15 Citations

Abstract

While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and benchmark multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve. Code and videos are available at https://yzqin.github.io/dexmv

Y. Qin and Y.-H. Wu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Adaptive Learning Methods for Autonomous Mobile Manipulation in RoboCup@Home

Primitive-contrastive network: data-efficient self-supervised learning from robot demonstration videos

Article 20 July 2021

Survey of imitation learning for robotic manipulation

Article 23 September 2019

References

Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning (2004)
Google Scholar
Aberman, K., Wu, R., Lischinski, D., Chen, B., Cohen-Or, D.: Learning character-agnostic motion for motion retargeting in 2d. arXiv preprint arXiv:1905.01680 (2019)
Andrews, S., Kry, P.G.: Goal directed multi-finger manipulation: control policies and analysis. Comput. Graph. 37(7), 830–839 (2013)
Article Google Scholar
Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshops (2018)
Google Scholar
Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching youtube. In: NeurIPS (2018)
Google Scholar
Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3d hand pose estimation via neural rendering. In: CVPR (2019)
Google Scholar
Bai, Y., Liu, C.K.: Dexterous manipulation using both palm and fingers (2014)
Google Scholar
Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence (1995)
Google Scholar
Baird III, L.C.: Advantage updating. Technical Report (1993)
Google Scholar
Bicchi, A.: Hands for dexterous manipulation and robust grasping: a difficult road toward simplicity. IEEE Trans. Robot. Autom. 16(6), 652–662 (2000)
Article Google Scholar
Bojarski, M., et al.: End to end learning for self-driving cars. arXiv (2016)
Google Scholar
Boukhayma, A., Bem, R.D., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR (2019)
Google Scholar
Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: Contactdb: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)
Google Scholar
Brahmbhatt, S., Handa, A., Hays, J., Fox, D.: Contactgrasp: functional multi-finger grasp synthesis from contact. arXiv (2019)
Google Scholar
Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: the YDB object and model set and benchmarking protocols. arXiv (2015)
Google Scholar
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Chang, M., Gupta, A., Gupta, S.: Semantic visual navigation by watching Youtube videos. In: NIPS (2020)
Google Scholar
Chao, Y.W., et al.: Dexycb: a benchmark for capturing hand grasping of objects. In: CVPR (2021)
Google Scholar
Craig, J.J.: Introduction to Robotics: Mechanics and Control, 3/E. Pearson Education India, Noida (2009)
Google Scholar
Dogar, M.R., Srinivasa, S.S.: Push-grasping with dexterous hands: mechanics and a method (2010)
Google Scholar
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control (2016)
Google Scholar
Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. J. Neurosci. 5(7), 1688–1703 (1985)
Article Google Scholar
Fu, J., Luo, K., Levine, S.: Learning robust rewards with adversarial inverse reinforcement learning. arXiv (2017)
Google Scholar
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3d hand pose annotations. In: CVPR (2018)
Google Scholar
Ge, L., et al.: 3d hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Google Scholar
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3d annotation of hand and object poses. In: CVPR (2020)
Google Scholar
Handa, A., et al.: Dexpilot: vision-based teleoperation of dexterous robotic hand-arm system. In: ICRA (2020)
Google Scholar
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
Google Scholar
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In: CVPR (2020)
Google Scholar
Hecker, C., Raabe, B., Enslow, R.W., DeWeese, J., Maynard, J., van Prooijen, K.: Real-time motion retargeting to highly varied user-created morphologies. ACM Trans. Graph. (TOG) 27(3), 1–11 (2008)
Article Google Scholar
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: NeurIPS (2016)
Google Scholar
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6d object pose estimation. In: CVPR (2019)
Google Scholar
Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression. In: ECCV (2018)
Google Scholar
Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. arXiv (2021)
Google Scholar
Johnson, S.G.: The nlopt nonlinear-optimization package (2014)
Google Scholar
Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: ICML (2018)
Google Scholar
Kato, H., Ushiku, Y., Harada, T.: Neural 3d mesh renderer. In: CVPR (2018)
Google Scholar
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: making RGB-based 3d detection and 6d pose estimation great again. In: ICCV (2017)
Google Scholar
Khaled, S.M., et al.: Combinatorial color space models for skin detection in sub-continental human images. In: IVIC (2009)
Google Scholar
Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)
Google Scholar
Kumar, V., Xu, Z., Todorov, E.: Fast, strong and compliant pneumatic actuation for dexterous tendon-driven hands. In: ICRA (2013)
Google Scholar
Kyriakopoulos, K.J., Saridis, G.N.: Minimum jerk path generation. In: Proceedings. 1988 IEEE International Conference on Robotics and Automation, pp. 364–369. IEEE (1988)
Google Scholar
Li, S., et al.: Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network. In: ICRA (2019)
Google Scholar
Liu, F., Ling, Z., Mu, T., Su, H.: State alignment-based imitation learning. In: ICLR (2020)
Google Scholar
Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3d hand-object poses estimation with interactions in time. In: CVPR (2021)
Google Scholar
Mandikal, P., Grauman, K.: Dexterous robotic grasping with object-centric visual affordances. arXiv (2020)
Google Scholar
Nakamura, Y., Hanafusa, H.: Inverse kinematic solutions with singularity robustness for robot manipulator control (1986)
Google Scholar
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning (2000)
Google Scholar
Okamura, A.M., Smaby, N., Cutkosky, M.R.: An overview of dexterous manipulation. In: ICRA (2000)
Google Scholar
Akkaya, I., et al.: Solving rubik’s cube with a robot hand. OpenAI, arXiv (2019)
Google Scholar
Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. OpenAI, arXiv (2018)
Google Scholar
Peng, S., Liu, Y., Huang, Q.X., Bao, H., Zhou, X.: Pvnet: pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)
Google Scholar
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Netw. 21(4), 682–697 (2008)
Article Google Scholar
Pomerleau, D.A.: Alvinn: an autonomous land vehicle in a neural network. In: NeurIPS (1989)
Google Scholar
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Google Scholar
Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: ICCV (2017)
Google Scholar
Radosavovic, I., Wang, X., Pinto, L., Malik, J.: State-only imitation learning for dexterous manipulation. In: IROS (2021)
Google Scholar
Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv (2017)
Google Scholar
Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations (2018)
Google Scholar
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. In: ToG (2017)
Google Scholar
Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: AISTATS (2010)
Google Scholar
Rus, D.: In-hand dexterous manipulation of piecewise-smooth 3-d objects. Int. J. Robot. Res. 18(4), 355–381 (1999)
Article Google Scholar
Russell, S.: Learning agents for uncertain environments (1998)
Google Scholar
Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. arXiv (2020)
Google Scholar
Schmeckpeper, K., et al.: Learning predictive models from observation and interaction. arXiv (2019)
Google Scholar
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: ICML (2015)
Google Scholar
Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)
Google Scholar
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: concept2robot: learning manipulation concepts from instructions and human demonstrations. In: RSS (2020)
Google Scholar
Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: learning 6dof closed-loop grasping from low-cost demonstrations. Robot. Autom. Lett. 5(3), 4978–4985 (2020)
Article Google Scholar
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)
Google Scholar
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: a dataset of whole-body human grasping of objects. In: ECCV (2020)
Google Scholar
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: CVPR (2018)
Google Scholar
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)
Google Scholar
Todorov, E., Jordan, M.I.: Smoothness maximization along a predefined path accurately predicts the speed profiles of complex arm movements. J. Neurophysiol. 80(2), 696–714 (1998)
Article Google Scholar
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. arXiv (2018)
Google Scholar
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv (2018)
Google Scholar
Večerík, M., et al.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv (2017)
Google Scholar
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv (2018)
Google Scholar
Young, S., Gandhi, D., Tulsiani, S., Gupta, A., Abbeel, P., Pinto, L.: Visual imitation made easy. arXiv (2020)
Google Scholar
Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: CVPR (2017)
Google Scholar

Download references

Acknowledgement

This work was supported, in part, by grants from DARPA LwLL, NSF CCF-2112665 (TILOS), NSF 1730158 CI-New: Cognitive Hardware and Software Ecosystem Community Infrastructure (CHASE-CI), NSF ACI-1541349 CC*DNI Pacific Research Platform, and gifts from Meta, Google, Qualcomm and Picsart.

Author information

Authors and Affiliations

University of California San Diego, La Jolla, 92093, USA
Yuzhe Qin, Yueh-Hua Wu, Shaowei Liu, Hanwen Jiang, Ruihan Yang, Yang Fu & Xiaolong Wang

Authors

Yuzhe Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yueh-Hua Wu
View author publications
You can also search for this author in PubMed Google Scholar
Shaowei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hanwen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ruihan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Fu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaolong Wang .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2057 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qin, Y. et al. (2022). DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_33

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_33
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

Abstract

Access this chapter

Similar content being viewed by others

Adaptive Learning Methods for Autonomous Mobile Manipulation in RoboCup@Home

Primitive-contrastive network: data-efficient self-supervised learning from robot demonstration videos

Survey of imitation learning for robotic manipulation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2057 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

DexMV: Imitation Learning for Dexterous Manipulation from Human Videos

Abstract

Access this chapter

Similar content being viewed by others

Adaptive Learning Methods for Autonomous Mobile Manipulation in RoboCup@Home

Primitive-contrastive network: data-efficient self-supervised learning from robot demonstration videos

Survey of imitation learning for robotic manipulation

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2057 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation