Abstract
While significant progress has been made on understanding hand-object interactions in computer vision, it is still very challenging for robots to perform complex dexterous manipulation. In this paper, we propose a new platform and pipeline DexMV (Dexterous Manipulation from Videos) for imitation learning. We design a platform with: (i) a simulation system for complex dexterous manipulation tasks with a multi-finger robot hand and (ii) a computer vision system to record large-scale demonstrations of a human hand conducting the same tasks. In our novel pipeline, we extract 3D hand and object poses from videos, and propose a novel demonstration translation method to convert human motion to robot demonstrations. We then apply and benchmark multiple imitation learning algorithms with the demonstrations. We show that the demonstrations can indeed improve robot learning by a large margin and solve the complex tasks which reinforcement learning alone cannot solve. Code and videos are available at https://yzqin.github.io/dexmv
Y. Qin and Y.-H. Wu—Equal Contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbeel, P., Ng, A.Y.: Apprenticeship learning via inverse reinforcement learning (2004)
Aberman, K., Wu, R., Lischinski, D., Chen, B., Cohen-Or, D.: Learning character-agnostic motion for motion retargeting in 2d. arXiv preprint arXiv:1905.01680 (2019)
Andrews, S., Kry, P.G.: Goal directed multi-finger manipulation: control policies and analysis. Comput. Graph. 37(7), 830–839 (2013)
Antotsiou, D., Garcia-Hernando, G., Kim, T.K.: Task-oriented hand motion retargeting for dexterous manipulation imitation. In: ECCV Workshops (2018)
Aytar, Y., Pfaff, T., Budden, D., Paine, T., Wang, Z., de Freitas, N.: Playing hard exploration games by watching youtube. In: NeurIPS (2018)
Baek, S., Kim, K.I., Kim, T.K.: Pushing the envelope for RGB-based dense 3d hand pose estimation via neural rendering. In: CVPR (2019)
Bai, Y., Liu, C.K.: Dexterous manipulation using both palm and fingers (2014)
Bain, M., Sammut, C.: A framework for behavioural cloning. In: Machine Intelligence (1995)
Baird III, L.C.: Advantage updating. Technical Report (1993)
Bicchi, A.: Hands for dexterous manipulation and robust grasping: a difficult road toward simplicity. IEEE Trans. Robot. Autom. 16(6), 652–662 (2000)
Bojarski, M., et al.: End to end learning for self-driving cars. arXiv (2016)
Boukhayma, A., Bem, R.D., Torr, P.H.: 3d hand shape and pose from images in the wild. In: CVPR (2019)
Brahmbhatt, S., Ham, C., Kemp, C.C., Hays, J.: Contactdb: analyzing and predicting grasp contact via thermal imaging. In: CVPR (2019)
Brahmbhatt, S., Handa, A., Hays, J., Fox, D.: Contactgrasp: functional multi-finger grasp synthesis from contact. arXiv (2019)
Calli, B., Walsman, A., Singh, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: Benchmarking in manipulation research: the YDB object and model set and benchmarking protocols. arXiv (2015)
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. arXiv preprint arXiv:1512.03012 (2015)
Chang, M., Gupta, A., Gupta, S.: Semantic visual navigation by watching Youtube videos. In: NIPS (2020)
Chao, Y.W., et al.: Dexycb: a benchmark for capturing hand grasping of objects. In: CVPR (2021)
Craig, J.J.: Introduction to Robotics: Mechanics and Control, 3/E. Pearson Education India, Noida (2009)
Dogar, M.R., Srinivasa, S.S.: Push-grasping with dexterous hands: mechanics and a method (2010)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., Abbeel, P.: Benchmarking deep reinforcement learning for continuous control (2016)
Flash, T., Hogan, N.: The coordination of arm movements: an experimentally confirmed mathematical model. J. Neurosci. 5(7), 1688–1703 (1985)
Fu, J., Luo, K., Levine, S.: Learning robust rewards with adversarial inverse reinforcement learning. arXiv (2017)
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3d hand pose annotations. In: CVPR (2018)
Ge, L., et al.: 3d hand shape and pose estimation from a single RGB image. In: CVPR (2019)
Hampali, S., Rad, M., Oberweger, M., Lepetit, V.: Honnotate: a method for 3d annotation of hand and object poses. In: CVPR (2020)
Handa, A., et al.: Dexpilot: vision-based teleoperation of dexterous robotic hand-arm system. In: ICRA (2020)
Hasson, Y., et al.: Learning joint reconstruction of hands and manipulated objects. In: CVPR (2019)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In: CVPR (2020)
Hecker, C., Raabe, B., Enslow, R.W., DeWeese, J., Maynard, J., van Prooijen, K.: Real-time motion retargeting to highly varied user-created morphologies. ACM Trans. Graph. (TOG) 27(3), 1–11 (2008)
Ho, J., Ermon, S.: Generative adversarial imitation learning. In: NeurIPS (2016)
Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6d object pose estimation. In: CVPR (2019)
Iqbal, U., Molchanov, P., Breuel Juergen Gall, T., Kautz, J.: Hand pose estimation via latent 2.5 d heatmap regression. In: ECCV (2018)
Jiang, H., Liu, S., Wang, J., Wang, X.: Hand-object contact consistency reasoning for human grasps generation. arXiv (2021)
Johnson, S.G.: The nlopt nonlinear-optimization package (2014)
Kang, B., Jie, Z., Feng, J.: Policy optimization with demonstrations. In: ICML (2018)
Kato, H., Ushiku, Y., Harada, T.: Neural 3d mesh renderer. In: CVPR (2018)
Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: making RGB-based 3d detection and 6d pose estimation great again. In: ICCV (2017)
Khaled, S.M., et al.: Combinatorial color space models for skin detection in sub-continental human images. In: IVIC (2009)
Kulon, D., Guler, R.A., Kokkinos, I., Bronstein, M.M., Zafeiriou, S.: Weakly-supervised mesh-convolutional hand reconstruction in the wild. In: CVPR (2020)
Kumar, V., Xu, Z., Todorov, E.: Fast, strong and compliant pneumatic actuation for dexterous tendon-driven hands. In: ICRA (2013)
Kyriakopoulos, K.J., Saridis, G.N.: Minimum jerk path generation. In: Proceedings. 1988 IEEE International Conference on Robotics and Automation, pp. 364–369. IEEE (1988)
Li, S., et al.: Vision-based teleoperation of shadow dexterous hand using end-to-end deep neural network. In: ICRA (2019)
Liu, F., Ling, Z., Mu, T., Su, H.: State alignment-based imitation learning. In: ICLR (2020)
Liu, S., Jiang, H., Xu, J., Liu, S., Wang, X.: Semi-supervised 3d hand-object poses estimation with interactions in time. In: CVPR (2021)
Mandikal, P., Grauman, K.: Dexterous robotic grasping with object-centric visual affordances. arXiv (2020)
Nakamura, Y., Hanafusa, H.: Inverse kinematic solutions with singularity robustness for robot manipulator control (1986)
Ng, A.Y., Russell, S.J., et al.: Algorithms for inverse reinforcement learning (2000)
Okamura, A.M., Smaby, N., Cutkosky, M.R.: An overview of dexterous manipulation. In: ICRA (2000)
Akkaya, I., et al.: Solving rubik’s cube with a robot hand. OpenAI, arXiv (2019)
Andrychowicz, M., et al.: Learning dexterous in-hand manipulation. OpenAI, arXiv (2018)
Peng, S., Liu, Y., Huang, Q.X., Bao, H., Zhou, X.: Pvnet: pixel-wise voting network for 6dof pose estimation. In: CVPR (2019)
Peters, J., Schaal, S.: Reinforcement learning of motor skills with policy gradients. Neural Netw. 21(4), 682–697 (2008)
Pomerleau, D.A.: Alvinn: an autonomous land vehicle in a neural network. In: NeurIPS (1989)
Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (1994)
Rad, M., Lepetit, V.: Bb8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In: ICCV (2017)
Radosavovic, I., Wang, X., Pinto, L., Malik, J.: State-only imitation learning for dexterous manipulation. In: IROS (2021)
Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv (2017)
Rajeswaran, A., et al.: Learning complex dexterous manipulation with deep reinforcement learning and demonstrations (2018)
Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. In: ToG (2017)
Ross, S., Bagnell, D.: Efficient reductions for imitation learning. In: AISTATS (2010)
Rus, D.: In-hand dexterous manipulation of piecewise-smooth 3-d objects. Int. J. Robot. Res. 18(4), 355–381 (1999)
Russell, S.: Learning agents for uncertain environments (1998)
Schmeckpeper, K., Rybkin, O., Daniilidis, K., Levine, S., Finn, C.: Reinforcement learning with videos: combining offline observations with interaction. arXiv (2020)
Schmeckpeper, K., et al.: Learning predictive models from observation and interaction. arXiv (2019)
Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P.: Trust region policy optimization. In: ICML (2015)
Shan, D., Geng, J., Shu, M., Fouhey, D.: Understanding human hands in contact at internet scale. In: CVPR (2020)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: concept2robot: learning manipulation concepts from instructions and human demonstrations. In: RSS (2020)
Song, S., Zeng, A., Lee, J., Funkhouser, T.: Grasping in the wild: learning 6dof closed-loop grasping from low-cost demonstrations. Robot. Autom. Lett. 5(3), 4978–4985 (2020)
Spurr, A., Song, J., Park, S., Hilliges, O.: Cross-modal deep variational hand pose estimation. In: CVPR (2018)
Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: Grab: a dataset of whole-body human grasping of objects. In: ECCV (2020)
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: CVPR (2018)
Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: IROS (2012)
Todorov, E., Jordan, M.I.: Smoothness maximization along a predefined path accurately predicts the speed profiles of complex arm movements. J. Neurophysiol. 80(2), 696–714 (1998)
Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. arXiv (2018)
Torabi, F., Warnell, G., Stone, P.: Generative adversarial imitation from observation. arXiv (2018)
Večerík, M., et al.: Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. arXiv (2017)
Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: a convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv (2018)
Young, S., Gandhi, D., Tulsiani, S., Gupta, A., Abbeel, P., Pinto, L.: Visual imitation made easy. arXiv (2020)
Zimmermann, C., Brox, T.: Learning to estimate 3d hand pose from single RGB images. In: CVPR (2017)
Acknowledgement
This work was supported, in part, by grants from DARPA LwLL, NSF CCF-2112665 (TILOS), NSF 1730158 CI-New: Cognitive Hardware and Software Ecosystem Community Infrastructure (CHASE-CI), NSF ACI-1541349 CC*DNI Pacific Research Platform, and gifts from Meta, Google, Qualcomm and Picsart.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qin, Y. et al. (2022). DexMV: Imitation Learning for Dexterous Manipulation from Human Videos. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-031-19842-7_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)