Advertisement

DeepIM: Deep Iterative Matching for 6D Pose Estimation

  • Yi LiEmail author
  • Gu Wang
  • Xiangyang Ji
  • Yu Xiang
  • Dieter Fox
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Estimating the 6D pose of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using an untangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

Keywords

3D object recognition 6D object pose estimation 

Notes

Acknowledgments

This work was funded in part by a Siemens grant. We would also like to thank NVIDIA for generously providing the DGX station used for this research via the NVIDIA Robotics Lab and the UW NVIDIA AI Lab (NVAIL). This work was also Supported by National Key R&D Program of China 2017YFB1002202, NSFC Projects 61620106005, 61325003, Beijing Municipal Sci. & Tech. Commission Z181100008918014 and THU Initiative Scientific Research Program.

Supplementary material

474211_1_En_42_MOESM1_ESM.pdf (2.6 mb)
Supplementary material 1 (pdf 2631 KB)

Supplementary material 2 (mp4 1429 KB)

Supplementary material 3 (mp4 7853 KB)

References

  1. 1.
    Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10605-2_35CrossRefGoogle Scholar
  2. 2.
    Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., Rother, C.: Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3364–3372 (2016)Google Scholar
  3. 3.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  4. 4.
    Collet, A., Martinez, M., Srinivasa, S.S.: The MOPED framework: object recognition and pose estimation for manipulation. Int. J. Robot. Res. (IJRR) 30(10), 1284–1306 (2011)CrossRefGoogle Scholar
  5. 5.
    Costante, G., Ciarfuglia, T.A.: LS-VO: learning dense optical subspace for robust visual odometry estimation. IEEE Robot. Autom. Lett. 3(3), 1735–1742 (2018)CrossRefGoogle Scholar
  6. 6.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766 (2015)Google Scholar
  7. 7.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (ICCV) 88(2), 303–338 (2010)CrossRefGoogle Scholar
  8. 8.
    Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)Google Scholar
  9. 9.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-37331-2_42CrossRefGoogle Scholar
  10. 10.
    Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-less: an RGB-D dataset for 6D pose estimation of texture-less objects. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 880–888. IEEE (2017)Google Scholar
  11. 11.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1521–1529 (2017)Google Scholar
  12. 12.
    Kendall, A., Cipolla, R.: Geometric loss functions for camera pose regression with deep learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  13. 13.
    Krull, A., Brachmann, E., Michel, F., Ying Yang, M., Gumhold, S., Rother, C.: Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In: IEEE International Conference on Computer Vision (ICCV), pp. 954–962 (2015)Google Scholar
  14. 14.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  15. 15.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440 (2015)Google Scholar
  16. 16.
    Lowe, D.G.: Object recognition from local scale-invariant features. In: IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1150–1157 (1999)Google Scholar
  17. 17.
    Michel, F., et al.: Global hypothesis generation for 6D object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  18. 18.
    Mousavian, A., Anguelov, D., Flynn, J., Košecká, J.: 3D bounding box estimation using deep learning and geometry. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5632–5640 (2017)Google Scholar
  19. 19.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  20. 20.
    Rad, M., Lepetit, V.: BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  21. 21.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)Google Scholar
  22. 22.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  23. 23.
    Rothganger, F., Lazebnik, S., Schmid, C., Ponce, J.: 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. Int. J. Comput. Vis. (IJCV) 66(3), 231–259 (2006)CrossRefGoogle Scholar
  24. 24.
    Saxena, A., Pandya, H., Kumar, G., Gaud, A., Krishna, K.M.: Exploring convolutional networks for end-to-end visual servoing. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3817–3823 (2017)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  26. 26.
    Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. arXiv preprint arXiv:1711.08848 (2017)
  27. 27.
    Tjaden, H., Schwanecke, U., Schömer, E.: Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 124–132 (2017)Google Scholar
  28. 28.
    Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 1912–1920 (2015)Google Scholar
  29. 29.
    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)
  30. 30.
    Zeng, A., et al.: Multi-view self-supervised deep learning for 6D pose estimation in the Amazon picking challenge. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1386–1383 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Yi Li
    • 1
    Email author
  • Gu Wang
    • 1
  • Xiangyang Ji
    • 1
  • Yu Xiang
    • 2
  • Dieter Fox
    • 2
  1. 1.Tsinghua University and BNRistBeijingChina
  2. 2.University of Washington and NVIDIA ResearchSeattleUSA

Personalised recommendations