Skip to main content
Log in

DeepIM: Deep Iterative Matching for 6D Pose Estimation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Estimating 6D poses of objects from images is an important problem in various applications such as robot manipulation and virtual reality. While direct regression of images to object poses has limited accuracy, matching rendered images of an object against the input image can produce accurate results. In this work, we propose a novel deep neural network for 6D pose matching named DeepIM. Given an initial pose estimation, our network is able to iteratively refine the pose by matching the rendered image against the observed image. The network is trained to predict a relative pose transformation using a disentangled representation of 3D location and 3D orientation and an iterative training process. Experiments on two commonly used benchmarks for 6D pose estimation demonstrate that DeepIM achieves large improvements over state-of-the-art methods. We furthermore show that DeepIM is able to match previously unseen objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  • Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Computer Vision and Image Understanding, 110(3), 346–359.

    Article  Google Scholar 

  • Besl, P. J., & McKay, N. D. (1992). Method for registration of 3-d shapes. In P. J. Besl & N. D. McKay (Eds.), Sensor fusion IV: Control paradigms and data structures (Vol. 1611, pp. 586–607). Bellingham: International Society for Optics and Photonics.

    Chapter  Google Scholar 

  • Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., & Rother, C. (2014). Learning 6D object pose estimation using 3D object coordinates. In: European conference on computer vision (ECCV).

  • Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., & Rother, C. (2016). Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3364–3372).

  • Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., & Dollar, A. M. (2015). The YCB object and model set: Towards common benchmarks for manipulation research. In: 2015 International conference on advanced robotics (ICAR), IEEE (pp. 510–517).

  • Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In: IEEE conference on computer vision and pattern recognition (CVPR).

  • Collet, A., Martinez, M., & Srinivasa, S. S. (2011). The MOPED framework: Object recognition and pose estimation for manipulation. International Journal of Robotics Research (IJRR), 30(10), 1284–1306.

    Article  Google Scholar 

  • Costante, G., & Ciarfuglia, T. A. (2018). LS-VO: Learning dense optical subspace for robust visual odometry estimation. IEEE Robotics and Automation Letters, 3(3), 1735–1742.

    Article  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1, 886–893.

    Google Scholar 

  • Deng, X., Mousavian, A., Xiang, Y., Xia, F., Bretl, T., & Fox, D. (2019). PoseRBPF: A Rao-blackwellized particle filter for 6D object pose tracking. In Robotics: Science and systems (RSS).

  • Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., & Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In: IEEE international conference on computer vision (ICCV), pp 2758–2766.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IEEE International Journal of Computer Vision (ICCV), 88(2), 303–338.

    Article  Google Scholar 

  • Garon, M., & Lalonde, J. F. (2017). Deep 6-DOF tracking. IEEE Transactions on Visualization and Computer Graphics, 23(11), 2410–2418.

    Article  Google Scholar 

  • Garon, M., Boulet, P. O., Doironz, J. P., Beaulieu, L., & Lalonde, J. F. (2016). Real-time high resolution 3D data on the hololens. In IEEE international symposium on mixed and augmented reality (ISMAR-Adjunct), IEEE (pp. 189–191).

  • Girshick, R. (2015). Fast R-CNN. In: IEEE international conference on computer vision (ICCV) (pp. 1440–1448).

  • Gu, C., & Ren, X. (2010). Discriminative mixture-of-templates for viewpoint classification. In European conference on computer vision (ECCV) (pp. 408–421).

  • Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., et al. (2012a). Gradient response maps for real-time detection of textureless objects. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 34(5), 876–888.

    Article  Google Scholar 

  • Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., & Navab, N. (2012b). Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Asian conference on computer vision (ACCV).

  • Hinterstoisser, S., Lepetit, V., Rajkumar, N., & Konolige, K. (2016). Going further with point pair features. In European conference on computer vision (ECCV) (pp. 834–848).

  • Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., & Zabulis, X. (2017). T-less: An RGB-D dataset for 6D pose estimation of texture-less objects. In IEEE winter conference on applications of computer vision (WACV), IEEE (pp. 880–888).

  • Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3D scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 5, 433–449.

    Article  Google Scholar 

  • Jurie, F., & Dhome, M. (2001). Real time 3D template matching. In IEEE conference on computer vision and pattern recognition (CVPR) (Vol. 1, p. I).

  • Kehl, W., Manhardt, F., Tombari, F., Ilic, S., & Navab, N. (2017). SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1521–1529).

  • Kendall, A., & Cipolla, R. (2017). Geometric loss functions for camera pose regression with deep learning. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Krull, A., Brachmann, E., Michel, F., Ying Yang, M., Gumhold, S., & Rother, C. (2015). Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In IEEE international conference on computer vision (ICCV) (pp. 954–962).

  • Lin, C. H., & Lucey, S. (2017). Inverse compositional spatial transformer networks. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2568–2576).

  • Liu, M. Y., Tuzel, O., Veeraraghavan, A., & Chellappa, R. (2010). Fast directional chamfer matching. In: IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1696–1703).

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. In European conference on computer vision (ECCV) (pp. 21–37).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).

  • Lowe, D. G. (1999). Object recognition from local scale-invariant features. IEEE international conference on computer vision (ICCV) (Vol. 2, pp. 1150–1157).

  • Manhardt, F., Kehl, W., Navab, N., & Tombari, F. (2018). Deep model-based 6D pose refinement in RGB. In European conference on computer vision (ECCV) (pp. 800–815).

  • Mellado, N., Aiger, D., & Mitra, N. J. (2014). Super 4pcs fast global pointcloud registration via smart indexing. Computer Graphics Forum, 33, 205–215.

    Article  Google Scholar 

  • Mian, A. S., Bennamoun, M., & Owens, R. (2006). Three-dimensional model-based object recognition and segmentation in cluttered scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(10), 1584–1601.

    Article  Google Scholar 

  • Michel, F., Kirillov, A., Brachmann, E., Krull, A., Gumhold, S., Savchynskyy, B., & Rother, C. (2017). Global hypothesis generation for 6D object pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Mousavian, A., Anguelov, D., Flynn, J., & Košecká, J. (2017). 3D bounding box estimation using deep learning and geometry. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 5632–5640).

  • Nistér, D. (2005). Preemptive RANSAC for live structure and motion estimation. Machine Vision and Applications, 16(5), 321–329.

    Article  Google Scholar 

  • Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Training a feedback loop for hand pose estimation. In IEEE international conference on computer vision (ICCV).

  • Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). Pointnet: Deep learning on point sets for 3D classification and segmentation. IEEE Computer Vision and Pattern Recognition (CVPR), 1(2), 4.

    Google Scholar 

  • Rad, M., & Lepetit, V. (2017). BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In IEEE international conference on computer vision (ICCV).

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 779–788).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NIPS).

  • Rothganger, F., Lazebnik, S., Schmid, C., & Ponce, J. (2006). 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. International Journal of Computer Vision (IJCV), 66(3), 231–259.

    Article  Google Scholar 

  • Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the ICP algorithm. In: Third international conference on 3-D digital imaging and modeling, 2001. Proceedings. IEEE (pp. 145–152).

  • Rusu, R. B., Blodow, N., & Beetz, M. (2009). Fast point feature histograms (FPFH) for 3D registration. In IEEE international conference on robotics and automation (ICRA), Citeseer (pp. 3212–3217).

  • Salvi, J., Matabosch, C., Fofi, D., & Forest, J. (2007). A review of recent range image registration methods with accuracy evaluation. Image and Vision Computing, 25(5), 578–596.

    Article  Google Scholar 

  • Saxena, A., Pandya, H., Kumar, G., Gaud, A., & Krishna, K. M. (2017). Exploring convolutional networks for end-to-end visual servoing. In IEEE international conference on robotics and automation (ICRA) (pp. 3817–3823).

  • Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., & Fitzgibbon, A. (2013). Scene coordinate regression forests for camera relocalization in RGB-D images. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2930–2937).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Sundermeyer, M., Marton, Z. C., Durner, M., Brucker, M., & Triebel, R. (2018). Implicit 3D orientation learning for 6D object detection from RGB images. In European conference on computer vision (ECCV) (pp. 699–715).

  • Tam, G. K., Cheng, Z. Q., Lai, Y. K., Langbein, F. C., Liu, Y., Marshall, D., et al. (2013). Registration of 3D point clouds and meshes: A survey from rigid to nonrigid. IEEE Transactions on Visualization and Computer Graphics, 19(7), 1199–1217.

    Article  Google Scholar 

  • Tekin, B., Sinha, S. N., & Fua, P. (2017). Real-time seamless single shot 6D object pose prediction. arXiv preprint arXiv:1711.08848.

  • Theiler, P. W., Wegner, J. D., & Schindler, K. (2015). Globally consistent registration of terrestrial laser scans via graph optimization. ISPRS Journal of Photogrammetry and Remote Sensing, 109, 126–138.

    Article  Google Scholar 

  • Tjaden, H., Schwanecke, U., & Schömer, E. (2017). Real-time monocular pose estimation of 3D objects using temporally consistent local color histograms. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 124–132).

  • Tombari, F., Salti, S., & Di Stefano, L. (2010). Unique signatures of histograms for local surface description. In European conference on computer vision (ECCV), Springer (pp. 356–369).

  • Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., & Birchfield, S. (2018). Deep object pose estimation for semantic robotic grasping of household objects. In Conference on robot learning (pp. 306–316).

  • Wang, C., Xu, D., Zhu, Y., Martín-Martín, R., Lu, C., Fei-Fei, L., & Savarese, S. (2019). Densefusion: 6D object pose estimation by iterative dense fusion. arXiv preprint arXiv:1901.04780.

  • Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In IEEE international conference on robotics and automation (ICRA), IEEE (pp. 2043–2050).

  • Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D shapenets: A deep representation for volumetric shapes. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1912–1920).

  • Xiang, Y., Schmidt, T., Narayanan, V., & Fox, D. (2018). PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. In Robotics: Science and systems (RSS).

  • Yang, J., Li, H., Campbell, D., & Jia, Y. (2016). GO-ICP: a globally optimal solution to 3D ICP point-set registration. arXiv preprint arXiv:1605.03344.

  • Zeng, A., Yu, K. T., Song, S., Suo, D., Walker, E., Rodriguez, A., & Xiao, J. (2017). Multi-view self-supervised deep learning for 6D pose estimation in the Amazon picking challenge. In IEEE international conference on robotics and automation (ICRA) (pp. 1386–1383).

  • Zhou, Q. Y., Park, J., & Koltun, V. (2016). Fast global registration. In European conference on computer vision (ECCV), Springer (pp. 766–782).

Download references

Acknowledgements

We thank Lirui Wang at University of Washington for his contribution in this project. This work was funded in part by a Siemens Grant. We would also like to thank NVIDIA for generously providing the DGX station used for this research via the NVIDIA Robotics Lab and the UW NVIDIA AI Lab (NVAIL). This work was also Supported by National Key R&D Program of China 2017YFB1002202, NSFC Projects 61620106005, 61325003, Beijing Municipal Sci. & Tech. Commission Z181100008918014 and THU Initiative Scientific Research Program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Li.

Additional information

Communicated by Cristian Sminchisescu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Wang, G., Ji, X. et al. DeepIM: Deep Iterative Matching for 6D Pose Estimation. Int J Comput Vis 128, 657–678 (2020). https://doi.org/10.1007/s11263-019-01250-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01250-9

Keywords

Navigation