Learning 6D Object Pose Estimation Using 3D Object Coordinates

  • Eric Brachmann
  • Alexander Krull
  • Frank Michel
  • Stefan Gumhold
  • Jamie Shotton
  • Carsten Rother
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8690)


This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less objects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 objects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of objects and has capabilities to run fast.


Training Image Object Detection Background Model Vary Lighting Condition Decision Forest 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

978-3-319-10605-2_35_MOESM1_ESM.pdf (11 mb)
Electronic Supplementary Material (PDF 11,260 KB)


  1. 1.
    Bo, L., Ren, X., Fox, D.: Unsupervised feature learning for RGB-D based object recognition. In: ISER (2012)Google Scholar
  2. 2.
    Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer (2013)Google Scholar
  3. 3.
    Damen, D., Bunnun, P., Calway, A., Mayol-Cuevas, W.: Real-time learning and detection of 3D texture-less objects: A scalable approach. In: BMVC (2012)Google Scholar
  4. 4.
    Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Efficient and robust 3D object recognition. In: CVPR (2010)Google Scholar
  5. 5.
    Gall, J., Yao, A., Razavi, N., Van Gool, L., Lempitsky, V.: Hough Forests for object detection, tracking, and action recognition. IEEE Trans. on PAMI 33(11) (2011)Google Scholar
  6. 6.
    Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: ICCV (2011)Google Scholar
  7. 7.
    Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., Lepetit, V.: Gradient response maps for real-time detection of texture-less objects. IEEE Trans. on PAMI (2012)Google Scholar
  8. 8.
    Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part I. LNCS, vol. 7724, pp. 548–562. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  9. 9.
    Hoiem, D., Rother, C., Winn, J.: 3D LayoutCRF for multi-view object class recognition and segmentation. In: CVPR (2007)Google Scholar
  10. 10.
    Holzer, S., Shotton, J., Kohli, P.: Learning to efficiently detect repeatable interest points in depth data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 200–213. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Huttenlocher, D., Klanderman, G., Rucklidge, W.: Comparing images using the hausdorff distance. IEEE Trans. on PAMI (1993)Google Scholar
  12. 12.
    Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., Fitzgibbon, A.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: UIST (2011)Google Scholar
  13. 13.
    Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view rgb-d object dataset. In: ICRA. IEEE (2011)Google Scholar
  14. 14.
    Lepetit, V., Fua, P.: Keypoint recognition using randomized trees. IEEE Trans. on PAMI 28(9) (2006)Google Scholar
  15. 15.
    Lowe, D.G.: Local feature view clustering for 3d object recognition. In: CVPR (2001)Google Scholar
  16. 16.
    Martinez, M., Collet, A., Srinivasa, S.S.: Moped: A scalable and low latency object recognition and pose estimation system. In: ICRA (2010)Google Scholar
  17. 17.
    Newcombe, R., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A., Kohli, P., Shotton, J., Hodges, S., Fitzgibbon, A.: KinectFusion: Real-time dense surface mapping and tracking. In: ISMAR (2011)Google Scholar
  18. 18.
    Nistér, D., Stewénius, H.: Scalable recognition with a vocabulary tree. In: CVPR (2006)Google Scholar
  19. 19.
    Ozuysal, M., Calonder, M., Lepetit, V., Fua, P.: Fast keypoint recognition using random ferns. IEEE Trans. on PAMI (2010)Google Scholar
  20. 20.
    Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with large vocabularies and fast spatial matching. In: CVPR (2007)Google Scholar
  21. 21.
    Rios-Cabrera, R., Tuytelaars, T.: Discriminatively trained templates for 3D object detection: A real time scalable approach. In: ICCV (2013)Google Scholar
  22. 22.
    Rosten, E., Porter, R., Drummond, T.: FASTER and better: A machine learning approach to corner detection. IEEE Trans. on PAMI 32 (2010)Google Scholar
  23. 23.
    Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from a single depth image. In: CVPR (2011)Google Scholar
  24. 24.
    Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene coordinate regression forests for camera relocalization in rgb-d images. In: CVPR (2013)Google Scholar
  25. 25.
    Shotton, J., Girshick, R.B., Fitzgibbon, A.W., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., Blake, A.: Efficient human pose estimation from single depth images. IEEE Trans. on PAMI 35(12) (2013)Google Scholar
  26. 26.
    Steger, C.: Similarity measures for occlusion, clutter, and illumination invariant object recognition. In: DAGM-S (2001)Google Scholar
  27. 27.
    Sun, M., Bradski, G., Xu, B.-X., Savarese, S.: Depth-encoded hough voting for joint object detection and shape recovery. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 658–671. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  28. 28.
    Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.: The Vitruvian Manifold: Inferring dense correspondences for one-shot human pose estimation. In: CVPR (2012)Google Scholar
  29. 29.
    Ferrari, V., Jurie, F., Schmid, C.: From images to shape models for object detection. In: IJCV (2009)Google Scholar
  30. 30.
    Winder, S., Hua, G., Brown, M.: Picking the best DAISY. In: CVPR (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Eric Brachmann
    • 1
  • Alexander Krull
    • 1
  • Frank Michel
    • 1
  • Stefan Gumhold
    • 1
  • Jamie Shotton
    • 2
  • Carsten Rother
    • 1
  1. 1.TU DresdenDresdenGermany
  2. 2.Microsoft ResearchCambridgeUK

Personalised recommendations