Self6D: Self-supervised Monocular 6D Object Pose Estimation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


6D object pose estimation is a fundamental problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even from monocular images. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this shortcoming, we propose the idea of monocular 6D pose estimation by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage recent advances in neural rendering to further self-supervise the model on unannotated real RGB-D data, seeking for a visually and geometrically optimal alignment. Extensive evaluations demonstrate that our proposed self-supervision is able to significantly enhance the model’s original performance, outperforming all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm .


Self-supervised learning 6D pose estimation 


This work was supported by China Scholarship Council (CSC) Grant #201906210393. This work was also supported by the National Key R&D Program of China under Grant 2018AAA0102801.

Supplementary material

500725_1_En_7_MOESM1_ESM.pdf (18.2 mb)
Supplementary material 1 (pdf 18586 KB)

Supplementary material 2 (mp4 38200 KB)


  1. 1.
    Alldieck, T., Magnor, M., Bhatnagar, B.L., Theobalt, C., Pons-Moll, G.: Learning to reconstruct people in clothing from a single RGB camera. In: CVPR, pp. 1175–1186 (2019)Google Scholar
  2. 2.
    Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: CVPR, pp. 3722–3731 (2017)Google Scholar
  3. 3.
    Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learning 6D object pose estimation using 3D object coordinates. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 536–551. Springer, Cham (2014). Scholar
  4. 4.
    Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., Rother, C.: Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In: CVPR, pp. 3364–3372 (2016)Google Scholar
  5. 5.
    Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., Rehg, J.M.: Unsupervised 3d pose estimation with geometric self-supervision. In: CVPR, pp. 5714–5724 (2019)Google Scholar
  6. 6.
    Chen, W., Ling, H., Gao, J., Smith, E., Lehtinen, J., Jacobson, A., Fidler, S.: Learning to predict 3d objects with an interpolation-based differentiable renderer. In: NeurIPS, pp. 9605–9616 (2019)Google Scholar
  7. 7.
    Deng, X., Xiang, Y., Mousavian, A., Eppner, C., Bretl, T., Fox, D.: Self-supervised 6d object pose estimation for robot manipulation. In: ICRA (2020)Google Scholar
  8. 8.
    Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: ICCV, pp. 1301–1310 (2017)Google Scholar
  9. 9.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 270–279 (2017)Google Scholar
  10. 10.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV, pp. 3828–3838 (2019)Google Scholar
  11. 11.
    Guizilini, V., Ambrus, R., Pillai, S., Gaidon, A.: Packnet-SFM: 3D packing for self-supervised monocular depth estimation. arXiv preprint arXiv:1905.02693 (2019)
  12. 12.
    Hinterstoisser, S., et al.: Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: ACCV, pp. 548–562 (2012)Google Scholar
  13. 13.
    Hodan, T., Haluza, P., Obdržálek, Š., Matas, J., Lourakis, M., Zabulis, X.: T-less: an RGB-D dataset for 6D pose estimation of texture-less objects. In: WACV, pp. 880–888 (2017)Google Scholar
  14. 14.
    Hodaň, T., Matas, J., Obdržálek, Š.: On evaluation of 6d object pose estimation. In: ECCVW, pp. 606–619 (2016)Google Scholar
  15. 15.
    Hodaň, T., et al.: BOP: benchmark for 6D object pose estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 19–35. Springer, Cham (2018). Scholar
  16. 16.
    Hodaň, T., et al.: Photorealistic image synthesis for object instance detection. In: ICIP (2019)Google Scholar
  17. 17.
    Hu, Y., Hugonot, J., Fua, P., Salzmann, M.: Segmentation-driven 6D object pose estimation. In: CVPR, pp. 3385–3394 (2019)Google Scholar
  18. 18.
    Jiang, P.T., Hou, Q., Cao, Y., Cheng, M.M., Wei, Y., Xiong, H.K.: Integral object mining via online attention accumulation. In: ICCV, pp. 2070–2079 (2019)Google Scholar
  19. 19.
    Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). Scholar
  20. 20.
    Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 386–402. Springer, Cham (2018). Scholar
  21. 21.
    Kaskman, R., Zakharov, S., Shugurov, I., Ilic, S.: HomebrewedDB: RGB-D dataset for 6D pose estimation of 3d objects. In: ICCVW (2019)Google Scholar
  22. 22.
    Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: CVPR, pp. 3907–3916 (2018)Google Scholar
  23. 23.
    Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: ICCV, pp. 1521–1529 (2017)Google Scholar
  24. 24.
    Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep learning of local RGB-D patches for 3D object detection and 6D pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 205–220. Springer, Cham (2016). Scholar
  25. 25.
    Kocabas, M., Karagoz, S., Akbas, E.: Self-supervised learning of 3D human pose using multi-view geometry. In: CVPR, pp. 1077–1086 (2019)Google Scholar
  26. 26.
    Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation learning. In: CVPR, pp. 1920–1929 (2019)Google Scholar
  27. 27.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS, pp. 1097–1105 (2012)Google Scholar
  28. 28.
    Lee, H.-Y., Tseng, H.-Y., Huang, J.-B., Singh, M., Yang, M.-H.: Diverse image-to-image translation via disentangled representations. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 36–52. Springer, Cham (2018). Scholar
  29. 29.
    Li, Y., Wang, G., Ji, X., Xiang, Y., Fox, D.: DeepIM: deep iterative matching for 6d pose estimation. IJCV, 1–22 (2019)Google Scholar
  30. 30.
    Li, Z., Wang, G., Ji, X.: CDPN: coordinates-based disentangled pose network for real-time RGB-based 6-DoF object pose estimation. In: ICCV, pp. 7678–7687 (2019)Google Scholar
  31. 31.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)Google Scholar
  32. 32.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  33. 33.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  34. 34.
    Liu, R., et al.: An intriguing failing of convolutional neural networks and the coordconv solution. In: NeurIPS, pp. 9605–9616 (2018)Google Scholar
  35. 35.
    Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV, pp. 7708–7717 (2019)Google Scholar
  36. 36.
    Liu, W.W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  37. 37.
    Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). Scholar
  38. 38.
    Manhardt, F., et al.: Explaining the ambiguity of object detection and 6D pose from visual data. In: ICCV, pp. 6841–6850 (2019)Google Scholar
  39. 39.
    Manhardt, F., Kehl, W., Gaidon, A.: ROI-10D: monocular lifting of 2D detection to 6D pose and metric shape. In: CVPR, pp. 2069–2078 (2019)Google Scholar
  40. 40.
    Manhardt, F., Kehl, W., Navab, N., Tombari, F.: Deep model-based 6D pose refinement in RGB. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 833–849. Springer, Cham (2018). Scholar
  41. 41.
    Marschner, S., Shirley, P.: Fundamentals of Computer Graphics. CRC Press (2015)Google Scholar
  42. 42.
    Omran, M., Lassner, C., Pons-Moll, G., Gehler, P., Schiele, B.: Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In: 3DV. pp. 484–494 (2018)Google Scholar
  43. 43.
    Park, K., Patten, T., Vincze, M.: Pix2pose: pixel-wise coordinate regression of objects for 6D pose estimation. In: ICCV, pp. 7668–7677 (2019)Google Scholar
  44. 44.
    Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: PVNet: pixel-wise voting network for 6DoF pose estimation. In: CVPR, pp. 4561–4570 (2019)Google Scholar
  45. 45.
    Pillai, S., Ambruş, R., Gaidon, A.: Superdepth: self-supervised, super-resolved monocular depth estimation. In: ICRA, pp. 9250–9256 (2019)Google Scholar
  46. 46.
    Rad, M., Lepetit, V.: BB8: A scalable, accurate, robust to partial occlusion method for predicting the 3D poses of challenging objects without using depth. In: ICCV, pp. 3828–3836 (2017)Google Scholar
  47. 47.
    Rad, M., Oberweger, M., Lepetit, V.: Domain transfer for 3D pose estimation from color images without manual annotations. In: ACCV, pp. 69–84 (2018)Google Scholar
  48. 48.
    Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: CVPR, pp. 658–666 (2019)Google Scholar
  49. 49.
    Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: ground truth from computer games. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 102–118. Springer, Cham (2016). Scholar
  50. 50.
    Spelke, E.S.: Principles of object perception. Cogn. Sci. 14(1), 29–56 (1990)CrossRefGoogle Scholar
  51. 51.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: ICCV, pp. 2686–2694 (2015)Google Scholar
  52. 52.
    Sundermeyer, M., Marton, Z.-C., Durner, M., Brucker, M., Triebel, R.: Implicit 3D orientation learning for 6D object detection from RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 712–729. Springer, Cham (2018). Scholar
  53. 53.
    Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6D object pose prediction. In: CVPR, pp. 292–301 (2018)Google Scholar
  54. 54.
    Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV, pp. 9627–9636 (2019)Google Scholar
  55. 55.
    Tremblay, J., To, T., Birchfield, S.: Falling things: a synthetic dataset for 3D object detection and pose estimation. In: CVPRW, pp. 2038–2041 (2018)Google Scholar
  56. 56.
    Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. In: Conference on Robot Learning (CoRL), pp. 306–316 (2018)Google Scholar
  57. 57.
    Tung, H.Y., Tung, H.W., Yumer, E., Fragkiadaki, K.: Self-supervised learning of motion capture. In: NeurIPS, pp. 5236–5246 (2017)Google Scholar
  58. 58.
    Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3D pose estimation. In: CVPR, pp. 3109–3118 (2015)Google Scholar
  59. 59.
    Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. In: RSS (2018)Google Scholar
  60. 60.
    Zakharov, S., Kehl, W., Ilic, S.: Deceptionnet: network-driven domain randomization. In: ICCV, pp. 532–541 (2019)Google Scholar
  61. 61.
    Zakharov, S., Shugurov, I., Ilic, S.: Dpod: 6D pose object detector and refiner. In: ICCV, pp. 1941–1950 (2019)Google Scholar
  62. 62.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR, pp. 586–595 (2018)Google Scholar
  63. 63.
    Zhao, H., Gallo, O., Frosio, I., Kautz, J.: Loss functions for image restoration with neural networks. IEEE Trans. Comput. Imaging 3(1), 47–57 (2016)CrossRefGoogle Scholar
  64. 64.
    Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-d safari: learning to estimate zebra pose, shape, and texture from images “in the wild”. In: ICCV, pp. 5359–5368 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.BNRistTsinghua UniversityBeijingChina
  2. 2.Technical University of MunichMunichGermany
  3. 3.GoogleMenlo ParkUSA

Personalised recommendations