Monocular Differentiable Rendering for Self-supervised 3D Object Detection

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)


3D object detection from monocular images is an ill-posed problem due to the projective entanglement of depth and scale. To overcome this ambiguity, we present a novel self-supervised method for textured 3D shape reconstruction and pose estimation of rigid objects with the help of strong shape priors and 2D instance masks. Our method predicts the 3D location and meshes of each object in an image using differentiable rendering and a self-supervised objective derived from a pretrained monocular depth estimation network. We use the KITTI 3D object detection dataset to evaluate the accuracy of the method. Experiments demonstrate that we can effectively use noisy monocular depth and differentiable rendering as an alternative to expensive 3D ground-truth labels or LiDAR information.



This work was supported by Toyota Research Institute Advanced Development, Inc. The authors would like to thank Richard Calland, Karim Hamzaoui, Rares Ambrus, Vitor Guizilini for their helpful comments and suggestions.

Supplementary material

504479_1_En_31_MOESM1_ESM.pdf (2.8 mb)
Supplementary material 1 (pdf 2846 KB)

Supplementary material 2 (mp4 57518 KB)


  1. 1.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. In: CoRR (2015)Google Scholar
  2. 2.
    Chen, W., et al.: Learning to predict 3D objects with an interpolation-based differentiable renderer. In: NeurIPS (2019)Google Scholar
  3. 3.
    Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR (2016)Google Scholar
  4. 4.
    Chen, X., et al.: 3D object proposals for accurate object class detection. In: NIPS (2015)Google Scholar
  5. 5.
    Engelmann, F., Stückler, J., Leibe, B.: Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 219–230. Springer, Cham (2016). Scholar
  6. 6.
    Engelmann, F., Stückler, J., Leibe, B.: SAMP: shape and motion priors for 4D vehicle reconstruction. In: WACV (2017)Google Scholar
  7. 7.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  8. 8.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  9. 9.
    Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: PackNet-SfM: 3D packing for self-supervised monocular depth estimation. In: CoRR (2019)Google Scholar
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  11. 11.
    Ushiku, Y., Kato, H., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)Google Scholar
  12. 12.
    Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NIPS (2018)Google Scholar
  13. 13.
    Kato, H., Harada, T.: Learning view priors for single-view 3D reconstruction. In: CVPR (2019)Google Scholar
  14. 14.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  15. 15.
    Kulkarni, N., Gupta, A., Tulsiani, S.: Canonical surface mapping via geometric cycle consistency. In: ICCV (2019)Google Scholar
  16. 16.
    Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)Google Scholar
  17. 17.
    Li, T.-M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable Monte Carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37(6), 222:1–222:11 (2018)Google Scholar
  18. 18.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  19. 19.
    Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)Google Scholar
  20. 20.
    Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). Scholar
  21. 21.
    Ma, X., Wang, Z., Li, H., Ouyang, W., Zhang, P.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: ICCV (2019)Google Scholar
  22. 22.
    Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and egomotion from monocular video using 3D geometric constraints. In: CVPR (2018)Google Scholar
  23. 23.
    Manhardt, F., Kehl, W., Gaidon, A.: ROI-10D: monocular lifting of 2D detection to 6D pose and metric shape. In: CVPR (2019)Google Scholar
  24. 24.
    Choi, H.M., Kang, H., Hyun, Y.: Multi-view reprojection architecture for orientation estimation. In: ICCVW (2019)Google Scholar
  25. 25.
    Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: CVPR (2017)Google Scholar
  26. 26.
    Pillai, S., Ambruş, R., Gaidon, A.: SuperDepth: self-supervised, super-resolved monocular depth estimation. In: ICRA (2019)Google Scholar
  27. 27.
    Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: ICCV (2019)Google Scholar
  28. 28.
    Simonelli, A., Bulò, S.R., Porzi, L., Ricci, E., Kontschieder,P.: Single-stage monocular 3D object detection with virtual cameras. In: CoRR (2019)Google Scholar
  29. 29.
    Stutz, D., Geiger, A.: Learning 3D shape completion under weak supervision. In: IJCV (2018)Google Scholar
  30. 30.
    Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018)Google Scholar
  31. 31.
    Wang, R., Yang, N., Stueckler, J., Cremers, D.: DirectShape: photometric alignment of shape priors for visual vehicle pose and shape estimation. In: ICRA (2020)Google Scholar
  32. 32.
    Wang, Y., Chao, W.-L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR (2019)Google Scholar
  33. 33.
    Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., Girshick, R.: Detectron2 (2019).
  34. 34.
    Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: CVPR (2018)Google Scholar
  35. 35.
    You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)Google Scholar
  36. 36.
    Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3D objects with differentiable rendering of SDF shape priors. In: CVPR (2020)Google Scholar
  37. 37.
    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)Google Scholar
  38. 38.
    Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-D safari: learning to estimate zebra pose, shape, and texture from images “in the wild”. In: ICCV (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Preferred Networks, Inc.Chiyoda CityJapan
  2. 2.Toyota Research Institute - Advanced DevelopmentChuo CityJapan
  3. 3.Toyota Research InstituteLos AltosUSA

Personalised recommendations