Monocular Differentiable Rendering for Self-supervised 3D Object Detection

Beker, Deniz; Kato, Hiroharu; Morariu, Mihai Adrian; Ando, Takahiro; Matsuoka, Toru; Kehl, Wadim; Gaidon, Adrien

doi:10.1007/978-3-030-58589-1_31

Deniz Beker¹²,
Hiroharu Kato¹²,
Mihai Adrian Morariu¹²,
Takahiro Ando¹²,
Toru Matsuoka¹²,
Wadim Kehl¹³ &
…
Adrien Gaidon¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12366))

Included in the following conference series:

European Conference on Computer Vision

3765 Accesses
20 Citations

Abstract

3D object detection from monocular images is an ill-posed problem due to the projective entanglement of depth and scale. To overcome this ambiguity, we present a novel self-supervised method for textured 3D shape reconstruction and pose estimation of rigid objects with the help of strong shape priors and 2D instance masks. Our method predicts the 3D location and meshes of each object in an image using differentiable rendering and a self-supervised objective derived from a pretrained monocular depth estimation network. We use the KITTI 3D object detection dataset to evaluate the accuracy of the method. Experiments demonstrate that we can effectively use noisy monocular depth and differentiable rendering as an alternative to expensive 3D ground-truth labels or LiDAR information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. In: CoRR (2015)
Google Scholar
Chen, W., et al.: Learning to predict 3D objects with an interpolation-based differentiable renderer. In: NeurIPS (2019)
Google Scholar
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR (2016)
Google Scholar
Chen, X., et al.: 3D object proposals for accurate object class detection. In: NIPS (2015)
Google Scholar
Engelmann, F., Stückler, J., Leibe, B.: Joint object pose estimation and shape reconstruction in urban street scenes using 3D shape priors. In: Rosenhahn, B., Andres, B. (eds.) GCPR 2016. LNCS, vol. 9796, pp. 219–230. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45886-1_18
Chapter Google Scholar
Engelmann, F., Stückler, J., Leibe, B.: SAMP: shape and motion priors for 4D vehicle reconstruction. In: WACV (2017)
Google Scholar
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI vision benchmark suite. In: CVPR (2012)
Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)
Google Scholar
Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: PackNet-SfM: 3D packing for self-supervised monocular depth estimation. In: CoRR (2019)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
Ushiku, Y., Kato, H., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)
Google Scholar
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NIPS (2018)
Google Scholar
Kato, H., Harada, T.: Learning view priors for single-view 3D reconstruction. In: CVPR (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)
Google Scholar
Kulkarni, N., Gupta, A., Tulsiani, S.: Canonical surface mapping via geometric cycle consistency. In: ICCV (2019)
Google Scholar
Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)
Google Scholar
Li, T.-M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable Monte Carlo ray tracing through edge sampling. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 37(6), 222:1–222:11 (2018)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)
Google Scholar
Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11
Chapter Google Scholar
Ma, X., Wang, Z., Li, H., Ouyang, W., Zhang, P.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: ICCV (2019)
Google Scholar
Mahjourian, R., Wicke, M., Angelova, A.: Unsupervised learning of depth and egomotion from monocular video using 3D geometric constraints. In: CVPR (2018)
Google Scholar
Manhardt, F., Kehl, W., Gaidon, A.: ROI-10D: monocular lifting of 2D detection to 6D pose and metric shape. In: CVPR (2019)
Google Scholar
Choi, H.M., Kang, H., Hyun, Y.: Multi-view reprojection architecture for orientation estimation. In: ICCVW (2019)
Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: CVPR (2017)
Google Scholar
Pillai, S., Ambruş, R., Gaidon, A.: SuperDepth: self-supervised, super-resolved monocular depth estimation. In: ICRA (2019)
Google Scholar
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: ICCV (2019)
Google Scholar
Simonelli, A., Bulò, S.R., Porzi, L., Ricci, E., Kontschieder,P.: Single-stage monocular 3D object detection with virtual cameras. In: CoRR (2019)
Google Scholar
Stutz, D., Geiger, A.: Learning 3D shape completion under weak supervision. In: IJCV (2018)
Google Scholar
Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018)
Google Scholar
Wang, R., Yang, N., Stueckler, J., Cremers, D.: DirectShape: photometric alignment of shape priors for visual vehicle pose and shape estimation. In: ICRA (2020)
Google Scholar
Wang, Y., Chao, W.-L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR (2019)
Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: CVPR (2018)
Google Scholar
You, Y., et al.: Pseudo-LiDAR++: accurate depth for 3D object detection in autonomous driving. In: ICLR (2020)
Google Scholar
Zakharov, S., Kehl, W., Bhargava, A., Gaidon, A.: Autolabeling 3D objects with differentiable rendering of SDF shape priors. In: CVPR (2020)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-D safari: learning to estimate zebra pose, shape, and texture from images “in the wild”. In: ICCV (2019)
Google Scholar

Download references

Acknowledgements

This work was supported by Toyota Research Institute Advanced Development, Inc. The authors would like to thank Richard Calland, Karim Hamzaoui, Rares Ambrus, Vitor Guizilini for their helpful comments and suggestions.

Author information

Authors and Affiliations

Preferred Networks, Inc., Chiyoda City, Japan
Deniz Beker, Hiroharu Kato, Mihai Adrian Morariu, Takahiro Ando & Toru Matsuoka
Toyota Research Institute - Advanced Development, Chuo City, Japan
Wadim Kehl
Toyota Research Institute, Los Altos, USA
Adrien Gaidon

Authors

Deniz Beker
View author publications
You can also search for this author in PubMed Google Scholar
Hiroharu Kato
View author publications
You can also search for this author in PubMed Google Scholar
Mihai Adrian Morariu
View author publications
You can also search for this author in PubMed Google Scholar
Takahiro Ando
View author publications
You can also search for this author in PubMed Google Scholar
Toru Matsuoka
View author publications
You can also search for this author in PubMed Google Scholar
Wadim Kehl
View author publications
You can also search for this author in PubMed Google Scholar
Adrien Gaidon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deniz Beker .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 57518 KB)

Supplementary material 1 (pdf 2846 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beker, D. et al. (2020). Monocular Differentiable Rendering for Self-supervised 3D Object Detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12366. Springer, Cham. https://doi.org/10.1007/978-3-030-58589-1_31

Download citation

DOI: https://doi.org/10.1007/978-3-030-58589-1_31
Published: 12 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58588-4
Online ISBN: 978-3-030-58589-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics