Skip to main content

3D Visual Object Detection from Monocular Images

  • Conference paper
  • First Online:
Advances in Visual Computing (ISVC 2019)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11844))

Included in the following conference series:

  • 2105 Accesses

Abstract

3D visual object detection is a fundamental requirement for autonomous vehicles. However, accurately detecting 3D objects was until recently a quality unique to expensive LiDAR ranging devices. Approaches based on cheaper monocular imagery are typically incapable of identifying 3D objects. In this paper, we propose a novel approach to predict accurate 3D bounding box locations on monocular images. We first train a generative adversarial network (GAN) to perform monocular depth estimation. The ground truth training depth data is obtained via depth completion on LiDAR scans. Next, we combine both depth and appearance data into a birds-eye-view representation with height, density and grayscale intensity as the three feature channels. Finally, We train a convolutional neural network (CNN) on our feature map leveraging bounding boxes annotated on corresponding LiDAR scans. Experiments show that our method performs favorably against baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Complex yolo with uncertainty. https://github.com/wl5/complex_yolo_3d

  2. pykitti open source utility library. https://github.com/utiasSTARS/pykitti

  3. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: USENIX Symposium, pp. 265–283 (2016)

    Google Scholar 

  4. Ali, W., Abdelkarim, S., Zidan, M., Zahran, M., Sallab, A.E.: YOLO3D: end-to-end real-time 3D oriented object bounding box detection from LiDAR point cloud. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11131, pp. 716–728. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11015-4_54

    Chapter  Google Scholar 

  5. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings Annual Conference on Computational Learning Theory, pp. 92–100. ACM (1998)

    Google Scholar 

  6. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR, pp. 1907–1915 (2017)

    Google Scholar 

  7. Eitel, A., Springenberg, J.T., Spinello, L., Riedmiller, M., Burgard, W.: Multimodal deep learning for robust RGB-D object recognition. In: IROS, pp. 681–687. IEEE (2015)

    Google Scholar 

  8. El Sallab, A., Sobh, I., Zidan, M., Zahran, M., Abdelkarim, S.: YOLO4D: a spatio-temporal approach for real-time multi-object detection and classification from LiDAR point clouds (2018)

    Google Scholar 

  9. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)

    Article  Google Scholar 

  10. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. IJRR 32(11), 1231–1237 (2013)

    Google Scholar 

  11. Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, vol. 2, p. 7 (2017)

    Google Scholar 

  12. Goodfellow, I., et al.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

    Google Scholar 

  13. Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_23

    Chapter  Google Scholar 

  14. He, D., et al.: Dual learning for machine translation. In: NIPS, pp. 820–828 (2016)

    Google Scholar 

  15. Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Conference on Computer Graphics and Interactive Techniques. ACM (2001)

    Google Scholar 

  16. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint (2017)

    Google Scholar 

  17. Lenz, I., Lee, H., Saxena, A.: Deep learning for detecting robotic grasps. IJRR 34(4–5), 705–724 (2015)

    Google Scholar 

  18. Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 663–678. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_39

    Chapter  Google Scholar 

  19. Mal, F., Karaman, S.: Sparse-to-dense: depth prediction from sparse depth samples and a single image. In: ICRA, pp. 1–8. IEEE (2018)

    Google Scholar 

  20. Montemerlo, M., et al.: Junior: the stanford entry in the urban challenge. J. Field Robot. 25(9), 569–597 (2008)

    Article  Google Scholar 

  21. Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: CVPR, pp. 918–927 (2018)

    Google Scholar 

  22. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: CVPR, pp. 7263–7271 (2017)

    Google Scholar 

  23. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)

    Google Scholar 

  24. Simon, M., Milz, S., Amende, K., Gross, H.-M.: Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 197–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11009-3_11

    Chapter  Google Scholar 

  25. Socher, R., Huval, B., Bath, B., Manning, C.D., Ng, A.Y.: Convolutional-recursive deep learning for 3D object classification. In: NIPS, pp. 656–664 (2012)

    Google Scholar 

  26. Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: a RGB-D scene understanding benchmark suite. In: CVPR, vol. 5, p. 6 (2015)

    Google Scholar 

  27. Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. arXiv preprint arXiv:1812.07179 (2018)

  28. Zhou, T., Krahenbuhl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3D-guided cycle consistency. In: CVPR, pp. 117–126 (2016)

    Google Scholar 

  29. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiaosong Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, Q., Rasmussen, C. (2019). 3D Visual Object Detection from Monocular Images. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2019. Lecture Notes in Computer Science(), vol 11844. Springer, Cham. https://doi.org/10.1007/978-3-030-33720-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-33720-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-33719-3

  • Online ISBN: 978-3-030-33720-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics