Skip to main content
  • 160 Accesses

Abstract

Autonomous driving has achieved a great development in recent years. In order to perceive the road and the distance of objects on the road for automatic path planning, 3D scene perception is required. Depth acquisition, as a fundamental part of 3D vision, is urgently needed and attracting increasing interests. With the rapid development of deep learning based vision processing techniques, lots of efforts have been put into the investigation of deep learning based depth estimation. This chapter analyses the existing depth acquisition methods, especially the deep learning based depth estimation with vision. Methodologies of different vision based depth estimation approaches, including depth estimation with a single image, a stereo pair and a video, have been explained, and their similarities and differences have been summarized and analyzed. This chapter also discusses the future depth acquisition/estimation direction with enhanced temporal information exploration and multimodal information fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. A. Smolic, 3D video and free viewpoint video—from capture to display. Pattern Recogn. 44, 1958–1968 (2011)

    Article  Google Scholar 

  2. V. Guizilini, R. Ambruş, W. Burgard, A. Gaidon, Sparse auxiliary networks for unified monocular depth prediction and completion, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 11073–11083

    Google Scholar 

  3. W. Yan, W. Chao, Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  4. S. Izadi, D. Kim, Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera, in Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (2011), pp. 559–568

    Google Scholar 

  5. S. Song, J. Sun, RGB-D: a RGB-D scene understanding benchmark suite, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 567–576

    Google Scholar 

  6. P.L. Lin, T. Zhou, R. Tucker et al., Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. IEEE Robot. Autom. Lett. 315–326 (2018)

    Google Scholar 

  7. S. Royo, M. Ballesta-Garcia, An overview of lidar imaging systems for autonomous vehicles. Appl. Sci. 9(19), 4093 (2019)

    Article  Google Scholar 

  8. M. Himmelsbach, A. Mueller, T. Lüttel, H.J. Wünsche, LIDAR-based 3D object perception, in Proceedings of 1st International Workshop on Cognition for Technical Systems (2008)

    Google Scholar 

  9. L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, LIDAR–camera fusion for road detection using fully convolutional neural networks. Robot. Auton. Syst. (2019)

    Google Scholar 

  10. A. Seppänen, R. Ojala, K. Tammi, 4DenoiseNet: Adverse Weather Denoising from Adjacent Point Clouds (2022). arXiv preprint arXiv:2209.07121

  11. J.I. Park, K.S. Kim, Fast and accurate desnowing algorithm for LiDAR point clouds. IEEE Access 160202–160212 (2020)

    Google Scholar 

  12. L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, R. Sell, Lidar-camera semi-supervised learning for semantic segmentation. Sensors 21(14), 4813 (2021)

    Article  Google Scholar 

  13. G. Yan, J. Pi, C. Wang, X. Cai, Y. Li, An Extrinsic Calibration Method of a 3D-LiDAR and a Pose Sensor for Autonomous Driving (2022). arXiv preprint arXiv:2209.07694

  14. Z. Cui, P. Tan, Global structure-from-motion by similarity averaging, in IEEE International Conference on Computer Vision (ICCV) (2015), pp. 864–872

    Google Scholar 

  15. Y. Zhai, L. Zeng, A SIFT matching algorithm based on adaptive contrast threshold, in Conference on Consumer Electronics, Communications and Networks (CECNet) (2011), pp. 1934–1937

    Google Scholar 

  16. T.T. San, N. War, Stereo matching algorithm by hill-climbing segmentation, in Global Conference on Consumer Electronics (GCCE) (2017), pp. 1–2

    Google Scholar 

  17. J. Cai, Integration of optical flow and dynamic programming for stereo matching. Image Process. 6(3), 205–212 (2012)

    Google Scholar 

  18. J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 787–800 (2003)

    Article  MATH  Google Scholar 

  19. P.F. Felzenszwalb, D.P. Huttenlocher, Efficient belief propagation for early vision. Int. J. Comput. Vision 70(1), 41–54 (2006)

    Article  Google Scholar 

  20. Y. Chang, Y. Ho, Modified SAD using adaptive window sizes for efficient stereo matching, in International Conference on Embedded Systems and Intelligent Technology (2014), pp. 9–11

    Google Scholar 

  21. R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence, in European Conference on Computer Vision (ECCV) (1994), pp. 151–158

    Google Scholar 

  22. O. Eksler, Fast variable window for stereo correspondence using integral images, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2003)

    Google Scholar 

  23. K.J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 650–656 (2006)

    Article  Google Scholar 

  24. H.H. Stereo, Processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007)

    Google Scholar 

  25. N. Snavely, S.M. Seitz, R. Szeliski, Modeling the world from internet photo collections. Int. J. Comput. Vision 80(2), 189–210 (2008)

    Article  Google Scholar 

  26. C. Wu, S. Agarwal, B. Curless, S.M. Seitz, Multicore bundle adjustment, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011), pp. 3057–3064

    Google Scholar 

  27. N. Snavely, S.M. Seitz, R. Szeliski, Skeletal graphs for efficient structure from motion, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2008), pp. 1–8

    Google Scholar 

  28. V.M. Govindu, Combining two-view constraints for motion estimation, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001)

    Google Scholar 

  29. D. Devarajan, R.J. Radke, Calibrating distributed camera networks using belief propagation. EURASIP J. Adv. Signal Process. 1–10 (2006)

    Google Scholar 

  30. P. Moulon, P. Monasse, R. Marlet, Global fusion of relative motions for robust, accurate and scalable structure from motion, in IEEE International Conference on Computer Vision (ICCV) (2013), pp. 3248–3255

    Google Scholar 

  31. B. Li, C. Shen, Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1119–1127

    Google Scholar 

  32. F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image. Comput. Vision Pattern Recogn. (CVPR) (2015)

    Google Scholar 

  33. D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 2366–2374 (2014)

    Google Scholar 

  34. A. Chakrabarti, J. Shao, G. Shakhnarovich, Depth from a single image by harmonizing overcomplete local network predictions. Adv. Neural Inform. Process. Syst. 2658–2666 (2016)

    Google Scholar 

  35. M. Song, S. Lim, W. Kim, Monocular depth estimation using Laplacian pyramid-based depth residuals. IEEE Trans. Circ. Syst. Video Technol. 31, 4381–4393 (2021)

    Article  Google Scholar 

  36. X. Chen, Y. Wang, X. Chen, W. Zeng, S2R-DepthNet: learning a generalizable depth-specific structural representation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 3034–3043

    Google Scholar 

  37. R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12179–12188

    Google Scholar 

  38. A. Agarwal, C. Arora, Attention Everywhere: Monocular Depth Prediction with Skip Attention (2022). arXiv preprint arXiv:2210.09071

  39. D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in International Conference on Computer Vision (ICCV) (2015), pp. 2650–2658

    Google Scholar 

  40. T. Dharmasiri, A. Spek, T. Drummond, Joint prediction of depths, normals and surface curvature from RGB images using CNNS, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017), pp. 1505–1512

    Google Scholar 

  41. P. Wang, X. Shen, Z. Lin, S. Cohen, Towards unified depth and semantic prediction from a single image, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 2800–2809

    Google Scholar 

  42. A. Mousavian, Pirsiavash, Joint semantic segmentation and depth estimation with deep convolutional networks, in Fourth International Conference on 3D Vision (3DV) (2016), pp. 611–619

    Google Scholar 

  43. H. Jung, E. Park, Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12642–12652

    Google Scholar 

  44. N. Mayer, E. Ilg, P. Hausser, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. Comput. Vision Pattern Recogn. (CVPR) 4040–4048 (2016)

    Google Scholar 

  45. J.H. Pang, W.X. Sun, J.S.J. Ren, Cascade residual learning: a two-stage convolutional neural network for stereo matching, in IEEE International Conference on Computer Vision Workshops (2017), pp. 878–886

    Google Scholar 

  46. X. Song, X. Zhao, H.W. Hu, L.J. Fang, EdgeStereo: a context integrated residual pyramid network for stereo matching, in Asian Conference on Computer Vision (2018)

    Google Scholar 

  47. A. Kendall, H. Martirosyan, End-to-end learning of geometry and context for deep stereo regression, in IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  48. J.R. Chang, Y.S. Chen, Pyramid stereo matching network, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 5410–5418

    Google Scholar 

  49. S. Zhang, Z. Wang, Q. Wang, et al., EDNet: efficient disparity estimation with cost volume combination and attention-based spatial residual, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 5433–5442

    Google Scholar 

  50. J. Xie, R. Girshick, A. Farhadi, Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks, in European Conference on Computer Vision (ECCV) (2016), pp. 842–857

    Google Scholar 

  51. R. Garg, G. Carneiro, I.D. Reid, Unsupervised CNN for single view depth estimation: Geometry to the rescue, in European Conference on Computer Vision (ECCV) (2016), pp. 740–756

    Google Scholar 

  52. C. Godard, O.M. Aodha, G.J. Brostow G. J. (2017). Unsupervised monocular depth estimation with left-right consistency, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6602–6611

    Google Scholar 

  53. A. Wong, S. Soatto, Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5637–5646

    Google Scholar 

  54. A. Pilzer, D. Xu, M. Puscas, Un-supervised adversarial depth estimation using cycled generative networks, in International Conference on 3D Vision (3DV) (2018), pp. 587–595

    Google Scholar 

  55. R. Peng, R. Wang, Y. Lai, et al., Excavating the potential capacity of self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (CVPR) (2021), pp. 15560–15569.

    Google Scholar 

  56. H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, Y. Yan, Exploiting temporal consistency for real-time video depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 1725–1734

    Google Scholar 

  57. R. Wang, S.M. Pizer, J. Frahm, Recurrent neural network for (Un-)supervised learning of monocular video visual odometry and depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5550–5559

    Google Scholar 

  58. X. Yang, Y. Gao, H. Luo, C. Liao, K. Cheng, Bayesian DeNet: monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans. Multimedia 21, 2701–2713 (2019)

    Article  Google Scholar 

  59. J. Watson, O. Mac Aodha, V. Prisacariu, et al., The temporal opportunist: Self-supervised multi-frame monocular depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 1164–1174

    Google Scholar 

  60. X. Long, L. Liu, W. Li, et al., Multi-view depth estimation using epipolar spatio-temporal networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp.8258–8267

    Google Scholar 

  61. T. Zhou, M. Brown, N. Snavely, Unsupervised learning of depth and ego-motion from video, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6612–6619

    Google Scholar 

  62. Z. Yin, J. Shi, Geonet: Unsupervised learning of dense depth, optical flow and camera pose, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 1983–1992

    Google Scholar 

  63. C. Godard, O. Mac Aodha, M. Firman, et al., Digging into self-supervised monocular depth estimation, in The IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 3828–3838

    Google Scholar 

  64. T.-W. Hui, RMDepth: unsupervised learning of recurrent monocular depth in dynamic scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  65. R. Mahjourian, M. Wicke, A. Angelova, Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2018)

    Google Scholar 

  66. J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger, Sparsity invariant CNNS, in International conference on 3D Vision (3DV) (2017), pp. 11–20

    Google Scholar 

  67. A. Eldesokey, M. Felsberg, F.S. Khan, Propagating Confidences Through CNNS for Sparse Data Regression (2018). arXiv preprint arXiv:1805.11913

  68. W. Van Gansbeke, D. Neven, B. De Brabandere, L. Van Gool, Sparse and noisy lidar completion with RGB guidance and uncertainty, in International Conference on Machine Vision Applications (MVA) (2019), pp. 1–6

    Google Scholar 

  69. S. Shivakumar, T. Nguyen, I.D. Miller, S.W. Chen, V. Kumar, C.J. Taylor, Dfusenet: deep fusion of RGB and sparse depth information for image guided dense depth completion, in Intelligent Transportation Systems Conference (ITSC) (2019), pp. 13–20

    Google Scholar 

  70. X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, C.L. Tai, Transfusion: robust lidar-camera fusion for 3d object detection with transformers, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 1090–1099

    Google Scholar 

  71. Y. Li, A. Yu, Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 17182–17191

    Google Scholar 

  72. S. Li, W. Li, C. Cook, et al., Independently recurrent neural network (INDRNN): building a longer and deeper RNN, in IEEE conference on computer vision and pattern recognition (CVPR) (2018), pp. 5457–5466

    Google Scholar 

  73. A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the Kitti vision benchmark suite, in IEEE conference on computer vision and pattern recognition (CVPR) (2012), pp. 3354–3361

    Google Scholar 

  74. M. Menze, A. Geiger, Object scene flow for autonomous vehicles, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 3061–3070

    Google Scholar 

  75. H. Caesar, V. Bankiti, A.H. Lang, S. Vora, V.E. Liong, Q. Xu, O. Beijbom, Nuscenes: a multimodal dataset for autonomous driving, in IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2020), pp. 11621–11631

    Google Scholar 

Download references

Acknowledgements

This chapter was supported in part by the National Natural Science Foundation of China under Grant 62271290 and Grant 61901083; and in part by SDU QILU Young Scholars Program. The authors would also like to acknowledge the contributions from Hongwei Xu, Xianye Wu and Jianguo Wang for providing relative materials to this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanbo Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Li, S., Zhou, H., Gao, Y., Cai, X., Yuan, H., Zhang, W. (2023). 3D Scene Perception for Autonomous Driving. In: Zhu, Y., Cao, Y., Hua, W., Xu, L. (eds) Communication, Computation and Perception Technologies for Internet of Vehicles. Springer, Singapore. https://doi.org/10.1007/978-981-99-5439-1_7

Download citation

Publish with us

Policies and ethics