3D Scene Perception for Autonomous Driving

Li, Shuai; Zhou, Huasong; Gao, Yanbo; Cai, Xun; Yuan, Hui; Zhang, Wei

doi:10.1007/978-981-99-5439-1_7

Shuai Li⁵,
Huasong Zhou⁵,
Yanbo Gao⁶,
Xun Cai⁶,
Hui Yuan⁵ &
…
Wei Zhang⁵

160 Accesses

Abstract

Autonomous driving has achieved a great development in recent years. In order to perceive the road and the distance of objects on the road for automatic path planning, 3D scene perception is required. Depth acquisition, as a fundamental part of 3D vision, is urgently needed and attracting increasing interests. With the rapid development of deep learning based vision processing techniques, lots of efforts have been put into the investigation of deep learning based depth estimation. This chapter analyses the existing depth acquisition methods, especially the deep learning based depth estimation with vision. Methodologies of different vision based depth estimation approaches, including depth estimation with a single image, a stereo pair and a video, have been explained, and their similarities and differences have been summarized and analyzed. This chapter also discusses the future depth acquisition/estimation direction with enhanced temporal information exploration and multimodal information fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

A. Smolic, 3D video and free viewpoint video—from capture to display. Pattern Recogn. 44, 1958–1968 (2011)
Article Google Scholar
V. Guizilini, R. Ambruş, W. Burgard, A. Gaidon, Sparse auxiliary networks for unified monocular depth prediction and completion, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 11073–11083
Google Scholar
W. Yan, W. Chao, Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
S. Izadi, D. Kim, Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera, in Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (2011), pp. 559–568
Google Scholar
S. Song, J. Sun, RGB-D: a RGB-D scene understanding benchmark suite, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 567–576
Google Scholar
P.L. Lin, T. Zhou, R. Tucker et al., Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. IEEE Robot. Autom. Lett. 315–326 (2018)
Google Scholar
S. Royo, M. Ballesta-Garcia, An overview of lidar imaging systems for autonomous vehicles. Appl. Sci. 9(19), 4093 (2019)
Article Google Scholar
M. Himmelsbach, A. Mueller, T. Lüttel, H.J. Wünsche, LIDAR-based 3D object perception, in Proceedings of 1st International Workshop on Cognition for Technical Systems (2008)
Google Scholar
L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, LIDAR–camera fusion for road detection using fully convolutional neural networks. Robot. Auton. Syst. (2019)
Google Scholar
A. Seppänen, R. Ojala, K. Tammi, 4DenoiseNet: Adverse Weather Denoising from Adjacent Point Clouds (2022). arXiv preprint arXiv:2209.07121
J.I. Park, K.S. Kim, Fast and accurate desnowing algorithm for LiDAR point clouds. IEEE Access 160202–160212 (2020)
Google Scholar
L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, R. Sell, Lidar-camera semi-supervised learning for semantic segmentation. Sensors 21(14), 4813 (2021)
Article Google Scholar
G. Yan, J. Pi, C. Wang, X. Cai, Y. Li, An Extrinsic Calibration Method of a 3D-LiDAR and a Pose Sensor for Autonomous Driving (2022). arXiv preprint arXiv:2209.07694
Z. Cui, P. Tan, Global structure-from-motion by similarity averaging, in IEEE International Conference on Computer Vision (ICCV) (2015), pp. 864–872
Google Scholar
Y. Zhai, L. Zeng, A SIFT matching algorithm based on adaptive contrast threshold, in Conference on Consumer Electronics, Communications and Networks (CECNet) (2011), pp. 1934–1937
Google Scholar
T.T. San, N. War, Stereo matching algorithm by hill-climbing segmentation, in Global Conference on Consumer Electronics (GCCE) (2017), pp. 1–2
Google Scholar
J. Cai, Integration of optical flow and dynamic programming for stereo matching. Image Process. 6(3), 205–212 (2012)
Google Scholar
J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 787–800 (2003)
Article MATH Google Scholar
P.F. Felzenszwalb, D.P. Huttenlocher, Efficient belief propagation for early vision. Int. J. Comput. Vision 70(1), 41–54 (2006)
Article Google Scholar
Y. Chang, Y. Ho, Modified SAD using adaptive window sizes for efficient stereo matching, in International Conference on Embedded Systems and Intelligent Technology (2014), pp. 9–11
Google Scholar
R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence, in European Conference on Computer Vision (ECCV) (1994), pp. 151–158
Google Scholar
O. Eksler, Fast variable window for stereo correspondence using integral images, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2003)
Google Scholar
K.J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 650–656 (2006)
Article Google Scholar
H.H. Stereo, Processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007)
Google Scholar
N. Snavely, S.M. Seitz, R. Szeliski, Modeling the world from internet photo collections. Int. J. Comput. Vision 80(2), 189–210 (2008)
Article Google Scholar
C. Wu, S. Agarwal, B. Curless, S.M. Seitz, Multicore bundle adjustment, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011), pp. 3057–3064
Google Scholar
N. Snavely, S.M. Seitz, R. Szeliski, Skeletal graphs for efficient structure from motion, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2008), pp. 1–8
Google Scholar
V.M. Govindu, Combining two-view constraints for motion estimation, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001)
Google Scholar
D. Devarajan, R.J. Radke, Calibrating distributed camera networks using belief propagation. EURASIP J. Adv. Signal Process. 1–10 (2006)
Google Scholar
P. Moulon, P. Monasse, R. Marlet, Global fusion of relative motions for robust, accurate and scalable structure from motion, in IEEE International Conference on Computer Vision (ICCV) (2013), pp. 3248–3255
Google Scholar
B. Li, C. Shen, Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1119–1127
Google Scholar
F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image. Comput. Vision Pattern Recogn. (CVPR) (2015)
Google Scholar
D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 2366–2374 (2014)
Google Scholar
A. Chakrabarti, J. Shao, G. Shakhnarovich, Depth from a single image by harmonizing overcomplete local network predictions. Adv. Neural Inform. Process. Syst. 2658–2666 (2016)
Google Scholar
M. Song, S. Lim, W. Kim, Monocular depth estimation using Laplacian pyramid-based depth residuals. IEEE Trans. Circ. Syst. Video Technol. 31, 4381–4393 (2021)
Article Google Scholar
X. Chen, Y. Wang, X. Chen, W. Zeng, S2R-DepthNet: learning a generalizable depth-specific structural representation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 3034–3043
Google Scholar
R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12179–12188
Google Scholar
A. Agarwal, C. Arora, Attention Everywhere: Monocular Depth Prediction with Skip Attention (2022). arXiv preprint arXiv:2210.09071
D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in International Conference on Computer Vision (ICCV) (2015), pp. 2650–2658
Google Scholar
T. Dharmasiri, A. Spek, T. Drummond, Joint prediction of depths, normals and surface curvature from RGB images using CNNS, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017), pp. 1505–1512
Google Scholar
P. Wang, X. Shen, Z. Lin, S. Cohen, Towards unified depth and semantic prediction from a single image, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 2800–2809
Google Scholar
A. Mousavian, Pirsiavash, Joint semantic segmentation and depth estimation with deep convolutional networks, in Fourth International Conference on 3D Vision (3DV) (2016), pp. 611–619
Google Scholar
H. Jung, E. Park, Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12642–12652
Google Scholar
N. Mayer, E. Ilg, P. Hausser, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. Comput. Vision Pattern Recogn. (CVPR) 4040–4048 (2016)
Google Scholar
J.H. Pang, W.X. Sun, J.S.J. Ren, Cascade residual learning: a two-stage convolutional neural network for stereo matching, in IEEE International Conference on Computer Vision Workshops (2017), pp. 878–886
Google Scholar
X. Song, X. Zhao, H.W. Hu, L.J. Fang, EdgeStereo: a context integrated residual pyramid network for stereo matching, in Asian Conference on Computer Vision (2018)
Google Scholar
A. Kendall, H. Martirosyan, End-to-end learning of geometry and context for deep stereo regression, in IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
J.R. Chang, Y.S. Chen, Pyramid stereo matching network, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 5410–5418
Google Scholar
S. Zhang, Z. Wang, Q. Wang, et al., EDNet: efficient disparity estimation with cost volume combination and attention-based spatial residual, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 5433–5442
Google Scholar
J. Xie, R. Girshick, A. Farhadi, Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks, in European Conference on Computer Vision (ECCV) (2016), pp. 842–857
Google Scholar
R. Garg, G. Carneiro, I.D. Reid, Unsupervised CNN for single view depth estimation: Geometry to the rescue, in European Conference on Computer Vision (ECCV) (2016), pp. 740–756
Google Scholar
C. Godard, O.M. Aodha, G.J. Brostow G. J. (2017). Unsupervised monocular depth estimation with left-right consistency, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6602–6611
Google Scholar
A. Wong, S. Soatto, Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5637–5646
Google Scholar
A. Pilzer, D. Xu, M. Puscas, Un-supervised adversarial depth estimation using cycled generative networks, in International Conference on 3D Vision (3DV) (2018), pp. 587–595
Google Scholar
R. Peng, R. Wang, Y. Lai, et al., Excavating the potential capacity of self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (CVPR) (2021), pp. 15560–15569.
Google Scholar
H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, Y. Yan, Exploiting temporal consistency for real-time video depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 1725–1734
Google Scholar
R. Wang, S.M. Pizer, J. Frahm, Recurrent neural network for (Un-)supervised learning of monocular video visual odometry and depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5550–5559
Google Scholar
X. Yang, Y. Gao, H. Luo, C. Liao, K. Cheng, Bayesian DeNet: monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans. Multimedia 21, 2701–2713 (2019)
Article Google Scholar
J. Watson, O. Mac Aodha, V. Prisacariu, et al., The temporal opportunist: Self-supervised multi-frame monocular depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 1164–1174
Google Scholar
X. Long, L. Liu, W. Li, et al., Multi-view depth estimation using epipolar spatio-temporal networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp.8258–8267
Google Scholar
T. Zhou, M. Brown, N. Snavely, Unsupervised learning of depth and ego-motion from video, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6612–6619
Google Scholar
Z. Yin, J. Shi, Geonet: Unsupervised learning of dense depth, optical flow and camera pose, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 1983–1992
Google Scholar
C. Godard, O. Mac Aodha, M. Firman, et al., Digging into self-supervised monocular depth estimation, in The IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 3828–3838
Google Scholar
T.-W. Hui, RMDepth: unsupervised learning of recurrent monocular depth in dynamic scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
Google Scholar
R. Mahjourian, M. Wicke, A. Angelova, Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2018)
Google Scholar
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger, Sparsity invariant CNNS, in International conference on 3D Vision (3DV) (2017), pp. 11–20
Google Scholar
A. Eldesokey, M. Felsberg, F.S. Khan, Propagating Confidences Through CNNS for Sparse Data Regression (2018). arXiv preprint arXiv:1805.11913
W. Van Gansbeke, D. Neven, B. De Brabandere, L. Van Gool, Sparse and noisy lidar completion with RGB guidance and uncertainty, in International Conference on Machine Vision Applications (MVA) (2019), pp. 1–6
Google Scholar
S. Shivakumar, T. Nguyen, I.D. Miller, S.W. Chen, V. Kumar, C.J. Taylor, Dfusenet: deep fusion of RGB and sparse depth information for image guided dense depth completion, in Intelligent Transportation Systems Conference (ITSC) (2019), pp. 13–20
Google Scholar
X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, C.L. Tai, Transfusion: robust lidar-camera fusion for 3d object detection with transformers, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 1090–1099
Google Scholar
Y. Li, A. Yu, Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 17182–17191
Google Scholar
S. Li, W. Li, C. Cook, et al., Independently recurrent neural network (INDRNN): building a longer and deeper RNN, in IEEE conference on computer vision and pattern recognition (CVPR) (2018), pp. 5457–5466
Google Scholar
A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the Kitti vision benchmark suite, in IEEE conference on computer vision and pattern recognition (CVPR) (2012), pp. 3354–3361
Google Scholar
M. Menze, A. Geiger, Object scene flow for autonomous vehicles, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 3061–3070
Google Scholar
H. Caesar, V. Bankiti, A.H. Lang, S. Vora, V.E. Liong, Q. Xu, O. Beijbom, Nuscenes: a multimodal dataset for autonomous driving, in IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2020), pp. 11621–11631
Google Scholar

Download references

Acknowledgements

This chapter was supported in part by the National Natural Science Foundation of China under Grant 62271290 and Grant 61901083; and in part by SDU QILU Young Scholars Program. The authors would also like to acknowledge the contributions from Hongwei Xu, Xianye Wu and Jianguo Wang for providing relative materials to this chapter.

Author information

Authors and Affiliations

School of Control Science and Engineering, Shandong University, Jinan, China
Shuai Li, Huasong Zhou, Hui Yuan & Wei Zhang
School of Software, Shandong University, Jinan, China
Yanbo Gao & Xun Cai

Authors

Shuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Huasong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yanbo Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Hui Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Wei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanbo Gao .

Editor information

Editors and Affiliations

Zhejiang Lab, Hangzhou, Zhejiang, China
Yongdong Zhu
Suzhou Research Institute of Wuhan University, Wuhan University, Wuhan, Hubei, China
Yue Cao
Zhejiang Lab, Hangzhou, Zhejiang, China
Wei Hua
China Unicom & Beijing University of Posts and Telecommunications, Beijing, China
Lexi Xu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Li, S., Zhou, H., Gao, Y., Cai, X., Yuan, H., Zhang, W. (2023). 3D Scene Perception for Autonomous Driving. In: Zhu, Y., Cao, Y., Hua, W., Xu, L. (eds) Communication, Computation and Perception Technologies for Internet of Vehicles. Springer, Singapore. https://doi.org/10.1007/978-981-99-5439-1_7

Download citation

DOI: https://doi.org/10.1007/978-981-99-5439-1_7
Published: 01 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5438-4
Online ISBN: 978-981-99-5439-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics