Abstract
Autonomous driving has achieved a great development in recent years. In order to perceive the road and the distance of objects on the road for automatic path planning, 3D scene perception is required. Depth acquisition, as a fundamental part of 3D vision, is urgently needed and attracting increasing interests. With the rapid development of deep learning based vision processing techniques, lots of efforts have been put into the investigation of deep learning based depth estimation. This chapter analyses the existing depth acquisition methods, especially the deep learning based depth estimation with vision. Methodologies of different vision based depth estimation approaches, including depth estimation with a single image, a stereo pair and a video, have been explained, and their similarities and differences have been summarized and analyzed. This chapter also discusses the future depth acquisition/estimation direction with enhanced temporal information exploration and multimodal information fusion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
A. Smolic, 3D video and free viewpoint video—from capture to display. Pattern Recogn. 44, 1958–1968 (2011)
V. Guizilini, R. Ambruş, W. Burgard, A. Gaidon, Sparse auxiliary networks for unified monocular depth prediction and completion, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 11073–11083
W. Yan, W. Chao, Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
S. Izadi, D. Kim, Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera, in Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology (2011), pp. 559–568
S. Song, J. Sun, RGB-D: a RGB-D scene understanding benchmark suite, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 567–576
P.L. Lin, T. Zhou, R. Tucker et al., Depth prediction without the sensors: leveraging structure for unsupervised learning from monocular videos. IEEE Robot. Autom. Lett. 315–326 (2018)
S. Royo, M. Ballesta-Garcia, An overview of lidar imaging systems for autonomous vehicles. Appl. Sci. 9(19), 4093 (2019)
M. Himmelsbach, A. Mueller, T. Lüttel, H.J. Wünsche, LIDAR-based 3D object perception, in Proceedings of 1st International Workshop on Cognition for Technical Systems (2008)
L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, LIDAR–camera fusion for road detection using fully convolutional neural networks. Robot. Auton. Syst. (2019)
A. Seppänen, R. Ojala, K. Tammi, 4DenoiseNet: Adverse Weather Denoising from Adjacent Point Clouds (2022). arXiv preprint arXiv:2209.07121
J.I. Park, K.S. Kim, Fast and accurate desnowing algorithm for LiDAR point clouds. IEEE Access 160202–160212 (2020)
L. Caltagirone, M. Bellone, L. Svensson, M. Wahde, R. Sell, Lidar-camera semi-supervised learning for semantic segmentation. Sensors 21(14), 4813 (2021)
G. Yan, J. Pi, C. Wang, X. Cai, Y. Li, An Extrinsic Calibration Method of a 3D-LiDAR and a Pose Sensor for Autonomous Driving (2022). arXiv preprint arXiv:2209.07694
Z. Cui, P. Tan, Global structure-from-motion by similarity averaging, in IEEE International Conference on Computer Vision (ICCV) (2015), pp. 864–872
Y. Zhai, L. Zeng, A SIFT matching algorithm based on adaptive contrast threshold, in Conference on Consumer Electronics, Communications and Networks (CECNet) (2011), pp. 1934–1937
T.T. San, N. War, Stereo matching algorithm by hill-climbing segmentation, in Global Conference on Consumer Electronics (GCCE) (2017), pp. 1–2
J. Cai, Integration of optical flow and dynamic programming for stereo matching. Image Process. 6(3), 205–212 (2012)
J. Sun, N.N. Zheng, H.Y. Shum, Stereo matching using belief propagation. IEEE Trans. Pattern Anal. Mach. Intell. 25(7), 787–800 (2003)
P.F. Felzenszwalb, D.P. Huttenlocher, Efficient belief propagation for early vision. Int. J. Comput. Vision 70(1), 41–54 (2006)
Y. Chang, Y. Ho, Modified SAD using adaptive window sizes for efficient stereo matching, in International Conference on Embedded Systems and Intelligent Technology (2014), pp. 9–11
R. Zabih, J. Woodfill, Non-parametric local transforms for computing visual correspondence, in European Conference on Computer Vision (ECCV) (1994), pp. 151–158
O. Eksler, Fast variable window for stereo correspondence using integral images, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2003)
K.J. Yoon, I.S. Kweon, Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 650–656 (2006)
H.H. Stereo, Processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328–341 (2007)
N. Snavely, S.M. Seitz, R. Szeliski, Modeling the world from internet photo collections. Int. J. Comput. Vision 80(2), 189–210 (2008)
C. Wu, S. Agarwal, B. Curless, S.M. Seitz, Multicore bundle adjustment, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011), pp. 3057–3064
N. Snavely, S.M. Seitz, R. Szeliski, Skeletal graphs for efficient structure from motion, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2008), pp. 1–8
V.M. Govindu, Combining two-view constraints for motion estimation, in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2001)
D. Devarajan, R.J. Radke, Calibrating distributed camera networks using belief propagation. EURASIP J. Adv. Signal Process. 1–10 (2006)
P. Moulon, P. Monasse, R. Marlet, Global fusion of relative motions for robust, accurate and scalable structure from motion, in IEEE International Conference on Computer Vision (ICCV) (2013), pp. 3248–3255
B. Li, C. Shen, Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFS, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 1119–1127
F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image. Comput. Vision Pattern Recogn. (CVPR) (2015)
D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inform. Process. Syst. 2366–2374 (2014)
A. Chakrabarti, J. Shao, G. Shakhnarovich, Depth from a single image by harmonizing overcomplete local network predictions. Adv. Neural Inform. Process. Syst. 2658–2666 (2016)
M. Song, S. Lim, W. Kim, Monocular depth estimation using Laplacian pyramid-based depth residuals. IEEE Trans. Circ. Syst. Video Technol. 31, 4381–4393 (2021)
X. Chen, Y. Wang, X. Chen, W. Zeng, S2R-DepthNet: learning a generalizable depth-specific structural representation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 3034–3043
R. Ranftl, A. Bochkovskiy, V. Koltun, Vision transformers for dense prediction, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12179–12188
A. Agarwal, C. Arora, Attention Everywhere: Monocular Depth Prediction with Skip Attention (2022). arXiv preprint arXiv:2210.09071
D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, in International Conference on Computer Vision (ICCV) (2015), pp. 2650–2658
T. Dharmasiri, A. Spek, T. Drummond, Joint prediction of depths, normals and surface curvature from RGB images using CNNS, in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2017), pp. 1505–1512
P. Wang, X. Shen, Z. Lin, S. Cohen, Towards unified depth and semantic prediction from a single image, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), pp. 2800–2809
A. Mousavian, Pirsiavash, Joint semantic segmentation and depth estimation with deep convolutional networks, in Fourth International Conference on 3D Vision (3DV) (2016), pp. 611–619
H. Jung, E. Park, Fine-grained semantics-aware representation enhancement for self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 12642–12652
N. Mayer, E. Ilg, P. Hausser, A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. Comput. Vision Pattern Recogn. (CVPR) 4040–4048 (2016)
J.H. Pang, W.X. Sun, J.S.J. Ren, Cascade residual learning: a two-stage convolutional neural network for stereo matching, in IEEE International Conference on Computer Vision Workshops (2017), pp. 878–886
X. Song, X. Zhao, H.W. Hu, L.J. Fang, EdgeStereo: a context integrated residual pyramid network for stereo matching, in Asian Conference on Computer Vision (2018)
A. Kendall, H. Martirosyan, End-to-end learning of geometry and context for deep stereo regression, in IEEE International Conference on Computer Vision (ICCV) (2017)
J.R. Chang, Y.S. Chen, Pyramid stereo matching network, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 5410–5418
S. Zhang, Z. Wang, Q. Wang, et al., EDNet: efficient disparity estimation with cost volume combination and attention-based spatial residual, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 5433–5442
J. Xie, R. Girshick, A. Farhadi, Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks, in European Conference on Computer Vision (ECCV) (2016), pp. 842–857
R. Garg, G. Carneiro, I.D. Reid, Unsupervised CNN for single view depth estimation: Geometry to the rescue, in European Conference on Computer Vision (ECCV) (2016), pp. 740–756
C. Godard, O.M. Aodha, G.J. Brostow G. J. (2017). Unsupervised monocular depth estimation with left-right consistency, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6602–6611
A. Wong, S. Soatto, Bilateral cyclic constraint and adaptive regularization for unsupervised monocular depth prediction, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5637–5646
A. Pilzer, D. Xu, M. Puscas, Un-supervised adversarial depth estimation using cycled generative networks, in International Conference on 3D Vision (3DV) (2018), pp. 587–595
R. Peng, R. Wang, Y. Lai, et al., Excavating the potential capacity of self-supervised monocular depth estimation, in IEEE/CVF International Conference on Computer Vision (CVPR) (2021), pp. 15560–15569.
H. Zhang, C. Shen, Y. Li, Y. Cao, Y. Liu, Y. Yan, Exploiting temporal consistency for real-time video depth estimation, in IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 1725–1734
R. Wang, S.M. Pizer, J. Frahm, Recurrent neural network for (Un-)supervised learning of monocular video visual odometry and depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 5550–5559
X. Yang, Y. Gao, H. Luo, C. Liao, K. Cheng, Bayesian DeNet: monocular depth prediction and frame-wise fusion with synchronized uncertainty. IEEE Trans. Multimedia 21, 2701–2713 (2019)
J. Watson, O. Mac Aodha, V. Prisacariu, et al., The temporal opportunist: Self-supervised multi-frame monocular depth, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp. 1164–1174
X. Long, L. Liu, W. Li, et al., Multi-view depth estimation using epipolar spatio-temporal networks, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), pp.8258–8267
T. Zhou, M. Brown, N. Snavely, Unsupervised learning of depth and ego-motion from video, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 6612–6619
Z. Yin, J. Shi, Geonet: Unsupervised learning of dense depth, optical flow and camera pose, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 1983–1992
C. Godard, O. Mac Aodha, M. Firman, et al., Digging into self-supervised monocular depth estimation, in The IEEE/CVF International Conference on Computer Vision (ICCV) (2019), pp. 3828–3838
T.-W. Hui, RMDepth: unsupervised learning of recurrent monocular depth in dynamic scenes, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022)
R. Mahjourian, M. Wicke, A. Angelova, Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2018)
J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, A. Geiger, Sparsity invariant CNNS, in International conference on 3D Vision (3DV) (2017), pp. 11–20
A. Eldesokey, M. Felsberg, F.S. Khan, Propagating Confidences Through CNNS for Sparse Data Regression (2018). arXiv preprint arXiv:1805.11913
W. Van Gansbeke, D. Neven, B. De Brabandere, L. Van Gool, Sparse and noisy lidar completion with RGB guidance and uncertainty, in International Conference on Machine Vision Applications (MVA) (2019), pp. 1–6
S. Shivakumar, T. Nguyen, I.D. Miller, S.W. Chen, V. Kumar, C.J. Taylor, Dfusenet: deep fusion of RGB and sparse depth information for image guided dense depth completion, in Intelligent Transportation Systems Conference (ITSC) (2019), pp. 13–20
X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, C.L. Tai, Transfusion: robust lidar-camera fusion for 3d object detection with transformers, in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 1090–1099
Y. Li, A. Yu, Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022), pp. 17182–17191
S. Li, W. Li, C. Cook, et al., Independently recurrent neural network (INDRNN): building a longer and deeper RNN, in IEEE conference on computer vision and pattern recognition (CVPR) (2018), pp. 5457–5466
A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the Kitti vision benchmark suite, in IEEE conference on computer vision and pattern recognition (CVPR) (2012), pp. 3354–3361
M. Menze, A. Geiger, Object scene flow for autonomous vehicles, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015), pp. 3061–3070
H. Caesar, V. Bankiti, A.H. Lang, S. Vora, V.E. Liong, Q. Xu, O. Beijbom, Nuscenes: a multimodal dataset for autonomous driving, in IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2020), pp. 11621–11631
Acknowledgements
This chapter was supported in part by the National Natural Science Foundation of China under Grant 62271290 and Grant 61901083; and in part by SDU QILU Young Scholars Program. The authors would also like to acknowledge the contributions from Hongwei Xu, Xianye Wu and Jianguo Wang for providing relative materials to this chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Li, S., Zhou, H., Gao, Y., Cai, X., Yuan, H., Zhang, W. (2023). 3D Scene Perception for Autonomous Driving. In: Zhu, Y., Cao, Y., Hua, W., Xu, L. (eds) Communication, Computation and Perception Technologies for Internet of Vehicles. Springer, Singapore. https://doi.org/10.1007/978-981-99-5439-1_7
Download citation
DOI: https://doi.org/10.1007/978-981-99-5439-1_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-5438-4
Online ISBN: 978-981-99-5439-1
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)