Abstract
Depth estimation is a fundamental computer vision problem that infers three-dimensional (3D) structures from a given scene. As it is an ill-posed problem, to fit the projection function from the given scene to the 3D structure, traditional methods generally require mass amounts of annotated data. Such pixel-level annotation is quite labor consuming, especially when addressing reflective surfaces such as mirrors or water. The widespread application of deep learning further intensifies the demand for large amounts of annotated data. Therefore, it is urgent and necessary to propose a framework that is able to reduce the requirement on the amount of data. In this paper, we propose a novel semisupervised learning framework to infer the 3D structure from the given scene. First, semantic information is employed to make the depth inference more accurate. Second, we make both the depth estimation and semantic segmentation coarse-to-fine frameworks; thus, the depth estimation can be gradually guided by semantic segmentation. We compare our model with state-of-the-art methods. The experimental results demonstrate that our method is better than many supervised learning-based methods, which proves the effectiveness of the proposed method.
Similar content being viewed by others
References
Saxena A, Min Sun A, Ng AY. Make3D: Learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell, 2009, 31: 824–840
Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems. Montreal, Quebec, 2014. 2366–2374
Liu B, Gould S, Koller D. Single image depth estimation from predicted semantic labels. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010. 1253–1260
Li C, Kowdle A, Saxena A, et al. Toward holistic scene understanding: Feedback enabled cascaded classification models. IEEE Trans Pattern Anal Mach Intell, 2012, 34: 1394–1408
Li B, Shen C, Dai Y, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarc hical CRFs. In: Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 1119–1127
Zhou Z H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, 5: 44–53
Ben-David S, Blitzer J, Crammer K, et al. A theory of learning from different domains. Mach Learn, 2010, 79: 151–175
Zhang M S. A survey of syntactic-semantic parsing based on constituent and dependency structures. Sci China Tech Sci, 2020, 63: 1898–1920
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778
Hu R, Monebhurrun V, Himeno R, et al. A statistical parsimony method for uncertainty quantification of FDTD computation based on the PCA and ridge regression. IEEE Trans Antennas Propagat, 2019, 67: 4726–4737
Hu R, Monebhurrun V, Himeno R, et al. An adaptive least angle regression method for uncertainty quantification in FDTD computation. IEEE Trans Antennas Propagat, 2018, 66: 7188–7197
Ladicky L, Shi J, Pollefeys M. Pulling things out of perspective. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 89–96
Yuan J H, Wu Y, Lu X, et al. Recent advances in deep learning based sentiment analysis. Sci China Tech Sci, 2020, 63: 1947–1970
Song W, Liu L Z. Representation learning in discourse parsing: A survey. Sci China Tech Sci, 2020, 63: 1921–1946
Kuznietsov Y, Stuckler J, Leibe B. Semi-supervised deep learning for monocular depth map prediction. In: IEEE International Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 2215–2223
Luo Y, Ren J S J, Lin M, et al. Single view stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition. Salty Lake: IEEE, 2018. 155–163
Zhang Z, Takanobu R, Zhu Q, et al. Recent advances and challenges in task-oriented dialog systems. Sci China Tech Sci, 2020, 63: 2011–2027
Zhang J J, Zong C Q. Neural machine translation: Challenges, progress and future. Sci China Tech Sci, 2020, 63: 2028–2050
Xu D, Wang W, Tang H, et al. Structured attention guided convolutional neural fields for monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. Salty Lake: IEEE, 2018. 3917–3925
Lan X, Zhu X, Gong S. Knowledge distillation by on-the-fly native ensemble. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, 2018. 7528–7538
Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: International Conference on Computer Vision. Santiago: IEEE, 2015. 2650–2658
Garg R, BG V K, Carneiro G, et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Leibe B, Matas J, Sebe N, et al., eds. Computer Vision — ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Cham: Springer, 2016
Godard C, Aodha O M, Firman M, et al. Digging into self-supervised monocular depth estimation. In: International Conference on Computer Vision. Seoul: IEEE, 2019. 3827–3837
Watson J, Firman M, Brostow G J, et al. Selfsupervised monocular depth hints. In: 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 2162–2171
Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 640–651
Mousavian A, Pirsiavash H, Kosecka J. Joint semantic segmentation and depth estimation with deep convolutional networks. In: International Conference on 3D Vision. Stanford: IEEE, 2016. 611–619
Wang P, Shen X, Lin Z, et al. Towards unified depth and semantic prediction from a single image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2800–2809
Menze M, Geiger A. Object scene flow for autonomous vehicles. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015
Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016
Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from RGBD images. In: Fitzgibbon A, Lazebnik S, Perona P, et al, eds. Computer Vision — ECCV 2012. ECCV 2012. Lecture Notes in Computer Science. Vol. 7576. Berlin, Heidelberg: Springer, 2012
Zhuo W, Salzmann M, He X, et al. Indoor scene structure analysis for single image depth estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 614–622
Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 5162–5170
Atapour-Abarghouei A, Breckon T P. Veritatem dies aperit — Temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3373–3384
Guizilini V, Ambrus R, Pillai S, et al. 3D packing for self-supervised monocular depth estimation. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 2482–2491
Tosi F, Aleotti F, Poggi M, et al. Learning monocular depth estimation infusing traditional stereo knowledge. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9799–9809
Cheng B, Saggu I S, Shah R, et al. S3Net: Semantic-aware self-supervised depth estimation with monocular videos and synthetic data. In: European Conference on Computer Vision. Vol. 12375. Glasgow, 2020. 52–69
Liu F, Shen C, Lin G, et al. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell, 2016, 38: 2024–2039
Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In: Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6602–6611
Zhou T, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video. In: Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6612–6619
Yin Z, Shi J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 1983–1992
Zhao S, Fu H, Gong M, et al. Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9788–9798
Johnston A, Carneiro G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 4755–4764
Klingner M, Termohlen J A, Mikolajczyk J, et al. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European Conference on Computer Vision. Vol. 12365. Glasgow, 2020. 582–600
Spencer J, Bowden R, Hadfield S. Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In: Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 390–401
Chakrabarti A, Shao G, Shakhnarovich G. Depth from a single image by harmonizing overcomplete local network predictions. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, 2016. 2658–2666
Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2144–2158
Liu M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In: Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 716–723
Long M, Cao Y, Wang J. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, 2015. 97–105
Wang P, Shen X, Russell B. Surge: Surface regularized geometry estimation from a single image. In: Advances in Neural Information Processing Systems. Barcelona, 2016. 172–180
Roy A, Todorovic S. Monocular depth estimation using neural regression forest. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016. 5506–5514
Baig M H, Torresani L. Coupled depth learning. In: Winter Conference on Applications of Computer Vision (WACV). Lake Placid: IEEE, 2016: 1–10
Laina I, Rupprecht C, Belagiannis V, et al. Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision. Stanford: IEEE, 2016
Lee J H, Heo M, Kim C S. Single-image depth estimation based on Fourier domain analysis. In: CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018
Gur S, Wolf L. Single image depth estimation trained via depth from defocus cues. In: CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7683–7692
Zhang Z, Cui Z, Xu C, et al. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019
Wang L, Zhang J, Wang O, et al. SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020
Wang L, Zhang J, Wang Y, et al. CLIFFNet for monocular depth estimation with hierarchical embedding loss. In: Vedaldi A, Bischof H, Brox T, et al, eds. Computer Vision — ECCV 2020. ECCV 2020. Lecture Notes in Computer Science. Vol. 12350. Cham: Springer, 2020
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported in part by the National High Technology Research and Development Program of China (Grant No. 2021YFF0900500), and the National Natural Science Foundation of China (Grant Nos. 61972115 and 61872116).
Rights and permissions
About this article
Cite this article
Zhang, Y., Fan, X. & Zhao, D. Semisupervised learning-based depth estimation with semantic inference guidance. Sci. China Technol. Sci. 65, 1098–1106 (2022). https://doi.org/10.1007/s11431-021-1948-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11431-021-1948-3