Skip to main content
Log in

Semisupervised learning-based depth estimation with semantic inference guidance

  • Article
  • Published:
Science China Technological Sciences Aims and scope Submit manuscript

Abstract

Depth estimation is a fundamental computer vision problem that infers three-dimensional (3D) structures from a given scene. As it is an ill-posed problem, to fit the projection function from the given scene to the 3D structure, traditional methods generally require mass amounts of annotated data. Such pixel-level annotation is quite labor consuming, especially when addressing reflective surfaces such as mirrors or water. The widespread application of deep learning further intensifies the demand for large amounts of annotated data. Therefore, it is urgent and necessary to propose a framework that is able to reduce the requirement on the amount of data. In this paper, we propose a novel semisupervised learning framework to infer the 3D structure from the given scene. First, semantic information is employed to make the depth inference more accurate. Second, we make both the depth estimation and semantic segmentation coarse-to-fine frameworks; thus, the depth estimation can be gradually guided by semantic segmentation. We compare our model with state-of-the-art methods. The experimental results demonstrate that our method is better than many supervised learning-based methods, which proves the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Saxena A, Min Sun A, Ng AY. Make3D: Learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell, 2009, 31: 824–840

    Article  Google Scholar 

  2. Eigen D, Puhrsch C, Fergus R. Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems. Montreal, Quebec, 2014. 2366–2374

  3. Liu B, Gould S, Koller D. Single image depth estimation from predicted semantic labels. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco: IEEE, 2010. 1253–1260

    Chapter  Google Scholar 

  4. Li C, Kowdle A, Saxena A, et al. Toward holistic scene understanding: Feedback enabled cascaded classification models. IEEE Trans Pattern Anal Mach Intell, 2012, 34: 1394–1408

    Article  Google Scholar 

  5. Li B, Shen C, Dai Y, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarc hical CRFs. In: Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 1119–1127

    Google Scholar 

  6. Zhou Z H. A brief introduction to weakly supervised learning. Natl Sci Rev, 2018, 5: 44–53

    Article  Google Scholar 

  7. Ben-David S, Blitzer J, Crammer K, et al. A theory of learning from different domains. Mach Learn, 2010, 79: 151–175

    Article  MathSciNet  MATH  Google Scholar 

  8. Zhang M S. A survey of syntactic-semantic parsing based on constituent and dependency structures. Sci China Tech Sci, 2020, 63: 1898–1920

    Article  Google Scholar 

  9. He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 770–778

    Google Scholar 

  10. Hu R, Monebhurrun V, Himeno R, et al. A statistical parsimony method for uncertainty quantification of FDTD computation based on the PCA and ridge regression. IEEE Trans Antennas Propagat, 2019, 67: 4726–4737

    Article  Google Scholar 

  11. Hu R, Monebhurrun V, Himeno R, et al. An adaptive least angle regression method for uncertainty quantification in FDTD computation. IEEE Trans Antennas Propagat, 2018, 66: 7188–7197

    Article  Google Scholar 

  12. Ladicky L, Shi J, Pollefeys M. Pulling things out of perspective. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 89–96

    Chapter  Google Scholar 

  13. Yuan J H, Wu Y, Lu X, et al. Recent advances in deep learning based sentiment analysis. Sci China Tech Sci, 2020, 63: 1947–1970

    Article  Google Scholar 

  14. Song W, Liu L Z. Representation learning in discourse parsing: A survey. Sci China Tech Sci, 2020, 63: 1921–1946

    Article  Google Scholar 

  15. Kuznietsov Y, Stuckler J, Leibe B. Semi-supervised deep learning for monocular depth map prediction. In: IEEE International Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 2215–2223

    Google Scholar 

  16. Luo Y, Ren J S J, Lin M, et al. Single view stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition. Salty Lake: IEEE, 2018. 155–163

    Google Scholar 

  17. Zhang Z, Takanobu R, Zhu Q, et al. Recent advances and challenges in task-oriented dialog systems. Sci China Tech Sci, 2020, 63: 2011–2027

    Article  Google Scholar 

  18. Zhang J J, Zong C Q. Neural machine translation: Challenges, progress and future. Sci China Tech Sci, 2020, 63: 2028–2050

    Article  Google Scholar 

  19. Xu D, Wang W, Tang H, et al. Structured attention guided convolutional neural fields for monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition. Salty Lake: IEEE, 2018. 3917–3925

    Google Scholar 

  20. Lan X, Zhu X, Gong S. Knowledge distillation by on-the-fly native ensemble. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montreal, 2018. 7528–7538

  21. Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: International Conference on Computer Vision. Santiago: IEEE, 2015. 2650–2658

    Google Scholar 

  22. Garg R, BG V K, Carneiro G, et al. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Leibe B, Matas J, Sebe N, et al., eds. Computer Vision — ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Cham: Springer, 2016

    Google Scholar 

  23. Godard C, Aodha O M, Firman M, et al. Digging into self-supervised monocular depth estimation. In: International Conference on Computer Vision. Seoul: IEEE, 2019. 3827–3837

    Google Scholar 

  24. Watson J, Firman M, Brostow G J, et al. Selfsupervised monocular depth hints. In: 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 2162–2171

    Google Scholar 

  25. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 640–651

    Article  Google Scholar 

  26. Mousavian A, Pirsiavash H, Kosecka J. Joint semantic segmentation and depth estimation with deep convolutional networks. In: International Conference on 3D Vision. Stanford: IEEE, 2016. 611–619

    Google Scholar 

  27. Wang P, Shen X, Lin Z, et al. Towards unified depth and semantic prediction from a single image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 2800–2809

    Chapter  Google Scholar 

  28. Menze M, Geiger A. Object scene flow for autonomous vehicles. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015

    Google Scholar 

  29. Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE, 2016

    Google Scholar 

  30. Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from RGBD images. In: Fitzgibbon A, Lazebnik S, Perona P, et al, eds. Computer Vision — ECCV 2012. ECCV 2012. Lecture Notes in Computer Science. Vol. 7576. Berlin, Heidelberg: Springer, 2012

    Google Scholar 

  31. Zhuo W, Salzmann M, He X, et al. Indoor scene structure analysis for single image depth estimation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 614–622

    Chapter  Google Scholar 

  32. Liu F, Shen C, Lin G. Deep convolutional neural fields for depth estimation from a single image. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 5162–5170

    Chapter  Google Scholar 

  33. Atapour-Abarghouei A, Breckon T P. Veritatem dies aperit — Temporally consistent depth prediction enabled by a multi-task geometric and semantic scene understanding approach. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3373–3384

    Google Scholar 

  34. Guizilini V, Ambrus R, Pillai S, et al. 3D packing for self-supervised monocular depth estimation. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 2482–2491

    Google Scholar 

  35. Tosi F, Aleotti F, Poggi M, et al. Learning monocular depth estimation infusing traditional stereo knowledge. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9799–9809

    Google Scholar 

  36. Cheng B, Saggu I S, Shah R, et al. S3Net: Semantic-aware self-supervised depth estimation with monocular videos and synthetic data. In: European Conference on Computer Vision. Vol. 12375. Glasgow, 2020. 52–69

  37. Liu F, Shen C, Lin G, et al. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell, 2016, 38: 2024–2039

    Article  Google Scholar 

  38. Godard C, Mac Aodha O, Brostow G J. Unsupervised monocular depth estimation with left-right consistency. In: Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6602–6611

    Google Scholar 

  39. Zhou T, Brown M, Snavely N, et al. Unsupervised learning of depth and ego-motion from video. In: Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 6612–6619

    Google Scholar 

  40. Yin Z, Shi J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, 2018. 1983–1992

  41. Zhao S, Fu H, Gong M, et al. Geometry-aware symmetric domain adaptation for monocular depth estimation. In: Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 9788–9798

    Google Scholar 

  42. Johnston A, Carneiro G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 4755–4764

    Google Scholar 

  43. Klingner M, Termohlen J A, Mikolajczyk J, et al. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: European Conference on Computer Vision. Vol. 12365. Glasgow, 2020. 582–600

  44. Spencer J, Bowden R, Hadfield S. Defeat-net: General monocular depth via simultaneous unsupervised representation learning. In: Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 390–401

    Google Scholar 

  45. Chakrabarti A, Shao G, Shakhnarovich G. Depth from a single image by harmonizing overcomplete local network predictions. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. Barcelona, 2016. 2658–2666

  46. Karsch K, Liu C, Kang S B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 2144–2158

    Article  Google Scholar 

  47. Liu M, Salzmann M, He X. Discrete-continuous depth estimation from a single image. In: Computer Vision and Pattern Recognition. Columbus: IEEE, 2014. 716–723

    Google Scholar 

  48. Long M, Cao Y, Wang J. Learning transferable features with deep adaptation networks. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, 2015. 97–105

  49. Wang P, Shen X, Russell B. Surge: Surface regularized geometry estimation from a single image. In: Advances in Neural Information Processing Systems. Barcelona, 2016. 172–180

  50. Roy A, Todorovic S. Monocular depth estimation using neural regression forest. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016. 5506–5514

  51. Baig M H, Torresani L. Coupled depth learning. In: Winter Conference on Applications of Computer Vision (WACV). Lake Placid: IEEE, 2016: 1–10

    Google Scholar 

  52. Laina I, Rupprecht C, Belagiannis V, et al. Deeper depth prediction with fully convolutional residual networks. In: International Conference on 3D Vision. Stanford: IEEE, 2016

    Google Scholar 

  53. Lee J H, Heo M, Kim C S. Single-image depth estimation based on Fourier domain analysis. In: CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018

    Google Scholar 

  54. Gur S, Wolf L. Single image depth estimation trained via depth from defocus cues. In: CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 7683–7692

    Google Scholar 

  55. Zhang Z, Cui Z, Xu C, et al. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In: CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019

    Google Scholar 

  56. Wang L, Zhang J, Wang O, et al. SDC-depth: Semantic divide-and-conquer network for monocular depth estimation. In: CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020

    Google Scholar 

  57. Wang L, Zhang J, Wang Y, et al. CLIFFNet for monocular depth estimation with hierarchical embedding loss. In: Vedaldi A, Bischof H, Brox T, et al, eds. Computer Vision — ECCV 2020. ECCV 2020. Lecture Notes in Computer Science. Vol. 12350. Cham: Springer, 2020

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to XiaoPeng Fan.

Additional information

This work was supported in part by the National High Technology Research and Development Program of China (Grant No. 2021YFF0900500), and the National Natural Science Foundation of China (Grant Nos. 61972115 and 61872116).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Fan, X. & Zhao, D. Semisupervised learning-based depth estimation with semantic inference guidance. Sci. China Technol. Sci. 65, 1098–1106 (2022). https://doi.org/10.1007/s11431-021-1948-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11431-021-1948-3

Navigation