Skip to main content
Log in

Single image depth estimation using improved U-Net and edge-guide loss

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Monocular depth estimation is regarded as a critical link in context-aware scene comprehension, which typically uses image data from a single point of view as the input to directly predict the depth value corresponding to each pixel in the image. However, predicting accurate object borders without replicating texture is difficult, resulting in missing tiny objects and blurry object edge in predicted depth images. In this paper, we propose a method for estimating monocular depth using an improved U-Net-based encoder-decoder network structure. We propose a new training loss term called edge-guide loss, which pushes the network to focus on object edges, resulting in better accuracy of the depth of tiny objects and edges. In the network, we build the encoder using DenseNet-169 and the decoder using 2 × bilinear up-sampling, skip-connections and hybrid dilated convolution. And skip-connections are used to send multi-scale feature maps from encoder to decoder. We specifically create a new loss function, edge-guide loss and three basic loss terms. We test our algorithm on the NYU Depth V2 dataset. The results of the experiments show that the proposed network can create depth image from a single RGB image with unambiguous borders and more tiny object depth. In the meantime, compared with state-of-the-art approaches, our proposed network outperforms for both visual quality and objective measurement.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Our experiments are deployed on the public dataset, the NYU Depth V2 dataset [41], which is available in [41].

References

  1. Huang C-H, Tsung W-N, Yang W-J, Chen C-H (2019) Unsupervised monocular depth estimation for autonomous driving. In: Proceedings of the international display workshops (IDV), pp 128–131

  2. Lai C, Su K (2018) Development of an intelligent mobile robot localization system using Kinect RGB-D mapping and neural network. Comput Electr Eng 67:620–628

    Article  Google Scholar 

  3. Lee J, Joo S (2021) Three-dimensional depth estimation of virtual objects in augmented reality. J Vision 21(9):2485. https://doi.org/10.1167/jov.21.9.2485

    Article  Google Scholar 

  4. Smisek J, Jancosek M, Pajdla T (2011) 3D with kinect. In: IEEE international conference on computer vision workshops (ICCV Workshops), pp 1154–1160. https://doi.org/10.1109/ICCVW.2011.6130380

  5. Dubayah RO, Drake JB (2000) Lidar remote sensing for forestry. J Forest 98(6):44–46

    Article  Google Scholar 

  6. Godard C, Aodha O, Firman M, Brostow G (2019) Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 3827–3837. https://doi.org/10.1109/ICCV.2019.00393

  7. Chen KY, Chien CC, Tseng CT (2013) Improving the accuracy of depth estimation in binocular vision for robotic applications. Appl Mech Mater 284–287:1862–1866. https://doi.org/10.4028/www.scientific.net/AMM.284-287.1862

    Article  Google Scholar 

  8. Allison RS, Gillam BJ, Vecellio E (2009) Binocular depth discrimination and estimation beyond interaction space. J Vision 9(1):1–14. https://doi.org/10.1167/9.1.10

    Article  Google Scholar 

  9. Zhou T, Brown M, Snavely N, Lowe D (2017) Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6612–6621. https://doi.org/10.1109/CVPR.2017.700

  10. Wang C, Buenaposada JM, Rui Z, Lucey S (2018) Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 2022–2030. https://doi.org/10.1109/CVPR.2018.00216

  11. Ranjan A, Jampani V, Balles L, Kim K, Sun D, Wulff J, Black MJ (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 12232–12241. https://doi.org/10.1109/CVPR.2019.01252

  12. Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Adv Neural Inf Process Syst 3:2366–2374

    Google Scholar 

  13. Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2650–2658. https://doi.org/10.1109/ICCV.2015.304

  14. Alhashim I, Wonka P (2018) High quality monocular depth estimation via transfer learning. arXiv:1812.11941

  15. Laga H, Jospin L, Boussaid F, Bennamoun M (2022) A survey on deep learning techniques for stereo-based depth estimation. IEEE T Pattern Anal 44(4):1738–1764

    Article  Google Scholar 

  16. Bolles RC, Baker HH, Marimont DH (1987) Epipolar-plane image analysis: An approach to determining structure from motion. Int J Comput Vis 1(1):7–55

    Article  Google Scholar 

  17. Prados E, Faugeras O (2005) A generic and provably convergent shape-from-shading method for orthographic and pinhole cameras. Int J Comput Vision 65(1–2):97–125

    Article  Google Scholar 

  18. Nayar SK, Nakagawa Y (1994) Shape from focus. IEEE T Pattern Anal 16(8):824–831

    Article  Google Scholar 

  19. Paolo F, Stefano S (2005) A geometric approach to shape from defocus. IEEE T Pattern Anal 27(3):406–417

    Article  Google Scholar 

  20. Huang G, Liu Z, Laurens V, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  21. Hao Z, Li Y, You S, Lu F (2018) Detail preserving depth estimation from a single image using attention guided networks. In: Proceedings of the international conference on 3D vision (3DV), pp 304–313. https://doi.org/10.1109/3DV.2018.00043

  22. Lee J, Kim C (2019) Monocular depth estimation using relative depth maps. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 9721–9730. https://doi.org/10.1109/CVPR.2019.00996

  23. Xue F, Cao J, Zhou Y, Sheng F, Wang Y, Ming A (2021) Boundary-induced and scene-aggregated network for monocular depth prediction. Pattern Recogn 115. https://doi.org/10.1016/j.patcog.2021.107901

  24. Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the international conference on 3D vision (3DV), pp 239–248. https://doi.org/10.1109/3DV.2016.32

  25. Wang L, Zhang J, Wang O, Lin Z, Lu H (2020) SDC-depth: semantic divide-and-conquer network for monocular depth estimation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 538–547. https://doi.org/10.1109/CVPR42600.2020.00062

  26. Lyu X, Liu L, Wang M, Kong X, Liu L, Liu Y, Chen X, Yuan Y (2021) HR-depth: high resolution self-supervised monocular depth estimation. In: 35th AAAI conference on artificial intelligence (AAAI), pp 2294–2301

  27. Li B, Shen C, Dai Y, Hengel AVD, He M (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 1119–1127. https://doi.org/10.1109/CVPR.2015.7298715

  28. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  29. Hu J, Ozay M, Zhang Y, Okatani T (2019) Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 1043–1051. https://doi.org/10.1109/WACV.2019.00116

  30. Chen W, Fu Z, Yang D, Deng J (2016) Single-image depth perception in the wild. In: Proceedings of the annual conference on neural information processing systems (NIPS), pp 730–738

  31. Xian K, Zhang J, Wang O, Mai L, Lin Z, Cao Z (2020) Structure-guided ranking loss for single image depth prediction. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 608–617. https://doi.org/10.1109/CVPR42600.2020.00069

  32. Zeiler M, Taylor GW, Fergus R (2011) Adaptive deconvolutional networks for mid and high level feature learning. In: Proceedings of the IEEE international conference on computer vision (ICCV), pp 2018–2025. https://doi.org/10.1109/ICCV.2011.6126474

  33. Zeiler M, Fergus R (2014) Visualizing and understanding convolutional networks. Lect Notes Comput Sci 8689:818–833

    Article  Google Scholar 

  34. Shelhamer E, Long J, Darrell T (2017) Fully convolutional networks for semantic segmentation. IEEE T Pattern Anal 39(4):640–651

    Article  Google Scholar 

  35. Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: Proceedings of the international conference on learning representations (ICLR)

  36. Wang P, Chen P, Yuan Y, Liu D, Huang Z, Hou X, Cottrell G (2018) Understanding convolution for semantic segmentation. In: Proceedings of the IEEE winter conference on applications of computer vision (WACV), pp 1451–1460. https://doi.org/10.1109/WACV.2018.00163

  37. Zhu S, Brazil G, Liu X (2020) The edge of depth: explicit constraints between segmentation and depth. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), pp 13113–13122. https://doi.org/10.1109/CVPR42600.2020.01313

  38. Wang Z, Bovik A, Sheikh H, Simoncelli E (2004) Image quality assessment: from error visibility to structural similarity. IEEE T Image Process 13(4):600–612

    Article  Google Scholar 

  39. Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 6602–6611. https://doi.org/10.1109/CVPR.2017.699

  40. Canny J (1986) A computational approach to edge detection. IEEE T Pattern Anal 8(6):679–698

    Article  Google Scholar 

  41. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of the european conference on computer vision (ECCV), pp 746–760. https://doi.org/10.1007/978-3-642-33715-4_54

  42. Levin A, Lischinski D, Weiss Y (2004) Colorization using optimization. Acm T Graphic 23:689–694

    Article  Google Scholar 

  43. Kingma D, Ba J (2014) Adam: a method for stochastic optimization. In: Proceedings of the international conference on learning representations (ICLR)

  44. Jia D, Wei D, Socher R, Li L, Kai L, Li F (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848

Download references

Acknowledgements

This work was supported by the Key R&D Program Project of Shaanxi Province, China (Grant Numbers 2020NY-144). The authors appreciate the funding organization for their financial supports. The authors would also like to thank the helpful comments and suggestions provided by all the authors cited in this article and the anonymous reviewers.

Funding

The research leading to these results received funding from the Key R&D Program Project of Shaanxi Province, China (Grant Numbers 2020NY-144).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Long.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, M., Gao, Y. & Long, Y. Single image depth estimation using improved U-Net and edge-guide loss. Multimed Tools Appl (2024). https://doi.org/10.1007/s11042-024-19235-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-024-19235-3

Keywords

Navigation