Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Li, Yuanzhen; Luo, Fei; Xiao, Chunxia

doi:10.1007/s41095-022-0279-3

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Research Article
Open access
Published: 16 June 2022

Volume 8, pages 631–647, (2022)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Download PDF

Yuanzhen Li¹,
Fei Luo¹ &
Chunxia Xiao¹

1078 Accesses
8 Citations
Explore all metrics

Abstract

Self-supervised monocular depth estimation has been widely investigated and applied in previous works. However, existing methods suffer from texture-copy, depth drift, and incomplete structure. It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment. Moreover, it is hard to design the depth smoothness loss to balance depth smoothness and sharpness. To address these issues, we propose a coarse-to-fine method with a normalized convolutional block attention module (NCBAM). In the coarse estimation stage, we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems. Then, we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage. Our method can produce results competitive with state-of-the-art methods. Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.

Article PDF

Dual-attention-based semantic-aware self-supervised monocular depth estimation

Article 12 January 2024

Jinze Xu, Feng Ye & Yizong Lai

RA-Depth: Resolution Adaptive Self-supervised Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Article 27 March 2024

Shaokang Li, Chengzhi Lyu, … Lei Zhang

References

Cao, Y. P.; Kobbelt, L.; Hu, S. M. Real-time high-accuracy three-dimensional reconstruction with consumer RGB-D cameras. ACM Transactions on Graphics Vol. 37, No. 5, Article No. 171, 2018.
Fu, Y. P.; Yan, Q. G.; Liao, J.; Xiao, C. X. Joint texture and geometry optimization for RGB-D reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5949–5958, 2020.
Yang, L.; Yan, Q. G.; Fu, Y. P.; Xiao, C. X. Surface reconstruction via fusing sparse-sequence of depth images. IEEE Transactions on Visualization and Computer Graphics Vol. 24, No. 2, 1190–1203, 2018.
Article Google Scholar
Fu, Y. P.; Yan, Q. G.; Yang, L.; Liao, J.; Xiao, C. X. Texture mapping for 3D reconstruction with RGB-D sensor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4645–4653, 2018.
Fu, Y. P.; Yan, Q. G.; Liao, J.; Zhou, H. J.; Tang, J.; Xiao, C. X. Seamless texture optimization for RGB-D reconstruction. IEEE Transactions on Visualization and Computer Graphics doi: https://doi.org/10.1109/TVCG.2021.3134105, 2021.
Luo, H. C.; Gao, Y.; Wu, Y. H.; Liao, C. Y.; Yang, X.; Cheng, K. T. Real-time dense monocular SLAM with online adapted depth prediction network. IEEE Transactions on Multimedia Vol. 21, No. 2, 470–483, 2019.
Article Google Scholar
Fan, X. Y.; Wu, W. J.; Zhang, L.; Yan, Q. G.; Fu, G.; Chen, Z. P.; Long, C.; Xiao, C. Shading-aware shadow detection and removal from a single image. The Visual Computer Vol. 36, Nos. 10–12, 2175–2188, 2020.
Article Google Scholar
Karsch, K.; Liu, C.; Kang, S. B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 11, 2144–2158, 2014.
Article Google Scholar
Watson, J.; Aodha, O. M.; Turmukhambetov, D.; Brostow, G. J.; Firman, M. Learning stereo from single images. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 722–740, 2020.
Chapter Google Scholar
Guo, X.; Li, H.; Yi, S.; Ren, J.; Wang, X. Learning monocular depth by distilling cross-domain stereo networks. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 506–523, 2018.
Chapter Google Scholar
Godard, C.; Aodha, O. M.; Brostow, G. J. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6602–6611, 2017.
Godard, C.; Aodha, O. M.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3827–3837, 2019.
Zhou, T. H.; Brown, M.; Snavely, N.; Lowe, D. G. Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6612–6619, 2017.
Zhao, W.; Liu, S. H.; Shu, Y. Z.; Liu, Y. J. Towards better generalization: Joint depth-pose learning without PoseNet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9148–9158, 2020.
Klingner, M.; Termöhlen, J. A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12365. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 582–600, 2020.
Chapter Google Scholar
Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D. Q.; Wulff, J.; Black, M. J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12232–12241, 2019.
Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. LEGO: Learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 225–234, 2018.
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2482–2491, 2020.
Woo, S.; Park, J.; Lee, J. Y.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 3–19, 2018.
Chapter Google Scholar
Schonberger, J. L.; Frahm, J. M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113, 2016.
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 328–341, 2008.
Article Google Scholar
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, 2650–2658, 2015.
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2366–2374, 2014.
Google Scholar
Yin, Z. C.; Shi, J. P. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1983–1992, 2018.
Xie, J. Y.; Girshick, R.; Farhadi, A. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 842–857, 2016.
Chapter Google Scholar
Garg, R.; Vijay Kumar, B. G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 740–756, 2016.
Chapter Google Scholar
Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised adversarial depth estimation using cycled generative networks. In: Proceedings of the International Conference on 3D Vision, 587–595, 2018.
Aleotti, F.; Tosi, F.; Poggi, M.; Mattoccia, S. Generative adversarial networks for unsupervised monocular depth prediction. In: Computer Vision — ECCV 2018 Workshops. Lecture Notes in Computer Science, Vol. 11129. Leal-Taixe, L.; Roth, S. Eds. Springer Cham, 337–354, 2019.
Chapter Google Scholar
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5667–5675, 2018.
Zhu, S. J.; Brazil, G.; Liu, X. M. The edge of depth: Explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13113–13122, 2020.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.
Yuan, Y. H.; Huang, L.; Guo, J. Y.; Zhang, C.; Chen, X. L.; Wang, J. D. OCNet: Object context for semantic segmentation. International Journal of Computer Vision Vol. 129, No. 8, 2375–2398, 2021.
Article Google Scholar
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12159–12168, 2021.
Li, Z. S.; Liu, X. T.; Drenkow, N.; Ding, A.; Creighton, F. X.; Taylor, R. H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6177–6186, 2021.
Yang, G. L.; Tang, H.; Ding, M. L.; Sebe, N.; Ricci, E. Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16249–16259, 2021.
Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4755–4764, 2020.
Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R. R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.
Article Google Scholar
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, 66–75, 2017.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.
Google Scholar
Chen, Y. H.; Schmid, C.; Sminchisescu, C. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7062–7071, 2019.
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600–612, 2004.
Article Google Scholar
Wang, C. Y.; Buenaposada, J. M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022–2030, 2018.
Ramamonjisoa, M.; Lepetit, V. SharpNet: Fast and accurate recovery of occluding contours in monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2109–2118, 2019.
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging Vol. 3, No. 1, 47–57, 2017.
Article Google Scholar
Yang, Z. H.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 32, No. 1, 7493–7500, 2018.
Google Scholar
Zou, Y. L.; Luo, Z. L.; Huang, J. B. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 38–55, 2018.
Chapter Google Scholar
Luo, C. X.; Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 10, 2624–2641, 2020.
Article Google Scholar
Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8001–8008, 2019.
Article Google Scholar
Garg, R.; Vijay Kumar, B. G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 740–756, 2016.
Chapter Google Scholar
Mehta, I.; Sakurikar, P.; Narayanan, P. J. Structured adversarial training for unsupervised monocular depth estimation. In: Proceedings of the International Conference on 3D Vision, 314–323, 2018.
Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In: Proceedings of the International Conference on 3D Vision, 324–333, 2018.
Pillai, S.; Ambrus R.; Gaidon, A. SuperDepth: Self-supervised, super-resolved monocular depth estimation. In: Proceedings of the International Conference on Robotics and Automation, 9250–9256, 2019.
Watson, J.; Firman, M.; Brostow, G.; Turmuk-hambetov, D. Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2162–2171, 2019.
Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9791–9801, 2019.
Li, R. H.; Wang, S.; Long, Z. Q.; Gu, D. B. UnDeepVO: Monocular visual odometry through unsupervised deep learning. In: Proceedings of the IEEE International Conference on Robotics and Automation, 7286–7291, 2018.
Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single image depth prediction with wavelet decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11084–11093, 2021.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223, 2016.
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361, 2012.
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015.
Saxena, A.; Sun, M.; Ng, A. Y. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 5, 824–840, 2009.
Article Google Scholar
Karsch, K.; Liu, C.; Kang, S. B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 11, 2144–2158, 2014.
Article Google Scholar
Liu, M. M.; Salzmann, M.; He, X. M. Discrete-continuous depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 716–723, 2014.
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the 4th International Conference on 3D Vision, 239–248, 2016.
Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.
Article Google Scholar

Download references

Acknowledgements

This work is partially supported by the Key Technological Innovation Projects of Hubei Province (2018AAA062), National Natural Science Foundation of China (61972298), Wuhan University-Huawei GeoInformatics Innovation Lab.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430072, China
Yuanzhen Li, Fei Luo & Chunxia Xiao

Authors

Yuanzhen Li
View author publications
You can also search for this author in PubMed Google Scholar
Fei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chunxia Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yuanzhen Li conceived and designed the study, and collected the data. All authors analyzed the data and were involved in writing the manuscript.

Corresponding authors

Correspondence to Fei Luo or Chunxia Xiao.

Ethics declarations

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Yuanzhen Li is working towards a Ph.D. degree in the School of Computer Science, Wuhan University. Her research interests include image editing and computer vision.

Fei Luo received his B.Sc. degree from the School of Computer Science of Hubei University of Technology in 2003. He received his M.Sc. and Ph.D. degrees from the School of Computer Science of Wuhan University in 2005 and 2008, respectively. He is now an assistant professor at the School of Computer Science, Wuhan University, Wuhan, China. In 2009, he worked as a research assistant at the School of Computer Engineering of Nanyang Technological University, Singapore. From December 2012 to December 2014, he worked as a postdoc at the Human Polymorphism Study Center, Paris, France. His research interests include data mining and computer vision.

Chunxia Xiao received his B.Sc. and M.Sc. degrees from the Mathematics Department of Hunan Normal University in 1999 and 2002, respectively, and his Ph.D. degree from the State Key Lab of CAD&CG of Zhejiang University in 2006. Currently, he is a professor at the School of Computer Science, Wuhan University. From October 2006 to April 2007, he worked as a postdoc in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, and from February 2012 to February 2013, he visited the University of California Davis for 1 year. His main interests include computer graphics, computer vision, virtual reality, and augmented reality.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Li, Y., Luo, F. & Xiao, C. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Comp. Visual Media 8, 631–647 (2022). https://doi.org/10.1007/s41095-022-0279-3

Download citation

Received: 07 January 2022
Accepted: 22 February 2022
Published: 16 June 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s41095-022-0279-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Abstract

Article PDF

Similar content being viewed by others

Dual-attention-based semantic-aware self-supervised monocular depth estimation

RA-Depth: Resolution Adaptive Self-supervised Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Abstract

Article PDF

Similar content being viewed by others

Dual-attention-based semantic-aware self-supervised monocular depth estimation

RA-Depth: Resolution Adaptive Self-supervised Monocular Depth Estimation

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation