Abstract
Self-supervised monocular depth estimation has been widely investigated and applied in previous works. However, existing methods suffer from texture-copy, depth drift, and incomplete structure. It is difficult for normal CNN networks to completely understand the relationship between the object and its surrounding environment. Moreover, it is hard to design the depth smoothness loss to balance depth smoothness and sharpness. To address these issues, we propose a coarse-to-fine method with a normalized convolutional block attention module (NCBAM). In the coarse estimation stage, we incorporate the NCBAM into depth and pose networks to overcome the texture-copy and depth drift problems. Then, we use a new network to refine the coarse depth guided by the color image and produce a structure-preserving depth result in the refinement stage. Our method can produce results competitive with state-of-the-art methods. Comprehensive experiments prove the effectiveness of our two-stage method using the NCBAM.
Article PDF
Similar content being viewed by others
References
Cao, Y. P.; Kobbelt, L.; Hu, S. M. Real-time high-accuracy three-dimensional reconstruction with consumer RGB-D cameras. ACM Transactions on Graphics Vol. 37, No. 5, Article No. 171, 2018.
Fu, Y. P.; Yan, Q. G.; Liao, J.; Xiao, C. X. Joint texture and geometry optimization for RGB-D reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5949–5958, 2020.
Yang, L.; Yan, Q. G.; Fu, Y. P.; Xiao, C. X. Surface reconstruction via fusing sparse-sequence of depth images. IEEE Transactions on Visualization and Computer Graphics Vol. 24, No. 2, 1190–1203, 2018.
Fu, Y. P.; Yan, Q. G.; Yang, L.; Liao, J.; Xiao, C. X. Texture mapping for 3D reconstruction with RGB-D sensor. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4645–4653, 2018.
Fu, Y. P.; Yan, Q. G.; Liao, J.; Zhou, H. J.; Tang, J.; Xiao, C. X. Seamless texture optimization for RGB-D reconstruction. IEEE Transactions on Visualization and Computer Graphics doi: https://doi.org/10.1109/TVCG.2021.3134105, 2021.
Luo, H. C.; Gao, Y.; Wu, Y. H.; Liao, C. Y.; Yang, X.; Cheng, K. T. Real-time dense monocular SLAM with online adapted depth prediction network. IEEE Transactions on Multimedia Vol. 21, No. 2, 470–483, 2019.
Fan, X. Y.; Wu, W. J.; Zhang, L.; Yan, Q. G.; Fu, G.; Chen, Z. P.; Long, C.; Xiao, C. Shading-aware shadow detection and removal from a single image. The Visual Computer Vol. 36, Nos. 10–12, 2175–2188, 2020.
Karsch, K.; Liu, C.; Kang, S. B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 11, 2144–2158, 2014.
Watson, J.; Aodha, O. M.; Turmukhambetov, D.; Brostow, G. J.; Firman, M. Learning stereo from single images. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 722–740, 2020.
Guo, X.; Li, H.; Yi, S.; Ren, J.; Wang, X. Learning monocular depth by distilling cross-domain stereo networks. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 506–523, 2018.
Godard, C.; Aodha, O. M.; Brostow, G. J. Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6602–6611, 2017.
Godard, C.; Aodha, O. M.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3827–3837, 2019.
Zhou, T. H.; Brown, M.; Snavely, N.; Lowe, D. G. Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6612–6619, 2017.
Zhao, W.; Liu, S. H.; Shu, Y. Z.; Liu, Y. J. Towards better generalization: Joint depth-pose learning without PoseNet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9148–9158, 2020.
Klingner, M.; Termöhlen, J. A.; Mikolajczyk, J.; Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In: Computer Vision — ECCV 2020. Lecture Notes in Computer Science, Vol. 12365. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 582–600, 2020.
Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D. Q.; Wulff, J.; Black, M. J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12232–12241, 2019.
Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R. LEGO: Learning edge with geometry all at once by watching videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 225–234, 2018.
Guizilini, V.; Ambrus, R.; Pillai, S.; Raventos, A.; Gaidon, A. 3D packing for self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2482–2491, 2020.
Woo, S.; Park, J.; Lee, J. Y.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 3–19, 2018.
Schonberger, J. L.; Frahm, J. M. Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4104–4113, 2016.
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 30, No. 2, 328–341, 2008.
Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision, 2650–2658, 2015.
Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2366–2374, 2014.
Yin, Z. C.; Shi, J. P. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1983–1992, 2018.
Xie, J. Y.; Girshick, R.; Farhadi, A. Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 842–857, 2016.
Garg, R.; Vijay Kumar, B. G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 740–756, 2016.
Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised adversarial depth estimation using cycled generative networks. In: Proceedings of the International Conference on 3D Vision, 587–595, 2018.
Aleotti, F.; Tosi, F.; Poggi, M.; Mattoccia, S. Generative adversarial networks for unsupervised monocular depth prediction. In: Computer Vision — ECCV 2018 Workshops. Lecture Notes in Computer Science, Vol. 11129. Leal-Taixe, L.; Roth, S. Eds. Springer Cham, 337–354, 2019.
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5667–5675, 2018.
Zhu, S. J.; Brazil, G.; Liu, X. M. The edge of depth: Explicit constraints between segmentation and depth. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13113–13122, 2020.
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.
Yuan, Y. H.; Huang, L.; Guo, J. Y.; Zhang, C.; Chen, X. L.; Wang, J. D. OCNet: Object context for semantic segmentation. International Journal of Computer Vision Vol. 129, No. 8, 2375–2398, 2021.
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 12159–12168, 2021.
Li, Z. S.; Liu, X. T.; Drenkow, N.; Ding, A.; Creighton, F. X.; Taylor, R. H.; Unberath, M. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6177–6186, 2021.
Yang, G. L.; Tang, H.; Ding, M. L.; Sebe, N.; Ricci, E. Transformer-based attention networks for continuous pixel-wise prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16249–16259, 2021.
Johnston, A.; Carneiro, G. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4755–4764, 2020.
Guo, M.; Xu, T.; Liu, J.; Liu, Z.; Jiang, P.; Mu, T.; Zhang, S.; Martin, R. R.; Cheng, M.; Hu, S. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.
Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P.; Kennedy, R.; Bachrach, A.; Bry, A. End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, 66–75, 2017.
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.
Chen, Y. H.; Schmid, C.; Sminchisescu, C. Self-supervised learning with geometric constraints in monocular video: Connecting flow, depth, and camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7062–7071, 2019.
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600–612, 2004.
Wang, C. Y.; Buenaposada, J. M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022–2030, 2018.
Ramamonjisoa, M.; Lepetit, V. SharpNet: Fast and accurate recovery of occluding contours in monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 2109–2118, 2019.
Zhao, H.; Gallo, O.; Frosio, I.; Kautz, J. Loss functions for image restoration with neural networks. IEEE Transactions on Computational Imaging Vol. 3, No. 1, 47–57, 2017.
Yang, Z. H.; Wang, P.; Xu, W.; Zhao, L.; Nevatia, R. Unsupervised learning of geometry from videos with edge-aware depth-normal consistency. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 32, No. 1, 7493–7500, 2018.
Zou, Y. L.; Luo, Z. L.; Huang, J. B. DF-Net: Unsupervised joint learning of depth and flow using cross-task consistency. In: Computer Vision — ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 38–55, 2018.
Luo, C. X.; Yang, Z. H.; Wang, P.; Wang, Y.; Xu, W.; Nevatia, R.; Yuille, A. Every pixel counts++: Joint learning of geometry and motion with 3D holistic understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 10, 2624–2641, 2020.
Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8001–8008, 2019.
Garg, R.; Vijay Kumar, B. G.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In: Computer Vision — ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 740–756, 2016.
Mehta, I.; Sakurikar, P.; Narayanan, P. J. Structured adversarial training for unsupervised monocular depth estimation. In: Proceedings of the International Conference on 3D Vision, 314–323, 2018.
Poggi, M.; Tosi, F.; Mattoccia, S. Learning monocular depth estimation with unsupervised trinocular assumptions. In: Proceedings of the International Conference on 3D Vision, 324–333, 2018.
Pillai, S.; Ambrus R.; Gaidon, A. SuperDepth: Self-supervised, super-resolved monocular depth estimation. In: Proceedings of the International Conference on Robotics and Automation, 9250–9256, 2019.
Watson, J.; Firman, M.; Brostow, G.; Turmuk-hambetov, D. Self-supervised monocular depth hints. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2162–2171, 2019.
Tosi, F.; Aleotti, F.; Poggi, M.; Mattoccia, S. Learning monocular depth estimation infusing traditional stereo knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9791–9801, 2019.
Li, R. H.; Wang, S.; Long, Z. Q.; Gu, D. B. UnDeepVO: Monocular visual odometry through unsupervised deep learning. In: Proceedings of the IEEE International Conference on Robotics and Automation, 7286–7291, 2018.
Ramamonjisoa, M.; Firman, M.; Watson, J.; Lepetit, V.; Turmukhambetov, D. Single image depth prediction with wavelet decomposition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11084–11093, 2021.
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223, 2016.
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361, 2012.
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations, 2015.
Saxena, A.; Sun, M.; Ng, A. Y. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 5, 824–840, 2009.
Karsch, K.; Liu, C.; Kang, S. B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 11, 2144–2158, 2014.
Liu, M. M.; Salzmann, M.; He, X. M. Discrete-continuous depth estimation from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 716–723, 2014.
Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In: Proceedings of the 4th International Conference on 3D Vision, 239–248, 2016.
Mur-Artal, R.; Montiel, J. M. M.; Tardós, J. D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Transactions on Robotics Vol. 31, No. 5, 1147–1163, 2015.
Acknowledgements
This work is partially supported by the Key Technological Innovation Projects of Hubei Province (2018AAA062), National Natural Science Foundation of China (61972298), Wuhan University-Huawei GeoInformatics Innovation Lab.
Author information
Authors and Affiliations
Contributions
Yuanzhen Li conceived and designed the study, and collected the data. All authors analyzed the data and were involved in writing the manuscript.
Corresponding authors
Ethics declarations
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Yuanzhen Li is working towards a Ph.D. degree in the School of Computer Science, Wuhan University. Her research interests include image editing and computer vision.
Fei Luo received his B.Sc. degree from the School of Computer Science of Hubei University of Technology in 2003. He received his M.Sc. and Ph.D. degrees from the School of Computer Science of Wuhan University in 2005 and 2008, respectively. He is now an assistant professor at the School of Computer Science, Wuhan University, Wuhan, China. In 2009, he worked as a research assistant at the School of Computer Engineering of Nanyang Technological University, Singapore. From December 2012 to December 2014, he worked as a postdoc at the Human Polymorphism Study Center, Paris, France. His research interests include data mining and computer vision.
Chunxia Xiao received his B.Sc. and M.Sc. degrees from the Mathematics Department of Hunan Normal University in 1999 and 2002, respectively, and his Ph.D. degree from the State Key Lab of CAD&CG of Zhejiang University in 2006. Currently, he is a professor at the School of Computer Science, Wuhan University. From October 2006 to April 2007, he worked as a postdoc in the Department of Computer Science and Engineering, Hong Kong University of Science and Technology, and from February 2012 to February 2013, he visited the University of California Davis for 1 year. His main interests include computer graphics, computer vision, virtual reality, and augmented reality.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.
About this article
Cite this article
Li, Y., Luo, F. & Xiao, C. Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module. Comp. Visual Media 8, 631–647 (2022). https://doi.org/10.1007/s41095-022-0279-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-022-0279-3