GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping

Ji, Pan; Yan, Qingan; Ma, Yuxin; Xu, Yi

doi:10.1007/978-3-031-19769-7_21

Pan Ji¹²,
Qingan Yan¹²,
Yuxin Ma¹² &
…
Yi Xu¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

European Conference on Computer Vision

3244 Accesses
6 Citations

Abstract

We present a robust and accurate depth refinement system, named GeoRefine, for geometrically-consistent dense mapping from monocular sequences. GeoRefine consists of three modules: a hybrid SLAM module using learning-based priors, an online depth refinement module leveraging self-supervision, and a global mapping module via TSDF fusion. The proposed system is online by design and achieves great robustness and accuracy via: (i) a robustified hybrid SLAM that incorporates learning-based optical flow and/or depth; (ii) self-supervised losses that leverage SLAM outputs and enforce long-term geometric consistency; (iii) careful system design that avoids degenerate cases in online depth refinement. We extensively evaluate GeoRefine on multiple public datasets and reach as low as \(5\%\) absolute relative depth errors.

P. Ji and Q. Yan—Joint first authorship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bian, J.W., et al.: Unsupervised scale-consistent depth and ego-motion learning from monocular video. In: NeurIPS (2019)
Google Scholar
Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., Davison, A.J.: Codeslam-learning a compact, optimisable representation for dense visual slam. In: CVPR, pp. 2560–2568 (2018)
Google Scholar
Burri, M., et al.: The EuRoC micro aerial vehicle datasets. Int. J. Robot. Res. 35, 1157–1163 (2016)
Article Google Scholar
Cadena, C., et al.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Rob. 32(6), 1309–1332 (2016)
Article Google Scholar
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual-inertial and multi-map slam. arXiv preprint arXiv:2007.11898 (2020)
Czarnowski, J., Laidlow, T., Clark, R., Davison, A.J.: Deepfactors: real-time probabilistic dense monocular slam. IEEE Robot. Autom. Let. 5(2), 721–728 (2020)
Article Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR, pp. 5828–5839 (2017)
Google Scholar
Dai, A., Nießner, M., Zollhöfer, M., Izadi, S., Theobalt, C.: Bundlefusion: real-time globally consistent 3D reconstruction using on-the-fly surface reintegration. ToG 36(4), 1 (2017)
Article Google Scholar
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. arXiv preprint arXiv:1406.2283 (2014)
Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. TPAMI 40(3), 611–625 (2017)
Article Google Scholar
Engel, J., Schöps, T., Cremers, D.: LSD-SLAM: large-scale direct monocular SLAM. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 834–849. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10605-2_54
Chapter Google Scholar
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)
Article MathSciNet Google Scholar
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)
Google Scholar
Garg, R., B.G., V.K., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_45
Chapter Google Scholar
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)
Article Google Scholar
Godard, C., Aodha, O.M., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)
Google Scholar
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR, pp. 270–279 (2017)
Google Scholar
Gordon, A., Li, H., Jonschkowski, R., Angelova, A.: Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In: CVPR (2019)
Google Scholar
Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2003)
MATH Google Scholar
Hermann, M., Ruf, B., Weinmann, M., Hinz, S.: Self-supervised learning for monocular depth estimation from aerial imagery. arXiv preprint arXiv:2008.07246 (2020)
Ji, P., Li, R., Bhanu, B., Xu, Y.: Monoindoor: towards good practice of self-supervised monocular depth estimation for indoor environments. In: ICCV, pp. 12787–12796 (2021)
Google Scholar
Kingma, D.P., Ba, J.L.: Adam: a method for stochastic gradient descent. In: ICLR, pp. 1–15 (2015)
Google Scholar
Klein, G., Murray, D.: Parallel tracking and mapping for small AR workspaces. In: ISMAR (2007)
Google Scholar
Koestler, L., Yang, N., Zeller, N., Cremers, D.: Tandem: tracking and dense mapping in real-time using deep multi-view stereo. In: CoLR, pp. 34–45 (2022)
Google Scholar
Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR, pp. 1611–1621 (2021)
Google Scholar
Kümmerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: g 2 o: A general framework for graph optimization. In: ICRA (2011)
Google Scholar
Li, Q., et al.: Deep learning based monocular depth prediction: datasets, methods and applications. arXiv preprint arXiv:2011.04123 (2020)
Li, S., Wu, X., Cao, Y., Zha, H.: Generalizing to the open world: deep visual odometry with online adaptation. In: CVPR, pp. 13184–13193 (2021)
Google Scholar
Li, Z., et al.: Learning the depths of moving people by watching frozen people. In: CVPR, pp. 4521–4530 (2019)
Google Scholar
Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR, pp. 2041–2050 (2018)
Google Scholar
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. TPAMI 38(10), 2024–2039 (2015)
Article Google Scholar
Liu, J., et al.: Planemvs: 3d plane reconstruction from multi-view stereo. In: CVPR (2022)
Google Scholar
Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. TOG 39(4), 71–1 (2020)
Article Google Scholar
Matsuki, H., Scona, R., Czarnowski, J., Davison, A.J.: Codemapping: real-time dense mapping for sparse slam using compact scene representations. IEEE Robot. Autom. Lett. 6(4), 7105–7112 (2021)
Article Google Scholar
Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: ORB-SLAM: a versatile and accurate monocular slam system. IEEE Trans. Rob. 31(5), 1147–1163 (2015)
Article Google Scholar
Mur-Artal, R., Tardós, J.D.: ORB-SLAM2: an open-source slam system for monocular, stereo, and RGB-D cameras. IEEE Trans. Rob. 33(5), 1255–1262 (2017)
Article Google Scholar
Nießner, M., Zollhöfer, M., Izadi, S., Stamminger, M.: Real-time 3D reconstruction at scale using voxel hashing. ToG 32(6), 1–11 (2013)
Article Google Scholar
Qi, X., Liao, R., Liu, Z., Urtasun, R., Jia, J.: Geonet: geometric neural network for joint depth and surface normal estimation. In: CVPR, pp. 283–291 (2018)
Google Scholar
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV, pp. 12179–12188 (2021)
Google Scholar
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341 (2019)
Ranjan, A., et al.: Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In: CVPR, pp. 12240–12249 (2019)
Google Scholar
Ruhkamp, P., Gao, D., Chen, H., Navab, N., Busam, B.: Attention meets geometry: geometry guided spatial-temporal attention for consistent self-supervised monocular depth estimation. In: 3DV, pp. 837–847 (2021)
Google Scholar
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR, pp. 4104–4113 (2016)
Google Scholar
Schubert, D., Demmel, N., Usenko, V., Stückler, J., Cremers, D.: Direct sparse odometry with rolling shutter. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 699–714. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_42
Chapter Google Scholar
Shu, C., Yu, K., Duan, Z., Yang, K.: Feature-metric loss for self-supervised learning of depth and egomotion. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 572–588. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_34
Chapter Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33715-4_54
Chapter Google Scholar
Song, S., Chandraker, M., Guest, C.C.: Parallel, real-time monocular visual odometry. In: ICRA (2013)
Google Scholar
Stanford Artificial Intelligence Laboratory et al.: Robotic operating system. http://www.ros.org
Sturm, J., Engelhard, N., Endres, F., Burgard, W., Cremers, D.: A benchmark for the evaluation of RGB-D slam systems. In: IROS, pp. 573–580 (2012)
Google Scholar
Tateno, K., Tombari, F., Laina, I., Navab, N.: CNN-SLAM: real-time dense monocular slam with learned depth prediction. In: CVPR (2017)
Google Scholar
Teed, Z., Deng, J.: DeepV2D: video to depth with differentiable structure from motion. arXiv preprint arXiv:1812.04605 (2018)
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Teed, Z., Deng, J.: Droid-slam: deep visual slam for monocular, stereo, and RGB-D cameras. arXiv preprint arXiv:2108.10869 (2021)
Tiwari, L., Ji, P., Tran, Q.-H., Zhuang, B., Anand, S., Chandraker, M.: Pseudo RGB-D for self-improving monocular SLAM and depth prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 437–455. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_26
Chapter Google Scholar
Triggs, B., McLauchlan, P., Hartley, R., Fitzgibbon, A.: Bundle adjustment-a modern synthesis. Vision Algorithms: Theory and Practice, pp. 153–177 (2000)
Google Scholar
Ummenhofer, B., et al.: DEMON: depth and motion network for learning monocular stereo. In: CVPR (2017)
Google Scholar
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR, pp. 2022–2030 (2018)
Google Scholar
Wang, R., Pizer, S.M., Frahm, J.M.: Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In: CVPR (2019)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP 13(4), 600–612 (2004)
Google Scholar
Xiong, M., Zhang, Z., Zhong, W., Ji, J., Liu, J., Xiong, H.: Self-supervised monocular depth and visual odometry learning with scale-consistent geometric constraints. In: IJCAI, pp. 963–969 (2021)
Google Scholar
Yang, N., Stumberg, L.v., Wang, R., Cremers, D.: D3VO: deep depth, deep pose and deep uncertainty for monocular visual odometry. In: CVPR (2020)
Google Scholar
Yang, N., Wang, R., Stückler, J., Cremers, D.: Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 835–852. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_50
Chapter Google Scholar
Yin, Z., Shi, J.: GeoNet: unsupervised learning of dense depth, optical flow and camera pose. In: CVPR (2018)
Google Scholar
Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM TOG 40(4), 1–12 (2021)
Google Scholar
Zhao, W., Liu, S., Shu, Y., Liu, Y.J.: Towards better generalization: joint depth-pose learning without posenet. In: CVPR, pp. 9151–9161 (2020)
Google Scholar
Zhou, H., Ummenhofer, B., Brox, T.: DeepTAM: deep tracking and mapping. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 851–868. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_50
Chapter Google Scholar
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR, pp. 1851–1858 (2017)
Google Scholar
Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 710–727. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_42
Chapter Google Scholar
Zou, Y., Luo, Z., Huang, J.-B.: DF-Net: unsupervised joint learning of depth and flow using cross-task consistency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 38–55. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_3
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

OPPO US Research Center, InnoPeak Technology, Inc., Palo Alto, USA
Pan Ji, Qingan Yan, Yuxin Ma & Yi Xu

Authors

Pan Ji
View author publications
You can also search for this author in PubMed Google Scholar
Qingan Yan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Yi Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pan Ji .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16387 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, P., Yan, Q., Ma, Y., Xu, Y. (2022). GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-19769-7_21
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GeoRefine: Self-supervised Online Depth Refinement for Accurate Dense Mapping