Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss

  • 56 Accesses

Abstract

We propose DPCNN (Depth and Pose Convolutional Network), a novel framework for monocular depth with absolute scale and camera motion estimation from videos. DPCNN uses our proposed stereo training examples, in which the spatial and temporal images can be combined more closely, thus providing more priori constraint relationships. In addition, there are two significant features existing in DPCNN: One is that the entire space–temporal-centroid model is established to independently constrain the rotation matrix and the translation vector, so that the spatial and temporal images are collectively limited in a common, real-world scale. The other is to use the triangulation principle to establish a two-channel depth consistency loss, which penalizes inconsistency of the depths estimated from the spatial images and inconsecutive temporal images, respectively. Experiments on the KITTI datasets show that DPCNN achieves the most advanced results in both tasks and outperforms the current monocular methods.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Wu S, Zhao H, Sun S (2017) Depth estimation from infrared video using local-feature-flow neural network. Int J Mach Learn Cybern 38(10):1–10

  2. 2.

    Marcialis GL, Roli F, Fadda G (2014) A novel method for head pose estimation based on the “Vitrubyn Man”. Int J Mach Learn Cybern 05(01):111–124

  3. 3.

    Kendall A, Grimes M, Cipolla R (2015) PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on computer vision (ICCV), pp. 2938–2946

  4. 4.

    Clark R, Wang S, Wen H, Markham A, Trigoni N (2017) VINet: visual-inertial odometry as a sequence-to-sequence learning problem. AAAI, pp. 3995–4001

  5. 5.

    Chaoyang Wang, Jos´e Miguel Buenaposada, Rui Zhu et al. “Learning Depth from Monocular Videos using Direct Methods,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  6. 6.

    Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst, pp. 2017–2025

  7. 7.

    Yang Z, Wang P, Xu W (2018) Unsupervised learning of geometry with edge-aware depth-normal consistency. Conference on the Association for the advance of artificial intelligence (AAAI)

  8. 8.

    Li R, Liu Q, Gui J, Gu D, Hu H (2017) Indoor relocalization in challenging environments with dual-stream convolutional neural networks. IEEE Trans Automat Sci Eng

  9. 9.

    Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two frame stereo correspondence algorithms. IJCV

  10. 10.

    Song S, Chandraker M (2014) Robust scale estimation in real-time monocular sfm for autonomous driving. In Proceedings of the IEEE Conference on computer vision and pattern recognition

  11. 11.

    Flynn J, Neulander I, Philbin J, Snavely N (2016) Deep-stereo: learning to predict new views from the worlds imagery. In CVPR

  12. 12.

    Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. NIPS

  13. 13.

    Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. PAMI

  14. 14.

    Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840

  15. 15.

    Liu FY, Shen CH, Lin GS et al (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38(10):2024–2039

  16. 16.

    Ummenhofer B, Zhou H, Uhrig J et al (2016) DeMoN: depth and motion network for learning monocular stereo

  17. 17.

    Li B, Shen C, Dai Y et al (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, pp. 1119–1127

  18. 18.

    Jayaraman D, Grauman K (2015) Learning image representations tied to ego-motion[C]//IEEE International Conference on Computer Vision. IEEE, pp 1413–1421

  19. 19.

    Yan X, Yang J, Yumer E et al (2016) Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. Adv Neural Inf Process Syst. pp. 1696–1704

  20. 20.

    Rezende DJ, Eslami SMA, Mohamed S et al (2016) Unsupervised learning of 3d structure from images. Adv Neural Inf Process Syst, 4997–5005

  21. 21.

    Zhou T, Brown M, Snavely N et al (2017) Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on computer vision and pattern recognition

  22. 22.

    Yin Z, Shi J (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. [C]//Conference on computer vision and pattern recognition

  23. 23.

    Xie J, Girshick R, Farhadi A (2016) Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks

  24. 24.

    Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency

  25. 25.

    Garg R, Vijay KBG, Carneiro G et al (2016) Unsupervised CNN for single view depth estimation: geometry to the rescue[J]. pp. 740–756

  26. 26.

    Mahjourian, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Conference on computer vision and pattern recognition

  27. 27.

    Zhan H, Garg R, Weerasekera CS et al (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.[C]//Conference on computer vision and pattern recognition, 2018

  28. 28.

    Li R, Wang S, Long Z et al (2018) UnDeepVO: monocular visual odometry through unsupervised deeplearning. Conference on computer vision and pattern recognition (CVPR)

  29. 29.

    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  30. 30.

    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR)

  31. 31.

    Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237

Download references

Acknowledgements

The authors would like to thank C. Godard and Tinghui Zhou for helpful discussions and sharing the code. The authors also thank the anonymous reviewers for their instructive comments.

Author information

Correspondence to Pengyuan Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Su, Q., Liu, P. et al. Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss. Int. J. Mach. Learn. & Cyber. 11, 615–627 (2020). https://doi.org/10.1007/s13042-019-01020-6

Download citation

Keywords

  • Deep learning
  • Depth estimation
  • Visual odometry (VO)
  • 3D feature transformation loss
  • Centroid transformation loss