We propose DPCNN (Depth and Pose Convolutional Network), a novel framework for monocular depth with absolute scale and camera motion estimation from videos. DPCNN uses our proposed stereo training examples, in which the spatial and temporal images can be combined more closely, thus providing more priori constraint relationships. In addition, there are two significant features existing in DPCNN: One is that the entire space–temporal-centroid model is established to independently constrain the rotation matrix and the translation vector, so that the spatial and temporal images are collectively limited in a common, real-world scale. The other is to use the triangulation principle to establish a two-channel depth consistency loss, which penalizes inconsistency of the depths estimated from the spatial images and inconsecutive temporal images, respectively. Experiments on the KITTI datasets show that DPCNN achieves the most advanced results in both tasks and outperforms the current monocular methods.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Wu S, Zhao H, Sun S (2017) Depth estimation from infrared video using local-feature-flow neural network. Int J Mach Learn Cybern 38(10):1–10
Marcialis GL, Roli F, Fadda G (2014) A novel method for head pose estimation based on the “Vitrubyn Man”. Int J Mach Learn Cybern 05(01):111–124
Kendall A, Grimes M, Cipolla R (2015) PoseNet: a convolutional network for real-time 6-DOF camera relocalization. In: Proceedings of the IEEE International Conference on computer vision (ICCV), pp. 2938–2946
Clark R, Wang S, Wen H, Markham A, Trigoni N (2017) VINet: visual-inertial odometry as a sequence-to-sequence learning problem. AAAI, pp. 3995–4001
Chaoyang Wang, Jos´e Miguel Buenaposada, Rui Zhu et al. “Learning Depth from Monocular Videos using Direct Methods,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017
Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst, pp. 2017–2025
Yang Z, Wang P, Xu W (2018) Unsupervised learning of geometry with edge-aware depth-normal consistency. Conference on the Association for the advance of artificial intelligence (AAAI)
Li R, Liu Q, Gui J, Gu D, Hu H (2017) Indoor relocalization in challenging environments with dual-stream convolutional neural networks. IEEE Trans Automat Sci Eng
Scharstein D, Szeliski R (2002) A taxonomy and evaluation of dense two frame stereo correspondence algorithms. IJCV
Song S, Chandraker M (2014) Robust scale estimation in real-time monocular sfm for autonomous driving. In Proceedings of the IEEE Conference on computer vision and pattern recognition
Flynn J, Neulander I, Philbin J, Snavely N (2016) Deep-stereo: learning to predict new views from the worlds imagery. In CVPR
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. NIPS
Liu F, Shen C, Lin G, Reid I (2015) Learning depth from single monocular images using deep convolutional neural fields. PAMI
Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840
Liu FY, Shen CH, Lin GS et al (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal Mach Intell 38(10):2024–2039
Ummenhofer B, Zhou H, Uhrig J et al (2016) DeMoN: depth and motion network for learning monocular stereo
Li B, Shen C, Dai Y et al (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs[C]//IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, pp. 1119–1127
Jayaraman D, Grauman K (2015) Learning image representations tied to ego-motion[C]//IEEE International Conference on Computer Vision. IEEE, pp 1413–1421
Yan X, Yang J, Yumer E et al (2016) Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. Adv Neural Inf Process Syst. pp. 1696–1704
Rezende DJ, Eslami SMA, Mohamed S et al (2016) Unsupervised learning of 3d structure from images. Adv Neural Inf Process Syst, 4997–5005
Zhou T, Brown M, Snavely N et al (2017) Unsupervised learning of depth and ego-motion from video[C]//IEEE Conference on computer vision and pattern recognition
Yin Z, Shi J (2018) GeoNet: unsupervised learning of dense depth, optical flow and camera pose. [C]//Conference on computer vision and pattern recognition
Xie J, Girshick R, Farhadi A (2016) Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks
Godard C, Aodha OM, Brostow GJ (2017) Unsupervised monocular depth estimation with left-right consistency
Garg R, Vijay KBG, Carneiro G et al (2016) Unsupervised CNN for single view depth estimation: geometry to the rescue[J]. pp. 740–756
Mahjourian, Wicke M, Angelova A (2018) Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In: Conference on computer vision and pattern recognition
Zhan H, Garg R, Weerasekera CS et al (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction.[C]//Conference on computer vision and pattern recognition, 2018
Li R, Wang S, Long Z et al (2018) UnDeepVO: monocular visual odometry through unsupervised deeplearning. Conference on computer vision and pattern recognition (CVPR)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations (ICLR)
Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: the kitti dataset. Int J Robot Res 32(11):1231–1237
The authors would like to thank C. Godard and Tinghui Zhou for helpful discussions and sharing the code. The authors also thank the anonymous reviewers for their instructive comments.
Conflict of interest
The authors declare that they have no conflict of interest.
This article does not contain any studies with animals performed by any of the authors.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Zhang, J., Su, Q., Liu, P. et al. Unsupervised learning of monocular depth and ego-motion with space–temporal-centroid loss. Int. J. Mach. Learn. & Cyber. 11, 615–627 (2020). https://doi.org/10.1007/s13042-019-01020-6
- Deep learning
- Depth estimation
- Visual odometry (VO)
- 3D feature transformation loss
- Centroid transformation loss