Depth estimation from single monocular images using deep hybrid network

Grigorev, Aleksei; Jiang, Feng; Rho, Seungmin; Sori, Worku J.; Liu, Shaohui; Sai, Sergey

doi:10.1007/s11042-016-4200-x

Depth estimation from single monocular images using deep hybrid network

Published: 20 December 2016

Volume 76, pages 18585–18604, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Aleksei Grigorev^1,2,
Feng Jiang¹,
Seungmin Rho³,
Worku J. Sori¹,
Shaohui Liu¹ &
…
Sergey Sai²

1249 Accesses
13 Citations
Explore all metrics

Abstract

Depth estimation is a significant task in the robotics vision. In this paper, we address the depth estimation from a single monocular image, which is a challenging problem in automated vision systems since a single image alone does not carry any additional measurements. To tackle our main objective, we design a deep hybrid neural network, which is composed of convolutional and recurrent layers (ReNet), where each ReNet layer is composed of the Long Short-Term Memory unit (LSTM), which is famous for the ability to memorize long-range context. In the proposed network, ReNet layers aim to enrich the features representation by directly capturing global context. The effective integration of ReNet and convolutional layers in the common CNN framework allows us to train the hybrid network in the end-to-end fashion. Experimental evaluation on the benchmarks dataset demonstrated, that hybrid network achieves the state-of-the-art results without any post-processing steps. Moreover, the composition of recurrent and convolutional layers provide more satisfying results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Monocular depth estimation based on deep learning: An overview

Article 10 June 2020

Learning Depth from Monocular Sequence with Convolutional LSTM Network

MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features

Reference

Bottou L (2012) Stochastic gradient descent tricks. Neural networks: Tricks of the trade 1(1):421–436
Google Scholar
Chen B-W, Ji W (2016) Intelligent marketing in smart cities: Crowdsourced data for geo-conquesting. IT Prof 18(4):18–24
Article Google Scholar
Chen B, Wang J, Wang J (2009a) A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11(2):295–312
Article Google Scholar
Chen BW, Tsai AC, Wang JF (2009b) Structuralized context-aware content and scalable resolution support for wireless VoD services. IEEE Trans Consum Electron 55(2):713–720
Article Google Scholar
Chen BW, Chen CY, Wang JF (2013) Smart homecare surveillance system: behavior identification based on state-transition support vector machines and sound directivity pattern analysis. IEEE Trans Syst Man Cybern Syst 43(6):1279–1289
Article Google Scholar
Chen L C, Papandreou G, Kokkinos I, et al. (2014) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, pages 1–14.
W. Chen, Z. Fu, D. Yang, and Deng J (2016) Single-image depth perception in the wild, arXiv.
Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. 2015 I.E. International Conference on Computer Vision (ICCV), pages 2650–2658
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Nips:1–9
Garg R, BG VK, Reid I (2016) Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79:2554–2558
Hua Y, Tian H (2016) Depth estimation with convolutional conditional random field network. Neurocomputing 214:546–554
Article Google Scholar
Jia Y, Shelhamer E, Donahue J, et al. (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093
Kang S, Ji W, Rho S, Anu V (2016) Cooperative mobile video transmission for traffic surveillance in smart cities. Comput Electr Eng 54:16–25
Article Google Scholar
Karsch K, Liu C, Kang SB (2014) Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 36(11):2144–2158
Article Google Scholar
Kim S, Choi S, Sohn K (2015) Learning depth from a single image using visual-depth words. 1(c):1895–1899
Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. CoRR, abs/1312.3
Krizhevsky A, Sulskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst:1–9
Ladicky L, Shi J, Pollefeys M (2014) Pulling Things out of Perspective, pages 89–96
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Article Google Scholar
Bo Li, Chunhua Shen, Yuchao Dai, et al. (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 1119–1127
Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels, pages 1253–1260
Liu M, Salzmann M, He X (2014) Discrete-Continuous Depth Estimation from a Single Image, pages 716–723
Liu F, Shen C, Lin G, et al. (2015) Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1
Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440
Muhammad K, Sajjad M, Mehmood I, Rho S and Baik SW (2015) A novel magic LSB substitution method (M-LSB-SM) using multi-level encryption and achromatic component of an image. Multimed Tools Appl, pp 14867–14893
Nilsson NJ (2009) The quest for artificial intelligence: a history of ideas and achievements. Cambridge University Press, Cambridge
Book Google Scholar
Olivia A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research: Visual perception 155:23–36
Article Google Scholar
Radosavljevic V, Vucetic S, Obradovic Z (2010) Continuous conditional random fields for regression in remote sensing. Frontiers in Artificial Intelligence and Applications 215:809–814
MATH Google Scholar
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. IEEE Comput Soc Conf Comput Vis Pattern Recognit Work:512–519
Ristovski K, Radosavljevic V, Vucetic S, et al. (2012) Continuous conditional random fields for efficient regression in large fully connected graphs. Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 840–846
Saxena A, Chung S, Ng A (2005) Learning depth from single monocular images[J]. Advances in Neural
Google Scholar
Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840
Article Google Scholar
Shotton J, Girshick R, Fitzgibbon A et al (2013) Efficient human pose estimation from single depth images. Pattern analysis and machine intelligence. IEEE Transactions on 35(12):2821–2840
Google Scholar
Silberman N, Hoiem D, Kohli P and Fergus R (2012) Indoor segmentation and support inference from rgbd images. In ECCV.
Simonyan K, Zisserman A (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. Iclr, pages 1–14
Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. Jmlr W&Cp 28(2010):1139–1147
Google Scholar
Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381:520–522
Article Google Scholar
Visin F, Kastner K, Courville A, et al. (2015a) ReSeg: A Recurrent Neural Network for Object Segmentation, pages 1–12
Visin F, Kastner K, Cho K, et al. (2015b) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. Arxiv, pages 1–9
Wang P, Shen X, Lin Z, et al. (2015) Towards unified depth and semantic prediction from a single image. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE: 2800–2809
Xiao J, Hays J, Russell BC, Patterson G, Ehinger KA, Torralba A, Oliva A (2013) Basic level scene understanding: categories, attributes and structures. Front Psychol 4:506
Google Scholar
Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. Nips, pages 593–601
Yan Z, Zhang H, Jia Y, et al. (2016) Combining the best of convolutional layers and recurrent layers: A Hybrid Network for Semantic Segmentation.
Zeller N, Quint F, Stilla U (2016) Depth estimation and camera calibration of a focused plenoptic camera for visual odometry. ISPRS J Photogramm Remote Sens 118:83–100
Article Google Scholar
Zhang S, Sheng H, Li C, Zhang J, Xiong Z (2016) Robust depth estimation for light field via spinning parallelogram operator. Comput Vis Image Underst 145:148–159
Article Google Scholar
Zhuo W, Salzmann M, He X, et al. (2015) Indoor scene structure analysis for single image depth estimation. Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pages 614–622
Zoran D, Isola P, Krishnan D and Freeman WT (2015) Learning ordinal Relationships for mid-level vision” in 2015 I.E. International Conference on Computer Vision (ICCV), pp 388–396

Download references

Acknowledgments

This work is partially funded by the MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, the Major State Basic Research Development Program of China (973 Program 2015CB351804) and the National Natural Science Foundation of China under Grant No. 61572155, 61672188 and 61272386. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU.

Author information

Authors and Affiliations

Department Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Aleksei Grigorev, Feng Jiang, Worku J. Sori & Shaohui Liu
Department of Computer Engineering, Pacific National University, Khabarovsk, 680035, Russia
Aleksei Grigorev & Sergey Sai
Department of Media Software, Sungkyul University, Anyang, South Korea
Seungmin Rho

Authors

Aleksei Grigorev
View author publications
You can also search for this author in PubMed Google Scholar
Feng Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Seungmin Rho
View author publications
You can also search for this author in PubMed Google Scholar
Worku J. Sori
View author publications
You can also search for this author in PubMed Google Scholar
Shaohui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Sai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Feng Jiang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Grigorev, A., Jiang, F., Rho, S. et al. Depth estimation from single monocular images using deep hybrid network. Multimed Tools Appl 76, 18585–18604 (2017). https://doi.org/10.1007/s11042-016-4200-x

Download citation

Received: 29 August 2016
Revised: 15 November 2016
Accepted: 21 November 2016
Published: 20 December 2016
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11042-016-4200-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Depth estimation from single monocular images using deep hybrid network

Abstract

Access this article

Similar content being viewed by others

Monocular depth estimation based on deep learning: An overview

Learning Depth from Monocular Sequence with Convolutional LSTM Network

MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features

Reference

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Depth estimation from single monocular images using deep hybrid network

Abstract

Access this article

Similar content being viewed by others

Monocular depth estimation based on deep learning: An overview

Learning Depth from Monocular Sequence with Convolutional LSTM Network

MD-ST: Monocular Depth Estimation Based on Spatio-Temporal Correlation Features

Reference

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation