Abstract
Depth estimation is a significant task in the robotics vision. In this paper, we address the depth estimation from a single monocular image, which is a challenging problem in automated vision systems since a single image alone does not carry any additional measurements. To tackle our main objective, we design a deep hybrid neural network, which is composed of convolutional and recurrent layers (ReNet), where each ReNet layer is composed of the Long Short-Term Memory unit (LSTM), which is famous for the ability to memorize long-range context. In the proposed network, ReNet layers aim to enrich the features representation by directly capturing global context. The effective integration of ReNet and convolutional layers in the common CNN framework allows us to train the hybrid network in the end-to-end fashion. Experimental evaluation on the benchmarks dataset demonstrated, that hybrid network achieves the state-of-the-art results without any post-processing steps. Moreover, the composition of recurrent and convolutional layers provide more satisfying results.
Similar content being viewed by others
Reference
Bottou L (2012) Stochastic gradient descent tricks. Neural networks: Tricks of the trade 1(1):421–436
Chen B-W, Ji W (2016) Intelligent marketing in smart cities: Crowdsourced data for geo-conquesting. IT Prof 18(4):18–24
Chen B, Wang J, Wang J (2009a) A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11(2):295–312
Chen BW, Tsai AC, Wang JF (2009b) Structuralized context-aware content and scalable resolution support for wireless VoD services. IEEE Trans Consum Electron 55(2):713–720
Chen BW, Chen CY, Wang JF (2013) Smart homecare surveillance system: behavior identification based on state-transition support vector machines and sound directivity pattern analysis. IEEE Trans Syst Man Cybern Syst 43(6):1279–1289
Chen L C, Papandreou G, Kokkinos I, et al. (2014) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, pages 1–14.
W. Chen, Z. Fu, D. Yang, and Deng J (2016) Single-image depth perception in the wild, arXiv.
Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. 2015 I.E. International Conference on Computer Vision (ICCV), pages 2650–2658
Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Nips:1–9
Garg R, BG VK, Reid I (2016) Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue
Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79:2554–2558
Hua Y, Tian H (2016) Depth estimation with convolutional conditional random field network. Neurocomputing 214:546–554
Jia Y, Shelhamer E, Donahue J, et al. (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093
Kang S, Ji W, Rho S, Anu V (2016) Cooperative mobile video transmission for traffic surveillance in smart cities. Comput Electr Eng 54:16–25
Karsch K, Liu C, Kang SB (2014) Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 36(11):2144–2158
Kim S, Choi S, Sohn K (2015) Learning depth from a single image using visual-depth words. 1(c):1895–1899
Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. CoRR, abs/1312.3
Krizhevsky A, Sulskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst:1–9
Ladicky L, Shi J, Pollefeys M (2014) Pulling Things out of Perspective, pages 89–96
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Bo Li, Chunhua Shen, Yuchao Dai, et al. (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 1119–1127
Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels, pages 1253–1260
Liu M, Salzmann M, He X (2014) Discrete-Continuous Depth Estimation from a Single Image, pages 716–723
Liu F, Shen C, Lin G, et al. (2015) Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1
Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440
Muhammad K, Sajjad M, Mehmood I, Rho S and Baik SW (2015) A novel magic LSB substitution method (M-LSB-SM) using multi-level encryption and achromatic component of an image. Multimed Tools Appl, pp 14867–14893
Nilsson NJ (2009) The quest for artificial intelligence: a history of ideas and achievements. Cambridge University Press, Cambridge
Olivia A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research: Visual perception 155:23–36
Radosavljevic V, Vucetic S, Obradovic Z (2010) Continuous conditional random fields for regression in remote sensing. Frontiers in Artificial Intelligence and Applications 215:809–814
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. IEEE Comput Soc Conf Comput Vis Pattern Recognit Work:512–519
Ristovski K, Radosavljevic V, Vucetic S, et al. (2012) Continuous conditional random fields for efficient regression in large fully connected graphs. Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 840–846
Saxena A, Chung S, Ng A (2005) Learning depth from single monocular images[J]. Advances in Neural
Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840
Shotton J, Girshick R, Fitzgibbon A et al (2013) Efficient human pose estimation from single depth images. Pattern analysis and machine intelligence. IEEE Transactions on 35(12):2821–2840
Silberman N, Hoiem D, Kohli P and Fergus R (2012) Indoor segmentation and support inference from rgbd images. In ECCV.
Simonyan K, Zisserman A (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. Iclr, pages 1–14
Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. Jmlr W&Cp 28(2010):1139–1147
Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381:520–522
Visin F, Kastner K, Courville A, et al. (2015a) ReSeg: A Recurrent Neural Network for Object Segmentation, pages 1–12
Visin F, Kastner K, Cho K, et al. (2015b) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. Arxiv, pages 1–9
Wang P, Shen X, Lin Z, et al. (2015) Towards unified depth and semantic prediction from a single image. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE: 2800–2809
Xiao J, Hays J, Russell BC, Patterson G, Ehinger KA, Torralba A, Oliva A (2013) Basic level scene understanding: categories, attributes and structures. Front Psychol 4:506
Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. Nips, pages 593–601
Yan Z, Zhang H, Jia Y, et al. (2016) Combining the best of convolutional layers and recurrent layers: A Hybrid Network for Semantic Segmentation.
Zeller N, Quint F, Stilla U (2016) Depth estimation and camera calibration of a focused plenoptic camera for visual odometry. ISPRS J Photogramm Remote Sens 118:83–100
Zhang S, Sheng H, Li C, Zhang J, Xiong Z (2016) Robust depth estimation for light field via spinning parallelogram operator. Comput Vis Image Underst 145:148–159
Zhuo W, Salzmann M, He X, et al. (2015) Indoor scene structure analysis for single image depth estimation. Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pages 614–622
Zoran D, Isola P, Krishnan D and Freeman WT (2015) Learning ordinal Relationships for mid-level vision” in 2015 I.E. International Conference on Computer Vision (ICCV), pp 388–396
Acknowledgments
This work is partially funded by the MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, the Major State Basic Research Development Program of China (973 Program 2015CB351804) and the National Natural Science Foundation of China under Grant No. 61572155, 61672188 and 61272386. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Grigorev, A., Jiang, F., Rho, S. et al. Depth estimation from single monocular images using deep hybrid network. Multimed Tools Appl 76, 18585–18604 (2017). https://doi.org/10.1007/s11042-016-4200-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-016-4200-x