Multimedia Tools and Applications

, Volume 76, Issue 18, pp 18585–18604 | Cite as

Depth estimation from single monocular images using deep hybrid network

  • Aleksei Grigorev
  • Feng Jiang
  • Seungmin Rho
  • Worku J. Sori
  • Shaohui Liu
  • Sergey Sai


Depth estimation is a significant task in the robotics vision. In this paper, we address the depth estimation from a single monocular image, which is a challenging problem in automated vision systems since a single image alone does not carry any additional measurements. To tackle our main objective, we design a deep hybrid neural network, which is composed of convolutional and recurrent layers (ReNet), where each ReNet layer is composed of the Long Short-Term Memory unit (LSTM), which is famous for the ability to memorize long-range context. In the proposed network, ReNet layers aim to enrich the features representation by directly capturing global context. The effective integration of ReNet and convolutional layers in the common CNN framework allows us to train the hybrid network in the end-to-end fashion. Experimental evaluation on the benchmarks dataset demonstrated, that hybrid network achieves the state-of-the-art results without any post-processing steps. Moreover, the composition of recurrent and convolutional layers provide more satisfying results.


CNN LSTM Depth estimation Monocular image RNN 



This work is partially funded by the MOE–Microsoft Key Laboratory of Natural Language Processing and Speech, Harbin Institute of Technology, the Major State Basic Research Development Program of China (973 Program 2015CB351804) and the National Natural Science Foundation of China under Grant No. 61572155, 61672188 and 61272386. We would also like to acknowledge NVIDIA Corporation who kindly provided two sets of GPU.


  1. 1.
    Bottou L (2012) Stochastic gradient descent tricks. Neural networks: Tricks of the trade 1(1):421–436Google Scholar
  2. 2.
    Chen B-W, Ji W (2016) Intelligent marketing in smart cities: Crowdsourced data for geo-conquesting. IT Prof 18(4):18–24CrossRefGoogle Scholar
  3. 3.
    Chen B, Wang J, Wang J (2009a) A novel video summarization based on mining the story-structure and semantic relations among concept entities. IEEE Transactions on Multimedia 11(2):295–312CrossRefGoogle Scholar
  4. 4.
    Chen BW, Tsai AC, Wang JF (2009b) Structuralized context-aware content and scalable resolution support for wireless VoD services. IEEE Trans Consum Electron 55(2):713–720CrossRefGoogle Scholar
  5. 5.
    Chen BW, Chen CY, Wang JF (2013) Smart homecare surveillance system: behavior identification based on state-transition support vector machines and sound directivity pattern analysis. IEEE Trans Syst Man Cybern Syst 43(6):1279–1289CrossRefGoogle Scholar
  6. 6.
    Chen L C, Papandreou G, Kokkinos I, et al. (2014) Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. Iclr, pages 1–14.Google Scholar
  7. 7.
    W. Chen, Z. Fu, D. Yang, and Deng J (2016) Single-image depth perception in the wild, arXiv.Google Scholar
  8. 8.
    Eigen D, Fergus R (2014) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. 2015 I.E. International Conference on Computer Vision (ICCV), pages 2650–2658Google Scholar
  9. 9.
    Eigen D, Puhrsch C, Fergus R (2014) Depth map prediction from a single image using a multi-scale deep network. Nips:1–9Google Scholar
  10. 10.
    Garg R, BG VK, Reid I (2016) Unsupervised CNN for Single View Depth Estimation: Geometry to the RescueGoogle Scholar
  11. 11.
    Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79:2554–2558Google Scholar
  12. 12.
    Hua Y, Tian H (2016) Depth estimation with convolutional conditional random field network. Neurocomputing 214:546–554CrossRefGoogle Scholar
  13. 13.
    Jia Y, Shelhamer E, Donahue J, et al. (2014) Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093Google Scholar
  14. 14.
    Kang S, Ji W, Rho S, Anu V (2016) Cooperative mobile video transmission for traffic surveillance in smart cities. Comput Electr Eng 54:16–25CrossRefGoogle Scholar
  15. 15.
    Karsch K, Liu C, Kang SB (2014) Depth transfer: depth extraction from video using non-parametric sampling. IEEE Trans Pattern Anal Mach Intell 36(11):2144–2158CrossRefGoogle Scholar
  16. 16.
    Kim S, Choi S, Sohn K (2015) Learning depth from a single image using visual-depth words. 1(c):1895–1899Google Scholar
  17. 17.
    Konda K, Memisevic R (2013) Unsupervised learning of depth and motion. CoRR, abs/1312.3Google Scholar
  18. 18.
    Krizhevsky A, Sulskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Adv Neural Inf Process Syst:1–9Google Scholar
  19. 19.
    Ladicky L, Shi J, Pollefeys M (2014) Pulling Things out of Perspective, pages 89–96Google Scholar
  20. 20.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444CrossRefGoogle Scholar
  21. 21.
    Bo Li, Chunhua Shen, Yuchao Dai, et al. (2015) Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE 1119–1127Google Scholar
  22. 22.
    Liu B, Gould S, Koller D (2010) Single image depth estimation from predicted semantic labels, pages 1253–1260Google Scholar
  23. 23.
    Liu M, Salzmann M, He X (2014) Discrete-Continuous Depth Estimation from a Single Image, pages 716–723Google Scholar
  24. 24.
    Liu F, Shen C, Lin G, et al. (2015) Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1Google Scholar
  25. 25.
    Long J, Shelhamer E, Darrell T (2014) Fully convolutional networks for semantic segmentation. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440Google Scholar
  26. 26.
    Muhammad K, Sajjad M, Mehmood I, Rho S and Baik SW (2015) A novel magic LSB substitution method (M-LSB-SM) using multi-level encryption and achromatic component of an image. Multimed Tools Appl, pp 14867–14893Google Scholar
  27. 27.
    Nilsson NJ (2009) The quest for artificial intelligence: a history of ideas and achievements. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  28. 28.
    Olivia A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Progress in brain research: Visual perception 155:23–36CrossRefGoogle Scholar
  29. 29.
    Radosavljevic V, Vucetic S, Obradovic Z (2010) Continuous conditional random fields for regression in remote sensing. Frontiers in Artificial Intelligence and Applications 215:809–814MATHGoogle Scholar
  30. 30.
    Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: An astounding baseline for recognition. IEEE Comput Soc Conf Comput Vis Pattern Recognit Work:512–519Google Scholar
  31. 31.
    Ristovski K, Radosavljevic V, Vucetic S, et al. (2012) Continuous conditional random fields for efficient regression in large fully connected graphs. Twenty-Seventh AAAI Conference on Artificial Intelligence, pages 840–846Google Scholar
  32. 32.
    Saxena A, Chung S, Ng A (2005) Learning depth from single monocular images[J]. Advances in NeuralGoogle Scholar
  33. 33.
    Saxena A, Sun M, Ng AY (2009) Make3D: learning 3D scene structure from a single still image. IEEE Trans Pattern Anal Mach Intell 31(5):824–840CrossRefGoogle Scholar
  34. 34.
    Shotton J, Girshick R, Fitzgibbon A et al (2013) Efficient human pose estimation from single depth images. Pattern analysis and machine intelligence. IEEE Transactions on 35(12):2821–2840Google Scholar
  35. 35.
    Silberman N, Hoiem D, Kohli P and Fergus R (2012) Indoor segmentation and support inference from rgbd images. In ECCV.Google Scholar
  36. 36.
    Simonyan K, Zisserman A (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. Iclr, pages 1–14Google Scholar
  37. 37.
    Sutskever I, Martens J, Dahl GE et al (2013) On the importance of initialization and momentum in deep learning. Jmlr W&Cp 28(2010):1139–1147Google Scholar
  38. 38.
    Thorpe S, Fize D, Marlot C (1996) Speed of processing in the human visual system. Nature 381:520–522CrossRefGoogle Scholar
  39. 39.
    Visin F, Kastner K, Courville A, et al. (2015a) ReSeg: A Recurrent Neural Network for Object Segmentation, pages 1–12Google Scholar
  40. 40.
    Visin F, Kastner K, Cho K, et al. (2015b) ReNet: A Recurrent Neural Network Based Alternative to Convolutional Networks. Arxiv, pages 1–9Google Scholar
  41. 41.
    Wang P, Shen X, Lin Z, et al. (2015) Towards unified depth and semantic prediction from a single image. 2015 I.E. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE: 2800–2809Google Scholar
  42. 42.
    Xiao J, Hays J, Russell BC, Patterson G, Ehinger KA, Torralba A, Oliva A (2013) Basic level scene understanding: categories, attributes and structures. Front Psychol 4:506Google Scholar
  43. 43.
    Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. Nips, pages 593–601Google Scholar
  44. 44.
    Yan Z, Zhang H, Jia Y, et al. (2016) Combining the best of convolutional layers and recurrent layers: A Hybrid Network for Semantic Segmentation.Google Scholar
  45. 45.
    Zeller N, Quint F, Stilla U (2016) Depth estimation and camera calibration of a focused plenoptic camera for visual odometry. ISPRS J Photogramm Remote Sens 118:83–100CrossRefGoogle Scholar
  46. 46.
    Zhang S, Sheng H, Li C, Zhang J, Xiong Z (2016) Robust depth estimation for light field via spinning parallelogram operator. Comput Vis Image Underst 145:148–159CrossRefGoogle Scholar
  47. 47.
    Zhuo W, Salzmann M, He X, et al. (2015) Indoor scene structure analysis for single image depth estimation. Computer Vision and Pattern Recognition (CVPR), 2015 I.E. Conference on, pages 614–622Google Scholar
  48. 48.
    Zoran D, Isola P, Krishnan D and Freeman WT (2015) Learning ordinal Relationships for mid-level vision” in 2015 I.E. International Conference on Computer Vision (ICCV), pp 388–396Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Aleksei Grigorev
    • 1
    • 2
  • Feng Jiang
    • 1
  • Seungmin Rho
    • 3
  • Worku J. Sori
    • 1
  • Shaohui Liu
    • 1
  • Sergey Sai
    • 2
  1. 1.Department Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina
  2. 2.Department of Computer EngineeringPacific National UniversityKhabarovskRussia
  3. 3.Department of Media SoftwareSungkyul UniversityAnyangSouth Korea

Personalised recommendations