Advertisement

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

  • Ravi GargEmail author
  • Vijay Kumar B.G.
  • Gustavo Carneiro
  • Ian Reid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9912)

Abstract

A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation.

Keywords

Convolutional Neural Network Depth Estimation Stereo Pair Convolutional Layer Deep Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This research was supported by the Australian Research Council through the Centre of Excellence in Robotic Vision, CE140100016, and through Laureate Fellowship FL130100102 to IDR.

Supplementary material

Supplementary material 1 (mp4 47351 KB)

References

  1. 1.
    Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  2. 2.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J.G. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  4. 4.
    Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence embedding model for stereo matching costs. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  5. 5.
    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  6. 6.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  7. 7.
    Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  8. 8.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems (NIPS) (2014)Google Scholar
  9. 9.
    Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  10. 10.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: Deepstereo: learning to predict new views from the world’s imagery (2016)Google Scholar
  11. 11.
    Garg, R., Pizarro, L., Rueckert, D., Agapito, L.: Dense multi-frame optic flow for non-rigid objects using subspace constraints. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010, Part IV. LNCS, vol. 6495, pp. 460–473. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  12. 12.
    Garg, R., Roussos, A., Agapito, L.: Dense variational reconstruction of non-rigid surfaces from monocular video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013)Google Scholar
  13. 13.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. (IJRR) 32, 1229–1235 (2013)Google Scholar
  14. 14.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  15. 15.
    Handa, A., Patraucean, V., Badrinarayanan, V., Stent, S., Cipolla, R.: Scenenet: understanding real world indoor scenes with synthetic data. arXiv preprint (2015). arXiv:1511.07041
  16. 16.
    Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005)Google Scholar
  17. 17.
    Horn, B.K., Schunck, B.G.: Determining optical flow. In: 1981 technical symposium east, pp. 319–331. International Society for Optics and Photonics (1981)Google Scholar
  18. 18.
    Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Neural Information Processing Systems (NIPS) (2011)Google Scholar
  19. 19.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS) (2012)Google Scholar
  20. 20.
    Vijay Kumar, B.G., Carneiro, G., Reid, I.: Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  21. 21.
    Ladicky, L., Shi, J., Pollefeys, M.: Pulling things out of perspective. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  22. 22.
    Li, B., Shen, C., Dai, Y., van den Hengel, A., He, M.: Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  23. 23.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Heidelberg (2014)Google Scholar
  24. 24.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2016)Google Scholar
  25. 25.
    Long, G., Kneip, L., Alvarez, J.M., Li, H.: Learning image matching by simply watching video. CoRR abs/1603.06041 (2016). http://arxiv.org/abs/1603.06041
  26. 26.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  27. 27.
    Newcombe, R.A., Lovegrove, S.J., Davison, A.J.: Dtam: dense tracking and mapping in real-time. In: IEEE International Conference on Computer Vision (ICCV) (2011)Google Scholar
  28. 28.
    Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: IEEE International on Computer Vision (ICCV) (2015)Google Scholar
  29. 29.
    Saxena, A., Sun, M., Ng, A.: Make3d: learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 31, 824–840 (2009)CrossRefGoogle Scholar
  30. 30.
    Steinbrücker, F., Pock, T., Cremers, D.: Large displacement optical flow computation withoutwarping. In: IEEE International Conference on Computer Vision (ICCV) (2009)Google Scholar
  31. 31.
    Vedaldi, A., Lenc, K.: Matconvnet - convolutional neural networks for matlab (2015)Google Scholar
  32. 32.
    Xie, J., Girshick, R., Farhadi, A.: Deep. 3d: fully automatic 2d-to-3d video conversion with deep convolutional neural networks. arXiv preprint (2016). arXiv:1604.03650
  33. 33.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L\(^{1}\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) Pattern Recognition, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  34. 34.
    Zbontar, J., LeCun, Y.: Computing the stereo matching cost with a convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  35. 35.
    Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.H.S.: Conditional random fields as recurrent neural networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Ravi Garg
    • 1
    Email author
  • Vijay Kumar B.G.
    • 1
  • Gustavo Carneiro
    • 1
  • Ian Reid
    • 1
  1. 1.The University of AdelaideAdelaideAustralia

Personalised recommendations