Advertisement

Open-World Stereo Video Matching with Deep RNN

  • Yiran ZhongEmail author
  • Hongdong Li
  • Yuchao Dai
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)

Abstract

Deep Learning based stereo matching methods have shown great successes and achieved top scores across different benchmarks. However, like most data-driven methods, existing deep stereo matching networks suffer from some well-known drawbacks such as requiring large amount of labeled training data, and that their performances are fundamentally limited by the generalization ability. In this paper, we propose a novel Recurrent Neural Network (RNN) that takes a continuous (possibly previously unseen) stereo video as input, and directly predicts a depth-map at each frame without a pre-training process, and without the need of ground-truth depth-maps as supervision. Thanks to the recurrent nature (provided by two convolutional-LSTM blocks), our network is able to memorize and learn from its past experiences, and modify its inner parameters (network weights) to adapt to previously unseen or unfamiliar environments. This suggests a remarkable generalization ability of the net, making it applicable in an open world setting. Our method works robustly with changes in scene content, image statistics, and lighting and season conditions etc. By extensive experiments, we demonstrate that the proposed method seamlessly adapts between different scenarios. Equally important, in terms of the stereo matching accuracy, it outperforms state-of-the-art deep stereo approaches on standard benchmark datasets such as KITTI and Middlebury stereo.

Keywords

Stereo video matching Open world Recurrent neural network Convolutional LSTM 

Notes

Acknowledgements

Y. Zhong’s PhD scholarship is funded by CSIRO Data61. H. Li’s work is funded in part by Australia ARC Centre of Excellence for Robotic Vision (CE140100016). Y. Dai is supported in part by National 1000 Young Talents Plan of China, Natural Science Foundation of China (61420106007, 61671387), and ARC grant (DE140100180). The authors are very grateful to NVIDIA’s generous gift of GPUs to ANU used in this research.

Supplementary material

474176_1_En_7_MOESM1_ESM.pdf (154 kb)
Supplementary material 1 (pdf 154 KB)

References

  1. 1.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  2. 2.
    Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comp. Vis. 47(1–3), 742 (2002)zbMATHGoogle Scholar
  3. 3.
    Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 328341 (2008)CrossRefGoogle Scholar
  4. 4.
    Kendall, A., et al: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, October 2017Google Scholar
  5. 5.
    Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the Advances in Neural Information Processing Systems, NIPS 2015, Cambridge, MA, USA, pp. 802–810. MIT Press (2015)Google Scholar
  6. 6.
    Janai, J., Gney, F., Behl, A., Geiger, A.: Computer vision for autonomous vehicles: problems, datasets and state-of-the-art. Arxiv (2017)Google Scholar
  7. 7.
    Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1), 22872318 (2016)zbMATHGoogle Scholar
  8. 8.
    Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, June 2016Google Scholar
  9. 9.
    Seki, A., Pollefeys, M.: SGM-Nets: semi-global matching with neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  10. 10.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical OW, and scene OW estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2016Google Scholar
  11. 11.
    Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: a two-stage convolutional neural network for stereo matching. In: International Conference on Computer Vision - Workshop on Geometry Meets Deep Learning (2017)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, June 2016Google Scholar
  13. 13.
    Garg, R., Vijay Kumar, B.G., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_45CrossRefGoogle Scholar
  14. 14.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  15. 15.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  16. 16.
    Xie, J., Girshick, R., Farhadi, A.: Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 842–857. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_51CrossRefGoogle Scholar
  17. 17.
    Zhong, Y., Dai, Y., Li, H.: Self-supervised learning for stereo matching with self-improving ability. arXiv:1709.00930 (2017)
  18. 18.
    Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning Depth from Monocular Videos using Direct Methods. ArXiv e-prints, November 2017Google Scholar
  19. 19.
    Luo, Y., et al.: Single View Stereo Matching. ArXiv e-prints, March 2018Google Scholar
  20. 20.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. Trans. Neural Netw. 5(2), 157166 (1994)CrossRefGoogle Scholar
  21. 21.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 17351780 (1997)CrossRefGoogle Scholar
  22. 22.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, Washington, D.C., USA, pp. 4346–4354. IEEE Computer Society (2015)Google Scholar
  23. 23.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600612 (2004)CrossRefGoogle Scholar
  24. 24.
    Scharstein, D., Pal, C.: Learning conditional random fields for stereo. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, p. 18, June 2007Google Scholar
  25. 25.
    Hirschmller, H., Scharstein, D.: Evaluation of cost functions for stereo matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society (2007)Google Scholar
  26. 26.
    Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  27. 27.
    Yamaguchi, K., McAllester, D., Urtasun, R.: Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 756–771. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_49CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Australian National UniversityCanberraAustralia
  2. 2.Northwestern Polytechnical UniversityXi’anChina
  3. 3.Australian Centre for Robotic VisionCanberraAustralia
  4. 4.Data61 CSIROCanberraAustralia

Personalised recommendations