Learning Stereo from Single Images

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury. Code to reproduce our results is available at


Stereo matching Correspondence training data 



Large thanks to Aron Monszpart for help with baseline code, to Grace Tsai for feedback, and Sara Vicente for suggestions for baseline implementations. Also thanks to the authors who shared models and images.

Supplementary material

500725_1_En_42_MOESM1_ESM.pdf (28.1 mb)
Supplementary material 1 (pdf 28782 KB)


  1. 1.
    Aleotti, F., Tosi, F., Zhang, L., Poggi, M., Mattoccia, S.: Reversing the cycle: self-supervised deep stereo. In: ECCV (2020)Google Scholar
  2. 2.
    Abu Alhaija, H., Mustikovela, S.K., Geiger, A., Rother, C.: Geometric image synthesis. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11366, pp. 85–100. Springer, Cham (2019). Scholar
  3. 3.
    Bleyer, M., Rhemann, C., Rother, C.: PatchMatch stereo - stereo matching with slanted support windows. In: BMVC (2011)Google Scholar
  4. 4.
    Cabon, Y., Murray, N., Humenberger, M.: Virtual KITTI 2. arXiv:2001.10773 (2020)
  5. 5.
    Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: CVPR (2018)Google Scholar
  6. 6.
    Chen, W., Fu, Z., Yang, D., Deng, J.: Single-image depth perception in the wild. In: NeurIPS (2016)Google Scholar
  7. 7.
    Chen, W., Qian, S., Deng, J.: Learning single-image depth from videos using quality assessment networks. In: CVPR (2019)Google Scholar
  8. 8.
    Cheng, X., Wang, P., Yang, R.: Learning depth with convolutional spatial propagation network. PAMI 42(10), 2361–2379 (2019)CrossRefGoogle Scholar
  9. 9.
    Choi, I., Gallo, O., Troccoli, A., Kim, M.H., Kautz, J.: Extreme view synthesis. In: ICCV (2019)Google Scholar
  10. 10.
    Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: AutoAugment: learning augmentation strategies from data. In: CVPR (2019)Google Scholar
  11. 11.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  12. 12.
    DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperPoint: self-supervised interest point detection and description. In: CVPR Deep Learning for Visual SLAM Workshop (2018)Google Scholar
  13. 13.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)Google Scholar
  14. 14.
    Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: an open urban driving simulator. In: CoRL (2017)Google Scholar
  15. 15.
    Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: ICCV (2017)Google Scholar
  16. 16.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)Google Scholar
  17. 17.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. Int. J. Comput. Vision 59, 167–181 (2004). Scholar
  18. 18.
    Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR (2016)Google Scholar
  19. 19.
    Garg, R., Vijay Kumar, B.G., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). CrossRefGoogle Scholar
  20. 20.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 1, 6 (2013)Google Scholar
  21. 21.
    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: CVPR (2012)Google Scholar
  22. 22.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  23. 23.
    Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: ICCV (2019)Google Scholar
  24. 24.
    Guo, X., Li, H., Yi, S., Ren, J., Wang, X.: Learning monocular depth by distilling cross-domain stereo networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 506–523. Springer, Cham (2018). Scholar
  25. 25.
    Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: CVPR (2019)Google Scholar
  26. 26.
    Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. PAMI 30(2), 328–341 (2007)CrossRefGoogle Scholar
  27. 27.
    Hu, J., Ozay, M., Zhang, Y., Okatani, T.: Revisiting single image depth estimation: toward higher resolution maps with accurate object boundaries. In: WACV (2019)Google Scholar
  28. 28.
    Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: FlowNet 2.0: evolution of optical flow estimation with deep networks. In: CVPR (2017)Google Scholar
  29. 29.
    Janai, J., Güney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 713–731. Springer, Cham (2018). Scholar
  30. 30.
    Kanazawa, A., Jacobs, D.W., Chandraker, M.: WarpNet: weakly supervised matching for single-view reconstruction. In: CVPR (2016)Google Scholar
  31. 31.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV (2017)Google Scholar
  32. 32.
    Klodt, M., Vedaldi, A.: Supervising the new with the old: learning SFM from SFM. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 713–728. Springer, Cham (2018). Scholar
  33. 33.
    Ladickỳ, L., Häne, C., Pollefeys, M.: Learning the matching function. arXiv:1502.00652 (2015)
  34. 34.
    Le, H.A., Nimbhorkar, T., Mensink, T., Baslamisli, A.S., Karaoglu, S., Gevers, T.: Unsupervised generation of optical flow datasets. arXiv:1812.01946 (2018)
  35. 35.
    Li, A., Yuan, Z.: Occlusion aware stereo matching via cooperative unsupervised learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11366, pp. 197–213. Springer, Cham (2019). Scholar
  36. 36.
    Li, W., et al.: InteriorNet: mega-scale multi-sensor photo-realistic indoor scenes dataset. In: BMVC (2018)Google Scholar
  37. 37.
    Li, Z., Snavely, N.: MegaDepth: learning single-view depth prediction from internet photos. In: CVPR (2018)Google Scholar
  38. 38.
    Liang, Z., et al.: Learning for disparity estimation through feature constancy. In: CVPR (2018)Google Scholar
  39. 39.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  40. 40.
    Luo, Y., et al.: Single view stereo matching. In: CVPR (2018)Google Scholar
  41. 41.
    Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 185–201. Springer, Cham (2018). Scholar
  42. 42.
    Mayer, N., et al.: What makes good synthetic training data for learning disparity and optical flow estimation? Int. J. Comput. Vision 126(9), 942–960 (2018). Scholar
  43. 43.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)Google Scholar
  44. 44.
    Neuhold, G., Ollmann, T., Rota Bulò, S., Kontschieder, P.: The Mapillary Vistas Dataset for semantic understanding of street scenes. In: ICCV (2017)Google Scholar
  45. 45.
    Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: CVPR (2020)Google Scholar
  46. 46.
    Pang, J., et al.: Zoom and learn: generalizing deep stereo matching to novel domains. In: CVPR (2018)Google Scholar
  47. 47.
    Ramamonjisoa, M., Lepetit, V.: SharpNet: fast and accurate recovery of occluding contours in monocular depth estimation. In: ICCV Workshops (2019)Google Scholar
  48. 48.
    Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv:1907.01341 (2019)
  49. 49.
    Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Comput. Graphics Appl. 21(5), 34–41 (2001)CrossRefGoogle Scholar
  50. 50.
    Reynolds, M., Doboš, J., Peel, L., Weyrich, T., Brostow, G.J.: Capturing time-of-flight data with confidence. In: CVPR (2011)Google Scholar
  51. 51.
    Scharstein, D., et al.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 31–42. Springer, Cham (2014). Scholar
  52. 52.
    Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)Google Scholar
  53. 53.
    Schwarz, L.A.: Non-rigid registration using free-form deformations. Technische Universität München (2007)Google Scholar
  54. 54.
    Sobel, I., Feldman, G.: A 3x3 isotropic gradient operator for image processing. A talk at the Stanford Artificial Project (1968)Google Scholar
  55. 55.
    Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV (2017)Google Scholar
  56. 56.
    Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks by factorized spatial embeddings. In: ICCV (2017)Google Scholar
  57. 57.
    Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised adaptation for deep stereo. In: ICCV (2017)Google Scholar
  58. 58.
    Tonioni, A., Rahnama, O., Joy, T., di Stefano, L., Ajanthan, T., Torr, P.H.S.: Learning to adapt for stereo. In: CVPR (2019)Google Scholar
  59. 59.
    Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L.D.: Real-time self-adaptive deep stereo. In: ICCV (2019)Google Scholar
  60. 60.
    Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: CVPR (2019)Google Scholar
  61. 61.
    Vasiljevic, I., et al.: DIODE: a dense indoor and outdoor depth dataset. arXiv:1908.00463 (2019)
  62. 62.
    Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., Belongie, S.: Learning from noisy large-scale datasets with minimal supervision. In: CVPR (2017)Google Scholar
  63. 63.
    Wang, C., Lucey, S., Perazzi, F., Wang, O.: Web stereo video supervision for depth prediction from dynamic scenes. In: 3DV (2019)Google Scholar
  64. 64.
    Wang, J., Zickler, T.: Local detection of stereo occlusion boundaries. In: CVPR (2019)Google Scholar
  65. 65.
    Wang, Q., Zheng, S., Yan, Q., Deng, F., Zhao, K., Chu, X.: IRS: a large synthetic indoor robotics stereo dataset for disparity and surface normal estimation. arXiv:1912.09678 (2019)
  66. 66.
    Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: UnOS: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: CVPR (2019)Google Scholar
  67. 67.
    Wang, Y., Wang, L., Yang, J., An, W., Guo, Y.: Flickr1024: a large-scale dataset for stereo image super-resolution. In: ICCV Workshops (2019)Google Scholar
  68. 68.
    Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: ICCV (2019)Google Scholar
  69. 69.
    Woodford, O., Torr, P., Reid, I., Fitzgibbon, A.: Global stereo reconstruction under second-order smoothness priors. PAMI 31(12), 2115–2128 (2009)CrossRefGoogle Scholar
  70. 70.
    Xian, K., et al.: Monocular relative depth perception with web stereo data supervision. In: CVPR (2018)Google Scholar
  71. 71.
    Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: CVPR (2015)Google Scholar
  72. 72.
    Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2016)Google Scholar
  73. 73.
    Yin, Z., Darrell, T., Yu, F.: Hierarchical discrete distribution decomposition for match density estimation. In: CVPR (2019)Google Scholar
  74. 74.
    Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: ICCV (2019)Google Scholar
  75. 75.
    Žbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. JMLR 17(1), 2287–2318 (2016)zbMATHGoogle Scholar
  76. 76.
    Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: GA-Net: guided aggregation net for end-to-end stereo matching. In: CVPR (2019)Google Scholar
  77. 77.
    Zhang, F., Qi, X., Yang, R., Prisacariu, V., Wah, B., Torr, P.: Domain-invariant stereo matching networks. In: ECCV (2020)Google Scholar
  78. 78.
    Zhong, Y., Dai, Y., Li, H.: Self-supervised learning for stereo matching with self-improving ability. arXiv:1709.00930 (2017)
  79. 79.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: CVPR (2017)Google Scholar
  80. 80.
    Zhou, C., Zhang, H., Shen, X., Jia, J.: Unsupervised learning of stereo matching. In: ICCV (2017)Google Scholar
  81. 81.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar
  82. 82.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NianticSan FranciscoUSA
  2. 2.University of EdinburghEdinburghUK
  3. 3.UCLLondonUK

Personalised recommendations