Advertisement

Reversing the Cycle: Self-supervised Deep Stereo Through Enhanced Monocular Distillation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)

Abstract

In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. This fact occurs for depth estimation based on either monocular or stereo, with the latter often providing a valid source of self-supervision for the former. In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. Purposely, in order to train deep stereo networks, we distill knowledge through a monocular completion network. This architecture exploits single-image clues and few sparse points, sourced by traditional stereo algorithms, to estimate dense yet accurate disparity maps by means of a consensus mechanism over multiple estimations. We thoroughly evaluate with popular stereo datasets the impact of different supervisory signals showing how stereo networks trained with our paradigm outperform existing self-supervised frameworks. Finally, our proposal achieves notable generalization capabilities dealing with domain shift issues. Code available at https://github.com/FilippoAleotti/Reversing.

Keywords

Stereo matching Self-supervised learning Distillation 

Notes

Acknowledgments.

We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.

Supplementary material

504452_1_En_36_MOESM1_ESM.pdf (28.3 mb)
Supplementary material 1 (pdf 28968 KB)

References

  1. 1.
    Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)Google Scholar
  2. 2.
    Chen, Y., Yang, B., Liang, M., Urtasun, R.: Learning joint 2D–3D representations for depth completion. In: IEEE International Conference on Computer Vision (ICCV), pp. 10023–10032. IEEE (2019)Google Scholar
  3. 3.
    Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence embedding model for stereo matching costs. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2015)Google Scholar
  4. 4.
    Cheng, X., Wang, P., Yang, R.: Depth estimation via affinity learned with convolutional spatial propagation network. In: European Conference on Computer Vision (ECCV), pp. 103–119. Springer, Heidlelberg (2018)Google Scholar
  5. 5.
    Dovesi, P.L., et al.: Real-time semantic stereo matching. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE (2020)Google Scholar
  6. 6.
    Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374. MIT Press (2014)Google Scholar
  7. 7.
    Eldesokey, A., Felsberg, M., Khan, F.S.: Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913 (2018)
  8. 8.
    Gidaris, S., Komodakis, N.: Detect, replace, refine: deep structured prediction for pixel wise labeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)Google Scholar
  9. 9.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)Google Scholar
  10. 10.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2019)Google Scholar
  11. 11.
    Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3273–3282. IEEE (2019)Google Scholar
  12. 12.
    Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2, pp. 807–814. IEEE (2005)Google Scholar
  13. 13.
    Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE TPAMI 30(2), 328–341 (2008)CrossRefGoogle Scholar
  14. 14.
    Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: Hms-net: hierarchicalmulti-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 29, 3429–3441 (2019)CrossRefGoogle Scholar
  15. 15.
    Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 626–643. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_38CrossRefGoogle Scholar
  16. 16.
    Joung, S., Kim, S., Park, K., Sohn, K.: Unsupervised stereo matching usingconfidential correspondence consistency. IEEE Trans. Intell. Transp. Syst. 21, 2190–2203 (2019)CrossRefGoogle Scholar
  17. 17.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)Google Scholar
  18. 18.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  19. 19.
    Ku, J., Harakeh, A., Waslander, S.L.: In defense of classical image processing: fast depth completion on the cpu. In: 2018 15th Conference on Computer and Robot Vision (CRV), pp. 16–22. IEEE (2018)Google Scholar
  20. 20.
    Lai, H.Y., Tsai, Y.H., Chiu, W.C.: Bridging stereo matching and optical flow via spatiotemporal correspondence. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)Google Scholar
  21. 21.
    Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV. IEEE (2016)Google Scholar
  22. 22.
    Li, A., Yuan, Z.: Occlusion aware stereo matching via cooperative unsupervised learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11366, pp. 197–213. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20876-9_13CrossRefGoogle Scholar
  23. 23.
    Liang, Z., et al.: Learning for disparity estimation through feature constancy. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)Google Scholar
  24. 24.
    Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)CrossRefGoogle Scholar
  25. 25.
    Liu, L.K., Chan, S.H., Nguyen, T.Q.: Depth reconstruction from sparse samples: representation, algorithm, and sampling. IEEE Trans. Image Process. 24(6), 1983–1996 (2015)MathSciNetCrossRefGoogle Scholar
  26. 26.
    Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703. IEEE (2016)Google Scholar
  27. 27.
    Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295. IEEE (2019)Google Scholar
  28. 28.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016)Google Scholar
  29. 29.
    Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2015)Google Scholar
  30. 30.
    Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: a two-stage convolutional neural network for stereo matching. In: The IEEE International Conference on Computer Vision (ICCV) Workshops. IEEE (2017)Google Scholar
  31. 31.
    Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035. MIT Press (2019)Google Scholar
  32. 32.
    Poggi, M., Tosi, F., Mattoccia, S.: Learning monocular depth estimation with unsupervised trinocular assumptions. In: 6th International Conference on 3D Vision (3DV). IEEE (2018)Google Scholar
  33. 33.
    Scharstein, D., et al.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 31–42. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-11752-2_3CrossRefGoogle Scholar
  34. 34.
    Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)CrossRefGoogle Scholar
  35. 35.
    Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3260–3269. IEEE (2017)Google Scholar
  36. 36.
    Seki, A., Pollefeys, M.: Patch based confidence prediction for dense disparity map. In: BMVC. BMVA (2016)Google Scholar
  37. 37.
    Shaked, A., Wolf, L.: Improved stereo matching with constant highway networks and reflective confidence learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)Google Scholar
  38. 38.
    Smolyanskiy, N., Kamenev, A., Birchfield, S.: On the importance of stereo for accurate depth estimation: an efficient semi-supervised deep neural network approach. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE (2018)Google Scholar
  39. 39.
    Song, X., Zhao, X., Fang, L., Hu, H., Yu, Y.: Edgestereo: an effective multi-task learning network for stereo matching and edge detection. Int. J. Comput. Vis. 128, 1–21 (2020)CrossRefGoogle Scholar
  40. 40.
    Song, X., Zhao, X., Hu, H., Fang, L.: EdgeStereo: a context integrated residual pyramid network for stereo matching. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 20–35. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-20873-8_2CrossRefGoogle Scholar
  41. 41.
    Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised adaptation for deep stereo. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)Google Scholar
  42. 42.
    Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised domain adaptation for depth prediction from images. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2396–2409 (2019)CrossRefGoogle Scholar
  43. 43.
    Tonioni, A., Rahnama, O., Joy, T., Di Stefano, L., Thalaiyasingam, A., Torr, P.: Learning to adapt for stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)Google Scholar
  44. 44.
    Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L.D.: Real-time self-adaptive deep stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)Google Scholar
  45. 45.
    Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)Google Scholar
  46. 46.
    Tosi, F., Poggi, M., Tonioni, A., Di Stefano, L., Mattoccia, S.: Learning confidence measures in the wild. In: BMVC. BMVA (2017)Google Scholar
  47. 47.
    Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348. IEEE (2017)Google Scholar
  48. 48.
    Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV). IEEE (2017)Google Scholar
  49. 49.
    Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081. IEEE (2019)Google Scholar
  50. 50.
    Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2019)Google Scholar
  51. 51.
    Watson, J., Mac Aodha, O., Turmukhambetov, D., Brostow, G.J., Firman, M.: Learning stereo from single images. In: European Conference on Computer Vision (ECCV). Springer, Heidelberg (2020)Google Scholar
  52. 52.
    Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., Zhou, B.: Drivingstereo: a large-scale dataset for stereo matching in autonomous driving scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)Google Scholar
  53. 53.
    Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: SegStereo: exploiting semantic information for disparity estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 660–676. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_39CrossRefGoogle Scholar
  54. 54.
    Yang, Q., Yang, R., Davis, J., Nistér, D.: Spatial-depth super resolution for range images. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)Google Scholar
  55. 55.
    Yu, L., Wang, Y., Wu, Y., Jia, Y.: Deep stereo matching with explicit cost aggregation sub-architecture. In: Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press (2018)Google Scholar
  56. 56.
    Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994).  https://doi.org/10.1007/BFb0028345CrossRefGoogle Scholar
  57. 57.
    Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1–32), 2 (2016)zbMATHGoogle Scholar
  58. 58.
    Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: Ga-net: guided aggregation net for end-to-end stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 185–194. IEEE (2019)Google Scholar
  59. 59.
    Zhong, Y., Li, H., Dai, Y.: Self-supervised learning for stereo matching with self-improving ability. arXiv preprint arXiv:1709.00930 (2017)
  60. 60.
    Zhong, Y., Li, H., Dai, Y.: Open-world stereo video matching with deep RNN. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 104–119. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_7CrossRefGoogle Scholar
  61. 61.
    Zhou, C., Zhang, H., Shen, X., Jia, J.: Unsupervised learning of stereo matching. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)Google Scholar
  62. 62.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of BolognaBolognaItaly
  2. 2.China Agricultural UniversityBeijingChina

Personalised recommendations