Abstract
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. This fact occurs for depth estimation based on either monocular or stereo, with the latter often providing a valid source of self-supervision for the former. In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. Purposely, in order to train deep stereo networks, we distill knowledge through a monocular completion network. This architecture exploits single-image clues and few sparse points, sourced by traditional stereo algorithms, to estimate dense yet accurate disparity maps by means of a consensus mechanism over multiple estimations. We thoroughly evaluate with popular stereo datasets the impact of different supervisory signals showing how stereo networks trained with our paradigm outperform existing self-supervised frameworks. Finally, our proposal achieves notable generalization capabilities dealing with domain shift issues. Code available at https://github.com/FilippoAleotti/Reversing.
F. Aleotti and F. Tosi—Joint first authorship
L. Zhang—Work done while at University of Bologna.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chang, J.R., Chen, Y.S.: Pyramid stereo matching network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Chen, Y., Yang, B., Liang, M., Urtasun, R.: Learning joint 2D–3D representations for depth completion. In: IEEE International Conference on Computer Vision (ICCV), pp. 10023–10032. IEEE (2019)
Chen, Z., Sun, X., Wang, L., Yu, Y., Huang, C.: A deep visual correspondence embedding model for stereo matching costs. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2015)
Cheng, X., Wang, P., Yang, R.: Depth estimation via affinity learned with convolutional spatial propagation network. In: European Conference on Computer Vision (ECCV), pp. 103–119. Springer, Heidlelberg (2018)
Dovesi, P.L., et al.: Real-time semantic stereo matching. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE (2020)
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: Advances in Neural Information Processing Systems, pp. 2366–2374. MIT Press (2014)
Eldesokey, A., Felsberg, M., Khan, F.S.: Propagating confidences through cnns for sparse data regression. arXiv preprint arXiv:1805.11913 (2018)
Gidaris, S., Komodakis, N.: Detect, replace, refine: deep structured prediction for pixel wise labeling. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Godard, C., Mac Aodha, O., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2019)
Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3273–3282. IEEE (2019)
Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2, pp. 807–814. IEEE (2005)
Hirschmuller, H.: Stereo processing by semiglobal matching and mutual information. IEEE TPAMI 30(2), 328–341 (2008)
Huang, Z., Fan, J., Cheng, S., Yi, S., Wang, X., Li, H.: Hms-net: hierarchicalmulti-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 29, 3429–3441 (2019)
Ilg, E., Saikia, T., Keuper, M., Brox, T.: Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 626–643. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_38
Joung, S., Kim, S., Park, K., Sohn, K.: Unsupervised stereo matching usingconfidential correspondence consistency. IEEE Trans. Intell. Transp. Syst. 21, 2190–2203 (2019)
Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Ku, J., Harakeh, A., Waslander, S.L.: In defense of classical image processing: fast depth completion on the cpu. In: 2018 15th Conference on Computer and Robot Vision (CRV), pp. 16–22. IEEE (2018)
Lai, H.Y., Tsai, Y.H., Chiu, W.C.: Bridging stereo matching and optical flow via spatiotemporal correspondence. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., Navab, N.: Deeper depth prediction with fully convolutional residual networks. In: 3DV. IEEE (2016)
Li, A., Yuan, Z.: Occlusion aware stereo matching via cooperative unsupervised learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11366, pp. 197–213. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20876-9_13
Liang, Z., et al.: Learning for disparity estimation through feature constancy. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2016)
Liu, L.K., Chan, S.H., Nguyen, T.Q.: Depth reconstruction from sparse samples: representation, algorithm, and sampling. IEEE Trans. Image Process. 24(6), 1983–1996 (2015)
Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703. IEEE (2016)
Ma, F., Cavalheiro, G.V., Karaman, S.: Self-supervised sparse-to-dense: self-supervised depth completion from lidar and monocular camera. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3288–3295. IEEE (2019)
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2016)
Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2015)
Pang, J., Sun, W., Ren, J.S., Yang, C., Yan, Q.: Cascade residual learning: a two-stage convolutional neural network for stereo matching. In: The IEEE International Conference on Computer Vision (ICCV) Workshops. IEEE (2017)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035. MIT Press (2019)
Poggi, M., Tosi, F., Mattoccia, S.: Learning monocular depth estimation with unsupervised trinocular assumptions. In: 6th International Conference on 3D Vision (3DV). IEEE (2018)
Scharstein, D., et al.: High-resolution stereo datasets with subpixel-accurate ground truth. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 31–42. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_3
Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 47(1–3), 7–42 (2002)
Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3260–3269. IEEE (2017)
Seki, A., Pollefeys, M.: Patch based confidence prediction for dense disparity map. In: BMVC. BMVA (2016)
Shaked, A., Wolf, L.: Improved stereo matching with constant highway networks and reflective confidence learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Smolyanskiy, N., Kamenev, A., Birchfield, S.: On the importance of stereo for accurate depth estimation: an efficient semi-supervised deep neural network approach. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. IEEE (2018)
Song, X., Zhao, X., Fang, L., Hu, H., Yu, Y.: Edgestereo: an effective multi-task learning network for stereo matching and edge detection. Int. J. Comput. Vis. 128, 1–21 (2020)
Song, X., Zhao, X., Hu, H., Fang, L.: EdgeStereo: a context integrated residual pyramid network for stereo matching. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 20–35. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20873-8_2
Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised adaptation for deep stereo. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Tonioni, A., Poggi, M., Mattoccia, S., Di Stefano, L.: Unsupervised domain adaptation for depth prediction from images. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2396–2409 (2019)
Tonioni, A., Rahnama, O., Joy, T., Di Stefano, L., Thalaiyasingam, A., Torr, P.: Learning to adapt for stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., Stefano, L.D.: Real-time self-adaptive deep stereo. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Tosi, F., Aleotti, F., Poggi, M., Mattoccia, S.: Learning monocular depth estimation infusing traditional stereo knowledge. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Tosi, F., Poggi, M., Tonioni, A., Di Stefano, L., Mattoccia, S.: Learning confidence measures in the wild. In: BMVC. BMVA (2017)
Tulyakov, S., Ivanov, A., Fleuret, F.: Weakly supervised learning of deep metrics for stereo reconstruction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1339–1348. IEEE (2017)
Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.: Sparsity invariant CNNs. In: International Conference on 3D Vision (3DV). IEEE (2017)
Wang, Y., Wang, P., Yang, Z., Luo, C., Yang, Y., Xu, W.: Unos: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081. IEEE (2019)
Watson, J., Firman, M., Brostow, G.J., Turmukhambetov, D.: Self-supervised monocular depth hints. In: IEEE International Conference on Computer Vision (ICCV). IEEE (2019)
Watson, J., Mac Aodha, O., Turmukhambetov, D., Brostow, G.J., Firman, M.: Learning stereo from single images. In: European Conference on Computer Vision (ECCV). Springer, Heidelberg (2020)
Yang, G., Song, X., Huang, C., Deng, Z., Shi, J., Zhou, B.: Drivingstereo: a large-scale dataset for stereo matching in autonomous driving scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2019)
Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: SegStereo: exploiting semantic information for disparity estimation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 660–676. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01234-2_39
Yang, Q., Yang, R., Davis, J., Nistér, D.: Spatial-depth super resolution for range images. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2007)
Yu, L., Wang, Y., Wu, Y., Jia, Y.: Deep stereo matching with explicit cost aggregation sub-architecture. In: Thirty-Second AAAI Conference on Artificial Intelligence. AAAI Press (2018)
Zabih, R., Woodfill, J.: Non-parametric local transforms for computing visual correspondence. In: Eklundh, J.-O. (ed.) ECCV 1994. LNCS, vol. 801, pp. 151–158. Springer, Heidelberg (1994). https://doi.org/10.1007/BFb0028345
Zbontar, J., LeCun, Y.: Stereo matching by training a convolutional neural network to compare image patches. J. Mach. Learn. Res. 17(1–32), 2 (2016)
Zhang, F., Prisacariu, V., Yang, R., Torr, P.H.: Ga-net: guided aggregation net for end-to-end stereo matching. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 185–194. IEEE (2019)
Zhong, Y., Li, H., Dai, Y.: Self-supervised learning for stereo matching with self-improving ability. arXiv preprint arXiv:1709.00930 (2017)
Zhong, Y., Li, H., Dai, Y.: Open-world stereo video matching with deep RNN. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 104–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_7
Zhou, C., Zhang, H., Shen, X., Jia, J.: Unsupervised learning of stereo matching. In: The IEEE International Conference on Computer Vision (ICCV). IEEE (2017)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Acknowledgments.
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Aleotti, F., Tosi, F., Zhang, L., Poggi, M., Mattoccia, S. (2020). Reversing the Cycle: Self-supervised Deep Stereo Through Enhanced Monocular Distillation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12356. Springer, Cham. https://doi.org/10.1007/978-3-030-58621-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-030-58621-8_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58620-1
Online ISBN: 978-3-030-58621-8
eBook Packages: Computer ScienceComputer Science (R0)