Skip to main content
Log in

Exploring the Usage of Pre-trained Features for Stereo Matching

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

For many vision tasks, utilizing pre-trained features results in improved performance and consistently benefits from the rapid advancement of pre-training technologies. However, in the field of stereo matching, the use of pre-trained features has not been extensively researched. In this paper, we present the first systematical exploration into the utilization of pre-trained features for stereo matching. To provide flexible employment for any combination of pre-trained backbones and stereo matching networks, we develop the deformable neck (DN) that decouples the network architectures of these two components. The core idea of DN is to utilize the deformable attention mechanism to iteratively fuse pre-trained features from shallow to deep layers. Empirically, our exploration reveals the crucial factors that influence using pre-trained features for stereo matching. We further investigate the role of instance-level information of pre-trained features, demonstrating it benefits stereo matching while can be suppressed during convolution-based feature fusion. Built on the attention mechanism, the proposed DN module effectively utilizes the instance-level information in pre-trained features. Besides, we provide an understanding of the efficiency-accuracy tradeoff, concluding that using pre-trained features can also be a good alternative with efficiency consideration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availibility

The datasets used for evaluation during the current study are all publicly available. SceneFlow is available at https://lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html. KITTI 2012 is available at https://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php?benchmark=stereo. KITTI 2015 is available at http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo. Middlebury is available at https://vision.middlebury.edu/stereo/submit3/. ETH3D is available at https://www.eth3d.net/datasets#low-res-two-view-test-data.

References

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization, arXiv preprint arXiv:1607.06450.

  • Bangunharcana, A., Cho, J. W., Lee, S., Kweon, I. S., Kim, K.-S., & Kim, S. (2021). Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 3542–3548). IEEE.

  • Biswas, J., & Veloso, M. (2011). Depth camera based localization and navigation for indoor mobile robots. In RGB-D workshop at RSS (vol. 2011), no. 21.

  • Bleyer, M., Rhemann, C., & Rother, C. (2011). Patchmatch stereo–stereo matching with slanted support windows. Bmvc, 11, 1–11.

    Google Scholar 

  • Cai, C., Poggi, M., Mattoccia, S., & Mordohai, P. (2020). Matching-space stereo networks for cross-domain generalization. In 2020 International conference on 3D vision (3DV) (pp. 364–373). IEEE.

  • Cao, C., Ren, X., & Fu, Y. (2022). Mvsformer: Multi-view stereo by learning robust image features and temperature-based depth. Transactions on Machine Learning Research, 6, 66.

    Google Scholar 

  • Chang, J.-R., & Chen, Y.-S. (2018). Pyramid stereo matching network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5410–5418).

  • Cheng, X., Wang, P., & Yang, R. (2019). Learning depth with convolutional spatial propagation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2361–2379.

    Article  Google Scholar 

  • Cheng, X., Zhong, Y., Harandi, M., Dai, Y., Chang, X., Li, H., Drummond, T., & Ge, Z. (2020). Hierarchical neural architecture search for deep stereo matching. Advances in Neural Information Processing Systems, 33, 22158–22169.

  • Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. Advances in Neural Information Processing Systems, 34, 9355–9366.

    Google Scholar 

  • Chuah, W., Tennakoon, R., Hoseinnezhad, R., Suter, D., & Bab-Hadiashar, A. (2023). An information-theoretic method to automatic shortcut avoidance and domain generalization for dense prediction tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 66.

    Google Scholar 

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255). IEEE.

  • Ding, Y., Zhu, Q., Liu, X., Yuan, W., Zhang, H., & Zhang, C. (2022). Kd-mvs: Knowledge distillation based self-supervised learning for multi-view stereo. In European conference on computer vision (pp. 630–646). Berlin: Springer.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929.

  • Engel, J., Stückler, J., & Cremers, D. (2015). Large-scale direct slam with stereo cameras. In 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1935–1942). IEEE.

  • Fathy, M. E., Tran, Q.-H., Zia, M. Z., Vernaza, P., & Chandraker, M. (2018). Hierarchical metric learning and matching for 2d and 3d geometric correspondences. In Proceedings of the European conference on computer vision (ECCV) (pp. 803–819).

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The Kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354–3361). IEEE.

  • Gomez-Ojeda, R., Moreno, F.-A., Zuniga-Noël, D., Scaramuzza, D., & Gonzalez-Jimenez, J. (2019). Pl-slam: A stereo slam system through the combination of points and line segments. IEEE Transactions on Robotics, 35(3), 734–746.

    Article  Google Scholar 

  • Guo, X., Yang, K., Yang, W., Wang, X., & Li, H. (2019). Group-wise correlation stereo network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3273–3282).

  • He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16000–16009).

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729–9738).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  • Hirschmuller, H. (2005). Accurate and efficient stereo processing by semi-global matching and mutual information. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (vol. 2., pp. 807–814). IEEE.

  • Hirschmuller, H. (2007). Stereo processing by semiglobal matching and mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 328–341.

    Article  Google Scholar 

  • Hosni, A., Rhemann, C., Bleyer, M., Rother, C., & Gelautz, M. (2012). Fast cost-volume filtering for visual correspondence and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(2), 504–511.

    Article  Google Scholar 

  • Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., & Bry, A. (2017). End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE international conference on computer vision (pp. 66–75).

  • Li, J., Wang, P., Xiong, P., Cai, T., Yan, Z., Yang, L., Liu, J., Fan, H., & Liu, S. (2022). Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16263–16272).

  • Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring plain vision transformer backbones for object detection. In European conference on computer vision (pp. 280–296). Berlin: Springer.

  • Liang, Z., Feng, Y., Guo, Y., Liu, H., Chen, W., Qiao, L., Zhou, L., & Zhang, J. (2018). Learning for disparity estimation through feature constancy. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2811–2820).

  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117–2125).

  • Lipson, L., Teed, Z., & Deng, J. (2021). Raft-stereo: Multilevel recurrent field transforms for stereo matching. In 2021 International conference on 3D vision (3DV) (pp. 218–227). IEEE.

  • Liu, B., Yu, H., & Long, Y. (2022a). Local similarity pattern and cost self-reassembling for deep stereo matching networks. Proceedings of the AAAI conference on artificial intelligence, (vol. 36(no. 2), pp. 1647–1655).

  • Liu, B., Yu, H., & Qi, G. (2022b). Graftnet: Towards domain generalized stereo matching with a broad-spectrum and task-oriented feature. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13012–13021).

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021), “Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022).

  • Mayer, N., Ilg, E., Fischer, P., Hazirbas, C., Cremers, D., Dosovitskiy, A., & Brox, T. (2018). What makes good synthetic training data for learning disparity and optical flow estimation? International Journal of Computer Vision, 126, 942–960.

    Article  Google Scholar 

  • Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., & Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040–4048).

  • Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3061–3070).

  • Poggi, M., Tonioni, A., Tosi, F., Mattoccia, S., & Di Stefano, L. (2021). Continual adaptation for deep stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 66.

    Google Scholar 

  • Poggi, M., Tosi, F., Batsos, K., Mordohai, P., & Mattoccia, S. (2021). On the synergies between machine learning and binocular stereo for depth estimation from images: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(9), 5314–5334.

    Google Scholar 

  • Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do vision transformers see like convolutional neural networks? Advances in Neural Information Processing Systems, 34, 12116–12128.

    Google Scholar 

  • Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1623–1637.

    Article  Google Scholar 

  • Rao, Y., Zhao, W., Chen, G., Tang, Y., Zhu, Z., Huang, G., Zhou, J., & Lu, J. (2022). Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18082–18091).

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., & Berg, A. C. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Scharstein, D., Hirschmüller, H., Kitajima, Y., Krathwohl, G., Nešić, N., Wang, X., & Westling, P. (2014). High-resolution stereo datasets with subpixel-accurate ground truth. In German conference on pattern recognition (pp. 31–42). Berlin: Springer.

  • Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1), 7–42.

    Article  Google Scholar 

  • Schops, T., Schonberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., & Geiger, A. (2017). A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3260–3269).

  • Shaked, A., & Wolf, L. (2017). Improved stereo matching with constant highway networks and reflective confidence learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4641–4650).

  • Shen, Z., Dai, Y., & Rao, Z. (2021). Cfnet: Cascade and fused cost volume for robust stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13906–13915).

  • Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., & Zhang, L. (2022). Pcw-net: Pyramid combination and warping cost volume for stereo matching. In European conference on computer vision (pp. 280–297). Berlin: Springer.

  • Shi, X., Huang, Z., Li, D., Zhang, M., Cheung, K. C., See, S., Qin, H., Dai, J., & Li, H. (2023). Flowformer++: Masked cost volume autoencoding for pretraining optical flow estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1599–1610).

  • Sivaraman, S., & Trivedi, M. M. (2013). A review of recent developments in vision-based vehicle detection. In 2013 IEEE intelligent vehicles symposium (IV) (pp. 310–315). IEEE.

  • Song, X., Yang, G., Zhu, X., Zhou, H., Ma, Y., Wang, Z., & Shi, J. (2022). Adastereo: An efficient domain-adaptive stereo matching approach. International Journal of Computer Vision, 66, 1–20.

    Google Scholar 

  • Sun, C., Shrivastava, A., Singh, S., & Gupta, A. (2017). Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision (pp. 843–852).

  • Teed, Z., & Deng, J. (2020). Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision-ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16 (402–419). Berlin: Springer.

  • Tonioni, A., Tosi, F., Poggi, M., Mattoccia, S., & Stefano, L.D. (2019). Real-time self-adaptive deep stereo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 195–204).

  • Tosi, F., Tonioni, A., De Gregorio, D., & Poggi, M. (2023). Nerf-supervised deep stereo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 855–866).

  • Vasconcelos, C., Birodkar, V., & Dumoulin, V. (2022). Proper reuse of image classification features improves object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13628–13637).

  • Wang, H., Fan, R., Cai, P., & Liu, M. (2021). Pvstereo: Pyramid voting module for end-to-end self-supervised stereo matching. IEEE Robotics and Automation Letters, 6(3), 4353–4360.

    Article  Google Scholar 

  • Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3024–3033).

  • Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9653–9663).

  • Xu, G., Cheng, J., Guo, P., & Yang, X. (2022). Attention concatenation volume for accurate and efficient stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12981–12990).

  • Xu, G., Wang, X., Ding, X., & Yang, X. (2023). Iterative geometry encoding volume for stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21919–21928).

  • Xu, G., Wang, Y., Cheng, J., Tang, J., & Yang, X. (2023). Accurate and efficient stereo matching via attention concatenation volume. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 66.

    Google Scholar 

  • Xu, H., & Zhang, J. (2020). Aanet: Adaptive aggregation network for efficient stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1959–1968).

  • Yin, Z., Darrell, T., & Yu, F. (2019). Hierarchical discrete distribution decomposition for match density estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6044–6053).

  • Zabih, R., & Woodfill, J. (1994). Non-parametric local transforms for computing visual correspondence. In European conference on computer vision (pp. 151–158). Berlin: Springer.

  • Zbontar, J., & LeCun, Y. (2015). Computing the stereo matching cost with a convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1592–1599).

  • Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1), 2287–2318.

    Google Scholar 

  • Zhang, F., Prisacariu, V., Yang, R., & Torr, P.H. (2019). Ga-net: Guided aggregation net for end-to-end stereo matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 185–194).

  • Zhang, F., Qi, X., Yang, R., Prisacariu, V., Wah, B., & Torr, P. (2020). Domain-invariant stereo matching networks. In European conference on computer vision (pp. 420–439). Berlin: Springer.

  • Zhang, F., & Wah, B. W. (2017). Fundamental principles on learning new features for effective dense matching. IEEE Transactions on Image Processing, 27(2), 822–836.

    Article  MathSciNet  Google Scholar 

  • Zhang, J., Wang, X., Bai, X., Wang, C., Huang, L., Chen, Y., Gu, L., Zhou, J., Harada, T., & Hancock, E.R. (2022). Revisiting domain generalized stereo matching networks from a feature consistency perspective. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13001–13011).

  • Zhang, Y., Chen, Y., Bai, X., Yu, S., Yu, K., Li, Z., & Yang, K. (2020). Adaptive unimodal cost volume filtering for deep stereo matching. In Proceedings of the AAAI conference on artificial intelligence (vol. 34(no. 7), pp. 12926–12934).

  • Zhang, Y., Liu, H., & Hu, Q. (2021). Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical image computing and computer assisted intervention–MICCAI 2021: 4th international conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24 (pp. 14–24). Berlin: Springer.

  • Zhao, H., Zhou, H., Zhang, Y., Chen, J., Yang, Y., & Zhao, Y. (2023). High-frequency stereo matching network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1327–1336).

  • Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6881–6890).

  • Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2020). Deformable detr: Deformable transformers for end-to-end object detection, arXiv preprint arXiv:2010.04159.

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China 62276016, 62372029, 62106012. Dr. Lin Gu was supported by JST Moonshot R &D Grant Number JPMJMS2011 Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiao Bai.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by Jifeng Dai.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Huang, L., Bai, X. et al. Exploring the Usage of Pre-trained Features for Stereo Matching. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02090-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02090-y

Keywords

Navigation