Pyramid Multi-view Stereo Net with Self-adaptive View Aggregation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12354)


In this paper, we propose an effective and efficient pyramid multi-view stereo (MVS) net with self-adaptive view aggregation for accurate and complete dense point cloud reconstruction. Different from using mean square variance to generate cost volume in previous deep-learning based MVS methods, our VA-MVSNet incorporates the cost variances in different views with small extra memory consumption by introducing two novel self-adaptive view aggregations: pixel-wise view aggregation and voxel-wise view aggregation. To further boost the robustness and completeness of 3D point cloud reconstruction, we extend VA-MVSNet with pyramid multi-scale images input as PVA-MVSNet, where multi-metric constraints are leveraged to aggregate the reliable depth estimation at the coarser scale to fill in the mismatched regions at the finer scale. Experimental results show that our approach establishes a new state-of-the-art on the DTU dataset with significant improvements in the completeness and overall quality, and has strong generalization by achieving a comparable performance as the state-of-the-art methods on the Tanks and Temples benchmark. Our codebase is at


Multi-view stereo Deep learning Self-adaptive view aggregation Multi-metric pyramid aggregation 



This project was supported by the National Key R&D Program of China (No. 2017YFB1002705, No. 2017YFB1002601) and NSFC of China (No. 61632003, No. 61661146002, No. 61872398).

Supplementary material

504446_1_En_44_MOESM1_ESM.pdf (6.7 mb)
Supplementary material 1 (pdf 6878 KB)

Supplementary material 2 (mp4 76446 KB)


  1. 1.
    Aanæs, H., Jensen, R.R., Vogiatzis, G., Tola, E., Dahl, A.B.: Large-scale data for multiple-view stereopsis. IJCV 120(2), 153–168 (2016)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Campbell, N.D.F., Vogiatzis, G., Hernández, C., Cipolla, R.: Using multiple hypotheses to improve depth-maps for multi-view stereo. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 766–779. Springer, Heidelberg (2008). Scholar
  3. 3.
    Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. In: ICCV (2019)Google Scholar
  4. 4.
    Chen, R., Han, S., Xu, J., Su, H.: Point-based multi-view stereo network. arXiv preprint arXiv:1908.04422 (2019)
  5. 5.
    Cheng, S., Xu, Z., Zhu, S., Li, Z., Li, L.E., Ramamoorthi, R., Su, H.: Deep stereo using adaptive thin volume representation with uncertainty awareness. In: CVPR (2020)Google Scholar
  6. 6.
    Cremers, D., Kolev, K.: Multiview stereo and silhouette consistency via convex functionals over convex domains. PAMI 33(6), 1161–1174 (2010)CrossRefGoogle Scholar
  7. 7.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV (2015)Google Scholar
  8. 8.
    Flynn, J., Neulander, I., Philbin, J., Snavely, N.: DeepStereo: learning to predict new views from the world’s imagery. In: CVPR (2016)Google Scholar
  9. 9.
    Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. PAMI 32(8), 1362–1376 (2009)CrossRefGoogle Scholar
  10. 10.
    Galliani, S., Lasinger, K., Schindler, K.: Massively parallel multiview stereopsis by surface normal diffusion. In: ICCV (2015)Google Scholar
  11. 11.
    Goesele, M., Snavely, N., Curless, B., Hoppe, H., Seitz, S.M.: Multi-view stereo for community photo collections. In: ICCV (2007)Google Scholar
  12. 12.
    Gu, X., Fan, Z., Zhu, S., Dai, Z., Tan, F., Tan, P.: Cascade cost volume for high-resolution multi-view stereo and stereo matching. In: CVPR (2020)Google Scholar
  13. 13.
    Hartmann, W., Galliani, S., Havlena, M., Van Gool, L., Schindler, K.: Learned multi-patch similarity. In: ICCV (2017)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  15. 15.
    Hiep, V.H., Keriven, R., Labatut, P., Pons, J.P.: Towards high-resolution large-scale multi-view stereo. In: CVPR (2009)Google Scholar
  16. 16.
    Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., Kautz, J.: Improving landmark localization with semi-supervised learning. In: CVPR (2018)Google Scholar
  17. 17.
    Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)Google Scholar
  18. 18.
    Im, S., Jeon, H.G., Lin, S., Kweon, I.S.: DPSNet: end-to-end deep plane sweep stereo. In: ICLR (2019)Google Scholar
  19. 19.
    Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: SurfaceNet: an end-to-end 3d neural network for multiview stereopsis. In: ICCV (2017)Google Scholar
  20. 20.
    Kendall, A., et al.: End-to-end learning of geometry and context for deep stereo regression. In: ICCV (2017)Google Scholar
  21. 21.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2014)Google Scholar
  22. 22.
    Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. TOG 36(4), 78 (2017)CrossRefGoogle Scholar
  23. 23.
    Lhuillier, M., Quan, L.: A quasi-dense approach to surface reconstruction from uncalibrated images. PAMI 27(3), 418–433 (2005)CrossRefGoogle Scholar
  24. 24.
    Luo, K., Guan, T., Ju, L., Huang, H., Luo, Y.: P-MVSNet: learning patch-wise matching confidence aggregation for multi-view stereo. In: ICCV (2019)Google Scholar
  25. 25.
    Moulon, P., Monasse, P., Marlet, R., et al.: OpenMVG. An open multiple view geometry library (2014)Google Scholar
  26. 26.
    Paszke, A., et al.: Automatic differentiation in PyTorch. In: NeurIPS Autodiff Workshop (2017)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)Google Scholar
  28. 28.
    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)Google Scholar
  29. 29.
    Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). Scholar
  30. 30.
    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)Google Scholar
  31. 31.
    Schops, T., et al.: A multi-view stereo benchmark with high-resolution images and multi-camera videos. In: CVPR (2017)Google Scholar
  32. 32.
    Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR (2006)Google Scholar
  33. 33.
    Sinha, S.N., Mordohai, P., Pollefeys, M.: Multi-view stereo via graph cuts on the dual of an adaptive tetrahedral mesh. In: ICCV (2007)Google Scholar
  34. 34.
    Song, X., Zhao, X., Hu, H., Fang, L.: EdgeStereo: a context integrated residual pyramid network for stereo matching. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 20–35. Springer, Cham (2019). Scholar
  35. 35.
    Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., Thoennessen, U.: On benchmarking camera calibration and multi-view stereo for high resolution imagery. In: CVPR (2008)Google Scholar
  36. 36.
    Tola, E., Strecha, C., Fua, P.: Efficient large-scale multi-view stereo for ultra high-resolution image sets. Mach. Vis. Appl. 23(5), 903–920 (2012)CrossRefGoogle Scholar
  37. 37.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)Google Scholar
  38. 38.
    Vogiatzis, G., Esteban, C.H., Torr, P.H., Cipolla, R.: Multiview stereo via volumetric graph-cuts and occlusion robust photo-consistency. PAMI 29(12), 2241–2246 (2007)CrossRefGoogle Scholar
  39. 39.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  40. 40.
    Xu, Q., Tao, W.: Multi-scale geometric consistency guided multi-view stereo. In: CVPR (2019)Google Scholar
  41. 41.
    Yang, G., Zhao, H., Shi, J., Deng, Z., Jia, J.: SegStereo: exploiting semantic information for disparity estimation. In: ECCV (2018)Google Scholar
  42. 42.
    Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L.: MVSNet: depth inference for unstructured multi-view stereo. In: ECCV (2018)Google Scholar
  43. 43.
    Yao, Y., Luo, Z., Li, S., Shen, T., Fang, T., Quan, L.: Recurrent MVSNet for high-resolution multi-view stereo depth inference. In: CVPR (2019)Google Scholar
  44. 44.
    Zheng, E., Dunn, E., Jojic, V., Frahm, J.M.: PatchMatch based joint view selection and depthmap estimation. In: CVPR (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.PKUBeijingChina
  2. 2.HKUShatinHong Kong
  3. 3.TencentShenzhenChina
  4. 4.Kwai Inc.BeijingChina

Personalised recommendations