Skip to main content

PlaneFormers: From Sparse View Planes to 3D Reconstruction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

Abstract

We present an approach for the planar surface reconstruction of a scene from images with limited overlap. This reconstruction task is challenging since it requires jointly reasoning about single image 3D reconstruction, correspondence between images, and the relative camera pose between images. Past work has proposed optimization-based approaches. We introduce a simpler approach, the PlaneFormer, that uses a transformer applied to 3D-aware plane tokens to perform 3D reasoning. Our experiments show that our approach is substantially more effective than prior work, and that several 3D-specific design decisions are crucial for its success. Code is available at https://github.com/samiragarwala/PlaneFormers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, S., Snavely, N., Seitz, S.M., Szeliski, R.: Bundle adjustment in the large. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 29–42. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_3

    Chapter  Google Scholar 

  2. Bloem, P.: August 2019. http://peterbloem.nl/blog/transformers

  3. Bozic, A., Palafox, P., Thies, J., Dai, A., Nießner, M.: Transformerfusion: monocular RGB scene reconstruction using transformers. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  4. Cai, R., Hariharan, B., Snavely, N., Averbuch-Elor, H.: Extreme rotation estimation using dense correlation volumes. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  5. Cai, Z., et al.: MessyTable: instance association in multiple camera views. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 1–16. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_1

    Chapter  Google Scholar 

  6. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 3DV (2017)

    Google Scholar 

  7. Chen, A., et al.: MVSNeRF: fast generalizable radiance field reconstruction from multi-view stereo. In: ICCV, pp. 14124–14133 (2021)

    Google Scholar 

  8. Chen, K., Snavely, N., Makadia, A.: Wide-baseline relative camera pose estimation with directional learning. In: CVPR, pp. 3258–3268, June 2021

    Google Scholar 

  9. Chen, W., Qian, S., Fan, D., Kojima, N., Hamilton, M., Deng, J.: Oasis: a large-scale dataset for single image 3D in the wild. In: CVPR (2020)

    Google Scholar 

  10. Choy, C., Dong, W., Koltun, V.: Deep global registration. In: CVPR, pp. 2514–2523 (2020)

    Google Scholar 

  11. Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38

    Chapter  Google Scholar 

  12. Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)

    Google Scholar 

  13. El Banani, M., Gao, L., Johnson, J.: Unsupervised R &R: unsupervised point cloud registration via differentiable rendering. In: CVPR (2021)

    Google Scholar 

  14. Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)

    Google Scholar 

  15. Furukawa, Y., Curless, B., Seitz, S.M., Szeliski, R.: Manhattan-world stereo. In: CVPR (2009)

    Google Scholar 

  16. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: ICCV (2019)

    Google Scholar 

  17. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2004). ISBN 0521540518

    Google Scholar 

  18. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)

    Google Scholar 

  19. Hoiem, D., Efros, A.A., Hebert, M.: Geometric context from a single image. In: ICCV, vol. 1, pp. 654–661. IEEE (2005)

    Google Scholar 

  20. Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: CVPR (2018)

    Google Scholar 

  21. Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 351–369. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01270-0_21

    Chapter  Google Scholar 

  22. Jain, A., Tancik, M., Abbeel, P.: Putting nerf on a diet: semantically consistent few-shot view synthesis. In: ICCV, pp. 5885–5894 (2021)

    Google Scholar 

  23. Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., Funkhouser, T., et al.: Local implicit grid representations for 3D scenes. In: CVPR, pp. 6001–6010 (2020)

    Google Scholar 

  24. Jin, L., Qian, S., Owens, A., Fouhey, D.F.: Planar surface reconstruction from sparse views. In: ICCV (2021)

    Google Scholar 

  25. Jin, Y., et al.: Image matching across wide baselines: from paper to practice. IJCV 129(2), 517–547 (2020)

    Article  Google Scholar 

  26. Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: NeurIPS (2017)

    Google Scholar 

  27. Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: CVPR, pp. 1611–1621 (2021)

    Google Scholar 

  28. Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2(1–2), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  29. Li, Z., Snavely, N.: Megadepth: learning single-view depth prediction from internet photos. In: CVPR, pp. 2041–2050 (2018)

    Google Scholar 

  30. Lin, C.H., Ma, W.C., Torralba, A., Lucey, S.: Barf: bundle-adjusting neural radiance fields. In: ICCV (2021)

    Google Scholar 

  31. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: CVPR (2021)

    Google Scholar 

  32. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)

    Google Scholar 

  33. Lindenberger, P., Sarlin, P.E., Larsson, V., Pollefeys, M.: Pixel-perfect structure-from-motion with featuremetric refinement. In: ICCV, pp. 5987–5997 (2021)

    Google Scholar 

  34. Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: Planercnn: 3D plane detection and reconstruction from a single image. In: CVPR (2019)

    Google Scholar 

  35. Liu, C., Yang, J., Ceylan, D., Yumer, E., Furukawa, Y.: Planenet: piece-wise planar reconstruction from a single RGB image. In: CVPR, pp. 2579–2588 (2018)

    Google Scholar 

  36. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)

    Article  Google Scholar 

  37. Ma, Y., Soatto, S., Košecká, J., Sastry, S.: An Invitation to 3-D Vision: From Images to Geometric Models, vol. 26. Springer, New York (2004). https://doi.org/10.1007/978-0-387-21779-6

    Book  MATH  Google Scholar 

  38. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR, pp. 4460–4470 (2019)

    Google Scholar 

  39. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 405–421. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_24

    Chapter  Google Scholar 

  40. Mur-Artal, R., Montiel, J.M.M., Tardos, J.D.: Orb-slam: a versatile and accurate monocular slam system. TOG 31(5), 1147–1163 (2015)

    Google Scholar 

  41. Pritchett, P., Zisserman, A.: Wide baseline stereo matching. In: ICCV (1998)

    Google Scholar 

  42. Qian, S., Jin, L., Fouhey, D.F.: Associative3D: volumetric reconstruction from sparse views. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12360, pp. 140–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_9

    Chapter  Google Scholar 

  43. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. TPAMI (2020)

    Google Scholar 

  44. Raposo, C., Lourenço, M., Antunes, M., Barreto, J.P.: Plane-based odometry using an RGB-D camera. In: BMVC (2013)

    Google Scholar 

  45. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: learning feature matching with graph neural networks. In: CVPR (2020)

    Google Scholar 

  46. Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016)

    Google Scholar 

  47. Schönberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_31

    Chapter  Google Scholar 

  48. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)

    Google Scholar 

  49. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: detector-free local feature matching with transformers. In: CVPR (2021)

    Google Scholar 

  50. Sun, J., Xie, Y., Chen, L., Zhou, X., Bao, H.: Neuralrecon: real-time coherent 3D reconstruction from monocular video. In: CVPR, pp. 15598–15607 (2021)

    Google Scholar 

  51. Teed, Z., Deng, J.: Droid-slam: deep visual slam for monocular, stereo, and RGB-D cameras. In: NeurIPS, vol. 34 (2021)

    Google Scholar 

  52. Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjustment — a modern synthesis. In: Triggs, B., Zisserman, A., Szeliski, R. (eds.) IWVA 1999. LNCS, vol. 1883, pp. 298–372. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44480-7_21

    Chapter  Google Scholar 

  53. Ummenhofer, B., et al.: Demon: depth and motion network for learning monocular stereo. In: CVPR (2017)

    Google Scholar 

  54. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  55. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2mesh: generating 3D mesh models from single RGB images. In: ECCV, pp. 52–67 (2018)

    Google Scholar 

  56. Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: CVPR (2021)

    Google Scholar 

  57. Wang, W., Hu, Y., Scherer, S.: TartanVO: a generalizable learning-based VO. In: CoRL (2020)

    Google Scholar 

  58. Wang, X., Fouhey, D.F., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)

    Google Scholar 

  59. Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: end-to-end view synthesis from a single image. In: CVPR, pp. 7467–7477 (2020)

    Google Scholar 

  60. Wong, S.: Takaratomy transformers henkei octane. https://live.staticflickr.com/3166/2970928056_c3b59be5ca_b.jpg

  61. Wu, C., Clipp, B., Li, X., Frahm, J.M., Pollefeys, M.: 3D model matching with viewpoint-invariant patches (VIP). In: CVPR (2008)

    Google Scholar 

  62. Yang, F., Zhou, Z.: Recovering 3D planes from a single image via convolutional neural networks. In: ECCV (2018)

    Google Scholar 

  63. Yi, K.M., Trulls, E., Ono, Y., Lepetit, V., Salzmann, M., Fua, P.: Learning to find good correspondences. In: CVPR, pp. 2666–2674 (2018)

    Google Scholar 

  64. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: CVPR (2021)

    Google Scholar 

  65. Yu, Z., Zheng, J., Lian, D., Zhou, Z., Gao, S.: Single-image piece-wise planar 3D reconstruction via associative embedding. In: CVPR, pp. 1029–1037 (2019)

    Google Scholar 

  66. Zhang, J., et al.: Learning two-view correspondences and geometry using order-aware network. In: ICCV, pp. 5845–5854 (2019)

    Google Scholar 

  67. Zhang, Z.: Iterative point matching for registration of free-form curves and surfaces. IJCV 13(2), 119–152 (1994)

    Article  Google Scholar 

  68. Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. TOG 40(4), 1–12 (2021)

    Google Scholar 

  69. Zhao, H., Jiang, L., Jia, J., Torr, P.H., Koltun, V.: Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16259–16268 (2021)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the DARPA Machine Common Sense Program. We would like to thank Richard Higgins and members of the Fouhey lab for helpful discussions and feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samir Agarwala .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 16939 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Agarwala, S., Jin, L., Rockwell, C., Fouhey, D.F. (2022). PlaneFormers: From Sparse View Planes to 3D Reconstruction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20062-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20061-8

  • Online ISBN: 978-3-031-20062-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics