Advertisement

Associative3D: Volumetric Reconstruction from Sparse Views

Conference paper
  • 570 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12360)

Abstract

This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera. While seemingly easy for humans, this problem poses many challenges for computers since it requires simultaneously reconstructing objects in the two views while also figuring out their relationship. We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations, as well as an inter-view object affinity matrix. This information is then jointly reasoned over to produce the most likely explanation of the scene. We train and test our approach on a dataset of indoor scenes, and rigorously evaluate the merits of our joint reasoning approach. Our experiments show that it is able to recover reasonable scenes from sparse views, while the problem is still challenging. Project site: https://jasonqsy.github.io/Associative3D.

Keyword

3D reconstruction 

Notes

Acknowledgments

We thank Nilesh Kulkarni and Shubham Tulsiani for their help of 3D-RelNet; Zhengyuan Dong for his help of visualization; Tianning Zhu for his help of video; Richard Higgins, Dandan Shan, Chris Rockwell and Tongan Cai for their feedback on the draft. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

References

  1. 1.
    Bao, S.Y., Bagra, M., Chao, Y.W., Savarese, S.: Semantic structure from motion with points, regions, and objects. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2703–2710. IEEE (2012)Google Scholar
  2. 2.
    Bowman, S.L., Atanasov, N., Daniilidis, K., Pappas, G.J.: Probabilistic data association for semantic slam. In: 2017 IEEE international conference on robotics and automation (ICRA), pp. 1722–1729. IEEE (2017)Google Scholar
  3. 3.
    Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a “siamese" time delay neural network. In: Advances in neural information processing systems, pp. 737–744 (1994)Google Scholar
  4. 4.
    Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. arXiv preprint arXiv:1709.06158 (2017)
  5. 5.
    Chen, W., Qian, S., Deng, J.: Learning single-image depth from videos using quality assessment networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5604–5613 (2019)Google Scholar
  6. 6.
    Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 8648–8657 (2019)Google Scholar
  7. 7.
    Choy, C.B., Gwak, J., Savarese, S., Chandraker, M.: Universal correspondence network. In: Advances in Neural Information Processing Systems, pp. 2414–2422 (2016)Google Scholar
  8. 8.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_38CrossRefGoogle Scholar
  9. 9.
    Crandall, D., Owens, A., Snavely, N., Huttenlocher, D.: SfM with MRFs: discrete-continuous optimization for large-scale structure from motion. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 35(12), 2841–2853 (2013)CrossRefGoogle Scholar
  10. 10.
    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)Google Scholar
  11. 11.
    Davis, J., Goadrich, M.: The relationship between Precision-Recall and ROC curves. In: Proceedings of the 23rd International Conference on Machine learning, pp. 233–240 (2006)Google Scholar
  12. 12.
    Du, Y., et al.: Learning to exploit stability for 3D scene parsing. In: Advances in Neural Information Processing Systems, pp. 1726–1736 (2018)Google Scholar
  13. 13.
    Duggal, S., Wang, S., Ma, W.C., Hu, R., Urtasun, R.: DeepPruner: learning efficient stereo matching via differentiable patchmatch. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4384–4393 (2019)Google Scholar
  14. 14.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp. 2650–2658 (2015)Google Scholar
  15. 15.
    En, S., Lechervy, A., Jurie, F.: RPNet: an end-to-end network for relative camera pose estimation. In: Leal-Taixé, L., Roth, S. (eds.) ECCV 2018. LNCS, vol. 11129, pp. 738–745. Springer, Cham (2019).  https://doi.org/10.1007/978-3-030-11009-3_46CrossRefGoogle Scholar
  16. 16.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010)CrossRefGoogle Scholar
  17. 17.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_29CrossRefGoogle Scholar
  18. 18.
    Gkioxari, G., Malik, J., Johnson, J.: Mesh r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9785–9795 (2019)Google Scholar
  19. 19.
    Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: AtlasNet: a papier-mâché approach to learning 3D surface generation. In: CVPR 2018 (2018)Google Scholar
  20. 20.
    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, ISBN: 0521540518, 2nd edn. (2004)Google Scholar
  21. 21.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  22. 22.
    Huang, P.H., Matzen, K., Kopf, J., Ahuja, N., Huang, J.B.: DeepMVS: learning multi-view stereopsis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2821–2830 (2018)Google Scholar
  23. 23.
    Huang, S., Qi, S., Zhu, Y., Xiao, Y., Xu, Y., Zhu, S.C.: Holistic 3D scene parsing and reconstruction from a single rgb image. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 187–203 (2018)Google Scholar
  24. 24.
    Huang, Z., et al.: Deep volumetric video from very sparse multi-view performance capture. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 336–354 (2018)Google Scholar
  25. 25.
    Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Advances in Neural Information Processing Systems, pp. 365–376 (2017)Google Scholar
  26. 26.
    Kendall, A., Grimes, M., Cipolla, R.: Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)Google Scholar
  27. 27.
    Kulkarni, N., Misra, I., Tulsiani, S., Gupta, A.: 3D-RelNet: joint object and relational network for 3D prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2212–2221 (2019)Google Scholar
  28. 28.
    Ladický, L., Zeisl, B., Pollefeys, M.: Discriminatively trained dense surface normal estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 468–484. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_31CrossRefGoogle Scholar
  29. 29.
    Lasinger, K., Ranftl, R., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. arXiv preprint arXiv:1907.01341 (2019)
  30. 30.
    Li, L., Khan, S., Barnes, N.: Silhouette-assisted 3D object instance reconstruction from a cluttered scene. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)Google Scholar
  31. 31.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  32. 32.
    Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: PlaneRCNN: 3D plane detection and reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4450–4459 (2019)Google Scholar
  33. 33.
    Liu, C., Wu, J., Furukawa, Y.: Floornet: a unified framework for floorplan reconstruction from 3D scans. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–217 (2018)Google Scholar
  34. 34.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  35. 35.
    Melekhov, I., Ylioinas, J., Kannala, J., Rahtu, E.: Relative camera pose estimation using convolutional neural networks. In: Blanc-Talon, J., Penne, R., Philips, W., Popescu, D., Scheunders, P. (eds.) ACIVS 2017. LNCS, vol. 10617, pp. 675–687. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-70353-4_57CrossRefGoogle Scholar
  36. 36.
    Mishkin, D., Perdoch, M., Matas, J.: Mods: fast and robust method for two-view matching. Comput. Vis. Image Underst. 1(141), 81–93 (2015)CrossRefGoogle Scholar
  37. 37.
    Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3Dunderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64 (2020)Google Scholar
  38. 38.
    Price, A., Jin, L., Berenson, D.: Inferring occluded geometry improves performance when retrieving an object from dense clutter. arXiv preprint arXiv:1907.08770 (2019)
  39. 39.
    Pritchett, P., Zisserman, A.: Wide baseline stereo matching. In: Sixth International Conference on Computer Vision (IEEE Cat. No. 98CH36271), pp. 754–760. IEEE (1998)Google Scholar
  40. 40.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015)Google Scholar
  41. 41.
    Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1936–1944 (2018)Google Scholar
  42. 42.
    Salas-Moreno, R.F., Newcombe, R.A., Strasdat, H., Kelly, P.H., Davison, A.J.: Slam++: simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1352–1359 (2013)Google Scholar
  43. 43.
    Sharma, G., Goyal, R., Liu, D., Kalogerakis, E., Maji, S.: Csgnet: neural shape parser for constructive solid geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5515–5523 (2018)Google Scholar
  44. 44.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-33715-4_54CrossRefGoogle Scholar
  45. 45.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1746–1754 (2017)Google Scholar
  46. 46.
    Sui, Z., Chang, H., Xu, N., Jenkins, O.C.: Geofusion: geometric consistency informed scene estimation in dense clutter. arXiv:2003.12610 (2020)
  47. 47.
    Sun, X., et al.: Pix3d: dataset and methods for single-image 3D shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983 (2018)Google Scholar
  48. 48.
    Tatarchenko, M., Richter, S.R., Ranftl, R., Li, Z., Koltun, V., Brox, T.: What do single-view 3D reconstruction networks learn? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3405–3414 (2019)Google Scholar
  49. 49.
    Tulsiani, S., Gupta, S., Fouhey, D.F., Efros, A.A., Malik, J.: Factoring shape, pose, and layout from the 2D image of a 3D scene. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 302–310 (2018)Google Scholar
  50. 50.
    Wang, Q., Zhou, X., Daniilidis, K.: Multi-image semantic matching by mining consistent features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694 (2018)Google Scholar
  51. 51.
    Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–547 (2015)Google Scholar
  52. 52.
    Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, B., Tenenbaum, J.: Marrnet: 3D shape reconstruction via 2.5D sketches. In: Advances in neural information processing systems, pp. 540–550 (2017)Google Scholar
  53. 53.
    Yang, Z., Pan, J.Z., Luo, L., Zhou, X., Grauman, K., Huang, Q.: Extreme relative pose estimation for rgb-d scans via scene completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4531–4540 (2019)Google Scholar
  54. 54.
    Yang, Z., Yan, S., Huang, Q.: Extreme relative pose network under hybrid representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2455–2464 (2020)Google Scholar
  55. 55.
    Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., Wu, J.: Learning to reconstruct shapes from unseen classes. In: Advances in Neural Information Processing Systems, pp. 2257–2268 (2018)Google Scholar
  56. 56.
    Zhang, Y., et al.: Physically-based rendering for indoor scene understanding using convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5287–5295 (2017)Google Scholar
  57. 57.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858 (2017)Google Scholar
  58. 58.
    Zwillinger, D., Kokoska, S.: CRC Standard Probability and Statistics Tables and Formulae. Crc Press (1999)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MichiganAnn ArborUSA

Personalised recommendations