CoReNet: Coherent 3D Scene Reconstruction from a Single RGB Image

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)


Advances in deep learning techniques have allowed recent work to reconstruct the shape of a single object given only one RBG image as input. Building on common encoder-decoder architectures for this task, we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume. We validate the impact of our contributions experimentally both on synthetic data from ShapeNet as well as real images from Pix3D. Our method improves over the state-of-the-art single-object methods on both datasets. Finally, we evaluate performance quantitatively on multiple object reconstruction with synthetic scenes assembled from ShapeNet objects.

Supplementary material

504434_1_En_22_MOESM1_ESM.pdf (3.9 mb)
Supplementary material 1 (pdf 3981 KB)


  1. 1.
  2. 2.
    Berman, M., Triki, A.R., Blaschko, M.B.: The Lovász-Softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: CVPR (2018)Google Scholar
  3. 3.
    Chang, A.X., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: 2017 International Conference on 3D Vision (2017)Google Scholar
  4. 4.
    Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. CoRR abs/1512.03012 (2015).
  5. 5.
    Chen, Z., Tagliasacchi, A., Zhang, H.: BSP-Net: generating compact meshes via binary space partitioning. In: CVPR (2020)Google Scholar
  6. 6.
    Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: CVPR (2019)Google Scholar
  7. 7.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). Scholar
  8. 8.
    Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)Google Scholar
  9. 9.
    Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: ICCV (2019)Google Scholar
  10. 10.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). Scholar
  11. 11.
    Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: A Papier-Mâché approach to learning 3D surface generation. In: CVPR (2018)Google Scholar
  12. 12.
    Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press, Cambridge (2000). ISBN 0521623049zbMATHGoogle Scholar
  13. 13.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV (2017)Google Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  15. 15.
    Izadinia, H., Shan, Q., Seitz, S.M.: IM2CAD. In: CVPR, pp. 2422–2431 (2017)Google Scholar
  16. 16.
    Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: NIPS (2017)Google Scholar
  17. 17.
    Krasin, I., et al.: OpenImages: a public dataset for large-scale multi-label and multi-class image classification (2017). Dataset
  18. 18.
    Kundu, A., Li, Y., Rehg, J.M.: 3D-RCNN: instance-level 3D object reconstruction via render-and-compare. In: CVPR (2018)Google Scholar
  19. 19.
    Kuznetsova, A., et al.: The Open Images Dataset V4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018)
  20. 20.
    Lewiner, T., Lopes, H., Vieira, A.W., Tavares, G.: Efficient implementation of marching cubes’ cases with topological guarantees. J. Graph. GPU Game Tools 8(2), 1–15 (2003)CrossRefGoogle Scholar
  21. 21.
    Liao, Y., Donné, S., Geiger, A.: Deep marching cubes: learning explicit surface representations. In: CVPR (2018)Google Scholar
  22. 22.
    Lin, T., Goyal, P., Girshick, R.B., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  23. 23.
    Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. ACM Trans. Graph. 38(4), 65:1–65:14 (2019)CrossRefGoogle Scholar
  24. 24.
    Mandikal, P., Navaneet, K.L., Agarwal, M., Radhakrishnan, V.B.: 3D-LMNet: latent embedding matching for accurate and diverse 3D point cloud reconstruction from a single image. In: BMVC (2018)Google Scholar
  25. 25.
    Mescheder, L.M., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: CVPR (2019)Google Scholar
  26. 26.
    Nguyen-Phuoc, T., Li, C., Balaban, S., Yang, Y.: RenderNet: a deep convolutional network for differentiable rendering from 3D shapes. In: NIPS (2018)Google Scholar
  27. 27.
    Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., Yang, Y.L.: HoloGAN: unsupervised learning of 3D representations from natural images. In: ICCV (2019)Google Scholar
  28. 28.
    Nicastro, A., Clark, R., Leutenegger, S.: X-Section: cross-section prediction for enhanced RGB-D fusion. In: ICCV (2019)Google Scholar
  29. 29.
    Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3DUnderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  30. 30.
    Niu, C., Li, J., Xu, K.: Im2Struct: recovering 3D shape structure from a single RGB image. In: CVPR (2018)Google Scholar
  31. 31.
    Park, J.J., Florence, P., Straub, J., Newcombe, R.A., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: CVPR (2019)Google Scholar
  32. 32.
    Pharr, M., Jakob, W., Humphreys, G.: Physically Based Rendering: From Theory to Implementation, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2016)Google Scholar
  33. 33.
    Richter, S.R., Roth, S.: Matryoshka networks: predicting 3D geometry via nested shape layers. In: CVPR (2018)Google Scholar
  34. 34.
    Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: ICCV (2019)Google Scholar
  35. 35.
    Shin, D., Fowlkes, C.C., Hoiem, D.: Pixels, voxels, and views: a study of shape representations for single view 3D object shape prediction. In: CVPR (2018)Google Scholar
  36. 36.
    Sitzmann, V., Thies, J., Heide, F., Niessner, M., Wetzstein, G., Zollhofer, M.: DeepVoxels: learning persistent 3D feature embeddings. In: CVPR (2019)Google Scholar
  37. 37.
    Sitzmann, V., Zollhöfer, M., Wetzstein, G.: Scene representation networks: continuous 3D-structure-aware neural scene representations. In: NIPS (2019)Google Scholar
  38. 38.
    Soltani, A.A., Huang, H., Wu, J., Kulkarni, T.D., Tenenbaum, J.B.: Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In: CVPR (2017)Google Scholar
  39. 39.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)Google Scholar
  40. 40.
    Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Jorge Cardoso, M.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso, M.J., et al. (eds.) DLMIA/ML-CDS 2017. LNCS, vol. 10553, pp. 240–248. Springer, Cham (2017). Scholar
  41. 41.
    Sun, X., et al.: Pix3D: dataset and methods for single-image 3D shape modeling. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  42. 42.
    Tatarchenko, M., Richter, S.R., Ranftl, R., Li, Z., Koltun, V., Brox, T.: What do single-view 3D reconstruction networks learn? In: CVPR (2019)Google Scholar
  43. 43.
    Tulsiani, S., Gupta, S., Fouhey, D.F., Efros, A.A., Malik, J.: Factoring shape, pose, and layout from the 2D image of a 3D scene. In: CVPR (2018)Google Scholar
  44. 44.
    Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: CVPR (2017)Google Scholar
  45. 45.
    Tung, H.F., Cheng, R., Fragkiadaki, K.: Learning spatial common sense with geometry-aware recurrent networks. In: CVPR (2019)Google Scholar
  46. 46.
    Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.-G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 55–71. Springer, Cham (2018). Scholar
  47. 47.
    Wu, J., Zhang, C., Xue, T., Freeman, W.T., Tenenbaum, J.B.: Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In: NIPS (2016)Google Scholar
  48. 48.
    Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR, pp. 3485–3492 (2010)Google Scholar
  49. 49.
    Xie, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2Vox: context-aware 3D reconstruction from single and multi-view images. In: ICCV (2019)Google Scholar
  50. 50.
    Xie, H., Yao, H., Zhang, S., Zhou, S., Sun, W.: Pix2Vox++: multi-scale context-aware 3D object reconstruction from single and multiple images. IJCV 128, 2919–2935 (2020)CrossRefGoogle Scholar
  51. 51.
    Yao, Y., Schertler, N., Rosales, E., Rhodin, H., Sigal, L., Sheffer, A.: Front2Back: single view 3D shape reconstruction via front to back prediction. In: CVPR (2020)Google Scholar
  52. 52.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Google ResearchZürichSwitzerland

Personalised recommendations