Layer-Structured 3D Scene Inference via View Synthesis

  • Shubham TulsianiEmail author
  • Richard Tucker
  • Noah Snavely
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11211)


We present an approach to infer a layer-structured 3D representation of a scene from a single input image. This allows us to infer not only the depth of the visible pixels, but also to capture the texture and depth for content in the scene that is not directly visible. We overcome the challenge posed by the lack of direct supervision by instead leveraging a more naturally available multi-view supervisory signal. Our insight is to use view synthesis as a proxy task: we enforce that our representation (inferred from a single image), when rendered from a novel perspective, matches the true observed image. We present a learning framework that operationalizes this insight using a new, differentiable novel view renderer. We provide qualitative and quantitative validation of our approach in two different settings, and demonstrate that we can learn to capture the hidden aspects of a scene. The project website can be found at



We would like to thank Tinghui Zhou and John Flynn for helpful discussions and comments. This work was done while ST was an intern at Google.

Supplementary material

474212_1_En_19_MOESM1_ESM.pdf (1.7 mb)
Supplementary material 1 (pdf 1711 KB)


  1. 1.
    Adelson, E.H.: Layered representations for image coding. Technical report (1991)Google Scholar
  2. 2.
    Baker, S., Szeliski, R., Anandan, P., Szeliski, R.: A layered approach to stereo reconstruction. In: CVPR (1998)Google Scholar
  3. 3.
    Bansal, A., Russell, B., Gupta, A.: Marr revisited: 2D-3D alignment via surface normal prediction. In: CVPR (2016)Google Scholar
  4. 4.
    Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. TPAMI 37(8), 1670–1687 (2015)CrossRefGoogle Scholar
  5. 5.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). Scholar
  6. 6.
    Ehsani, K., Mottaghi, R., Farhadi, A.: SeGAN: segmenting and generating the invisible. CoRR abs/1703.10239 (2017)Google Scholar
  7. 7.
    Eigen, D., Fergus, R.: Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: ICCV (2015)Google Scholar
  8. 8.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
  9. 9.
    Garg, R., Vijay Kumar, B.G., Carneiro, G., Reid, I.: Unsupervised CNN for single view depth estimation: geometry to the rescue. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 740–756. Springer, Cham (2016). Scholar
  10. 10.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI Dataset. IJRR 32(11), 1231–1237 (2013)Google Scholar
  11. 11.
    Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). Scholar
  12. 12.
    Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: CVPR (2017)Google Scholar
  13. 13.
    Grossman, J.P., Dally, W.J.: Point sample rendering. In: Drettakis, G., Max, N. (eds.) Rendering Techniques 1998. Eurographics, pp. 181–192. Springer, Vienna (1998). Scholar
  14. 14.
    Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. In: SIGGRAPH (2005)Google Scholar
  15. 15.
    Isola, P., Liu, C.: Scene collaging: analysis and synthesis of natural images with semantic layers. In: ICCV (2013)Google Scholar
  16. 16.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  17. 17.
    Marr, D.: Vision: a computational investigation into the human representation and processing of visual information (1982)Google Scholar
  18. 18.
    Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)Google Scholar
  19. 19.
    Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3D structure from images. In: NIPS (2016)Google Scholar
  20. 20.
    Russell, B., Efros, A., Sivic, J., Freeman, B., Zisserman, A.: Segmenting scenes by matching image composites. In: NIPS (2009)Google Scholar
  21. 21.
    Saxena, A., Sun, M., Ng, A.Y.: Make3D: learning 3D scene structure from a single still image. TPAMI 31(5), 824–840 (2009)CrossRefGoogle Scholar
  22. 22.
    Shade, J., Gortler, S., He, L.w., Szeliski, R.: Layered depth images. In: SIGGRAPH (1998)Google Scholar
  23. 23.
    Sinha, P., Adelson, E.: Recovering reflectance and illumination in a world of painted polyhedra. In: ICCV (1993)Google Scholar
  24. 24.
    Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)Google Scholar
  25. 25.
    Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR (2017)Google Scholar
  26. 26.
    Vijayanarasimhan, S., Ricco, S., Schmid, C., Sukthankar, R., Fragkiadaki, K.: SfM-Net: learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017)
  27. 27.
    Wang, J.Y., Adelson, E.H.: Layered representation for motion analysis. In: CVPR (1993)Google Scholar
  28. 28.
    Wang, X., Fouhey, D., Gupta, A.: Designing deep networks for surface normal estimation. In: CVPR (2015)Google Scholar
  29. 29.
    Wulff, J., Sevilla-Lara, L., Black, M.J.: Optical flow in mostly rigid scenes. arXiv preprint arXiv:1705.01352 (2017)
  30. 30.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)Google Scholar
  31. 31.
    Yamaguchi, K., McAllester, D., Urtasun, R.: Efficient joint segmentation, occlusion labeling, stereo and flow estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 756–771. Springer, Cham (2014). Scholar
  32. 32.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)Google Scholar
  33. 33.
    Zhou, T., Brown, M., Snavely, N., Lowe, D.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Shubham Tulsiani
    • 1
    Email author
  • Richard Tucker
    • 2
  • Noah Snavely
    • 2
  1. 1.University of California, BerkeleyBerkeleyUSA
  2. 2.GoogleMenlo ParkUSA

Personalised recommendations