Skip to main content

Complete 3D Scene Parsing from an RGBD Image


One major goal of vision is to infer physical models of objects, surfaces, and their layout from sensors. In this paper, we aim to interpret indoor scenes from one RGBD image. Our representation encodes the layout of orthogonal walls and the extent of objects, modeled with CAD-like 3D shapes. We parse both the visible and occluded portions of the scene and all observable objects, producing a complete 3D parse. Such a scene interpretation is useful for robotics and visual reasoning, but difficult to produce due to the well-known challenge of segmentation, the high degree of occlusion, and the diversity of objects in indoor scenes. We take a data-driven approach, generating sets of potential object regions, matching to regions in training images, and transferring and aligning associated 3D models while encouraging fit to observations and spatial consistency. We use support inference to aid interpretation and propose a retrieval scheme that uses convolutional neural networks to classify regions and retrieve objects with similar shapes. We demonstrate the performance of our method on our newly annotated NYUd v2 dataset (Silberman et al., in: Computer vision-ECCV, 2012, pp 746–760, 2012) with detailed 3D shapes.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., & Sivic, J. (2014). Seeing 3D chairs: Exemplar part-based 2D–3D alignment using a large dataset of cad models. In CVPR.

  2. Banica, D., & Sminchisescu, C. (2013). CPMC-3D-O2P: Semantic segmentation of RGB-D images using CPMC and second order pooling. In CoRR.

  3. Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7), 1312–1328.

    Article  Google Scholar 

  4. Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., et al. (2015). ShapeNet: An information-rich 3D model repository. Technical Report. arXiv:1512.03012. Stanford University—Princeton University—Toyota Technological Institute at Chicago.

  5. Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings computer vision and pattern recognition (CVPR). IEEE.

  6. Dasgupta, S., Fang, K., Chen, K., & Savarese, S. (2016). Delay: robust spatial layout estimation for cluttered indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 616–624).

  7. Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In CVPR.

  8. Deng, Z., Todorovic, S., & Jan Latecki, L. (2015). Semantic segmentation of RGBD images with mutex constraints. In Proceedings of the IEEE international conference on computer vision (pp. 1733–1741).

  9. Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In ICCV.

  10. Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV.

  11. Flint, A., Murray, D. W., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.

  12. Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Reconstructing building interiors from images. In ICCV.

  13. Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2013). A multi-view embedding space for internet images, tags, and their semantics. In IJCV.

  14. Guo, R., & Hoiem, D. (2012). Beyond the line of sight: Labeling the underlying surfaces. In ECCV.

  15. Guo, R., & Hoiem, D. (2013). Support surface prediction in indoor scenes. In ICCV.

  16. Guo, R., Zou, C., & Hoiem, D. (2015). Predicting complete 3D models of indoor scenes. arXiv preprint arXiv:1504.02437.

  17. Gupta, S., Arbelaez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR.

  18. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.

  19. Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Aligning 3D models to RGB-D images of cluttered scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4731–4740).

  20. Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2014). Simultaneous detection and segmentation. In Computer vision–ECCV 2014 (pp. 297–312). Springer.

  21. Hedau, V., Hoiem, D, & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV.

  22. Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.

  23. Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.

    Article  Google Scholar 

  24. Karsch, K., Liu, C., & Kang, S. B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.

  25. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  26. Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.

  27. Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., Guibas, L. J. (2015). Joint embeddings of shapes and images via CNN image purification. ACM Transactions on Graphics, 34(6), 234.

  28. Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing IKEA objects: Fine pose estimation. In ICCV.

  29. Lim, J. J., Khosla, A., & Torralba, A. (2014). FPM: Fine poseparts-based model with 3D cad models. In European conference on computer vision (pp. 478–493). Springer.

  30. Lin, D., Fidler, S., & Urtasun, R. (2013). Holistic scene understanding for 3D object detection with RGBD cameras. In ICCV.

  31. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

  32. Mallya, A., & Lazebnik, S. (2015). Learning informative edge maps for indoor scene layout prediction. In Proceedings of the IEEE international conference on computer vision (pp. 936–944).

  33. Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized prim’s algorithm. In Proceedings of the IEEE international conference on computer vision (pp. 2536–2543).

  34. Ren, X., Bo, L., & Fox, D. (2012). RGB-(D) scene labeling: Features and algorithms. In CVPR.

  35. Roberts, L. G. (1963). Machine perception of 3-D solids. Ph.D. thesis, Massachusetts Institute of Technology.

  36. Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., & Hoiem, D. (2015). Completing 3D object shape from one depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2484–2493).

  37. Satkin, S., & Hebert, M. (2013). 3DNN: Viewpoint invariant 3D geometry matching for scene understanding. In ICCV.

  38. Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.

  39. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In Computer vision-ECCV, 2012 (pp. 746–760).

  40. Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection indepth images. In European conference on computer vision (pp. 634–651). Springer.

  41. Song, S., & Xiao, J. (2016). Deep sliding shapes for a modal 3D object detection in RGB-D images. In The IEEE conference on computer vision and pattern recognition (CVPR).

  42. Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun RGB-D: A RGB-D scene understanding benchmark suite. CVPR, vol 5, p, 6.

    Google Scholar 

  43. Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In Proceedings of 30th IEEE conference on computer vision and pattern recognition.

  44. Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.

  45. Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., et al. (2013). Reconstruction meets recognition challenge.

  46. Walk, S., Schindler, K., & Schiele, B. (2010). Disparity statistics for pedestrian detection: Combining appearance, motion and stereo. In Computer Vision–ECCV 2010 (pp. 182–195). Springer.

  47. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X. et al. (2015). 3D shape nets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1912–1920).

  48. Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., et al. (2016). Objectnet 3D: A large scale database for 3D object recognition. In European conference computer vision (ECCV).

  49. Xiao, J., & Furukawa, Y. (2012). Reconstructing the world’s museums. In ECCV.

  50. Xiao, J., Russell, B. C., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.

  51. Yamaguchi, K., Kiapour, M. H., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV.

  52. Yih, W. T, Toutanova, K., Platt, J. C., & Meek, C. (2011). Learningdiscriminative projections for text similarity measures. In Proceedings of the 15th conference on computational natural language learning, Association for Computational Linguistics (pp. 247–256).

  53. Zhang, J., Kan, C., Schwing, A. G., & Urtasun, R. (2013). Estimating the 3D layout of indoor scenes and its clutter from depth sensors. In ICCV.

  54. Zhang, Y., Song, S., Tan, P., & Xiao, J. (2014). Panocontext: Awhole-room 3D context model for panoramic scene understanding. In European conference on computer vision (pp. 668–686). Springer.

  55. Zhang, Y., Bai, M., Kohli, P., Izadi, S., & Xiao, J. (2016). Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. In ICCV.

  56. Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In CVPR.

Download references


This research is supported in part by ONR MURI Grant N000141010934 and ONR MURI Grant N000141612007. We thank David Forsyth for insightful comments and discussion and Saurabh Singh, Kevin Shih and Tanmay Gupta for their comments on an earlier version of the manuscript.

Author information



Corresponding author

Correspondence to Chuhang Zou.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Ping Tan.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zou, C., Guo, R., Li, Z. et al. Complete 3D Scene Parsing from an RGBD Image. Int J Comput Vis 127, 143–162 (2019).

Download citation


  • Visual scene understanding
  • 3D parsing
  • Single image reconstruction