International Journal of Computer Vision

, Volume 127, Issue 2, pp 143–162 | Cite as

Complete 3D Scene Parsing from an RGBD Image

  • Chuhang ZouEmail author
  • Ruiqi Guo
  • Zhizhong Li
  • Derek Hoiem


One major goal of vision is to infer physical models of objects, surfaces, and their layout from sensors. In this paper, we aim to interpret indoor scenes from one RGBD image. Our representation encodes the layout of orthogonal walls and the extent of objects, modeled with CAD-like 3D shapes. We parse both the visible and occluded portions of the scene and all observable objects, producing a complete 3D parse. Such a scene interpretation is useful for robotics and visual reasoning, but difficult to produce due to the well-known challenge of segmentation, the high degree of occlusion, and the diversity of objects in indoor scenes. We take a data-driven approach, generating sets of potential object regions, matching to regions in training images, and transferring and aligning associated 3D models while encouraging fit to observations and spatial consistency. We use support inference to aid interpretation and propose a retrieval scheme that uses convolutional neural networks to classify regions and retrieve objects with similar shapes. We demonstrate the performance of our method on our newly annotated NYUd v2 dataset (Silberman et al., in: Computer vision-ECCV, 2012, pp 746–760, 2012) with detailed 3D shapes.


Visual scene understanding 3D parsing Single image reconstruction 



This research is supported in part by ONR MURI Grant N000141010934 and ONR MURI Grant N000141612007. We thank David Forsyth for insightful comments and discussion and Saurabh Singh, Kevin Shih and Tanmay Gupta for their comments on an earlier version of the manuscript.


  1. Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., & Sivic, J. (2014). Seeing 3D chairs: Exemplar part-based 2D–3D alignment using a large dataset of cad models. In CVPR.Google Scholar
  2. Banica, D., & Sminchisescu, C. (2013). CPMC-3D-O2P: Semantic segmentation of RGB-D images using CPMC and second order pooling. In CoRR.Google Scholar
  3. Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. PAMI, 34(7), 1312–1328.CrossRefGoogle Scholar
  4. Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., et al. (2015). ShapeNet: An information-rich 3D model repository. Technical Report. arXiv:1512.03012. Stanford University—Princeton University—Toyota Technological Institute at Chicago.
  5. Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., & Nießner, M. (2017). Scannet: Richly-annotated 3D reconstructions of indoor scenes. In Proceedings computer vision and pattern recognition (CVPR). IEEE.Google Scholar
  6. Dasgupta, S., Fang, K., Chen, K., & Savarese, S. (2016). Delay: robust spatial layout estimation for cluttered indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 616–624).Google Scholar
  7. Delage, E., Lee, H., & Ng, A. Y. (2006). A dynamic Bayesian network model for autonomous 3D reconstruction from a single indoor image. In CVPR.Google Scholar
  8. Deng, Z., Todorovic, S., & Jan Latecki, L. (2015). Semantic segmentation of RGBD images with mutex constraints. In Proceedings of the IEEE international conference on computer vision (pp. 1733–1741).Google Scholar
  9. Dollár, P., & Zitnick, C. L. (2013). Structured forests for fast edge detection. In ICCV.Google Scholar
  10. Endres, I., & Hoiem, D. (2010). Category independent object proposals. In ECCV.Google Scholar
  11. Flint, A., Murray, D. W., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3D features. In ICCV.Google Scholar
  12. Furukawa, Y., Curless, B., Seitz, S. M., & Szeliski, R. (2009). Reconstructing building interiors from images. In ICCV.Google Scholar
  13. Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2013). A multi-view embedding space for internet images, tags, and their semantics. In IJCV.Google Scholar
  14. Guo, R., & Hoiem, D. (2012). Beyond the line of sight: Labeling the underlying surfaces. In ECCV.Google Scholar
  15. Guo, R., & Hoiem, D. (2013). Support surface prediction in indoor scenes. In ICCV.Google Scholar
  16. Guo, R., Zou, C., & Hoiem, D. (2015). Predicting complete 3D models of indoor scenes. arXiv preprint arXiv:1504.02437.
  17. Gupta, S., Arbelaez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR.Google Scholar
  18. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (pp. 345–360). Springer.Google Scholar
  19. Gupta, S., Arbeláez, P., Girshick, R., & Malik, J. (2015). Aligning 3D models to RGB-D images of cluttered scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4731–4740).Google Scholar
  20. Hariharan, B., Arbeláez, P., Girshick, R., & Malik, J. (2014). Simultaneous detection and segmentation. In Computer vision–ECCV 2014 (pp. 297–312). Springer.Google Scholar
  21. Hedau, V., Hoiem, D, & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV.Google Scholar
  22. Hedau, V., Hoiem, D., & Forsyth, D. (2010). Thinking inside the box: Using appearance models and context based on room geometry. In ECCV.Google Scholar
  23. Hoiem, D., Efros, A. A., & Hebert, M. (2008). Putting objects in perspective. International Journal of Computer Vision, 80(1), 3–15.CrossRefGoogle Scholar
  24. Karsch, K., Liu, C., & Kang, S. B. (2012). Depth extraction from video using non-parametric sampling. In ECCV.Google Scholar
  25. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  26. Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.Google Scholar
  27. Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., Guibas, L. J. (2015). Joint embeddings of shapes and images via CNN image purification. ACM Transactions on Graphics, 34(6), 234.Google Scholar
  28. Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing IKEA objects: Fine pose estimation. In ICCV.Google Scholar
  29. Lim, J. J., Khosla, A., & Torralba, A. (2014). FPM: Fine poseparts-based model with 3D cad models. In European conference on computer vision (pp. 478–493). Springer.Google Scholar
  30. Lin, D., Fidler, S., & Urtasun, R. (2013). Holistic scene understanding for 3D object detection with RGBD cameras. In ICCV.Google Scholar
  31. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).Google Scholar
  32. Mallya, A., & Lazebnik, S. (2015). Learning informative edge maps for indoor scene layout prediction. In Proceedings of the IEEE international conference on computer vision (pp. 936–944).Google Scholar
  33. Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized prim’s algorithm. In Proceedings of the IEEE international conference on computer vision (pp. 2536–2543).Google Scholar
  34. Ren, X., Bo, L., & Fox, D. (2012). RGB-(D) scene labeling: Features and algorithms. In CVPR.Google Scholar
  35. Roberts, L. G. (1963). Machine perception of 3-D solids. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
  36. Rock, J., Gupta, T., Thorsen, J., Gwak, J., Shin, D., & Hoiem, D. (2015). Completing 3D object shape from one depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2484–2493).Google Scholar
  37. Satkin, S., & Hebert, M. (2013). 3DNN: Viewpoint invariant 3D geometry matching for scene understanding. In ICCV.Google Scholar
  38. Schwing, A. G., & Urtasun, R. (2012). Efficient exact inference for 3D indoor scene understanding. In ECCV.Google Scholar
  39. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In Computer vision-ECCV, 2012 (pp. 746–760).Google Scholar
  40. Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection indepth images. In European conference on computer vision (pp. 634–651). Springer.Google Scholar
  41. Song, S., & Xiao, J. (2016). Deep sliding shapes for a modal 3D object detection in RGB-D images. In The IEEE conference on computer vision and pattern recognition (CVPR).Google Scholar
  42. Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun RGB-D: A RGB-D scene understanding benchmark suite. CVPR, vol 5, p, 6.Google Scholar
  43. Song, S., Yu, F., Zeng, A., Chang, A. X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In Proceedings of 30th IEEE conference on computer vision and pattern recognition.Google Scholar
  44. Tighe, J., & Lazebnik, S. (2010). Superparsing: Scalable nonparametric image parsing with superpixels. In ECCV.Google Scholar
  45. Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., et al. (2013). Reconstruction meets recognition challenge.Google Scholar
  46. Walk, S., Schindler, K., & Schiele, B. (2010). Disparity statistics for pedestrian detection: Combining appearance, motion and stereo. In Computer Vision–ECCV 2010 (pp. 182–195). Springer.Google Scholar
  47. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X. et al. (2015). 3D shape nets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1912–1920).Google Scholar
  48. Xiang, Y., Kim, W., Chen, W., Ji, J., Choy, C., Su, H., Mottaghi, R., et al. (2016). Objectnet 3D: A large scale database for 3D object recognition. In European conference computer vision (ECCV).Google Scholar
  49. Xiao, J., & Furukawa, Y. (2012). Reconstructing the world’s museums. In ECCV.Google Scholar
  50. Xiao, J., Russell, B. C., & Torralba, A. (2012). Localizing 3D cuboids in single-view images. In NIPS.Google Scholar
  51. Yamaguchi, K., Kiapour, M. H., & Berg, T. L. (2013). Paper doll parsing: Retrieving similar styles to parse clothing items. In ICCV.Google Scholar
  52. Yih, W. T, Toutanova, K., Platt, J. C., & Meek, C. (2011). Learningdiscriminative projections for text similarity measures. In Proceedings of the 15th conference on computational natural language learning, Association for Computational Linguistics (pp. 247–256).Google Scholar
  53. Zhang, J., Kan, C., Schwing, A. G., & Urtasun, R. (2013). Estimating the 3D layout of indoor scenes and its clutter from depth sensors. In ICCV.Google Scholar
  54. Zhang, Y., Song, S., Tan, P., & Xiao, J. (2014). Panocontext: Awhole-room 3D context model for panoramic scene understanding. In European conference on computer vision (pp. 668–686). Springer.Google Scholar
  55. Zhang, Y., Bai, M., Kohli, P., Izadi, S., & Xiao, J. (2016). Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. In ICCV.Google Scholar
  56. Zhao, Y., & Zhu, S. C. (2013). Scene parsing by integrating function, geometry and appearance models. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Illinois at Urbana-ChampaignChampaignUSA
  2. 2.Google ResearchNew YorkUSA

Personalised recommendations