International Journal of Computer Vision

, Volume 112, Issue 2, pp 133–149 | Cite as

Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation

  • Saurabh Gupta
  • Pablo Arbeláez
  • Ross Girshick
  • Jitendra Malik


In this paper, we address the problems of contour detection, bottom-up grouping, object detection and semantic segmentation on RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset (Silberman et al., ECCV, 2012). We propose algorithms for object boundary detection and hierarchical segmentation that generalize the \(gPb-ucm\) approach of Arbelaez et al. (TPAMI, 2011) by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We train RGB-D object detectors by analyzing and computing histogram of oriented gradients on the depth image and using them with deformable part models (Felzenszwalb et al., TPAMI, 2010). We observe that this simple strategy for training object detectors significantly outperforms more complicated models in the literature. We then turn to the problem of semantic segmentation for which we propose an approach that classifies superpixels into the dominant object categories in the NYUD2 dataset. We design generic and class-specific features to encode the appearance and geometry of objects. We also show that additional features computed from RGB-D object detectors and scene classifiers further improves semantic segmentation accuracy. In all of these tasks, we report significant improvements over the state-of-the-art.


RGB-D contour detection RGB-D image segmentation RGB-D object detection RGB-D semantic segmentation RGB-D scene classification 



We are thankful to Jon Barron, Bharath Hariharan, and Pulkit Agrawal for the useful discussions. This work was sponsored by ONR SMARTS MURI N00014-09-1-1051, ONR MURI N00014-10-10933, and a Berkeley Fellowship.


  1. Arbelaez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L., & Malik, J. (2012). Semantic segmentation using regions and parts. In CVPR.Google Scholar
  2. Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. In TPAMI.Google Scholar
  3. Barron, J. T., Malik, J. (2013). Intrinsic scene properties from a single RGB-D image. In CVPR.Google Scholar
  4. Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In ECCV.Google Scholar
  5. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.CrossRefzbMATHGoogle Scholar
  6. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV.Google Scholar
  7. Carreira, J., Li, F., & Sminchisescu, C. (2012). Object recognition by sequential figure-ground ranking. In IJCV.Google Scholar
  8. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. ECCV. Berlin Heidelberg: Springer.Google Scholar
  9. Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Graphics and Vision: Found and Trends in Comp.Google Scholar
  10. Dalal, N., & Triggs, B.(2005). Histograms of oriented gradients for human detection. In CVPR.Google Scholar
  11. Deng, J., Dong, W., Socher, R., Li, L. -J., Li, K., & Fei-Fei, L. (2009). Image{N}et: A large-scale hierarchical image database. In CVPR.Google Scholar
  12. Dollár, P., Zitnick, C. L. (2013). Structured forests for fast edge detection. In ICCV.Google Scholar
  13. Endres, I., Shih, K. J., Jiaa, J., & Hoiem, D. (2013). Learning collections of part models for object recognition. In CVPR.Google Scholar
  14. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL Visual Object Classes (VOC) Challenge. In IJCV.Google Scholar
  15. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., & Zisserman, A.(2012). The PASCAL Visual Object Classes Challenge (VOC2012) Results.
  16. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. In TPAMI.Google Scholar
  17. Frome, A., Huber, D., Kolluri, R., Bülow, T., & Malik, J. (2004). Recognizing objects in range data using regional point descriptors. In ECCV.Google Scholar
  18. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
  19. Gupta, S., Arbelaez, P., & Malik, J. (2013). Perceptual organization and recognition of indoor scenes from RGB-D images. In CVPR.Google Scholar
  20. Gupta, A., Efros, A., & Hebert, M. (2010). Blocks world revisited: Image understanding using qualitative geometry and mechanics. In ECCV.Google Scholar
  21. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for Object detection and segmentation. In ECCV.Google Scholar
  22. Gupta, A., Satkin, S., Efros, A., & Hebert, M. (2011). From 3D scene geometry to human workspace. In CVPR.Google Scholar
  23. Hedau, V., Hoiem, D., & Forsyth, D. (2012). Recovering free space of indoor scenes from a single image. In CVPR.Google Scholar
  24. Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. In IJCV.Google Scholar
  25. Hoiem, D., Efros, A., & Hebert, M. (2011). Recovering occlusion boundaries from an image. In IJCV.Google Scholar
  26. Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., & Fitzgibbon, A. (2011). KinectFusion: Real-time 3D reconstruction and interaction using a moving depth camera. In UIST.Google Scholar
  27. Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3D object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision (pp. 141–165). Berlin: Springer.CrossRefGoogle Scholar
  28. Johnson, A., Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3D scenes. In TPAMI.Google Scholar
  29. Kanizsa, G. (1979). Organization in Vision: Essays on Gestalt Perception. New York: Praeger Publishers.Google Scholar
  30. Koppula, H., Anand, A., Joachims, T., & Saxena, A. (2011). Semantic labeling of 3d point clouds for indoor scenes. In NIPS.Google Scholar
  31. Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. S. (2010). Graph cut based inference with co-occurrence statistics. In ECCV.Google Scholar
  32. Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A large-scale hierarchical multi-view RGB-D object dataset. In ICRA.Google Scholar
  33. Lai, K., Bo, L., Ren, X., & Fox, D. (2013). RGB-D object recognition: Features, algorithms, and a large scale benchmark. In A. Fossati, J. Gall, H. Grabner, X. Ren, & K. Konolige (Eds.), Consumer Depth Cameras for Computer Vision: Research Topics and Applications (pp. 167–192). Berlin: Springer.CrossRefGoogle Scholar
  34. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  35. Lee, D., Gupta, A., Hebert, M., & Kanade, T. (2010). Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In NIPS.Google Scholar
  36. Lee, D., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In CVPR.Google Scholar
  37. Maji, S., Berg, A. C., & Malik, J. (2013). Efficient classification for additive kernel svms. In TPAMI.Google Scholar
  38. Martin, D., Fowlkes, C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color and texture cues. In TPAMI.Google Scholar
  39. Reconstruction meets recognition challenge, iccv 2013. (2013)
  40. Ren, X., & Bo, L. (2012). Discriminatively trained sparse code gradients for contour detection. In NIPS.Google Scholar
  41. Ren, X., Bo, L., & Fox, D.(2012). RGB-(D) scene labeling: Features and algorithms. In CVPR.Google Scholar
  42. Rusu, R. B., Blodow, N., & Beetz, M. (2009). Fast point feature histograms (FPFH) for 3D registration. In ICRA.Google Scholar
  43. Savitsky, A., & Golay, M. (1964). Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, 36(8), 1627–1639.CrossRefGoogle Scholar
  44. Saxena, A., Chung, S., & Ng, A. (2008). 3-D depth reconstruction from a single still image. In IJCV.Google Scholar
  45. Shotton, J., Fitzgibbon, A. W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR.Google Scholar
  46. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.Google Scholar
  47. soo Kim, B., Xu, S., & Savarese, S.(2013) Accurate localization of 3D objects from RGB-D data using segmentation hypotheses. In CVPR.Google Scholar
  48. Tang, S., Wang, X., Lv, X., Han, T.X., Keller, J., He, Z., Skubic, M., & Lao, S. (2012). Histogram of oriented normal vectors for object recognition with a depth sensor. In ACCV.Google Scholar
  49. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. In TPAMI.Google Scholar
  50. Viola, P., & Jones, M.(2001). Rapid object detection using a boosted cascade of simple features. In CVPR.Google Scholar
  51. Ye, E.S.(2013). Object detection in RGB-D indoor scenes. Master’s thesis, EECS Department, University of California, Berkeley.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Saurabh Gupta
    • 1
  • Pablo Arbeláez
    • 1
    • 2
  • Ross Girshick
    • 1
  • Jitendra Malik
    • 1
  1. 1.University of California at BerkeleyBerkeleyUSA
  2. 2.Universidad de los AndesBogotáColombia

Personalised recommendations