Skip to main content

Holistic indoor scene understanding by context-supported instance segmentation


We propose a new method flow that utilizes pixel-level labeling information for instance-level object detection in indoor scenes from RGB-D data. Semantic labeling and instance segmentation are two different paradigms for indoor scene understanding that are usually accomplished separately and independently. We are interested in integrating the two tasks in a synergistic way in order to take advantage of their complementary nature for comprehensive understanding. Our work can capitalize on any deep learning networks used for semantic labeling by treating the intermediate layer as the category-wise local detection output, from which instance segmentation is optimized by jointly considering both the spatial fitness and the relational context encoded by three graphical models, namely, the vertical placement model (VPM), horizontal placement model (HPM) and non-placement model (NPM). VPM, HPM and NPM represent three common but distinct indoor object placement configurations: vertical, horizontal and hanging relationships, respectively. Experimental results on two standard RGB-D datasets show that our method can significantly improve small object segmentation with promising overall performance that is competitive with the state-of-the-art methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13


  1. Abdulnabi AH, Shuai B, Zuo Z, Chau LP, Wang G (2017) Multimodal recurrent neural networks with information transfer layers for indoor scene labeling. IEEE Trans Multimed 20(7):1656–1671

    Article  Google Scholar 

  2. Bellver M, Salvador A, Torres J, Giro-i Nieto X (2020) Mask-guided sample selection for semi-supervised instance segmentation. Multimed Tools Appl 79(35):25551–25569

    Article  Google Scholar 

  3. Cabral R, Furukawa Y (2014) Piecewise planar and compact floorplan reconstruction from images. In: Proceedings CVPR

  4. Chen X, Ma H, Wan J, Li B, Xia T (2017) Multi-view 3D object detection network for autonomous driving. In: Proceedings CVPR

  5. Choi MJ, Lim JJ, Torralba A, Willsky AS (2010) Exploiting hierarchical context on a large database of object categories. In: Proceedings CVPR

  6. Choi MJ, Torralba A, Willsky AS (2012) A tree-based context model for object recognition. IEEE T-PAMI 34(2):240–252

    Article  Google Scholar 

  7. Chow C, Liu C (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans Information Theory 14(3):462–467

    Article  Google Scholar 

  8. Dai A, Chang AX, Savva M, Halber M, Funkhouser T, Nießner M (2017) Scannet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings Computer vision and pattern recognition (CVPR). IEEE

  9. Deng Z, Jan Latecki L (2017) Amodal detection of 3D objects: Inferring 3D bounding boxes from 2d ones in rgb-depth images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5762–5770

  10. Ding X, Li B, Xiong W, Guo W, Hu W, Wang B (2016) Multi-instance multi-label learning combining hierarchical context and its application to image annotation. IEEE Trans Multimed 18(8):1616–1627

    Article  Google Scholar 

  11. Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. IEEE T-PAMI 35(8):1915–1929

    Article  Google Scholar 

  12. Furukawa Y, Curless B, S.M.S, Szeliski R (2009) Manhattan-world stereo. In: Proceedings CVPR

  13. Gao M, Du Y, Yang Y, Zhang J (2019) Adaptive anchor box mechanism to improve the accuracy in the object detection system. Multimed Tools Appl 78(19):27383–27402

    Article  Google Scholar 

  14. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  15. Graham B, Engelcke M, Van der Maaten L (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 9224–9232

  16. Guo L, Fan G, Sheng W (2017) Robust object detection by cuboid matching with local plane optimization in indoor RGB-d images. In: Proceedings VCIP

  17. Guo L, Fan G, Sheng W (2019) Dual graphical models for relational modeling of indoor object categories. In: Proceedings CVPR-workshops

  18. Guo L, Fan G, Sheng W (2019) Creating 3D bounding box hypotheses from deep network Score-Maps. In: Proceedings ICIP

  19. Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841

    Article  MathSciNet  Google Scholar 

  20. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings CVPR

  21. Ikehata S, Yang H, Furukawa Y (2015) Structured indoor modeling. In: Proceedings ICCV

  22. Jian M, Jung C (2016) Semi-supervised bi-dictionary learning for image classification with smooth representation-based label propagation. IEEE Trans Multimed 18(3):458–473

    Article  Google Scholar 

  23. Jian M, Jung C, Zheng Y (2013) Discriminative structure learning for semantic concept detection with graph embedding. IEEE Trans Multimed 16(2):413–426

    Article  Google Scholar 

  24. Jiang H, Xiao J (2013) A linear approach to matching cuboids in RGBD images. In: Proceedings CVPR

  25. Kohli YZMBP, Izadi S, Xiao J (2016) Deepcontext: Context-encoding neural pathways for 3D holistic scene understanding. arXiv:1603.04922

  26. Lahoud J, Ghanem B, Pollefeys M, Oswald MR (2019) 3D instance segmentation via multi-task metric learning

  27. LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436

    Article  Google Scholar 

  28. Lempitsky VS, Kohli P, Rother C, Sharp T (2009) Image segmentation with a bounding box prior. In: Proceedings ICCV

  29. Lewis RM, Torczon V, Trosset MW (2000) Direct search methods: then and now. JCAM 124(1):191– 207

    MathSciNet  MATH  Google Scholar 

  30. Li W, Gu J, Dong Y, Dong Y, Han J (2019) Indoor scene understanding via rgb-d image segmentation employing depth-based CNN and CRFs. Multimed Tools Appl 1–15

  31. Li Y, Guo Y, Guo J, Ma Z, Kong X, Liu Q (2018) Joint CRF and locality-consistent dictionary learning for semantic segmentation. IEEE Trans Multimed 21(4):875–886

    Article  Google Scholar 

  32. Li Z, Gan Y, Liang X, Yu Y, Cheng H, Lin L (2016) LSTM-CF: Unifying Context modeling and fusion with LSTMs for RGB-d scene labeling. In: Proceedings ECCV

  33. Li Z, Peng C, Yu G, Zhang X, Deng Y, Sun J (2018) Detnet: A backbone network for object detection. arXiv:1804.06215

  34. Liu C, Furukawa Y (2019) Masc: Multi-scale affinity with sparse convolution for 3D instance segmentation. arXiv:1902.04478

  35. Liu Y, Li Z, Liu J, Lu H (2015) Boosted miml method for weakly-supervised image semantic segmentation. Multimed Tools Appl 74(2):543–559

    Article  Google Scholar 

  36. Lv X, Liu X, Li X, Li X, Jiang S, He Z (2017) Modality-specific and hierarchical feature learning for RGB-d hand-held object recognition. Multimed Tools Appl 76(3):4273–4290

    Article  Google Scholar 

  37. Narita G, Seno T, Ishikawa T, Kaji Y (2019) Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In: Proceedings IROS

  38. Pinheiro PO, Collobert R (2015) From image-level to pixel-level labeling with convolutional networks. In: Proceedings CVPR

  39. Qi CR, Litany O, He K, Guibas LJ (2019) Deep hough voting for 3D object detection in point clouds. arXiv:1904.09664

  40. Qi CR, Liu W, Wu C, Su H, Guibas LJ (2018) Frustum pointnets for 3D object detection from RGB-d data. In: Proceedings CVPR

  41. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings CVPR

  42. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  43. Ren Z, Sudderth EB (2016) Three-dimensional object detection and layout prediction using clouds of oriented gradients. In: Proceedings CVPR

  44. Ries CX, Richter F, Lienhart R (2016) Towards automatic bounding box annotations from weakly labeled images. Multimed Tools Appl 75 (11):6091–6118

    Article  Google Scholar 

  45. Shaikh RA, Memon I, Hussain R, Maitlo A, Shaikh H (2018) A contemporary approach for object recognition based on spatial layout and low level features’ integration. Multimed Tools Appl 1–24

  46. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Psongroc. ECCV. Springer

  47. Song S, Lichtenberg SP, Xiao J (2015) SUN RGB-D: a RGB-d scene understanding benchmark suite. In: Proceedings CVPR

  48. Song S, Xiao J (2016) Deep sliding shapes for amodal 3D object detection in RGB-d images. In: Proceedings CVPR

  49. Song S, Zeng A, Chang AX, Savva M, Savarese S, Funkhouser T (2018) Im2pano3d: Extrapolating 360 structure and semantics beyond the field of view. In: Proceedings CVPR

  50. Tang S, Wang X, Lv X, Han TX, Keller J, He Z, Skubic M, Lao S (2012) Histogram of oriented normal vectors for object recognition with a depth sensor. In: Proceedings ACCV

  51. Wu L, Liu Z, Song H, Le Meur O (2018) RGBD Co-saliency detection via multiple kernel boosting and fusion. Multimed Tools Appl 77(16):21185–21199

    Article  Google Scholar 

  52. Xiao J, Furukawa Y (2014) Reconstructing the world’s museums. IJCV 110(3):243–258

    Article  Google Scholar 

  53. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) SUN Database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE Computer society conference on computer vision and pattern recognition. IEEE, pp 3485–3492

  54. Xiao Z, Gao J, Wu D, Zhang L, Chen X (2020) A fast 3D object recognition algorithm using plane-constrained point pair features. Multimed Tools Appl 1–21

  55. Xu D, Anguelov D, Jain A (2018) Pointfusion: Deep sensor fusion for 3D bounding box estimation. In: Proceedings CVPR

  56. Yang B, Wang J, Clark R, Hu Q, Wang S, Markham A, Trigoni N (2019) Learning object bounding boxes for 3D instance segmentation on point clouds. In: Proceedings NIPS

  57. Zhang J, Wu Q, Shen C, Zhang J, Lu J (2018) Multilabel image classification with regional latent semantic dependencies. IEEE Trans Multimed 20 (10):2801–2813

    Article  Google Scholar 

  58. Zhang Y, Davison BD (2020) Domain adaptation for object recognition using subspace sampling demons. Multimed Tools Appl 1–20

  59. Zheng Y, Gao X (2017) Indoor scene recognition via multi-task metric multi-kernel learning from rgb-d images. Multimed Tools Appl 76(3):4427–4443

    Article  Google Scholar 

  60. Zhou Y, Tuzel O (2018) Voxelnet: End-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4490–4499

Download references


This work is supported in part by the US National Institutes of Health (NIH) Grant R15 AG061833 and the Oklahoma Center for the Advancement of Science and Technology (OCAST) Health Research Grant HR18-069.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Guoliang Fan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, L., Fan, G. Holistic indoor scene understanding by context-supported instance segmentation. Multimed Tools Appl 81, 35751–35773 (2022).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: