Content-Aware Semantic Indoor Scene Modeling from a Single Image

Nie, Yinyu; Chang, Jian; Zhang, Jian Jun

doi:10.1007/978-3-030-71002-6_8

Yinyu Nie⁷,
Jian Chang⁷ &
Jian Jun Zhang⁷

Part of the book series: Human–Computer Interaction Series ((HCIS))

666 Accesses

Abstract

Digitalizing indoor scenes into a 3D virtual world enables people to visit and roam in their daily-life environments through remote devices. However, reconstructing indoor geometry with enriched semantics (e.g. the room layout, object category and support relationship) requires computers to parse and holistically understand the scene context, which is challenging considering the complexity and clutter of our living surroundings. However, with the rising development of deep learning techniques, modeling indoor scenes from single RGB images has been available. In this chapter, we introduce an automatic method for semantic indoor scene modeling based on deep convolutional features. Specifically, we decouple the task of indoor scene modeling into different hierarchies of scene understanding subtasks to parse semantic and geometric contents from scene images (i.e. object masks, scene depth map and room layout). Above these semantic and geometric contents, we deploy a data-driven support relation inference to estimate the physical contact between indoor objects. Under the support context, we adopt an image-CAD matching strategy to retrieve an indoor scene from global searching to local fine-tuning. The experiments show that this method can retrieve CAD models efficiently with enriched semantics, and demonstrate its feasibility in handling serious object occlusions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Google Scholar
Chen K, Lai Y-K, Hu S-M (2015) 3d indoor scene modeling from rgb-d data: a survey. Comput Visual Media 1(4):267–278
Article Google Scholar
Eigen D, Fergus R (2015) Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE international conference on computer vision, pp 2650–2658
Google Scholar
Hariharan B, Arbeláez P, Girshick R, Malik J (2014) Simultaneous detection and segmentation. In: European conference on computer vision. Springer, pp 297–312
Google Scholar
Hedau V, Hoiem D, Forsyth D (2009) Recovering the spatial layout of cluttered rooms. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 1849–1856
Google Scholar
Hua B-S, Pham Q-H, Nguyen DT, Tran M-K, Yu L-F, Yeung S-K (2016) Scenenn: a scene meshes dataset with annotations. In: 2016 fourth international conference on 3D vision (3DV). IEEE, pp 92–101
Google Scholar
Huang S, Qi S, Zhu Y, Xiao Y, Xu Y, Zhu S-C (2018) Holistic 3d scene parsing and reconstruction from a single rgb image. In: Proceedings of the European conference on computer vision (ECCV), pp 187–203
Google Scholar
Izadinia H, Shan Q, Seitz SM (2017) Im2cad. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2422–2431
Google Scholar
Jia Z, Gallagher A, Saxena A, Chen T (2013) 3d-based reasoning with blocks, support, and stability. In: 2013 IEEE conference on computer vision and pattern recognition, pp 1–8
Google Scholar
Jones DR, Perttunen CD, Stuckman BE (1993) Lipschitzian optimization without the lipschitz constant. J Optimizat Theory Appl 79(1):157–181
Article MathSciNet Google Scholar
Konolige K, Mihelich P (2011) Technical description of kinect calibration. http://www.ros.org/wiki/kinect_calibration/technical
Laina I, Rupprecht C, Belagiannis V, Tombari F, Navab N (2016) Deeper depth prediction with fully convolutional residual networks. In: 2016 fourth international conference on 3D vision (3DV). IEEE, pp 239–248
Google Scholar
Li Y, Qi H, Dai J, Ji X, Wei Y (2016) Fully convolutional instance-aware semantic segmentation. arXiv:1611.07709
Liu M, Guo Y, Wang J (2017) Indoor scene modeling from a single image using normal inference and edge features. Visual Comput 1–14
Google Scholar
Liu T, Chaudhuri S, Kim VG, Huang Q, Mitra NJ, Funkhouser T (2014) Creating consistent scene graphs using a probabilistic grammar. ACM Trans Graph (TOG) 33(6):211
MATH Google Scholar
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Google Scholar
Mallya A, Lazebnik S (2015) Learning informative edge maps for indoor scene layout prediction. In: Proceedings of the IEEE international conference on computer vision, pp 936–944
Google Scholar
Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, Kohi P, Shotton J, Hodges S, Fitzgibbon A (2011) Kinectfusion: real-time dense surface mapping and tracking. In: 2011 10th IEEE international symposium on mixed and augmented reality (ISMAR). IEEE, pp 127–136
Google Scholar
Nie Y, Chang J, Chaudhry E, Guo S, Smart A, Zhang JJ (2018) Semantic modeling of indoor scenes with support inference from a single photograph. Comput Animat Virtual Worlds 29(3–4)
Google Scholar
Nie Y, Guo S, Chang J, Han X, Huang J, Hu S-M, Zhang JJ (2020) Shallow2deep: indoor scene modeling by single image understanding. Patt Recognit 103
Google Scholar
Powell MJ (2009) The bobyqa algorithm for bound constrained optimization without derivatives. Cambridge NA report NA2009/06, University of Cambridge, Cambridge
Google Scholar
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 91–99
Google Scholar
Rother C, Kolmogorov V, Blake A (2004) Grabcut: interactive foreground extraction using iterated graph cuts. ACM Trans Graph (TOG) 23:309–314
Article Google Scholar
Salas-Moreno RF, Newcombe RA, Strasdat H, Kelly PH, Davison AJ (2013) Slam++: simultaneous localisation and mapping at the level of objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1352–1359
Google Scholar
Santoro A, Raposo D, Barrett DG, Malinowski M, Pascanu R, Battaglia P, Lillicrap T (2017) A simple neural network module for relational reasoning. In: Advances in neural information processing systems, pp 4967–4976
Google Scholar
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from rgbd images. In: Computer vision-ECCV 2012, pp 746–760
Google Scholar
Song S, Lichtenberg SP, Xiao J (2015) Sun rgb-d: a rgb-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 567–576
Google Scholar
Xu K, Chen K, Fu H, Sun W-L, Hu S-M (2013) Sketch2scene: sketch-based co-retrieval and co-placement of 3d models. ACM Trans Graph (TOG) 32(4):123
Article Google Scholar
Xue F, Xu S, He C, Wang M, Hong R (2015) Towards efficient support relation extraction from rgbd images. Inf Sci 320:320–332
Article MathSciNet Google Scholar
Zhang Y, Liu Z, Miao Z, Wu W, Liu K, Sun Z (2015) Single image-based data-driven indoor scene modeling. Comput Graph 53:210–223
Article Google Scholar
Zheng B, Zhao Y, Yu J, Ikeuchi K, Zhu S-C (2015) Scene understanding by reasoning stability and safety. Int J Comput Vision 112(2):221–238
Article MathSciNet Google Scholar

Download references

Acknowledgements

The research leading to these results has been partially supported by the VISTA AR project (funded by the Interreg France (Channel) England, ERDF), the China Scholarship Council and Bournemouth University.

Author information

Authors and Affiliations

Bournemouth University, Poole, UK
Yinyu Nie, Jian Chang & Jian Jun Zhang

Authors

Yinyu Nie
View author publications
You can also search for this author in PubMed Google Scholar
Jian Chang
View author publications
You can also search for this author in PubMed Google Scholar
Jian Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian Jun Zhang .

Editor information

Editors and Affiliations

Institute for Media Innovation, Nanyang Technological University, Singapore, Singapore
Nadia Magnenat Thalmann
Bournemouth University, Poole, UK
Jian Jun Zhang
Institute for Media Innovation, Nanyang Technological University, Singapore, Singapore
Manoj Ramanathan
School of Computer and Communication Sciences, École Polytechnique Fédérale de Lausa, Lausanne, Switzerland
Daniel Thalmann

Appendix: Parameter Decision

In image segmentation and room corner searching, the training configurations and parameters setup are followed with Li et al. (2016) and Nie et al. (2018). In object modeling, we set \(\mathbf {d}_{1}=[0.5, 0.5, 0.5]^\mathbf{T }\) (in meters, the same below) for normal objects. For those supported by a wall, \(\mathbf {d}_{1}\) is set as \([0.2, \infty , \infty ]^\mathbf{T }\) or \([\infty , 0.2, \infty ]^\mathbf{T }\) depending on the orientation of the wall. \(\mathbf {d}_{2}\) is set as \([1.0, 1.0, 0.5]^{\text {T}}\) as the point cloud is noisier in horizontal plane than in the vertical direction (see Fig. 8.8). For model scales, we set \(\rho _{1}^{\text {L}}=\rho _{2}^{\text {L}}=\rho _{3}^{\text {L}}=0.8\), \(\rho _{1}^{\text {U}}=\rho _{2}^{\text {U}}=1.2\), and \(\rho _{3}^{\text {U}}=1.0\). While for objects whose top part is occluded, the point cloud could underestimate the model height size. We hence change the lower bounds to \(\rho _{1}^{\text {L}}=\rho _{2}^{\text {L}}=\rho _{3}^{\text {L}}=1.0\), and the upper bounds to \(\rho _{1}^{\text {U}}=\rho _{2}^{\text {U}}=\rho _{3}^{\text {U}}=2.0\) or more. In global searching, the maximal iterations number is limited to 50, while in the local matching, generally we do not set the maximal iteration number to ensure convergence, the only stopping criteria is set as when the absolute tolerance reaches \(10^{-3}\).

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nie, Y., Chang, J., Zhang, J.J. (2021). Content-Aware Semantic Indoor Scene Modeling from a Single Image. In: Thalmann, N.M., Zhang, J.J., Ramanathan, M., Thalmann, D. (eds) Intelligent Scene Modeling and Human-Computer Interaction. Human–Computer Interaction Series. Springer, Cham. https://doi.org/10.1007/978-3-030-71002-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-71002-6_8
Published: 09 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71001-9
Online ISBN: 978-3-030-71002-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Content-Aware Semantic Indoor Scene Modeling from a Single Image

Abstract

Access this chapter

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix: Parameter Decision

Appendix: Parameter Decision

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation