Skip to main content

Advertisement

Log in

Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Inexpensive structured light sensors can capture rich information from indoor scenes, and scene labeling problems provide a compelling opportunity to make use of this information. In this paper we present a novel conditional random field (CRF) model to effectively utilize depth information for semantic labeling of indoor scenes. At the core of the model, we propose a novel and efficient plane detection algorithm which is robust to erroneous depth maps. Our CRF formulation defines local, pairwise and higher order interactions between image pixels. At the local level, we propose a novel scheme to combine energies derived from appearance, depth and geometry-based cues. The proposed local energy also encodes the location of each object class by considering the approximate geometry of a scene. For the pairwise interactions, we learn a boundary measure which defines the spatial discontinuity of object classes across an image. To model higher-order interactions, the proposed energy treats smooth surfaces as cliques and encourages all the pixels on a surface to take the same label. We show that the proposed higher-order energies can be decomposed into pairwise sub-modular energies and efficient inference can be made using the graph-cuts algorithm. We follow a systematic approach which uses structured learning to fine-tune the model parameters. We rigorously test our approach on SUN3D and both versions of the NYU-Depth database. Experimental results show that our work achieves superior performance to state-of-the-art scene labeling techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. In this work we set \(r=3\) and \({{\varvec{\kappa }}}\) is set to [0.25, 0.75], [0.5, 0.5] and [0.75, 0.25] respectively in each case. This choice is based on the validation set (see Sect. 6.2).

  2. Plane detection code is available at author’s webpage: http://www.csse.uwa.edu.au/~salman.

  3. The development of this section is similar to Kohli et al. (2009). We also used the same notation - wherever possible - to allow the reader to easily sort out differences and commonalities.

References

  • Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. TPAMI, 33(5), 898–916.

    Article  Google Scholar 

  • Blake, A., Kohli, P., & Rother, C. (2011). Markov random fields for vision and image processing. Cambridge: The MIT Press.

    MATH  Google Scholar 

  • Boykov, Y., & Funka-Lea, G. (2006). Graph cuts and efficient nd image segmentation. IJCV, 70(2), 109–131.

    Article  Google Scholar 

  • Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. TPAMI, 23(11), 1222–1239.

    Article  Google Scholar 

  • Breiman, L. (2001). Random forests. Machine Learning, 45(0885–6125), 5–32.

    Article  MATH  Google Scholar 

  • Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages.

  • Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. TPAMI, 34(7), 1312–1328.

    Article  Google Scholar 

  • Couprie, C., Farabet, C., Najman, L., & LeCun, Y.(2013). Indoor semantic segmentation using depth information. ICLR.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, vol 1 (pp 886–893).

  • Edwards, W., Miles, R. F, Jr, & Von Winterfeldt, D. (2007). Advances in decision analysis: from foundations to applications. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. TPAMI, 35(8), 1915–1929. doi:10.1109/TPAMI.2012.231.

    Article  Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. IJCV, 59(2), 167–181.

    Article  Google Scholar 

  • Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. TIT, 21(1), 32–40.

    MathSciNet  MATH  Google Scholar 

  • Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In IEEE ICCV (pp 1–8).

  • Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In IEEE CVPR (pp 3129–3136).

  • Gupta, S., Arbelaez, P., & Malik, J. (2013), Perceptual organization and recognition of indoor scenes from rgb-d images. In IEEE CVPR (pp. 564–571).

  • Gupta, S., Girshick, R., Arbeláez. P., & Malik, J. (2014). Learning rich features from rgb-d images for object detection and segmentation. In Computer Vision–ECCV 2014 (pp. 345–360). Springer.

  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD, 11(1), 10–18.

    Article  Google Scholar 

  • Hayat, M., Bennamoun, M., & An, S. (2015). Deep reconstruction models for image set classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4), 713–727. doi:10.1109/TPAMI.2014.2353635.

    Article  Google Scholar 

  • He, X., Zemel, R. S., & Carreira-Perpinán, M. A. (2004). Multiscale conditional random fields for image labeling. In IEEE CVPR, vol 2 (pp II–695).

  • Huang, Q., Han, M., Wu, B., & Ioffe, S. (2011). A hierarchical conditional random field model for labeling and segmenting images of street scenes. In IEEE CVPR (pp. 1953–1960).

  • Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., et al (2011). Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In ACM Proceedings of the 24th annual ACM symposium on User interface software and technology (pp. 559–568).

  • Jiang, Y., Lim, M., Zheng, C., & Saxena, A. (2012). Learning to place new objects in a scene. IJRR, 31(9), 1021–1043.

    Google Scholar 

  • Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. JML, 77(1), 27–59.

    Article  MATH  Google Scholar 

  • Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), 433–449.

    Article  Google Scholar 

  • Khan, S., Bennamoun, M., Sohel, F., & Togneri, R. (2014a). Automatic feature learning for robust shadow detection. In IEEE CVPR.

  • Khan, S., He, X., Bennamoun, M., Sohel, F., & Togneri, R. (2015). Separating objects and clutter in indoor scenes. In IEEE CVPR.

  • Khan, S. H., Bennamoun, M., Sohel, F., & Togneri, R. (2014b). Geometry driven semantic labeling of indoor scenes. In Computer Vision–ECCV 2014 (pp. 679–694). Springer.

  • Kohli, P., Kumar, M. P., & Torr, P. H. (2007). P3 & beyond: Solving energies with higher order cliques. In IEEE CVPR (pp. 1–8).

  • Kohli, P., Torr, P. H., et al. (2009). Robust higher order potentials for enforcing label consistency. IJCV, 82(3), 302–324.

    Article  Google Scholar 

  • Koppula, H. S., Anand, A., Joachims, T., & Saxena ,A. (2011). Semantic labeling of 3d point clouds for indoor scenes. In NIPS (pp. 244–252).

  • Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS (pp. 109–117).

  • Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. (2009). Associative hierarchical crfs for object class image segmentation. In IEEE ICCV (pp. 739–746).

  • Ladickỳ, L., Russell, C., Kohli, P., & Torr, P. H. (2013). Inference methods for crfs with co-occurrence statistics. In IJCV (pp. 1–13).

  • Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A large-scale hierarchical multi-view rgb-d object dataset. In IEEE ICRA (pp. 1817–1824).

  • Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). Pylon model for semantic segmentation. In NIPS (pp. 1485–1493).

  • Li, Y., Tarlow, D., & Zemel, R. (2013). Exploring compositional high order pattern potentials for structured output learning. In IEEE CVPR (pp. 49–56).

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.

    Article  Google Scholar 

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR (pp. 413–420). doi:10.1109/CVPR.2009.5206537.

  • Quigley, M., Batra, S., Gould, S., Klingbeil, E., Le, Q., Wellman, A., & Ng, A. Y. (2009). High-accuracy 3d sensing for mobile manipulation: Improving object detection and door opening. In IEEE ICRA (pp. 2816–2822).

  • Rabbani, T., van Den Heuvel, F., & Vosselmann, G. (2006). Segmentation of point clouds using smoothness constraint. IAPR SSIS, 36(5), 248–253.

    Google Scholar 

  • Rao, D., Le, Q. V., Phoka, T., Quigley, M., Sudsang, A., & Ng, A. Y. (2010). Grasping novel objects with depth segmentation. In IEEE IROS (pp. 2578–2585).

  • Ren, X., Bo, L., & Fox, D. (2012). Rgb-(d) scene labeling: Features and algorithms. In IEEE CVPR (pp. 2759–2766).

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. TOG, ACM, 23, 309–314.

    Article  Google Scholar 

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2–23.

    Article  Google Scholar 

  • Silberman, N., & Fergus, R. (2011). Indoor scene segmentation using a structured light sensor. In IEEE ICCV Workshops (pp. 601–608).

  • Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In ECCV (pp. 746–760). Springer.

  • Szummer, M., Kohli, P., & Hoiem, D. (2008). Learning crfs using graph cuts. In ECCV (pp 582–595). Springer.

  • Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ACM ICML (p 104).

  • Van De Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In ECCV (pp 334–348). Springer

  • Von Gioi, R. G., Jakubowicz, J., Morel, J. M., & Randall, G. (2010). Lsd: A fast line segment detector with a false detection control. TPAMI, 32(4), 722–732.

    Article  Google Scholar 

  • Woodford, O. J., Rother, C., & Kolmogorov, V. (2009). A global perspective on map inference for low-level vision. In IEEE ICCV (pp. 2319–2326).

  • Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using sfm and object labels. In IEEE ICCV

  • Xiong, X., & Huber, D. (2010). Using context to create semantic 3d models of indoor environments. In BMVC (pp. 45–1).

Download references

Acknowledgments

This research was supported by the IPRS scholarship from The University of Western Australia and the Australian Research Council (ARC) Grants DP110102166, DP150104251 and DE120102960. The authors would especially like to thank the anonymous reviewers and the Associate Editor for their valuable comments and suggestions to improve the quality of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Salman H. Khan.

Additional information

Communicated by Derek Hoiem.

Appendix: Disintegration of Higher-Order Energies

Appendix: Disintegration of Higher-Order Energies

In this appendix, we will show how the higher-order energies can be minimized using graph cuts. Since, graph cuts can efficiently minimize submodular functions, we will transform our higher-order energy function (Eq. 9) to a submodular second-order energy function. For the case of both \(\alpha \beta \)-swap and \(\alpha \)-expansion move making algorithms, we will explain this transformation and the process of optimal moves computationFootnote 3. All of the previously defined notations are used in the same context and all of the newly introduced symbols are defined in this section. The function that accounts for the number of disagreeing nodes in a clique is defined as:

$$\begin{aligned} n_{\ell }({y}_{\mathbf {c}}) = \sum \limits _{i \in {\mathbf {c}}} w_i^{\ell } \mathbf {1}_{y_i = \ell } \end{aligned}$$

The function \( \mathbf {1}_{y_i = \ell }\) is a zero-one indicator function that returns a unit value when \(y_i = \ell \). We suppose here that weights are symmetric for all labels \(\ell \in {\mathcal {L}}\) i.e., \(w_i^{\ell } = w_i\). Further, for our implementation we set \(w_i=1 \;\; \forall i \in {\mathbf {c}}\). This setting satisfies the required constraints for these parameters, i.e.,

$$\begin{aligned} w_i^{\ell } \ge 0 \quad \text {and}\quad \sum \limits _{i \in {\mathbf {c}}} w_i^{\ell } = \# {\mathbf {c}} \;\; \forall \ell \in {\mathcal {L}}. \end{aligned}$$

We define a summation function that adds the weights for a subset \(\mathbf {s}\) of \({\mathbf {c}}\),

$$\begin{aligned} W(\mathbf {s}) = \sum \limits _{i\in \mathbf {s}} w_i^{\ell } = \# \mathbf {s} \quad \forall \ell \in {\mathcal {L}}. \end{aligned}$$

1.1 Disintegration of Higher-Order Energies to Second-Order Sub-modular Energies for Swap Moves

Suppose, in a clique ‘\({\mathbf {c}}\)’, the locations of the active nodes is represented by a set of indices \({\mathbf {c}}_{a}\). The nodes which remain inactive during the move making process are termed the passive nodes. Their locations are denoted by \(\bar{{\mathbf {c}}}_{a} = \{{\mathbf {c}}\setminus \forall c_i \in {\mathbf {c}}_{a}\} \). The corresponding set of available moves to the swap move making algorithm are encoded in the form of a vector \(\mathbf {t}_{c_a}\). For the sake of a simple demonstration, let us focus on the two class labeling problem i.e., \(\ell \in \{0,1\}\). The induced labeling is the combination of the old labeling for the inactive nodes and the new labeling for the active nodes i.e., \({y}^n_c = {y}^{\circ }_{\bar{c}_{a}} \cup T_{\alpha \beta }({y}^{\circ }_{{c}_{a}}, \mathbf {t}_{c_a})\). If \({y}^n_c\) denotes the new labeling induced by move \(\mathbf {t}_{c_a}\) and \({y}^{\circ }_c\) denotes the old labeling, we can define the energy of move for an \(\alpha \beta \) swap as:

$$\begin{aligned}&\psi ^m_{{\mathbf {c}}}(\mathbf {t}_{c_a}) = \psi _{{\mathbf {c}}}({y}^n_c) = \psi _{{\mathbf {c}}}({y}^{\circ }_{\bar{c}_{a}} \cup T_{\alpha \beta }({y}^{\circ }_{c_{a}}, \mathbf {t}_{c_a}))\\&\quad = \underset{\ell \in {\mathcal {L}}}{{\text {min}}} \left\{ \lambda _{max} - (\lambda _{max} - \lambda _{\ell })\right. \\&\left. \quad \text {exp} {\left( - \frac{W({\mathbf {c}}) - n_{\ell }({y}^{\circ }_{\bar{c}_{a}} \cup T_{\alpha \beta }({y}^{\circ }_{c_{a}}, \mathbf {t}_{c_a}))}{Q_{\ell }}\right) }\right\} \\&\quad = \underset{\ell \in {\mathcal {L}}}{{\text {min}}} \left\{ \lambda _{max} - (\lambda _{max} - \lambda _{\alpha })\text {exp} { \left( - \frac{W({\mathbf {c}}) - n_{0}^m(\mathbf {t}_{c_a})}{Q_{\alpha }}\right) }, \right. \\&\left. \quad \lambda _{max} - (\lambda _{max} - \lambda _{\beta })\text {exp} {\left( - \frac{W({\mathbf {c}} - {\mathbf {c}}_a) + n_{0}^m(\mathbf {t}_{c_a})}{Q_{\beta }}\right) } \right\} , \end{aligned}$$

where, \(W({\mathbf {c}}_a) = n_0^m(\mathbf {t}_{c_a}) + n_1^m(\mathbf {t}_{c_a})\). The minimization operation in the above equation can be replaced by defining a piecewise function:

$$\begin{aligned} \psi _{{\mathbf {c}}}^m(\mathbf {t}_{c_a}) = \left\{ \begin{array}{l} \lambda _{max} - (\lambda _{max} - \lambda _{\alpha })\text {exp} { \left( - \frac{W({\mathbf {c}}) - n_{0}^m(\mathbf {t}_{c_a})}{Q_{\alpha }}\right) } \\ \qquad \text {if}\quad n_{0}^m(\mathbf {t}_{c_a}) > \varrho _{\alpha \beta }\left( \frac{W({\mathbf {c}})}{Q_{\alpha }} - \frac{W({\mathbf {c}} - {\mathbf {c}}_a)}{Q_{\beta }} \right. \\ \qquad \qquad \qquad \qquad \left. - \log \left( \frac{\lambda _{max} - \lambda _{\alpha }}{\lambda _{max} - \lambda _{\beta }}\right) \right) ,\\ \lambda _{max} - (\lambda _{max} - \lambda _{\beta })\text {exp} {\left( - \frac{W({\mathbf {c}} - {\mathbf {c}}_a) + n_{0}^m(\mathbf {t}_{c_a})}{Q_{\beta }}\right) }\\ \qquad \text {if}\quad n_{0}^m(\mathbf {t}_{c_a}) < \varrho _{\alpha \beta }\left( \frac{W({\mathbf {c}})}{Q_{\alpha }} - \frac{W({\mathbf {c}} - {\mathbf {c}}_a)}{Q_{\beta }} \right. \\ \qquad \qquad \qquad \qquad \left. - \log \left( \frac{\lambda _{max} - \lambda _{\alpha }}{\lambda _{max} - \lambda _{\beta }}\right) \right) , \end{array}\right. \end{aligned}$$

where, \(\varrho _{\alpha \beta } = \frac{Q_{\alpha }Q_{\beta }}{Q_{\alpha }+Q_{\beta }}\). The function \(n^m_{\ell }(\mathbf {t}_{c_a})\) is defined as:

$$\begin{aligned} n^m_{\ell }(\mathbf {t}_{c_a}) = \sum \limits _{i \in {\mathbf {c}}_{a}}w_i \delta _{\ell }(\mathbf {t}_i). \end{aligned}$$

From Theorem 1 in Kohli et al. (2009), the energy defined above can be transformed to the submodular quadratic pseudo-boolean function with two binary meta variables. In this form the \(\alpha \beta \)-swap algorithm can be used for minimizing the energy function.

1.2 Disintegration of Higher-Order Energies to Second-Order Sub-modular Energies for Expansion Moves

Suppose, in a clique ‘ c’, the location of the nodes with label \(\ell \) is represented by a set of indices \({\mathbf {c}}_{\ell }\). The current labeling solution is denoted by \({y}_{{\mathbf {c}}}^{\circ }\).

If the dominant label is denoted by \(d \in {\mathcal {L}}\) in the current labeling \({y}_{{\mathbf {c}}}^{\circ }\) is,

$$\begin{aligned} \text {s.t} \quad W({\mathbf {c}}_d) > W({\mathbf {c}}) - Q_d \quad \text {where} \;d \ne \alpha , \end{aligned}$$

there must be one dominant label:

$$\begin{aligned}&Q_a + Q_b < W({\mathbf {c}}) \qquad \forall a \ne b \in {\mathcal {L}},\\&\begin{array}{l} \psi _{{\mathbf {c}}}^{m}(t_c) = \psi _{{\mathbf {c}}} (T_{\alpha }({y}_c^{\circ }, t_c)) \\ = \underset{\ell \in {\mathcal {L}}}{{\text {min}}} \left\{ \lambda _{max} - (\lambda _{max} - \lambda _{\alpha }) {\text {exp}} \left( - \frac{\sum \limits _{i\in c} w_i t_i}{Q_{\alpha }}\right) , \right. \\ \left. \lambda _{max} - (\lambda _{max} - \lambda _{d}) \text {exp} \left( - \frac{W({\mathbf {c}}) - \sum \limits _{i\in c} w_i t_i}{Q_{d}}\right) \right\} . \end{array} \end{aligned}$$

The minimization operator in the above function can be replaced by a piecewise function:

$$\begin{aligned} \psi _{{\mathbf {c}}}^m(\mathbf {t}_{c}, \mathbf {t}_{c_d}) = \left\{ \begin{array}{l} \lambda _{max} - (\lambda _{max} - \lambda _{\alpha })\text {exp} {\left( - \frac{n_{0}^m(\mathbf {t}_{c})}{Q_{\alpha }}\right) } \\ \qquad \text {if}\quad n_{0}^m(\mathbf {t}_{c}) > \varrho _{\alpha d}\left( \frac{W({\mathbf {c}})}{Q_{\alpha }} \right. \\ \qquad \qquad \qquad \qquad \left. - \log \left( \frac{\lambda _{max} - \lambda _{\alpha }}{\lambda _{max} - \lambda _{d}}\right) \right) , \\ \lambda _{max} - (\lambda _{max} - \lambda _{d})\text {exp} {\left( - \frac{W({\mathbf {c}}) - n_{0}^m(\mathbf {t}_{c_d})}{Q_{d}}\right) }\\ \qquad \text {if}\quad n_{0}^m(\mathbf {t}_{c}) < \varrho _{\alpha d}\left( \frac{W({\mathbf {c}})}{Q_{\alpha }} \right. \\ \qquad \qquad \qquad \qquad \left. - \log \left( \frac{\lambda _{max} - \lambda _{\alpha }}{\lambda _{max} - \lambda _{d}}\right) \right) , \end{array}\right. \end{aligned}$$

where, \(\varrho _{\alpha d} = \frac{Q_{\alpha }Q_{d}}{Q_{\alpha }+Q_{d}}\) and function \(n^m_{\ell }(\mathbf {t}_c)\) is defined as:

$$\begin{aligned} n^m_{\ell }(\mathbf {t}_c) = \sum \limits _{i\in {\mathbf {c}}}w_i \delta _{\ell }(\mathbf {t}_i). \end{aligned}$$

From Theorem 2 in Kohli et al. (2009), the energy defined above can be transformed to the submodular quadratic pseudo-boolean function with two binary meta variables. In this form the \(\alpha \)-expansion algorithm can be used for minimizing the energy function.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khan, S.H., Bennamoun, M., Sohel, F. et al. Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images. Int J Comput Vis 117, 1–20 (2016). https://doi.org/10.1007/s11263-015-0843-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0843-8

Keywords

Navigation