Abstract
Inexpensive structured light sensors can capture rich information from indoor scenes, and scene labeling problems provide a compelling opportunity to make use of this information. In this paper we present a novel conditional random field (CRF) model to effectively utilize depth information for semantic labeling of indoor scenes. At the core of the model, we propose a novel and efficient plane detection algorithm which is robust to erroneous depth maps. Our CRF formulation defines local, pairwise and higher order interactions between image pixels. At the local level, we propose a novel scheme to combine energies derived from appearance, depth and geometry-based cues. The proposed local energy also encodes the location of each object class by considering the approximate geometry of a scene. For the pairwise interactions, we learn a boundary measure which defines the spatial discontinuity of object classes across an image. To model higher-order interactions, the proposed energy treats smooth surfaces as cliques and encourages all the pixels on a surface to take the same label. We show that the proposed higher-order energies can be decomposed into pairwise sub-modular energies and efficient inference can be made using the graph-cuts algorithm. We follow a systematic approach which uses structured learning to fine-tune the model parameters. We rigorously test our approach on SUN3D and both versions of the NYU-Depth database. Experimental results show that our work achieves superior performance to state-of-the-art scene labeling techniques.
Similar content being viewed by others
Notes
In this work we set \(r=3\) and \({{\varvec{\kappa }}}\) is set to [0.25, 0.75], [0.5, 0.5] and [0.75, 0.25] respectively in each case. This choice is based on the validation set (see Sect. 6.2).
Plane detection code is available at author’s webpage: http://www.csse.uwa.edu.au/~salman.
The development of this section is similar to Kohli et al. (2009). We also used the same notation - wherever possible - to allow the reader to easily sort out differences and commonalities.
References
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. TPAMI, 33(5), 898–916.
Blake, A., Kohli, P., & Rother, C. (2011). Markov random fields for vision and image processing. Cambridge: The MIT Press.
Boykov, Y., & Funka-Lea, G. (2006). Graph cuts and efficient nd image segmentation. IJCV, 70(2), 109–131.
Boykov, Y., Veksler, O., & Zabih, R. (2001). Fast approximate energy minimization via graph cuts. TPAMI, 23(11), 1222–1239.
Breiman, L. (2001). Random forests. Machine Learning, 45(0885–6125), 5–32.
Cadena, C., & Košecká, J. (2014). Semantic segmentation with heterogeneous sensor coverages.
Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. TPAMI, 34(7), 1312–1328.
Couprie, C., Farabet, C., Najman, L., & LeCun, Y.(2013). Indoor semantic segmentation using depth information. ICLR.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, vol 1 (pp 886–893).
Edwards, W., Miles, R. F, Jr, & Von Winterfeldt, D. (2007). Advances in decision analysis: from foundations to applications. Cambridge: Cambridge University Press.
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. TPAMI, 35(8), 1915–1929. doi:10.1109/TPAMI.2012.231.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. IJCV, 59(2), 167–181.
Fukunaga, K., & Hostetler, L. (1975). The estimation of the gradient of a density function, with applications in pattern recognition. TIT, 21(1), 32–40.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In IEEE ICCV (pp 1–8).
Gulshan, V., Rother, C., Criminisi, A., Blake, A., & Zisserman, A. (2010). Geodesic star convexity for interactive image segmentation. In IEEE CVPR (pp 3129–3136).
Gupta, S., Arbelaez, P., & Malik, J. (2013), Perceptual organization and recognition of indoor scenes from rgb-d images. In IEEE CVPR (pp. 564–571).
Gupta, S., Girshick, R., Arbeláez. P., & Malik, J. (2014). Learning rich features from rgb-d images for object detection and segmentation. In Computer Vision–ECCV 2014 (pp. 345–360). Springer.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. ACM SIGKDD, 11(1), 10–18.
Hayat, M., Bennamoun, M., & An, S. (2015). Deep reconstruction models for image set classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4), 713–727. doi:10.1109/TPAMI.2014.2353635.
He, X., Zemel, R. S., & Carreira-Perpinán, M. A. (2004). Multiscale conditional random fields for image labeling. In IEEE CVPR, vol 2 (pp II–695).
Huang, Q., Han, M., Wu, B., & Ioffe, S. (2011). A hierarchical conditional random field model for labeling and segmenting images of street scenes. In IEEE CVPR (pp. 1953–1960).
Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe, R., Kohli, P., Shotton, J., Hodges, S., Freeman, D., Davison, A., et al (2011). Kinectfusion: real-time 3d reconstruction and interaction using a moving depth camera. In ACM Proceedings of the 24th annual ACM symposium on User interface software and technology (pp. 559–568).
Jiang, Y., Lim, M., Zheng, C., & Saxena, A. (2012). Learning to place new objects in a scene. IJRR, 31(9), 1021–1043.
Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. JML, 77(1), 27–59.
Johnson, A. E., & Hebert, M. (1999). Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(5), 433–449.
Khan, S., Bennamoun, M., Sohel, F., & Togneri, R. (2014a). Automatic feature learning for robust shadow detection. In IEEE CVPR.
Khan, S., He, X., Bennamoun, M., Sohel, F., & Togneri, R. (2015). Separating objects and clutter in indoor scenes. In IEEE CVPR.
Khan, S. H., Bennamoun, M., Sohel, F., & Togneri, R. (2014b). Geometry driven semantic labeling of indoor scenes. In Computer Vision–ECCV 2014 (pp. 679–694). Springer.
Kohli, P., Kumar, M. P., & Torr, P. H. (2007). P3 & beyond: Solving energies with higher order cliques. In IEEE CVPR (pp. 1–8).
Kohli, P., Torr, P. H., et al. (2009). Robust higher order potentials for enforcing label consistency. IJCV, 82(3), 302–324.
Koppula, H. S., Anand, A., Joachims, T., & Saxena ,A. (2011). Semantic labeling of 3d point clouds for indoor scenes. In NIPS (pp. 244–252).
Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS (pp. 109–117).
Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. (2009). Associative hierarchical crfs for object class image segmentation. In IEEE ICCV (pp. 739–746).
Ladickỳ, L., Russell, C., Kohli, P., & Torr, P. H. (2013). Inference methods for crfs with co-occurrence statistics. In IJCV (pp. 1–13).
Lai, K., Bo, L., Ren, X., & Fox, D. (2011). A large-scale hierarchical multi-view rgb-d object dataset. In IEEE ICRA (pp. 1817–1824).
Lempitsky, V., Vedaldi, A., & Zisserman, A. (2011). Pylon model for semantic segmentation. In NIPS (pp. 1485–1493).
Li, Y., Tarlow, D., & Zemel, R. (2013). Exploring compositional high order pattern potentials for structured output learning. In IEEE CVPR (pp. 49–56).
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR (pp. 413–420). doi:10.1109/CVPR.2009.5206537.
Quigley, M., Batra, S., Gould, S., Klingbeil, E., Le, Q., Wellman, A., & Ng, A. Y. (2009). High-accuracy 3d sensing for mobile manipulation: Improving object detection and door opening. In IEEE ICRA (pp. 2816–2822).
Rabbani, T., van Den Heuvel, F., & Vosselmann, G. (2006). Segmentation of point clouds using smoothness constraint. IAPR SSIS, 36(5), 248–253.
Rao, D., Le, Q. V., Phoka, T., Quigley, M., Sudsang, A., & Ng, A. Y. (2010). Grasping novel objects with depth segmentation. In IEEE IROS (pp. 2578–2585).
Ren, X., Bo, L., & Fox, D. (2012). Rgb-(d) scene labeling: Features and algorithms. In IEEE CVPR (pp. 2759–2766).
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: Interactive foreground extraction using iterated graph cuts. TOG, ACM, 23, 309–314.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2–23.
Silberman, N., & Fergus, R. (2011). Indoor scene segmentation using a structured light sensor. In IEEE ICCV Workshops (pp. 601–608).
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In ECCV (pp. 746–760). Springer.
Szummer, M., Kohli, P., & Hoiem, D. (2008). Learning crfs using graph cuts. In ECCV (pp 582–595). Springer.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ACM ICML (p 104).
Van De Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In ECCV (pp 334–348). Springer
Von Gioi, R. G., Jakubowicz, J., Morel, J. M., & Randall, G. (2010). Lsd: A fast line segment detector with a false detection control. TPAMI, 32(4), 722–732.
Woodford, O. J., Rother, C., & Kolmogorov, V. (2009). A global perspective on map inference for low-level vision. In IEEE ICCV (pp. 2319–2326).
Xiao, J., Owens, A., & Torralba, A. (2013). Sun3d: A database of big spaces reconstructed using sfm and object labels. In IEEE ICCV
Xiong, X., & Huber, D. (2010). Using context to create semantic 3d models of indoor environments. In BMVC (pp. 45–1).
Acknowledgments
This research was supported by the IPRS scholarship from The University of Western Australia and the Australian Research Council (ARC) Grants DP110102166, DP150104251 and DE120102960. The authors would especially like to thank the anonymous reviewers and the Associate Editor for their valuable comments and suggestions to improve the quality of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Derek Hoiem.
Appendix: Disintegration of Higher-Order Energies
Appendix: Disintegration of Higher-Order Energies
In this appendix, we will show how the higher-order energies can be minimized using graph cuts. Since, graph cuts can efficiently minimize submodular functions, we will transform our higher-order energy function (Eq. 9) to a submodular second-order energy function. For the case of both \(\alpha \beta \)-swap and \(\alpha \)-expansion move making algorithms, we will explain this transformation and the process of optimal moves computationFootnote 3. All of the previously defined notations are used in the same context and all of the newly introduced symbols are defined in this section. The function that accounts for the number of disagreeing nodes in a clique is defined as:
The function \( \mathbf {1}_{y_i = \ell }\) is a zero-one indicator function that returns a unit value when \(y_i = \ell \). We suppose here that weights are symmetric for all labels \(\ell \in {\mathcal {L}}\) i.e., \(w_i^{\ell } = w_i\). Further, for our implementation we set \(w_i=1 \;\; \forall i \in {\mathbf {c}}\). This setting satisfies the required constraints for these parameters, i.e.,
We define a summation function that adds the weights for a subset \(\mathbf {s}\) of \({\mathbf {c}}\),
1.1 Disintegration of Higher-Order Energies to Second-Order Sub-modular Energies for Swap Moves
Suppose, in a clique ‘\({\mathbf {c}}\)’, the locations of the active nodes is represented by a set of indices \({\mathbf {c}}_{a}\). The nodes which remain inactive during the move making process are termed the passive nodes. Their locations are denoted by \(\bar{{\mathbf {c}}}_{a} = \{{\mathbf {c}}\setminus \forall c_i \in {\mathbf {c}}_{a}\} \). The corresponding set of available moves to the swap move making algorithm are encoded in the form of a vector \(\mathbf {t}_{c_a}\). For the sake of a simple demonstration, let us focus on the two class labeling problem i.e., \(\ell \in \{0,1\}\). The induced labeling is the combination of the old labeling for the inactive nodes and the new labeling for the active nodes i.e., \({y}^n_c = {y}^{\circ }_{\bar{c}_{a}} \cup T_{\alpha \beta }({y}^{\circ }_{{c}_{a}}, \mathbf {t}_{c_a})\). If \({y}^n_c\) denotes the new labeling induced by move \(\mathbf {t}_{c_a}\) and \({y}^{\circ }_c\) denotes the old labeling, we can define the energy of move for an \(\alpha \beta \) swap as:
where, \(W({\mathbf {c}}_a) = n_0^m(\mathbf {t}_{c_a}) + n_1^m(\mathbf {t}_{c_a})\). The minimization operation in the above equation can be replaced by defining a piecewise function:
where, \(\varrho _{\alpha \beta } = \frac{Q_{\alpha }Q_{\beta }}{Q_{\alpha }+Q_{\beta }}\). The function \(n^m_{\ell }(\mathbf {t}_{c_a})\) is defined as:
From Theorem 1 in Kohli et al. (2009), the energy defined above can be transformed to the submodular quadratic pseudo-boolean function with two binary meta variables. In this form the \(\alpha \beta \)-swap algorithm can be used for minimizing the energy function.
1.2 Disintegration of Higher-Order Energies to Second-Order Sub-modular Energies for Expansion Moves
Suppose, in a clique ‘ c’, the location of the nodes with label \(\ell \) is represented by a set of indices \({\mathbf {c}}_{\ell }\). The current labeling solution is denoted by \({y}_{{\mathbf {c}}}^{\circ }\).
If the dominant label is denoted by \(d \in {\mathcal {L}}\) in the current labeling \({y}_{{\mathbf {c}}}^{\circ }\) is,
there must be one dominant label:
The minimization operator in the above function can be replaced by a piecewise function:
where, \(\varrho _{\alpha d} = \frac{Q_{\alpha }Q_{d}}{Q_{\alpha }+Q_{d}}\) and function \(n^m_{\ell }(\mathbf {t}_c)\) is defined as:
From Theorem 2 in Kohli et al. (2009), the energy defined above can be transformed to the submodular quadratic pseudo-boolean function with two binary meta variables. In this form the \(\alpha \)-expansion algorithm can be used for minimizing the energy function.
Rights and permissions
About this article
Cite this article
Khan, S.H., Bennamoun, M., Sohel, F. et al. Integrating Geometrical Context for Semantic Labeling of Indoor Scenes using RGBD Images. Int J Comput Vis 117, 1–20 (2016). https://doi.org/10.1007/s11263-015-0843-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0843-8