The Visual Computer

, Volume 33, Issue 12, pp 1587–1600 | Cite as

Learning aggregated features and optimizing model for semantic labeling

  • Jianhua Wang
  • Chuanxia Zheng
  • Weihai ChenEmail author
  • Xingming Wu
Original Article


Semantic labeling for indoor scenes has been extensively developed with the wide availability of affordable RGB-D sensors. However, it is still a challenging task for multi-class recognition, especially for “small” objects. In this paper, a novel semantic labeling model based on aggregated features and contextual information is proposed. Given an RGB-D image, the proposed model first creates a hierarchical segmentation using an adapted gPb/UCM algorithm. Then, a support vector machine is trained to predict initial labels using aggregated features, which fuse small-scale appearance features, mid-scale geometric features, and large-scale scene features. Finally, a joint multi-label Conditional random field model that exploits both spatial and attributive contextual relations is constructed to optimize the initial semantic and attributive predicted results. The experimental results on the public NYU v2 dataset demonstrate the proposed model outperforms the existing state-of-the-art methods on the challenging 40 dominant classes task, and the model also achieves a good performance on a recent SUN RGB-D dataset. Especially, the prediction accuracy of “small” classes has been improved significantly.


Semantic scene understanding Aggregated features Object attribute Joint optimizing model Conditional random field 



The work described in this paper was supported by the National Natural Science Foundation of China under Grant No. 61573048, 61620106012, and the International Scientific and Technological Cooperation Projects of China under Grant No. 2015DFG12650.


  1. 1.
    Anand, A., Koppula, H.S., Joachims, T., Saxena, A.: Contextually guided semantic labeling and search for three-dimensional point clouds. Int. J. Robot. Res. 32(1), 19–34 (2012)CrossRefGoogle Scholar
  2. 2.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. Pattern Anal. Mach. Intell. IEEE Trans. 33(5), 898–916 (2011)CrossRefGoogle Scholar
  3. 3.
    Bell, S., Upchurch, P., Snavely, N., Bala, K.: Opensurfaces: a richly annotated catalog of surface appearance. ACM Trans. Gr. (TOG) 32(4), 111 (2013)Google Scholar
  4. 4.
    Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material recognition in the wild with the materials in context database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3479–3487 (2015)Google Scholar
  5. 5.
    Bo, L., Ren, X., Fox, D.: Kernel descriptors for visual recognition. In: Annual conference on neural information processing systems, pp. 244–252 (2010)Google Scholar
  6. 6.
    Cadena, C., Kosecka, J.: Semantic segmentation with heterogeneous sensor coverages. In: Robotics and automation (ICRA), IEEE international conference on, pp. 2639–2645. IEEE (2014)Google Scholar
  7. 7.
    Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)Google Scholar
  8. 8.
    Chao, Y.W., Choi, W., Pantofaru, C., Savarese, S.: Layout estimation of highly cluttered indoor scenes using geometric and semantic cues. In: Image analysis and processing—ICIAP 2013, pp. 489–499. Springer, Berlin (2013)Google Scholar
  9. 9.
    Chatzichristofis, S.A., Boutalis, Y.S.: Cedd: color and edge directivity descriptor: a compact descriptor for image indexing and retrieval. In: Gasteratos, A., Vincze, M. and Tsotos, J.K. (eds.) Computer vision systems, pp. 312–322. Springer, Berlin (2008)Google Scholar
  10. 10.
    Chen, K., Lai, Y., Wu, Y.X., Martin, R.R., Hu, S.M.: Automatic semantic modeling of indoor scenes from low-quality RGB-d data using contextual information. ACM Trans. Gr. 33(6), 208:1–208:12 (2014)Google Scholar
  11. 11.
    Chen, W., Yue, H., Wang, J., Wu, X.: An improved edge detection algorithm for depth map inpainting. Opt. Lasers Eng. 55, 69–77 (2014)CrossRefGoogle Scholar
  12. 12.
    Cheng, M.M., Zheng, S., Lin, W.Y., Vineet, V., Sturgess, P., Crook, N., Mitra, N.J., Torr, P.: Imagespirit: verbal guided image parsing. ACM Trans. Gr. 34(1), 3:1–3:11 (2014). doi: 10.1145/2682628 CrossRefzbMATHGoogle Scholar
  13. 13.
    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv:1301.3572 (2013) (arXiv preprint)
  14. 14.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol. 1, pp. 1–2. Prague (2004)Google Scholar
  15. 15.
    Delong, A., Osokin, A., Isack, H.N., Boykov, Y.: Fast approximate energy minimization with label costs. Int. J. Comput. Vis. 96(1), 1–27 (2012)CrossRefzbMATHMathSciNetGoogle Scholar
  16. 16.
    Deng, Z., Todorovic, S., Jan Latecki, L.: Semantic segmentation of rgbd images with mutex constraints. In: Proceedings of the IEEE international conference on computer vision, pp. 1733–1741 (2015)Google Scholar
  17. 17.
    Ding, K., Chen, W., Wu, X.: Optimum inpainting for depth map based on l 0 total variation. Vis. Comput. 30(12), 1311–1320 (2014)CrossRefGoogle Scholar
  18. 18.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Computer vision and pattern recognition, CVPR 2009. IEEE conference on, pp. 1778–1785. IEEE (2009)Google Scholar
  19. 19.
    Gupta, A., Hebert, M., Kanade, T., Blei, D.M.: Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In: Advances in neural information processing Systems, pp. 1288–1296 (2010)Google Scholar
  20. 20.
    Gupta, S., Arbeláez, P., Girshick, R., Malik, J.: Indoor scene understanding with RGB-d images: bottom-up segmentation, object detection and semantic segmentation. Int. J. Comput. Vis. 112(2), 133–149 (2015)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Gupta, S., Arbelaez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-d images. In: Computer vision and pattern recognition (CVPR), IEEE conference on, pp. 564–571. IEEE (2013)Google Scholar
  22. 22.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-d images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer vision–ECCV 2014, pp. 345–360. Springer, Berlin (2014)Google Scholar
  23. 23.
    Hermans, A., Floros, G., Leibe, B.: Dense 3d semantic mapping of indoor scenes from RGB-d images. In: Robotics and automation (ICRA), IEEE international conference on, pp. 2631–2638. IEEE (2014)Google Scholar
  24. 24.
    Hoiem, D., Efros, A.A., Hebert, M.: Putting objects in perspective. Int. J. Comput. Vis. 80(1), 3–15 (2008)CrossRefGoogle Scholar
  25. 25.
    Lai, K., Bo, L., Fox, D.: Unsupervised feature learning for 3d scene labeling. In: Robotics and automation (ICRA), IEEE international conference on, pp. 3050–3057. IEEE (2014)Google Scholar
  26. 26.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Computer vision and pattern recognition (CVPR), IEEE Computer society conference on, vol. 2, pp. 2169–2178. IEEE (2006)Google Scholar
  27. 27.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. arXiv:1411.4038 (2014) (arXiv preprint)
  28. 28.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  29. 29.
    Silberman, N., Hoiem, D., Kholi, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV (2012)Google Scholar
  30. 30.
    Ren, X., Bo, L., Fox, D.: RGB-(d) scene labeling: features and algorithms. In: Computer vision and pattern recognition (CVPR), IEEE Conference on, pp. 2759–2766. IEEE (2012)Google Scholar
  31. 31.
    Shao, T., Xu, W., Zhou, K., Wang, J., Li, D., Guo, B.: An interactive approach to semantic modeling of indoor scenes with an RGBD camera. ACM Trans. Gr. (TOG) 31(6), 136 (2012)Google Scholar
  32. 32.
    Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: Computer vision workshops (ICCV Workshops), IEEE international conference on, pp. 601–608. IEEE (2011)Google Scholar
  33. 33.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Computer vision. Proceedings. Ninth IEEE international conference on, pp. 1470–1477. IEEE (2003)Google Scholar
  34. 34.
    Song, S., Lichtenberg, S.P., Xiao, J.: Sun RGB-d: A RGB-d scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 567–576 (2015)Google Scholar
  35. 35.
    Song, S., Xiao, J.: Sliding shapes for 3d object detection in RGB-d images. In: European conference on computer vision, vol. 2, pp. 6 (2014)Google Scholar
  36. 36.
    Tighe, J., Lazebnik, S.: Superparsing: scalable nonparametric image parsing with superpixels. In: Danilidis, K., Maragos, P., Paragios, N. (eds.) Computer Vision–ECCV 2010, pp. 352–365. Springer, Berlin (2010)Google Scholar
  37. 37.
    Wang, A., Lu, J., Wang, G., Cai, J., Cham, T.J.: Multi-modal unsupervised feature learning for RGB-d scene labeling. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision–ECCV 2014, pp. 453–467. Springer, Berlin (2014)Google Scholar
  38. 38.
    Wolf, D., Prankl, J., Vincze, M.: Fast semantic segmentation of 3d point clouds using a dense crf with learned parameters. In: Robotics and automation (ICRA), IEEE international conference on, pp. 4867–4873. IEEE (2015)Google Scholar
  39. 39.
    Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A., et al.: Sun database: large-scale scene recognition from abbey to zoo. In: Computer vision and pattern recognition (CVPR), IEEE conference on, pp. 3485–3492. IEEE (2010)Google Scholar
  40. 40.
    Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. In: Computer vision and pattern recognition (CVPR), IEEE conference on, pp. 1713–1720. IEEE (2011)Google Scholar
  41. 41.
    Zhang, J., Kan, C., Schwing, A.G., Urtasun, R.: Estimating the 3d layout of indoor scenes and its clutter from depth sensors. In: Computer Vision (ICCV), IEEE international conference on, pp. 1273–1280. IEEE (2013)Google Scholar
  42. 42.
    Zhang, Y., Song, S., Tan, P., Xiao, J.: Panocontext: A whole-room 3d context model for panoramic scene understanding. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer vision–ECCV 2014, pp. 668–686. Springer, Berlin (2014)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Jianhua Wang
    • 1
  • Chuanxia Zheng
    • 1
  • Weihai Chen
    • 1
    Email author
  • Xingming Wu
    • 1
  1. 1.School of Automation Science and Electrical EngineeringBeihang UniversityBeijingChina

Personalised recommendations