Learning Rich Features from RGB-D Images for Object Detection and Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8695)


In this paper we study the problem of object detection for RGB-D images using semantically rich image and depth features. We propose a new geocentric embedding for depth images that encodes height above ground and angle with gravity for each pixel in addition to the horizontal disparity. We demonstrate that this geocentric embedding works better than using raw depth images for learning feature representations with convolutional neural networks. Our final object detection system achieves an average precision of 37.3%, which is a 56% relative improvement over existing methods. We then focus on the task of instance segmentation where we label pixels belonging to object instances found by our detector. For this task, we propose a decision forest approach that classifies pixels in the detection window as foreground or background using a family of unary and binary tests that query shape and geocentric pose features. Finally, we use the output from our object detectors in an existing superpixel classification framework for semantic scene segmentation and achieve a 24% relative improvement over current state-of-the-art for the object categories that we study. We believe advances such as those represented in this paper will facilitate the use of perception in fields like robotics.


RGB-D perception object detection object segmentation 

Supplementary material

978-3-319-10584-0_23_MOESM1_ESM.pdf (11.9 mb)
Electronic Supplementary Material (PDF 12,149 KB)


  1. 1.
    Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)Google Scholar
  2. 2.
    Arbeláez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. TPAMI (2011)Google Scholar
  3. 3.
    Banica, D., Sminchisescu, C.: CPMC-3D-O2P: Semantic segmentation of RGB-D images using CPMC and second order pooling. CoRR abs/1312.7715 (2013)Google Scholar
  4. 4.
    Bo, L., Ren, X., Fox, D.: Unsupervised Feature Learning for RGB-D Based Object Recognition. In: ISER (2012)Google Scholar
  5. 5.
    Breiman, L.: Random forests. Machine Learning (2001)Google Scholar
  6. 6.
    Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. CoRR abs/1301.3572 (2013)Google Scholar
  7. 7.
    Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Competition 2012 (ILSVRC 2012) (2012),
  8. 8.
    Dollár, P.: Piotr’s Image and Video Matlab Toolbox (PMT),
  9. 9.
    Dollár, P., Zitnick, C.L.: Structured forests for fast edge detection. In: ICCV (2013)Google Scholar
  10. 10.
    Dollár, P., Zitnick, C.L.: Fast edge detection using structured forests. CoRR abs/1406.5549 (2014)Google Scholar
  11. 11.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: ICML (2014)Google Scholar
  12. 12.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A library for large linear classification. JMRL (2008)Google Scholar
  13. 13.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. TPAMI (2013)Google Scholar
  14. 14.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. TPAMI (2010)Google Scholar
  15. 15.
    Geman, D., Amit, Y., Wilder, K.: Joint induction of shape features and tree classifiers. TPAMI (1997)Google Scholar
  16. 16.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  17. 17.
    Guo, R., Hoiem, D.: Support surface prediction in indoor scenes. In: ICCV (2013)Google Scholar
  18. 18.
    Gupta, S., Arbeláez, P., Malik, J.: Perceptual organization and recognition of indoor scenes from RGB-D images. In: CVPR (2013)Google Scholar
  19. 19.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, Springer, Heidelberg (2014)Google Scholar
  20. 20.
    Janoch, A., Karayev, S., Jia, Y., Barron, J.T., Fritz, M., Saenko, K., Darrell, T.: A category-level 3D object dataset: Putting the kinect to work. In: Consumer Depth Cameras for Computer Vision (2013)Google Scholar
  21. 21.
    Jia, Y.: Caffe: An open source convolutional architecture for fast feature embedding (2013),
  22. 22.
    Soo Kim, B., Xu, S., Savarese, S.: Accurate localization of 3D objects from RGB-D data using segmentation hypotheses. In: CVPR (2013)Google Scholar
  23. 23.
    Koppula, H., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3D point clouds for indoor scenes. In: NIPS (2011)Google Scholar
  24. 24.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  25. 25.
    Lai, K., Bo, L., Ren, X., Fox, D.: A large-scale hierarchical multi-view rgb-d object dataset. In: ICRA (2011)Google Scholar
  26. 26.
    LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation (1989)Google Scholar
  27. 27.
    Lim, J.J., Zitnick, C.L., Dollár, P.: Sketch tokens: A learned mid-level representation for contour and object detection. In: CVPR (2013)Google Scholar
  28. 28.
    Lin, D., Fidler, S., Urtasun, R.: Holistic scene understanding for 3D object detection with RGBD cameras. In: ICCV (2013)Google Scholar
  29. 29.
    Ren, X., Bo, L.: Discriminatively trained sparse code gradients for contour detection. In: NIPS (2012)Google Scholar
  30. 30.
    Ren, X., Bo, L., Fox, D.: RGB-(D) scene labeling: Features and algorithms. In: CVPR (2012)Google Scholar
  31. 31.
    Shotton, J., Fitzgibbon, A.W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)Google Scholar
  32. 32.
    Shrivastava, A., Gupta, A.: Building part-based object detectors via 3D geometry. In: ICCV (2013)Google Scholar
  33. 33.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  34. 34.
    Socher, R., Huval, B., Bath, B.P., Manning, C.D., Ng, A.Y.: Convolutional-recursive deep learning for 3D object classification. In: NIPS (2012)Google Scholar
  35. 35.
    Tang, S., Wang, X., Lv, X., Han, T.X., Keller, J., He, Z., Skubic, M., Lao, S.: Histogram of oriented normal vectors for object recognition with a depth sensor. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725, pp. 525–538. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  36. 36.
    Tighe, J., Niethammer, M., Lazebnik, S.: Scene parsing with object instances and occlusion ordering. In: CVPR (2014)Google Scholar
  37. 37.
    Wang, T., He, X., Barnes, N.: Learning structured hough voting for joint object detection and occlusion reasoning. In: CVPR (2013)Google Scholar
  38. 38.
    Ye, E.S.: Object Detection in RGB-D Indoor Scenes. Master’s thesis, EECS Department, University of California, Berkeley (January 2013),

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.University of CaliforniaBerkeleyUSA
  2. 2.Universidad de los AndesColombia

Personalised recommendations