Multimedia Tools and Applications

, Volume 76, Issue 3, pp 4273–4290 | Cite as

Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition

  • Xiong Lv
  • Xinda Liu
  • Xiangyang Li
  • Xue Li
  • Shuqiang Jiang
  • Zhiqiang He


Hand-held object recognition is an important research topic in image understanding and plays an essential role in human-machine interaction. With the easily available RGB-D devices, the depth information greatly promotes the performance of object segmentation and provides additional channel information. While how to extract a representative and discriminating feature from object region and efficiently take advantage of the depth information plays an important role in improving hand-held object recognition accuracy and eventual human-machine interaction experience. In this paper, we focus on a special but important area called RGB-D hand-held object recognition and propose a hierarchical feature learning framework for this task. First, our framework learns modality-specific features from RGB and depth images using CNN architectures with different network depth and learning strategies. Secondly a high-level feature learning network is implemented for a comprehensive feature representation. Different with previous works on feature learning and representation, the hierarchical learning method can sufficiently dig out the characteristics of different modal information and efficiently fuse them in a unified framework. The experimental results on HOD dataset illustrate the effectiveness of our proposed method.


Feature learning RGB-D object recogntion Multiple modalities 


  1. 1.
    Beck C, Broun A, Mirmehdi M, Pipe A, Melhuish C (2014) Text line aggregation. Int Conf Pattern Recogn Appl Methods (ICPRAM), pp 393–401Google Scholar
  2. 2.
    Bo L, Ren X (2011) Depth kernel descriptors for object recognition. In: IROS, pp 821–826Google Scholar
  3. 3.
    Chai X, Li G, Lin Y, Xu Z, Tang Y, Chen X, Zhou M (2013) Sign language recognition and translation with kinect. In: ICAFGRGoogle Scholar
  4. 4.
    Fu Y, Cao L, Guo G, Huang TS (2008) Multiple feature fusion by subspace learning. In: Proceedings of the 2008 international conference on Content-based image and video retrieval. ACM, pp 127– 134Google Scholar
  5. 5.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp 580–587Google Scholar
  6. 6.
    Gupta S, Arbeláez P, Girshick R, images JM (2014) Indoor scene understanding with rgb-d Bottom-up segmentation, object detection and semantic segmentation. IJCV, pp 1–17Google Scholar
  7. 7.
    Gupta S, Girshick RB, Arbelaez P, Malik J (2014) Learning rich features from RGB-d images for object detection and segmentation. CoRR, abs/1407:5736Google Scholar
  8. 8.
    Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  9. 9.
    Kanezaki A, Marton Z-C, Pangercic D, Harada T, Kuniyoshi Y, Beetz M (2011) Voxelized shape and color histograms for rgb-d. In: IROS Workshop on Active Semantic Perception. CiteseerGoogle Scholar
  10. 10.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114Google Scholar
  11. 11.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp 1097–1105Google Scholar
  12. 12.
    Liu S, Wang S, Wu L, Jiang S (2014) Multiple feature fusion based hand-held object recognition with rgb-d data. In: Proceedings of International Conference on Internet Multimedia Computing and Service. ACM, pp 303–306Google Scholar
  13. 13.
    Liu W, Tao D, Cheng J, Tang Y (2014) Multiview hessian discriminative sparse coding for image annotation. Comput Vis Image Understand 118:50–60CrossRefGoogle Scholar
  14. 14.
    Lv X, Jiang S-Q, Herranz L, Wang S (2015) Rgb-d hand-held object recognition based on heterogeneous feature fusion. J Comput Sci Technol 30(2):340–352CrossRefGoogle Scholar
  15. 15.
    Lv X, Wang S, Li X, Jiang S (2014) Combining heterogenous features for 3d hand-held object recognition. In: Proceedings SPIE, Optoelectronic Imaging and Multimedia Technology III, vol 9273, pp 92732I–92732I–10Google Scholar
  16. 16.
    Marton Z-C, Pangercic D, Rusu Radu B, Holzbach A, Beetz M (2010) Hierarchical object geometric categorization and appearance classification for mobile manipulation. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots, TN, USAGoogle Scholar
  17. 17.
    Ren X, Gu C (2010) Figure-ground segmentation improves handled object recognition in egocentric video. In: CVPR, pp 3137–3144Google Scholar
  18. 18.
    Rivera-Rubio J, Idrees S, Alexiou I, Hadjilucas L, Bharath AA (2014) Small hand-held object recognition test (short). In: WACV, pp 524–531Google Scholar
  19. 19.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409:1556Google Scholar
  20. 20.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. CoRR, abs/1409:4842Google Scholar
  21. 21.
    Wohlkinger W, Vincze M (2011) Ensemble of shape functions for 3d object classification. In: ROBIO, pp 2987–2992Google Scholar
  22. 22.
    Wu F, Jing X-Y, You X, Yue D, Hu R, Yang J-Y (2016) Multi-view low-rank dictionary learning for image classification. Pattern Recogn 50:143–154CrossRefGoogle Scholar
  23. 23.
    Xu RYD, Jin JS (2006) Individual object interaction for camera control and multimedia synchronization. In: ICASSP, vol 5Google Scholar
  24. 24.
    Zeiler MD, Fergus R (2013) Visualizing and understanding convolutional networks. CoRR, abs/1311:2901Google Scholar
  25. 25.
    Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: ECCV, pp 818–833Google Scholar
  26. 26.
    Zha Z-J, Yang Y, Tang J, Wang M, Chua T-S (2015) Robust multiview feature learning for rgb-d image understanding. ACM Trans Intell Syst Technol, vol 6, pp 15:115:19Google Scholar
  27. 27.
    Zhang B, Perina A, Li Z, Murino V (2016) Bounding multiple gaussians uncertainty with application to object tracking. IJCVGoogle Scholar
  28. 28.
    Zhang B, Li Z, Perina A, Del Bue A, Murino V (2015) Adaptive local movement modelling for object tracking. In: WACV, pp 25–32Google Scholar
  29. 29.
    Zhang K, Zhang L, Yang M-H (2014) Fast compressive tracking. IEEE Trans Pattern Anal Mach Intell 36(10):2002–2015CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Xiong Lv
    • 1
  • Xinda Liu
    • 2
  • Xiangyang Li
    • 1
  • Xue Li
    • 1
    • 3
  • Shuqiang Jiang
    • 1
  • Zhiqiang He
    • 4
  1. 1.Key Laboratory of Intelligent Information ProcessingInstitute of Computing Technology, Chinese Academy of SciencesBeijingChina
  2. 2.School of Mathematics and Computer ScienceNingxia UniversityNingxiaChina
  3. 3.College of Information Science and EngineeringShandong University of Science and TechnologyQingdaoChina
  4. 4.Lenovo Corporate ResearchBeijingChina

Personalised recommendations