Multimedia Tools and Applications

, Volume 76, Issue 3, pp 4427–4443 | Cite as

Indoor scene recognition via multi-task metric multi-kernel learning from RGB-D images

Article
  • 384 Downloads

Abstract

The traditional scene analysis mainly focuses on outdoor scene recognition rather than indoor scene understanding. However, with the widespread use of depth cameras, we have a new opportunity to handle the indoor scene recognition problem. In this paper, we propose a multi-task metric multi-kernel learning algorithm that exploits the inter-source similarities and complementarities between color images and depth images to conduct the indoor scene recognition. Specifically, our method utilize multi-task metric learning to learn a Mahalanobis metric for RGB-D images. Multi-task metric learning can extract the common properties from color images and depth images to learn better metrics. Furthermore, the learned metrics are employed to transform features to a correcting feature space for obtaining a better representation. By exploiting multi-kernel learning, our method can leverage multiple feature representations to train a more discriminative classifier. We conduct experiments on NYU Depth Dataset and B3DO Dataset to evaluate the effectiveness of our approach. The experimental results have demonstrated that our proposed method can lead to better indoor scene recognition.

Keywords

Multi-task learning Multi-kernel learning Metric learning RGB-D Scene recognition 

References

  1. 1.
    Barron JT, Malik J (2013) Intrinsic scene properties from a single RGB-d image. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):17-24Google Scholar
  2. 2.
    Bo L, Lai K, Ren X, Fox D (2011) Object recognition with hierarchical kernel descriptors. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1729-1736Google Scholar
  3. 3.
    Cai F, Cherkassky V (2012) Generalized SMO algorithm for SVM-based multitask learning. IEEE Transactions on Neural Networks and Learning Systems 23(6):997–1003CrossRefGoogle Scholar
  4. 4.
    Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2(3):27CrossRefGoogle Scholar
  5. 5.
    Cruz L, Lucio D, Velho L (2012) Kinect and rgbd images: challenges and applications. In: Proceedings of IEEE international conference on graphics, patterns and images tutorials (SIBGRAPI-t):36-49Google Scholar
  6. 6.
    Davis JV, Kulis B, Jain P, Sra S, Dhillon IS (2007) Information-theoretic metric learning. In: Proceedings of international conference on machine learning (ICML):209-216Google Scholar
  7. 7.
    Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2013) Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531
  8. 8.
    Evgeniou T, Micchelli CA, Pontil M (2005) Learning multiple tasks with kernel methods. J Mach Learn Res:615–637Google Scholar
  9. 9.
    Fan H, Yang M, Cao Z, Jiang Y, Yin Q (2014) Learning compact face representation: Packing a face into an int32. In: Proceedings of ACM international conference on multimedia (ACM MM):933–936Google Scholar
  10. 10.
    Gao X, Gao F, Tao D, Li X (2013) Universal blind image quality assessment metrics via natural scene statistics and multiple kernel learning. IEEE Transactions on Neural Networks and Learning Systems 24(12):2013–2026CrossRefGoogle Scholar
  11. 11.
    Gonen M, Alpaydin E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268MathSciNetMATHGoogle Scholar
  12. 12.
    Gould S, Fulton R, Koller D (2009) Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of IEEE international conference on computer vision (ICCV):1–8Google Scholar
  13. 13.
    Gupta S, Arbelaez P, Malik J (2013) Perceptual organization and recognition of indoor scenes from RGB-d images. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):564-571Google Scholar
  14. 14.
    Gupta S, Girshick R, Arbelez P, Malik J (2014) Learning rich features from RGB-d images for object detection and segmentation. In: Proceedings of european conference on computer vision (ECCV):345-360Google Scholar
  15. 15.
    Han J, Shao L, Xu D, Shotton J (2013) Enhanced computer vision with Microsoft Kinect sensor: A review. IEEE Transactions on Cybernetics 43(5):1318–1334CrossRefGoogle Scholar
  16. 16.
    Han J, Pauwels EJ, De Zeeuw PM, De With PH (2012) Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment. IEEE Transactions on Consumer Electronics 58(2):255–263CrossRefGoogle Scholar
  17. 17.
    He X, Zemel RS, Carreira-Perpinan M (2004) Multiscale conditional random fields for image labeling. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):695-702Google Scholar
  18. 18.
    Janoch A, Karayev S, Jia Y, Barron JT, Fritz M, Saenko K, Darrell T (2013) A category-level 3d object dataset: Putting the kinect to work. Consumer Depth Cameras for Computer Vision:141–165Google Scholar
  19. 19.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM international conference on multimedia (ACM MM):675-678Google Scholar
  20. 20.
    Jiang F, Zhang S, Wu S, Gao Y, Zhao D (2015) Multi-layered gesture recognition with kinect. J Mach Learn Res 16(1):227–254MathSciNetGoogle Scholar
  21. 21.
    Khosla A, An B, Lim JJ, Torralba A (2014) Looking beyond the visible scene. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3710–3717Google Scholar
  22. 22.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS):1097–1105Google Scholar
  23. 23.
    Kulis B (2012) Metric learning: a survey. Foundations and Trends in Machine Learning 5(4):287–364CrossRefMATHGoogle Scholar
  24. 24.
    Kumar MP, Torr PHS, Zisserman A (2007) An invariant large margin nearest neighbour classifier. In: Proceedings of IEEE international conference on computer vision (ICCV):1-8Google Scholar
  25. 25.
    Lapin M, Schiele B, Hein M (2014) Scalable multitask representation learning for scene classification. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1434-1441Google Scholar
  26. 26.
    Li LJ, Su H, Lim Y, Fei-Fei L (2012) Objects as attributes for scene classification. Trends and Topics in Computer Vision:57–69Google Scholar
  27. 27.
    Lin D, Fidler S, Urtasun R (2013) Holistic scene understanding for 3d object detection with rgbd cameras. In: Proceedings of IEEE international conference on computer vision (ICCV):1417-1424Google Scholar
  28. 28.
    Ming Y, Ruan Q, Hauptmann AG (2012) Activity recognition from rgb-d camera with 3d local spatio-temporal features. In: Proceedings of IEEE international conference on multimedia and expo (ICME):344-349Google Scholar
  29. 29.
    Niu Z, Hua G, Gao X, Tian Q (2012) Context aware topic model for scene recognition. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2743-2750Google Scholar
  30. 30.
    Ouyang W, Wang X, Zeng X, Qiu S, Luo P, Tian Y, Li H, Yang S, Wang Z, Loy CC, Tang X (2015) Deepid-net: deformable deep convolutional neural networks for object detection. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):2403-2412Google Scholar
  31. 31.
    Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: Proceedings of IEEE international conference on computer vision (ICCV):1307-1314Google Scholar
  32. 32.
    Parameswaran S, Weinberger KQ (2010) Large margin multi-task metric learning. Advances in Neural Information Processing Systems (NIPS):1867–1875Google Scholar
  33. 33.
    Qian Q, Jin R, Zhu S, Lin Y (2015) Fine-grained Visual Categorization via Multi-stage Metric Learning. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3716-3724Google Scholar
  34. 34.
    Rakotomamonjy A, Bach F, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9:2491–2521MathSciNetMATHGoogle Scholar
  35. 35.
    Ramirez I, Sprechmann P, Sapiro G (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):3501-3508Google Scholar
  36. 36.
    Ren X, Bo L, Fox D (2012) Rgb-(d) scene labeling: Features and algorithms. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR): 2759-2766Google Scholar
  37. 37.
    Shao J, Kang K, Loy CC, Wang X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):4657–4666Google Scholar
  38. 38.
    Shao T, Xu W, Zhou K, Wang J, Li D, Guo B (2012) An interactive approach to semantic modeling of indoor scenes with an rgbd camera. ACM Trans Graph 31(6):136CrossRefGoogle Scholar
  39. 39.
    Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: Proceedings of european conference on computer vision (ECCV):746-760Google Scholar
  40. 40.
    Song X, Jiang S, Herranz L (2015) Joint multi-feature spatial context for scene recognition in the semantic manifold. In: Proceedings of IEEE international conference on computer vision and pattern recognition (CVPR):1312-1320Google Scholar
  41. 41.
    Wallraven C, Caputo B, Graf A (2003) Recognition with local features: the kernel recipe. In: Proceedings of IEEE international conference on computer vision (ICCV):257-264Google Scholar
  42. 42.
    Wan J, Ruan Q, Li W, Deng S (2013) One-shot learning gesture recognition from RGB-d data using bag of features. J Mach Learn Res 14(1):2549–2582Google Scholar
  43. 43.
    Wan S, Hu C, Aggarwal JK (2014) Indoor scene recognition from RGB-d images by learning scene bases. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3416-3421Google Scholar
  44. 44.
    Wang A, Lu J, Wang G, Cai J, Cham TJ (2014) Multi-modal unsupervised feature learning for RGB-d scene labeling. In: Proceedings of european conference on computer vision (ECCV):453–467Google Scholar
  45. 45.
    Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244MATHGoogle Scholar
  46. 46.
    Xing EP, Jordan MI, Russell S, Ng AY (2002) Distance metric learning with application to clustering with side-information. Advances in Neural Information Processing Systems (NIPS):505–512Google Scholar
  47. 47.
    Yan Y, Ricci E, Liu G, Subramanian R, Sebe N (2014) Clustered multi-task linear discriminant analysis for view invariant color-depth action recognition. In: Proceedings of IEEE international conference on pattern recognition (ICPR):3493-3498Google Scholar
  48. 48.
    Yu M, Liu L, Shao L (2015) Structure-preserving binary representations for RGB-D action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, doi:10.1109/TPAMI.2015.2491925

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Electronic EngineeringXidian UniversityXi’anChina

Personalised recommendations