Skip to main content
Log in

Semantic-aware visual scene representation

  • Regular Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

Scene classification is a mature and active computer vision task, due to the inherent ambiguity. The scene classification task aims to classify the visual scene images in predefined categories based on the ambient content, objects and the layout of the images. Inspired by human visual scene understanding, the visual scenes can be divided into two categories: (1) Object-based scenes that consist of the scene-specific objects and can be understood with those objects. (2) Layout-based scenes that are understandable based on the layout and the ambient content of the scene images. Scene-specific objects semantically help to understand object-based scenes, whereas the layout and the ambient content are effective in understanding layout-based scenes by representing the visual appearance of the scene images. Hence, one of the main challenges in scene classification is to create a discriminative representation that can provide a high-level perception of visual scenes. Accordingly, we have presented a discriminative hybrid representation of visual scenes, in which semantic features extracted from scene-specific objects are fused with visual features extracted from a deep CNN. The proposed scene representation method is used for the scene classification task and is applied to three benchmark scene datasets including: MIT67, SUN397, and UIUC Sports. Moreover, a new scene dataset, called "Scene40," has been introduced, and also, the results of our proposed method have been presented on it. Experimental results show that our proposed method has achieved remarkable performance in the scene classification task and has outperformed the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Xu N, Liu A-A, Liu J, Nie W, Su Y (2019) Scene graph captioner: image captioning based on structural visual representation. J Vis Commun Image Represent 58:477–485

    Article  Google Scholar 

  2. Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455

    Article  Google Scholar 

  3. Savchenko AV, Demochkin KV, Grechikhin I (2022) Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn 121:108248

    Article  Google Scholar 

  4. Henderson JM, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50(1):243–271

    Article  Google Scholar 

  5. Yin W, Xu D, Wang Z, Zhao Z, Chen C, Yao Y (2019) Perceptually learning multi-view sparse representation for scene categorization. J Vis Commun Image Represent 60:59–63

    Article  Google Scholar 

  6. Zhang C, Wang D-H, Li H (2021) Discriminative semantic region selection for fine-grained recognition. J Vis Commun Image Represent 77:103084

    Article  Google Scholar 

  7. Liu Y, Wang H, Gu Y, Lv X (2019) Image classification toward lung cancer recognition by learning deep quality model. J Vis Commun Image Represent 63:102570

    Article  Google Scholar 

  8. Georgiou T, Liu Y, Chen W, Lew M (2020) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimed Inf Retr 9(3):135–170

    Article  Google Scholar 

  9. Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE Trans Pattern Anal Mach Intell 40(5):1224–1244

    Article  Google Scholar 

  10. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition 248–255

  11. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. Adv Neural Inf Process Syst 27

  12. Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464

    Article  Google Scholar 

  13. Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 413–420

  14. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3485–3492

  15. Liu S, Tian G, Xu Y (2019) A novel scene classification model combining ResNet based transfer learning and data augmentation with a filter. Neurocomputing 338:191–206

    Article  Google Scholar 

  16. Liu Y, Chen Q, Chen W, Wassell I (2018). Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, no 1

  17. Xie L, Zheng L, Wang J, Yuille AL, Tian Q (2016) Interactive: inter-layer activeness propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279

  18. Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841

    Article  MathSciNet  MATH  Google Scholar 

  19. Xie L, Wang J, Lin W, Zhang B, Tian Q (2017) Towards reversal-invariant image representation. Int J Comput Vis 123(2):226–250

    Article  MathSciNet  Google Scholar 

  20. Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 571–579

  21. Rezanejad M, et al (2019) Scene categorization from contours: medial axis based salience measures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4124

  22. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407

  23. Guo S, Huang W, Wang L, Qiao Y (2016) Locally supervised deep hybrid model for scene recognition. IEEE Trans Image Process 26(2):808–820

    Article  MathSciNet  MATH  Google Scholar 

  24. Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3828–3836

  25. Li Y, Zhang Z, Cheng Y, Wang L, Tan T (2019) MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification. Pattern Recogn 90:436–449

    Article  Google Scholar 

  26. Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2974–2983

  27. Yoo D, Park S, Lee J-Y, So Kweon I (2015) Multi-scale pyramid pooling for deep convolutional representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 71–80

  28. Gao BB, Wei XS, Wu J, Lin W (2015) Deep spatial pyramid: the devil is once again in the details. arXiv preprint arXiv:1504.05277

  29. Liu L, Chen J, Fieguth P, Zhao G, Chellappa R, Pietikäinen M (2019) From BoW to CNN: two decades of texture representation for texture classification. Int J Comput Vision 127(1):74–109

    Article  Google Scholar 

  30. Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vision 105(3):222–245

    Article  MathSciNet  MATH  Google Scholar 

  31. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3304–3311

  32. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, no 122, pp 1–2

  33. Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. Adv Neural Inf Process Syst 29

  34. Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE international conference on computer vision, pp 5746–5754

  35. Cheng X, Lu J, Feng J, Yuan B, Zhou J (2018) Scene recognition with objectness. Pattern Recogn 74:474–487

    Article  Google Scholar 

  36. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307

  37. Wang Z, Wang L, Wang Y, Zhang B, Qiao Y (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041

    Article  MathSciNet  MATH  Google Scholar 

  38. Liu L et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128(2):261–318

    Article  MATH  Google Scholar 

  39. Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104(2):154–171

    Article  Google Scholar 

  40. Singh S, Gupta A, Efros AA (2012) Unsupervised discovery of mid-level discriminative patches. In: European conference on computer vision. Springer, pp 73–86

  41. Arbeláez P, Pont-Tuset J, Barron JT, Marques F, Malik J (2014). Multiscale combinatorial grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 328–335

  42. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  43. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28

  44. Liu W et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37

  45. J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271

  46. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788

  47. Liu B, Liu J, Wang J, Lu H (2014) Learning a representative and discriminative part model with deep convolutional features for scene recognition. In: Asian conference on computer vision. Springer, pp 643–658

  48. Durand T, Thome N, Cord M (2016) Weldon: weakly supervised learning of deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4743–4752

  49. Wu R, Wang B, Wang W, Yu Y (2015) Harvesting discriminative meta objects with deep CNN features for scene classification. In: Proceedings of the IEEE international conference on computer vision, pp 1287–1295

  50. Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2018) Deeply-supervised nets. In: Artificial intelligence and statistics. PMLR, pp 562–570

  51. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  52. Xie GS, Zhang XY, Yan S, Liu CL (2015) Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans Circuits Syst Video Technol 27(6):1263–1274

    Article  Google Scholar 

  53. Tang P, Wang H, Kwong S (2017) G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225:188–197

    Article  Google Scholar 

  54. Song X, Jiang S, Herranz L (2017) Combining models from multiple sources for RGB-D scene recognition. In: IJCAI, pp 4523–4529

  55. Yang S, Ramanan D (2015) Multi-scale recognition with DAG-CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1215–1223

  56. Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  57. Sun N, Li W, Liu J, Han G, Wu C (2018) Fusing object semantics and deep appearance features for scene recognition. IEEE Trans Circuits Syst Video Technol 29(6):1715–1728

    Article  Google Scholar 

  58. Wang L, Guo S, Huang W, Xiong Y, Qiao Y (2017) Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Trans Image Process 26(4):2055–2068

    Article  MathSciNet  MATH  Google Scholar 

  59. Li J et al (2020) Deep discriminative representation learning with attention map for scene classification. Remote ing 12(9):1366

    Google Scholar 

  60. Zhang F, Du B, Zhang L (2015) Scene classification via a gradient boosting random convolutional network framework. IEEE Trans Geosci Remote Sens 54(3):1793–1802

    Article  Google Scholar 

  61. Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 30–35

  62. Xia S, Zeng J, Leng L, Fu X (2019) WS-AM: weakly supervised attention map for scene recognition. Electronics 8(10):1072

    Article  Google Scholar 

  63. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456

  64. Kim HJ, Frahm J-M (2018) Hierarchy of alternating specialists for scene recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 451–467

  65. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  66. Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2020) Learning visual features for relational CBIR. Int J Multimed Inf Retr 9(2):113–124

    Article  Google Scholar 

  67. Müller-Budack E, Theiner J, Diering S, Idahl M, Hakimov S, Ewerth R (2021) Multimodal news analytics using measures of cross-modal entity and context consistency. Int J Multimed Inf Retr 10(2):111–125

    Article  Google Scholar 

  68. López-Cifuentes M, Escudero-Viñolo JB, García-Martín Á (2020) Semantic-aware scene recognition. Pattern Recogn 102:107256

    Article  Google Scholar 

  69. Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In: Thirty-first AAAI conference on artificial intelligence

  70. Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  71. Johnson J, et al (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678

  72. Hafiz M, Bhat GM (2020) A survey on instance segmentation: state of the art. Int J Multimed Inf Retr 9(3):171–189

    Article  Google Scholar 

  73. Zeng D, et al (2021) Deep learning for scene classification: a survey. arXiv preprint arXiv:2101.10531

  74. Sinha N, Das A (2020) Automatic diagnosis of cardiac arrhythmias based on three stage feature fusion and classification model using DWT. Biomed Signal Process Control 62:102066

    Article  Google Scholar 

  75. Yang Y (2011) A novel DWT based multi-focus image fusion method. Proc Eng 24:177–181

    Article  Google Scholar 

  76. Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73

    Article  MathSciNet  Google Scholar 

  77. Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: 2007 IEEE 11th international conference on computer vision, IEEE, pp 1–8

  78. Lin T-Y, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755

  79. Zhou L, Zhou Z, Hu D (2013) Scene classification using a multi-resolution bag-of-features model. Pattern Recogn 46(1):424–433

    Article  Google Scholar 

  80. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  81. Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst 26(10):2222–2233

    Article  MathSciNet  Google Scholar 

  82. Zuo Z, Wang G, Shuai B, Zhao L, Yang Q (2015) Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recogn 48(10):3004–3015

    Article  Google Scholar 

  83. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25

  84. Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383

    Article  MathSciNet  MATH  Google Scholar 

  85. Song X, Jiang S, Herranz L, Kong Y, Zheng K (2016) Category co-occurrence modeling for large scale scene recognition. Pattern Recogn 59:98–111

    Article  Google Scholar 

  86. Song X, Jiang S, Herranz L (2017) Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Trans Image Process 26(6):2721–2735

    Article  MathSciNet  MATH  Google Scholar 

  87. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  88. Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(09):1755013

    Article  Google Scholar 

  89. Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287

    Article  Google Scholar 

  90. Xie L et al (2018) Improved spatial pyramid matching for scene recognition. Pattern Recogn 82:118–129

    Article  Google Scholar 

  91. Shi J, Zhu H, Yu S, Wu W, Shi H (2019) Scene categorization model using deep visually sensitive features. IEEE Access 7:45230–45239

    Article  Google Scholar 

  92. Sorkhi G, Hassanpour H, Fateh M (2020) A comprehensive system for image scene classification. Multimedia Tools Appl 79(25):18033–18058

    Article  Google Scholar 

  93. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833

  94. Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE international conference on computer vision, pp 5638–5648

  95. Chen G, Song X, Zeng H, Jiang S (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888

    Article  MATH  Google Scholar 

  96. Seong H, Hyun J, Kim E (2020) Fosnet: an end-to-end trainable deep neural network for scene recognition. IEEE Access 8:82066–82077

    Article  Google Scholar 

  97. Qiu J, Yang Y, Wang X, Tao D (2021) Scene essence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8322–8333

  98. Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500

  99. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  100. Liu L et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348

    Article  Google Scholar 

  101. Chen B, Li J, Wei G, Ma B (2018) A novel localized and second order feature coding network for image recognition. Pattern Recogn 76:339–348

    Article  Google Scholar 

  102. Gamage BMSV (2021) An embarrassingly simple comparison of machine learning algorithms for indoor scene classification. arXiv preprint arXiv:2109.12261

  103. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammad Rahmanimanesh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parseh, M.J., Rahmanimanesh, M., Keshavarzi, P. et al. Semantic-aware visual scene representation. Int J Multimed Info Retr 11, 619–638 (2022). https://doi.org/10.1007/s13735-022-00246-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13735-022-00246-5

Keywords

Navigation