Semantic-aware visual scene representation

Parseh, Mohammad Javad; Rahmanimanesh, Mohammad; Keshavarzi, Parviz; Azimifar, Zohreh

doi:10.1007/s13735-022-00246-5

Semantic-aware visual scene representation

Regular Paper
Published: 30 August 2022

Volume 11, pages 619–638, (2022)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Mohammad Javad Parseh¹,
Mohammad Rahmanimanesh¹,
Parviz Keshavarzi¹ &
…
Zohreh Azimifar²

343 Accesses
2 Citations
Explore all metrics

Abstract

Scene classification is a mature and active computer vision task, due to the inherent ambiguity. The scene classification task aims to classify the visual scene images in predefined categories based on the ambient content, objects and the layout of the images. Inspired by human visual scene understanding, the visual scenes can be divided into two categories: (1) Object-based scenes that consist of the scene-specific objects and can be understood with those objects. (2) Layout-based scenes that are understandable based on the layout and the ambient content of the scene images. Scene-specific objects semantically help to understand object-based scenes, whereas the layout and the ambient content are effective in understanding layout-based scenes by representing the visual appearance of the scene images. Hence, one of the main challenges in scene classification is to create a discriminative representation that can provide a high-level perception of visual scenes. Accordingly, we have presented a discriminative hybrid representation of visual scenes, in which semantic features extracted from scene-specific objects are fused with visual features extracted from a deep CNN. The proposed scene representation method is used for the scene classification task and is applied to three benchmark scene datasets including: MIT67, SUN397, and UIUC Sports. Moreover, a new scene dataset, called "Scene40," has been introduced, and also, the results of our proposed method have been presented on it. Experimental results show that our proposed method has achieved remarkable performance in the scene classification task and has outperformed the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene representation using a new two-branch neural network model

Article 01 December 2023

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Article 30 January 2023

Semantic embedding: scene image classification using scene-specific objects

Article 18 October 2022

References

Xu N, Liu A-A, Liu J, Nie W, Su Y (2019) Scene graph captioner: image captioning based on structural visual representation. J Vis Commun Image Represent 58:477–485
Article Google Scholar
Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455
Article Google Scholar
Savchenko AV, Demochkin KV, Grechikhin I (2022) Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn 121:108248
Article Google Scholar
Henderson JM, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50(1):243–271
Article Google Scholar
Yin W, Xu D, Wang Z, Zhao Z, Chen C, Yao Y (2019) Perceptually learning multi-view sparse representation for scene categorization. J Vis Commun Image Represent 60:59–63
Article Google Scholar
Zhang C, Wang D-H, Li H (2021) Discriminative semantic region selection for fine-grained recognition. J Vis Commun Image Represent 77:103084
Article Google Scholar
Liu Y, Wang H, Gu Y, Lv X (2019) Image classification toward lung cancer recognition by learning deep quality model. J Vis Commun Image Represent 63:102570
Article Google Scholar
Georgiou T, Liu Y, Chen W, Lew M (2020) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimed Inf Retr 9(3):135–170
Article Google Scholar
Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE Trans Pattern Anal Mach Intell 40(5):1224–1244
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition 248–255
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. Adv Neural Inf Process Syst 27
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Article Google Scholar
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 413–420
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3485–3492
Liu S, Tian G, Xu Y (2019) A novel scene classification model combining ResNet based transfer learning and data augmentation with a filter. Neurocomputing 338:191–206
Article Google Scholar
Liu Y, Chen Q, Chen W, Wassell I (2018). Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, no 1
Xie L, Zheng L, Wang J, Yuille AL, Tian Q (2016) Interactive: inter-layer activeness propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841
Article MathSciNet MATH Google Scholar
Xie L, Wang J, Lin W, Zhang B, Tian Q (2017) Towards reversal-invariant image representation. Int J Comput Vis 123(2):226–250
Article MathSciNet Google Scholar
Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 571–579
Rezanejad M, et al (2019) Scene categorization from contours: medial axis based salience measures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4124
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407
Guo S, Huang W, Wang L, Qiao Y (2016) Locally supervised deep hybrid model for scene recognition. IEEE Trans Image Process 26(2):808–820
Article MathSciNet MATH Google Scholar
Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3828–3836
Li Y, Zhang Z, Cheng Y, Wang L, Tan T (2019) MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification. Pattern Recogn 90:436–449
Article Google Scholar
Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2974–2983
Yoo D, Park S, Lee J-Y, So Kweon I (2015) Multi-scale pyramid pooling for deep convolutional representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 71–80
Gao BB, Wei XS, Wu J, Lin W (2015) Deep spatial pyramid: the devil is once again in the details. arXiv preprint arXiv:1504.05277
Liu L, Chen J, Fieguth P, Zhao G, Chellappa R, Pietikäinen M (2019) From BoW to CNN: two decades of texture representation for texture classification. Int J Comput Vision 127(1):74–109
Article Google Scholar
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vision 105(3):222–245
Article MathSciNet MATH Google Scholar
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3304–3311
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, no 122, pp 1–2
Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. Adv Neural Inf Process Syst 29
Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE international conference on computer vision, pp 5746–5754
Cheng X, Lu J, Feng J, Yuan B, Zhou J (2018) Scene recognition with objectness. Pattern Recogn 74:474–487
Article Google Scholar
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307
Wang Z, Wang L, Wang Y, Zhang B, Qiao Y (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041
Article MathSciNet MATH Google Scholar
Liu L et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128(2):261–318
Article MATH Google Scholar
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104(2):154–171
Article Google Scholar
Singh S, Gupta A, Efros AA (2012) Unsupervised discovery of mid-level discriminative patches. In: European conference on computer vision. Springer, pp 73–86
Arbeláez P, Pont-Tuset J, Barron JT, Marques F, Malik J (2014). Multiscale combinatorial grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 328–335
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Liu W et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Liu B, Liu J, Wang J, Lu H (2014) Learning a representative and discriminative part model with deep convolutional features for scene recognition. In: Asian conference on computer vision. Springer, pp 643–658
Durand T, Thome N, Cord M (2016) Weldon: weakly supervised learning of deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4743–4752
Wu R, Wang B, Wang W, Yu Y (2015) Harvesting discriminative meta objects with deep CNN features for scene classification. In: Proceedings of the IEEE international conference on computer vision, pp 1287–1295
Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2018) Deeply-supervised nets. In: Artificial intelligence and statistics. PMLR, pp 562–570
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Xie GS, Zhang XY, Yan S, Liu CL (2015) Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans Circuits Syst Video Technol 27(6):1263–1274
Article Google Scholar
Tang P, Wang H, Kwong S (2017) G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225:188–197
Article Google Scholar
Song X, Jiang S, Herranz L (2017) Combining models from multiple sources for RGB-D scene recognition. In: IJCAI, pp 4523–4529
Yang S, Ramanan D (2015) Multi-scale recognition with DAG-CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1215–1223
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Sun N, Li W, Liu J, Han G, Wu C (2018) Fusing object semantics and deep appearance features for scene recognition. IEEE Trans Circuits Syst Video Technol 29(6):1715–1728
Article Google Scholar
Wang L, Guo S, Huang W, Xiong Y, Qiao Y (2017) Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Trans Image Process 26(4):2055–2068
Article MathSciNet MATH Google Scholar
Li J et al (2020) Deep discriminative representation learning with attention map for scene classification. Remote ing 12(9):1366
Google Scholar
Zhang F, Du B, Zhang L (2015) Scene classification via a gradient boosting random convolutional network framework. IEEE Trans Geosci Remote Sens 54(3):1793–1802
Article Google Scholar
Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 30–35
Xia S, Zeng J, Leng L, Fu X (2019) WS-AM: weakly supervised attention map for scene recognition. Electronics 8(10):1072
Article Google Scholar
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Kim HJ, Frahm J-M (2018) Hierarchy of alternating specialists for scene recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 451–467
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2020) Learning visual features for relational CBIR. Int J Multimed Inf Retr 9(2):113–124
Article Google Scholar
Müller-Budack E, Theiner J, Diering S, Idahl M, Hakimov S, Ewerth R (2021) Multimodal news analytics using measures of cross-modal entity and context consistency. Int J Multimed Inf Retr 10(2):111–125
Article Google Scholar
López-Cifuentes M, Escudero-Viñolo JB, García-Martín Á (2020) Semantic-aware scene recognition. Pattern Recogn 102:107256
Article Google Scholar
Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In: Thirty-first AAAI conference on artificial intelligence
Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Johnson J, et al (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678
Hafiz M, Bhat GM (2020) A survey on instance segmentation: state of the art. Int J Multimed Inf Retr 9(3):171–189
Article Google Scholar
Zeng D, et al (2021) Deep learning for scene classification: a survey. arXiv preprint arXiv:2101.10531
Sinha N, Das A (2020) Automatic diagnosis of cardiac arrhythmias based on three stage feature fusion and classification model using DWT. Biomed Signal Process Control 62:102066
Article Google Scholar
Yang Y (2011) A novel DWT based multi-focus image fusion method. Proc Eng 24:177–181
Article Google Scholar
Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Article MathSciNet Google Scholar
Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: 2007 IEEE 11th international conference on computer vision, IEEE, pp 1–8
Lin T-Y, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755
Zhou L, Zhou Z, Hu D (2013) Scene classification using a multi-resolution bag-of-features model. Pattern Recogn 46(1):424–433
Article Google Scholar
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst 26(10):2222–2233
Article MathSciNet Google Scholar
Zuo Z, Wang G, Shuai B, Zhao L, Yang Q (2015) Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recogn 48(10):3004–3015
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383
Article MathSciNet MATH Google Scholar
Song X, Jiang S, Herranz L, Kong Y, Zheng K (2016) Category co-occurrence modeling for large scale scene recognition. Pattern Recogn 59:98–111
Article Google Scholar
Song X, Jiang S, Herranz L (2017) Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Trans Image Process 26(6):2721–2735
Article MathSciNet MATH Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(09):1755013
Article Google Scholar
Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287
Article Google Scholar
Xie L et al (2018) Improved spatial pyramid matching for scene recognition. Pattern Recogn 82:118–129
Article Google Scholar
Shi J, Zhu H, Yu S, Wu W, Shi H (2019) Scene categorization model using deep visually sensitive features. IEEE Access 7:45230–45239
Article Google Scholar
Sorkhi G, Hassanpour H, Fateh M (2020) A comprehensive system for image scene classification. Multimedia Tools Appl 79(25):18033–18058
Article Google Scholar
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE international conference on computer vision, pp 5638–5648
Chen G, Song X, Zeng H, Jiang S (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888
Article MATH Google Scholar
Seong H, Hyun J, Kim E (2020) Fosnet: an end-to-end trainable deep neural network for scene recognition. IEEE Access 8:82066–82077
Article Google Scholar
Qiu J, Yang Y, Wang X, Tao D (2021) Scene essence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8322–8333
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Liu L et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348
Article Google Scholar
Chen B, Li J, Wei G, Ma B (2018) A novel localized and second order feature coding network for image recognition. Pattern Recogn 76:339–348
Article Google Scholar
Gamage BMSV (2021) An embarrassingly simple comparison of machine learning algorithms for indoor scene classification. arXiv preprint arXiv:2109.12261
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Semnan University, Semnan, Iran
Mohammad Javad Parseh, Mohammad Rahmanimanesh & Parviz Keshavarzi
Department of Computer Science and Engineering, Shiraz University, Shiraz, Iran
Zohreh Azimifar

Authors

Mohammad Javad Parseh
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Rahmanimanesh
View author publications
You can also search for this author in PubMed Google Scholar
Parviz Keshavarzi
View author publications
You can also search for this author in PubMed Google Scholar
Zohreh Azimifar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Rahmanimanesh.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Parseh, M.J., Rahmanimanesh, M., Keshavarzi, P. et al. Semantic-aware visual scene representation. Int J Multimed Info Retr 11, 619–638 (2022). https://doi.org/10.1007/s13735-022-00246-5

Download citation

Received: 22 March 2022
Revised: 21 July 2022
Accepted: 25 July 2022
Published: 30 August 2022
Issue Date: December 2022
DOI: https://doi.org/10.1007/s13735-022-00246-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic-aware visual scene representation

Abstract

Access this article

Similar content being viewed by others

Scene representation using a new two-branch neural network model

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Semantic embedding: scene image classification using scene-specific objects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semantic-aware visual scene representation

Abstract

Access this article

Similar content being viewed by others

Scene representation using a new two-branch neural network model

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Semantic embedding: scene image classification using scene-specific objects

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation