Abstract
Scene classification is a mature and active computer vision task, due to the inherent ambiguity. The scene classification task aims to classify the visual scene images in predefined categories based on the ambient content, objects and the layout of the images. Inspired by human visual scene understanding, the visual scenes can be divided into two categories: (1) Object-based scenes that consist of the scene-specific objects and can be understood with those objects. (2) Layout-based scenes that are understandable based on the layout and the ambient content of the scene images. Scene-specific objects semantically help to understand object-based scenes, whereas the layout and the ambient content are effective in understanding layout-based scenes by representing the visual appearance of the scene images. Hence, one of the main challenges in scene classification is to create a discriminative representation that can provide a high-level perception of visual scenes. Accordingly, we have presented a discriminative hybrid representation of visual scenes, in which semantic features extracted from scene-specific objects are fused with visual features extracted from a deep CNN. The proposed scene representation method is used for the scene classification task and is applied to three benchmark scene datasets including: MIT67, SUN397, and UIUC Sports. Moreover, a new scene dataset, called "Scene40," has been introduced, and also, the results of our proposed method have been presented on it. Experimental results show that our proposed method has achieved remarkable performance in the scene classification task and has outperformed the state-of-the-art methods.
Similar content being viewed by others
References
Xu N, Liu A-A, Liu J, Nie W, Su Y (2019) Scene graph captioner: image captioning based on structural visual representation. J Vis Commun Image Represent 58:477–485
Li X, Wu B, Song J, Gao L, Zeng P, Gan C (2022) Text-instance graph: exploring the relational semantics for text-based visual question answering. Pattern Recogn 124:108455
Savchenko AV, Demochkin KV, Grechikhin I (2022) Preference prediction based on a photo gallery analysis with scene recognition and object detection. Pattern Recogn 121:108248
Henderson JM, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50(1):243–271
Yin W, Xu D, Wang Z, Zhao Z, Chen C, Yao Y (2019) Perceptually learning multi-view sparse representation for scene categorization. J Vis Commun Image Represent 60:59–63
Zhang C, Wang D-H, Li H (2021) Discriminative semantic region selection for fine-grained recognition. J Vis Commun Image Represent 77:103084
Liu Y, Wang H, Gu Y, Lv X (2019) Image classification toward lung cancer recognition by learning deep quality model. J Vis Commun Image Represent 63:102570
Georgiou T, Liu Y, Chen W, Lew M (2020) A survey of traditional and deep learning-based feature descriptors for high dimensional data in computer vision. Int J Multimed Inf Retr 9(3):135–170
Zheng L, Yang Y, Tian Q (2017) SIFT meets CNN: a decade survey of instance retrieval. IEEE Trans Pattern Anal Mach Intell 40(5):1224–1244
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition 248–255
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. Adv Neural Inf Process Syst 27
Zhou B, Lapedriza A, Khosla A, Oliva A, Torralba A (2017) Places: a 10 million image database for scene recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1452–1464
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 413–420
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition, pp 3485–3492
Liu S, Tian G, Xu Y (2019) A novel scene classification model combining ResNet based transfer learning and data augmentation with a filter. Neurocomputing 338:191–206
Liu Y, Chen Q, Chen W, Wassell I (2018). Dictionary learning inspired deep network for scene recognition. In: Proceedings of the AAAI conference on artificial intelligence, vol 32, no 1
Xie L, Zheng L, Wang J, Yuille AL, Tian Q (2016) Interactive: inter-layer activeness propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 270–279
Hayat M, Khan SH, Bennamoun M, An S (2016) A spatial layout and scale invariant feature representation for indoor scene classification. IEEE Trans Image Process 25(10):4829–4841
Xie L, Wang J, Lin W, Zhang B, Tian Q (2017) Towards reversal-invariant image representation. Int J Comput Vis 123(2):226–250
Herranz L, Jiang S, Li X (2016) Scene recognition with cnns: objects, scales and dataset bias. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 571–579
Rezanejad M, et al (2019) Scene categorization from contours: medial axis based salience measures. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4124
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: European conference on computer vision. Springer, pp 392–407
Guo S, Huang W, Wang L, Qiao Y (2016) Locally supervised deep hybrid model for scene recognition. IEEE Trans Image Process 26(2):808–820
Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3828–3836
Li Y, Zhang Z, Cheng Y, Wang L, Tan T (2019) MAPNet: Multi-modal attentive pooling network for RGB-D indoor scene classification. Pattern Recogn 90:436–449
Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2974–2983
Yoo D, Park S, Lee J-Y, So Kweon I (2015) Multi-scale pyramid pooling for deep convolutional representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 71–80
Gao BB, Wei XS, Wu J, Lin W (2015) Deep spatial pyramid: the devil is once again in the details. arXiv preprint arXiv:1504.05277
Liu L, Chen J, Fieguth P, Zhao G, Chellappa R, Pietikäinen M (2019) From BoW to CNN: two decades of texture representation for texture classification. Int J Comput Vision 127(1):74–109
Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vision 105(3):222–245
Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 3304–3311
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol 1, no 122, pp 1–2
Dixit MD, Vasconcelos N (2016) Object based scene representations using fisher scores of local subspace projections. Adv Neural Inf Process Syst 29
Li Y, Dixit M, Vasconcelos N (2017) Deep scene image classification with the MFAFVNet. In: Proceedings of the IEEE international conference on computer vision, pp 5746–5754
Cheng X, Lu J, Feng J, Yuan B, Zhou J (2018) Scene recognition with objectness. Pattern Recogn 74:474–487
Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307
Wang Z, Wang L, Wang Y, Zhang B, Qiao Y (2017) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Trans Image Process 26(4):2028–2041
Liu L et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128(2):261–318
Uijlings JR, Van De Sande KE, Gevers T, Smeulders AW (2013) Selective search for object recognition. Int J Comput Vision 104(2):154–171
Singh S, Gupta A, Efros AA (2012) Unsupervised discovery of mid-level discriminative patches. In: European conference on computer vision. Springer, pp 73–86
Arbeláez P, Pont-Tuset J, Barron JT, Marques F, Malik J (2014). Multiscale combinatorial grouping. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 328–335
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28
Liu W et al (2016) Ssd: single shot multibox detector. In: European conference on computer vision. Springer, pp 21–37
J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7263–7271
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 779–788
Liu B, Liu J, Wang J, Lu H (2014) Learning a representative and discriminative part model with deep convolutional features for scene recognition. In: Asian conference on computer vision. Springer, pp 643–658
Durand T, Thome N, Cord M (2016) Weldon: weakly supervised learning of deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4743–4752
Wu R, Wang B, Wang W, Yu Y (2015) Harvesting discriminative meta objects with deep CNN features for scene classification. In: Proceedings of the IEEE international conference on computer vision, pp 1287–1295
Lee CY, Xie S, Gallagher P, Zhang Z, Tu Z (2018) Deeply-supervised nets. In: Artificial intelligence and statistics. PMLR, pp 562–570
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125
Xie GS, Zhang XY, Yan S, Liu CL (2015) Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans Circuits Syst Video Technol 27(6):1263–1274
Tang P, Wang H, Kwong S (2017) G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition. Neurocomputing 225:188–197
Song X, Jiang S, Herranz L (2017) Combining models from multiple sources for RGB-D scene recognition. In: IJCAI, pp 4523–4529
Yang S, Ramanan D (2015) Multi-scale recognition with DAG-CNNs. In: Proceedings of the IEEE international conference on computer vision, pp 1215–1223
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Sun N, Li W, Liu J, Han G, Wu C (2018) Fusing object semantics and deep appearance features for scene recognition. IEEE Trans Circuits Syst Video Technol 29(6):1715–1728
Wang L, Guo S, Huang W, Xiong Y, Qiao Y (2017) Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs. IEEE Trans Image Process 26(4):2055–2068
Li J et al (2020) Deep discriminative representation learning with attention map for scene classification. Remote ing 12(9):1366
Zhang F, Du B, Zhang L (2015) Scene classification via a gradient boosting random convolutional network framework. IEEE Trans Geosci Remote Sens 54(3):1793–1802
Wang L, Wang Z, Du W, Qiao Y (2015) Object-scene convolutional neural networks for event recognition in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 30–35
Xia S, Zeng J, Leng L, Fu X (2019) WS-AM: weakly supervised attention map for scene recognition. Electronics 8(10):1072
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Kim HJ, Frahm J-M (2018) Hierarchy of alternating specialists for scene recognition. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 451–467
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Messina N, Amato G, Carrara F, Falchi F, Gennaro C (2020) Learning visual features for relational CBIR. Int J Multimed Inf Retr 9(2):113–124
Müller-Budack E, Theiner J, Diering S, Idahl M, Hakimov S, Ewerth R (2021) Multimodal news analytics using measures of cross-modal entity and context consistency. Int J Multimed Inf Retr 10(2):111–125
López-Cifuentes M, Escudero-Viñolo JB, García-Martín Á (2020) Semantic-aware scene recognition. Pattern Recogn 102:107256
Speer R, Chin J, Havasi C (2017) Conceptnet 5.5: an open multilingual graph of general knowledge. In: Thirty-first AAAI conference on artificial intelligence
Ren S, He K, Girshick R, Sun J (2016) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Johnson J, et al (2015) Image retrieval using scene graphs. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3668–3678
Hafiz M, Bhat GM (2020) A survey on instance segmentation: state of the art. Int J Multimed Inf Retr 9(3):171–189
Zeng D, et al (2021) Deep learning for scene classification: a survey. arXiv preprint arXiv:2101.10531
Sinha N, Das A (2020) Automatic diagnosis of cardiac arrhythmias based on three stage feature fusion and classification model using DWT. Biomed Signal Process Control 62:102066
Yang Y (2011) A novel DWT based multi-focus image fusion method. Proc Eng 24:177–181
Krishna R et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73
Li L-J, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. In: 2007 IEEE 11th international conference on computer vision, IEEE, pp 1–8
Lin T-Y, et al (2014) Microsoft coco: Common objects in context. In: European conference on computer vision. Springer, pp 740–755
Zhou L, Zhou Z, Hu D (2013) Scene classification using a multi-resolution bag-of-features model. Pattern Recogn 46(1):424–433
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst 26(10):2222–2233
Zuo Z, Wang G, Shuai B, Zhao L, Yang Q (2015) Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recogn 48(10):3004–3015
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 25
Khan SH, Hayat M, Bennamoun M, Togneri R, Sohel FA (2016) A discriminative representation of convolutional features for indoor scene recognition. IEEE Trans Image Process 25(7):3372–3383
Song X, Jiang S, Herranz L, Kong Y, Zheng K (2016) Category co-occurrence modeling for large scale scene recognition. Pattern Recogn 59:98–111
Song X, Jiang S, Herranz L (2017) Multi-scale multi-feature context modeling for scene recognition in the semantic manifold. IEEE Trans Image Process 26(6):2721–2735
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Bai S (2017) Scene categorization through using objects represented by deep features. Int J Pattern Recognit Artif Intell 31(09):1755013
Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287
Xie L et al (2018) Improved spatial pyramid matching for scene recognition. Pattern Recogn 82:118–129
Shi J, Zhu H, Yu S, Wu W, Shi H (2019) Scene categorization model using deep visually sensitive features. IEEE Access 7:45230–45239
Sorkhi G, Hassanpour H, Fateh M (2020) A comprehensive system for image scene classification. Multimedia Tools Appl 79(25):18033–18058
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer, pp 818–833
Khan SH, Hayat M, Porikli F (2017) Scene categorization with spectral features. In: Proceedings of the IEEE international conference on computer vision, pp 5638–5648
Chen G, Song X, Zeng H, Jiang S (2020) Scene recognition with prototype-agnostic scene layout. IEEE Trans Image Process 29:5877–5888
Seong H, Hyun J, Kim E (2020) Fosnet: an end-to-end trainable deep neural network for scene recognition. IEEE Access 8:82066–82077
Qiu J, Yang Y, Wang X, Tao D (2021) Scene essence. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8322–8333
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1492–1500
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Liu L et al (2017) Compositional model based fisher vector coding for image classification. IEEE Trans Pattern Anal Mach Intell 39(12):2335–2348
Chen B, Li J, Wei G, Ma B (2018) A novel localized and second order feature coding network for image recognition. Pattern Recogn 76:339–348
Gamage BMSV (2021) An embarrassingly simple comparison of machine learning algorithms for indoor scene classification. arXiv preprint arXiv:2109.12261
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Parseh, M.J., Rahmanimanesh, M., Keshavarzi, P. et al. Semantic-aware visual scene representation. Int J Multimed Info Retr 11, 619–638 (2022). https://doi.org/10.1007/s13735-022-00246-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-022-00246-5