Using semantic context for multiple concepts detection in still images

Hamadi, Abdelkader; Lattar, Hafsa; Khoussa, Mohamed El Bachir; Safadi, Bahjat

doi:10.1007/s10044-018-0761-9

Using semantic context for multiple concepts detection in still images

Theoretical Advances
Published: 02 November 2018

Volume 23, pages 27–44, (2020)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Abdelkader Hamadi ORCID: orcid.org/0000-0001-9990-332X¹,
Hafsa Lattar¹,
Mohamed El Bachir Khoussa¹ &
…
Bahjat Safadi²

214 Accesses
4 Citations
Explore all metrics

Abstract

Multimedia documents indexing systems performances have been improved significantly in recent years, especially after the involvement of deep learning approaches. However, this progress remains insufficient with the evolution of users' needs that become complex in terms of semantics and the number of words that compose their queries. So, it is important to think about indexing images by a group of concepts simultaneously (multi-concepts) and not just single ones. This would allow systems to better respond to queries composed of several terms. This task is much more difficult than indexing images by single concepts. Multi-concepts detection in images has been little dealt in the state of the art compared to the detection of visual single concepts. On the other hand, the use of context has proved its effectiveness in the field of multimedia semantic indexing. In this work, we propose two approaches that consider the semantic context for multi-concepts detection in still images. We tested and evaluated our proposal on the international standard corpus Pascal VOC for the detection of concepts pairs and triplets of concepts. Our contributions have shown that context is useful and improves multi-concepts detection in images. The combination of the use of semantic context and deep learning-based features yielded much better results than those of the state of the art. This difference in performance is estimated by a relative gain on mean average precision reaching + 70% for concepts pairs and + 34% for the case of triplets of concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Learning to Prompt for Vision-Language Models

Article 31 July 2022

A conditioned feature reconstruction network for few-shot classification

Article 21 May 2024

Notes

http://host.robots.ox.ac.uk/pascal/VOC/voc2012/. Last check: November 29th, 2017.
https://www.tensorflow.org/. Last check: November 29th, 2017.

References

https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md/. The model used to generate Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017
https://github.com/bvlc/caffe/tree/master/models/bvlc_googlenet/. The model used to generate Googlenet learned features. Last check: 24/11/2017
https://github.com/bvlc/caffe/tree/master/models/bvlc_reference_caffenet/. The model used to generate Alexnet\(_{-}\)fc7\(_{-}\)2 learned features. Last check: 24/11/2017
https://www.codeproject.com/articles/619039/bag-of-features-descriptor-on-sift-features-with-o/. The code used to extract Sift1024 features. Last check: 24/11/2017
https://www.codeproject.com/tips/656906/bag-of-features-descriptor-on-surf-and-orb-feature/. The code used to extract Orb1024 and Surf1024 features. Last check: 24/11/2017
https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11/. The code used to extract Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017
https://www.tensorflow.org/tutorials/image_recognition/. The model used to generate Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017
http://www.marekrei.com/blog/transforming-images-to-feature-vectors/. The code used to extract Googlenet, Alexnet\(_{-}\)fc7, Alexnet\(_{-}\)fc7\(_{-}\)2 and Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017
https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/. The model used to generate AlexNet\(_{-}\)Fc7 learned features. Last check: 24/11/2017
Abburu S (2010) Context ontology construction for cricket video. Int J Comput Sci Eng 2:2593–2597
Google Scholar
Aly R, Hiemstra D, de Vries A, de Jong F (2008) A probabilistic ranking framework using unobservable binary events for video search. In: 7th ACM international conference on content-based image and video retrieval. CIVR 2008. ACM, New York, pp 349–358
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359. https://doi.org/10.1016/j.cviu.2007.09.014
Article Google Scholar
Brézillon P (1999) Context in problem solving: a survey. Knowl Eng Rev 14(1):47–80. https://doi.org/10.1017/S0269888999141018
Article MATH Google Scholar
Brown L, Cao L, Chang SF, Cheng Y, Choudhary A, Codella N, Cotton C, Ellis D, Fan Q, Feris R, Gong L, Hill M, Hua G, Kender J, Merler M, Mu Y, Pankanti S, Smith JR, Yu FX (2013) Ibm research and columbia university trecvid-2013 multimedia event detection (med), multimedia event recounting (mer), surveillance event detection (sed), and semantic indexing (sin) systems. In: Proceedings of the TRECVID workshop, Gaithersburg
Budnik M, Gutierrez-Gomez E, Safadi B, Pellerin D, Quénot G (2017) Learned features versus engineered features for multimedia indexing. Multimed Tools Appl 76(9):11941–11958. https://doi.org/10.1007/s11042-016-4240-2
Article Google Scholar
Chang SF, Hsu W, Jiang W, Kennedy L, Xu D, Yanagawa A, Zavesky E (2006) Columbia university trecvid-2006 video search and high-level feature extraction. In: Proceedings of the TRECVID workshop
Desvignes M, Porquet C, Spagnou P (1991) The use of context in image sequences interpretation. In: Proceedings of the 8e Congrs AFCET-RFIA, pp 55–61
Dimai A (1999) Rotation invariant texture description using general moment invariants and Gabor filters In: Proceedings of the 11th Scandinavian conference on image analysis, pp 391–398
Divvala SK, Hoiem D, Hays J, Efros AA, Hebert M (2009) An empirical study of context in object detection. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst (CVIU) 114:712–722
Article Google Scholar
Hamadi A, Mulhem P, Quénot G (2015) Extended conceptual feedback for semantic multimedia indexing. Multimed Tools Appl 74(4):1225–1248. https://doi.org/10.1007/s11042-014-1937-y
Article Google Scholar
Hamadi A, Mulhem P, Quénot G (2016) A comparative study for multiple visual concepts detection in images and videos. Multimed Tools Appl 75(15):8973–8997. https://doi.org/10.1007/s11042-015-2730-2
Article Google Scholar
Hamadi A, Quenot G, Mulhem P (2012) Two-layers re-ranking approach based on contextual information for visual concepts detection in videos. In: 10th International workshop on content-based multimedia indexing (CBMI), 2012 , pp 1–6. https://doi.org/10.1109/CBMI.2012.6269837
Hamadi A, Safadi B, Vuong TTT, Han D, Derbas N, Mulhem P, Qunot G (2013) Quaero at TRECVID 2013: semantic indexing and instance search. In: Proceedings of the TRECVID workshop, Gaithersburg
Hauptmann A, y Chen M, Christel M, Huang C, h Lin W, Ng T, Velivelli A, Yang J, Yan R, Yang H, Wactlar HD (2004) Confounded expectations: informedia at trecvid 2004. In: Proceedings of TRECVID
Huang J, Kumar SR, mitra M, Zhu W, Zabih R (1997) Image indexing using color correlograms. In: Proceedings of the conference on computer vision and pattern recognition, Puerto Rico, pp 762–768
Ishikawa S, Koskela M, Sjoberg M, Laaksonen J, Oja E, Amid E, Palomaki K, Mesaros A, Kurimo M (2013) Picsom experiments in trecvid 2013. In: Proceedings of the TRECVID workshop, Gaithersburg
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68
Le Cun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Li X, Snoek CGM, Worring M, Smeulders A (2012) Harvesting social images for bi-concept search. IEEE Trans Multimed 14(4):1091–1104
Article Google Scholar
Li X, Wang D, Li J, Zhang B (2007) Video search in concept subspace: a text-like paradigm. In: Proceeding of CIVR
Lowe D (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999, vol 2, pp 1150–1157. https://doi.org/10.1109/ICCV.1999.790410
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Ma WY, Manjunath BS (1996) Texture features and learning similarity. In: CVPR p. 00:425
Manjunath BS, Ohm J, Vasudevan VV, Yamada A (2001) Color and texture descriptors. Trans Circuits Syst Video Technol 11(6):703–715
Article Google Scholar
Manjunath BS, Wu P, Newsam S, Shin H (2000) A texture descriptor for browsing and image retrieval. Int Commun J 16:33–43
Google Scholar
Memar S, Ektefa M, Affendey L (2010) Developing context model supporting spatial relations for semantic video retrieval. In: International conference on information retrieval knowledge management, (CAMP), 2010, pp 40–43. https://doi.org/10.1109/INFRKM.2010.5466951
Min R, Cheng H (2009) Effective image retrieval using dominant color descriptor and fuzzy support vector machine. Pattern Recognit 42:147–157
Article Google Scholar
Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73
Qi GJ, Hua XS, Rui Y, Tang J, Mei T, Zhang HJ (2007) Correlative multi-label video annotation. In: Lienhart R, Prasad AR, Hanjalic A, Choi S, Bailey BP, Sebe N (eds) Proceedings of the 15th international conference on multimedia 2007, Augsburg, 24–29 Sept 2007, pp 17–26. ACM. https://doi.org/10.1145/1291233.1291245
Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2008) Two-dimensional active learning for image classification. In: IEEE computer society conference on computer vision and pattern recognition, vol 0, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587383
Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2010) Image classification with kernelized spatial-context. IEEE Trans Multimed 12(4):278–287. https://doi.org/10.1109/TMM.2010.2046270
Article Google Scholar
Qiu Y, Guan G, Wang Z, Feng D (2010) Improving news video annotation with semantic context. In: International conference on digital image computing: techniques and applications (DICTA), 2010, pp 214–219. https://doi.org/10.1109/DICTA.2010.47
Naphade RM, Kozintsev IV, Huang TS (2002) Factor graph framework for semantic video indexing. IEEE Trans Circuits Syst Video Technol 12(1):40–52. https://doi.org/10.1109/76.981844
Article Google Scholar
Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: 2011 International conference on computer vision, pp 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544
Safadi B, Derbas N, Quénot G (2015) Descriptor optimization for multimedia indexing and retrieval. Multimed Tools Appl 74(4):1267–1290. https://doi.org/10.1007/s11042-014-2071-6
Article Google Scholar
Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: RIAO, pp 88–91
Safadi B, Quénot G (2011) Re-ranking by local re-scoring for video indexing and retrieval. In: Proceedings of the 20th ACM conference on information and knowledge management (CIKM), Glasgow, pp 2081–2084
Safadi B, Qunot G (2011) Re-ranking for multimedia indexing and retrieval. In: Proceedings of the 33rd European conference on IR research (ECIR), Dublin, pp 708–711
Schilit B, Adams N, Want R (1994) Context-aware computing applications. In: Proceedings of the 1994 first workshop on mobile computing systems and applications, WMCSA ’94, pp. 85–90. IEEE Computer Society, Washington. https://doi.org/10.1109/WMCSA.1994.16
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRR, abs/1409.1556
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. https://doi.org/10.1109/34.895972
Article Google Scholar
Smith JR, Naphade MR, Natsev A (2003) Multimedia semantic indexing using model vectors. In: ICME, pp 445–448. IEEE
Snoek CG, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. Trans Multimed 9(5):975–986
Article Google Scholar
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, MULTIMEDIA ’05, pp. 399–402. ACM, New York. https://doi.org/10.1145/1101149.1101236
Strat TM (1993) Employing contextual information in computer vision. In: Proceedings of ARPA image understanding workshop, pp 217–229
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191. https://doi.org/10.1023/A:1023052124951
Article MathSciNet Google Scholar
van de Sande KE, Gevers T, Snoek CG (2008) A comparison of color features for visual concept classification. In: Proceedings of the 2008 international conference on Content-based image and video retrieval, CIVR ’08, pp 141–150. ACM, New York. https://doi.org/10.1145/1386352.1386376
Wang F, Merialdo B (2009) Eurecom at trecvid 2009 high-level feature extraction. In: TREC2009 notebook
Wang G, Forsyth DA (2009) Joint learning of visual attributes, object classes and visual saliency. In: ICCV’09, pp 537–544
Wei XY, Jiang YG, Ngo CW (2011) Concept-driven multi-modality fusion for video search. IEEE Trans Circuits Syst Video Technol 21(1):62–73
Article Google Scholar
Wolf L, Bileschi S (2006) A critical view of context. Int J Comput Vis 69(2):251–261
Article Google Scholar
Wu Y, Tseng BL, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo (ICME) (IEEE Cat. No.04TH8763), vol 2, Taipei, pp. 1003–1006. https://doi.org/10.1109/ICME.2004.1394372
Yan R, Hauptmann AG (2003) The combination limit in multimedia retrieval. In: In Proceedings of the eleventh ACM international conference on Multimedia (MULTIMEDIA \(\acute{0}\)3, pp 339–342
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision—ECCV 2014—13th European conference, Zurich, 6–12 Sept 2014, Proceedings, Part I, pp 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
Chapter Google Scholar
Zhang D, Lu G (2003) Evaluation of mpeg-7 shape descriptors against other shape descriptors. ACM J Multimed Syst 9(1):15–30
Article Google Scholar
Zhang DS, Lu GJ (2001) Shape retrieval using fourier descriptors. In: Proceedings of international conference on multimedia and distance education (ICMADE-01), Fargo, pp 1–9
Zhang H, Cao X, Ho JKL, Chow TWS (2017) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531. https://doi.org/10.1109/TII.2016.2605629
Article Google Scholar
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst https://doi.org/10.1109/TNNLS.2018.2797060
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Exact Sciences and Computer Science, University of Abdelhamid Ibn Badis - Mostaganem, Mostaganem, Algeria
Abdelkader Hamadi, Hafsa Lattar & Mohamed El Bachir Khoussa
LIG, CNRS, Univ. Grenoble Alpes, 38000, Grenoble, France
Bahjat Safadi

Authors

Abdelkader Hamadi
View author publications
You can also search for this author in PubMed Google Scholar
Hafsa Lattar
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed El Bachir Khoussa
View author publications
You can also search for this author in PubMed Google Scholar
Bahjat Safadi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdelkader Hamadi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hamadi, A., Lattar, H., Khoussa, M.E.B. et al. Using semantic context for multiple concepts detection in still images. Pattern Anal Applic 23, 27–44 (2020). https://doi.org/10.1007/s10044-018-0761-9

Download citation

Received: 16 December 2017
Accepted: 29 October 2018
Published: 02 November 2018
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10044-018-0761-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using semantic context for multiple concepts detection in still images

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

A conditioned feature reconstruction network for few-shot classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Using semantic context for multiple concepts detection in still images

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Learning to Prompt for Vision-Language Models

A conditioned feature reconstruction network for few-shot classification

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation