Abstract
Multimedia documents indexing systems performances have been improved significantly in recent years, especially after the involvement of deep learning approaches. However, this progress remains insufficient with the evolution of users' needs that become complex in terms of semantics and the number of words that compose their queries. So, it is important to think about indexing images by a group of concepts simultaneously (multi-concepts) and not just single ones. This would allow systems to better respond to queries composed of several terms. This task is much more difficult than indexing images by single concepts. Multi-concepts detection in images has been little dealt in the state of the art compared to the detection of visual single concepts. On the other hand, the use of context has proved its effectiveness in the field of multimedia semantic indexing. In this work, we propose two approaches that consider the semantic context for multi-concepts detection in still images. We tested and evaluated our proposal on the international standard corpus Pascal VOC for the detection of concepts pairs and triplets of concepts. Our contributions have shown that context is useful and improves multi-concepts detection in images. The combination of the use of semantic context and deep learning-based features yielded much better results than those of the state of the art. This difference in performance is estimated by a relative gain on mean average precision reaching + 70% for concepts pairs and + 34% for the case of triplets of concepts.
Similar content being viewed by others
Notes
http://host.robots.ox.ac.uk/pascal/VOC/voc2012/. Last check: November 29th, 2017.
https://www.tensorflow.org/. Last check: November 29th, 2017.
References
https://gist.github.com/ksimonyan/211839e770f7b538e2d8#file-readme-md/. The model used to generate Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017
https://github.com/bvlc/caffe/tree/master/models/bvlc_googlenet/. The model used to generate Googlenet learned features. Last check: 24/11/2017
https://github.com/bvlc/caffe/tree/master/models/bvlc_reference_caffenet/. The model used to generate Alexnet\(_{-}\)fc7\(_{-}\)2 learned features. Last check: 24/11/2017
https://www.codeproject.com/articles/619039/bag-of-features-descriptor-on-sift-features-with-o/. The code used to extract Sift1024 features. Last check: 24/11/2017
https://www.codeproject.com/tips/656906/bag-of-features-descriptor-on-surf-and-orb-feature/. The code used to extract Orb1024 and Surf1024 features. Last check: 24/11/2017
https://www.kernix.com/blog/image-classification-with-a-pre-trained-deep-neural-network_p11/. The code used to extract Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017
https://www.tensorflow.org/tutorials/image_recognition/. The model used to generate Inceptionv3\(_{-}\)pool3 learned features. Last check: 24/11/2017
http://www.marekrei.com/blog/transforming-images-to-feature-vectors/. The code used to extract Googlenet, Alexnet\(_{-}\)fc7, Alexnet\(_{-}\)fc7\(_{-}\)2 and Vgg16\(_{-}\)fc7 learned features. Last check: 24/11/2017
https://github.com/BVLC/caffe/blob/master/models/bvlc_alexnet/. The model used to generate AlexNet\(_{-}\)Fc7 learned features. Last check: 24/11/2017
Abburu S (2010) Context ontology construction for cricket video. Int J Comput Sci Eng 2:2593–2597
Aly R, Hiemstra D, de Vries A, de Jong F (2008) A probabilistic ranking framework using unobservable binary events for video search. In: 7th ACM international conference on content-based image and video retrieval. CIVR 2008. ACM, New York, pp 349–358
Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359. https://doi.org/10.1016/j.cviu.2007.09.014
Brézillon P (1999) Context in problem solving: a survey. Knowl Eng Rev 14(1):47–80. https://doi.org/10.1017/S0269888999141018
Brown L, Cao L, Chang SF, Cheng Y, Choudhary A, Codella N, Cotton C, Ellis D, Fan Q, Feris R, Gong L, Hill M, Hua G, Kender J, Merler M, Mu Y, Pankanti S, Smith JR, Yu FX (2013) Ibm research and columbia university trecvid-2013 multimedia event detection (med), multimedia event recounting (mer), surveillance event detection (sed), and semantic indexing (sin) systems. In: Proceedings of the TRECVID workshop, Gaithersburg
Budnik M, Gutierrez-Gomez E, Safadi B, Pellerin D, Quénot G (2017) Learned features versus engineered features for multimedia indexing. Multimed Tools Appl 76(9):11941–11958. https://doi.org/10.1007/s11042-016-4240-2
Chang SF, Hsu W, Jiang W, Kennedy L, Xu D, Yanagawa A, Zavesky E (2006) Columbia university trecvid-2006 video search and high-level feature extraction. In: Proceedings of the TRECVID workshop
Desvignes M, Porquet C, Spagnou P (1991) The use of context in image sequences interpretation. In: Proceedings of the 8e Congrs AFCET-RFIA, pp 55–61
Dimai A (1999) Rotation invariant texture description using general moment invariants and Gabor filters In: Proceedings of the 11th Scandinavian conference on image analysis, pp 391–398
Divvala SK, Hoiem D, Hays J, Efros AA, Hebert M (2009) An empirical study of context in object detection. In: IEEE computer society conference on computer vision and pattern recognition (CVPR)
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
Galleguillos C, Belongie S (2010) Context based object categorization: a critical survey. Comput Vis Image Underst (CVIU) 114:712–722
Hamadi A, Mulhem P, Quénot G (2015) Extended conceptual feedback for semantic multimedia indexing. Multimed Tools Appl 74(4):1225–1248. https://doi.org/10.1007/s11042-014-1937-y
Hamadi A, Mulhem P, Quénot G (2016) A comparative study for multiple visual concepts detection in images and videos. Multimed Tools Appl 75(15):8973–8997. https://doi.org/10.1007/s11042-015-2730-2
Hamadi A, Quenot G, Mulhem P (2012) Two-layers re-ranking approach based on contextual information for visual concepts detection in videos. In: 10th International workshop on content-based multimedia indexing (CBMI), 2012 , pp 1–6. https://doi.org/10.1109/CBMI.2012.6269837
Hamadi A, Safadi B, Vuong TTT, Han D, Derbas N, Mulhem P, Qunot G (2013) Quaero at TRECVID 2013: semantic indexing and instance search. In: Proceedings of the TRECVID workshop, Gaithersburg
Hauptmann A, y Chen M, Christel M, Huang C, h Lin W, Ng T, Velivelli A, Yang J, Yan R, Yang H, Wactlar HD (2004) Confounded expectations: informedia at trecvid 2004. In: Proceedings of TRECVID
Huang J, Kumar SR, mitra M, Zhu W, Zabih R (1997) Image indexing using color correlograms. In: Proceedings of the conference on computer vision and pattern recognition, Puerto Rico, pp 762–768
Ishikawa S, Koskela M, Sjoberg M, Laaksonen J, Oja E, Amid E, Palomaki K, Mesaros A, Kurimo M (2013) Picsom experiments in trecvid 2013. In: Proceedings of the TRECVID workshop, Gaithersburg
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE computer society conference on computer vision and pattern recognition, 2006, vol 2, pp 2169–2178. https://doi.org/10.1109/CVPR.2006.68
Le Cun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Li X, Snoek CGM, Worring M, Smeulders A (2012) Harvesting social images for bi-concept search. IEEE Trans Multimed 14(4):1091–1104
Li X, Wang D, Li J, Zhang B (2007) Video search in concept subspace: a text-like paradigm. In: Proceeding of CIVR
Lowe D (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999, vol 2, pp 1150–1157. https://doi.org/10.1109/ICCV.1999.790410
Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Ma WY, Manjunath BS (1996) Texture features and learning similarity. In: CVPR p. 00:425
Manjunath BS, Ohm J, Vasudevan VV, Yamada A (2001) Color and texture descriptors. Trans Circuits Syst Video Technol 11(6):703–715
Manjunath BS, Wu P, Newsam S, Shin H (2000) A texture descriptor for browsing and image retrieval. Int Commun J 16:33–43
Memar S, Ektefa M, Affendey L (2010) Developing context model supporting spatial relations for semantic video retrieval. In: International conference on information retrieval knowledge management, (CAMP), 2010, pp 40–43. https://doi.org/10.1109/INFRKM.2010.5466951
Min R, Cheng H (2009) Effective image retrieval using dominant color descriptor and fuzzy support vector machine. Pattern Recognit 42:147–157
Pass G, Zabih R, Miller J (1996) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73
Qi GJ, Hua XS, Rui Y, Tang J, Mei T, Zhang HJ (2007) Correlative multi-label video annotation. In: Lienhart R, Prasad AR, Hanjalic A, Choi S, Bailey BP, Sebe N (eds) Proceedings of the 15th international conference on multimedia 2007, Augsburg, 24–29 Sept 2007, pp 17–26. ACM. https://doi.org/10.1145/1291233.1291245
Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2008) Two-dimensional active learning for image classification. In: IEEE computer society conference on computer vision and pattern recognition, vol 0, pp 1–8. https://doi.org/10.1109/CVPR.2008.4587383
Qi GJ, Hua XS, Rui Y, Tang J, Zhang HJ (2010) Image classification with kernelized spatial-context. IEEE Trans Multimed 12(4):278–287. https://doi.org/10.1109/TMM.2010.2046270
Qiu Y, Guan G, Wang Z, Feng D (2010) Improving news video annotation with semantic context. In: International conference on digital image computing: techniques and applications (DICTA), 2010, pp 214–219. https://doi.org/10.1109/DICTA.2010.47
Naphade RM, Kozintsev IV, Huang TS (2002) Factor graph framework for semantic video indexing. IEEE Trans Circuits Syst Video Technol 12(1):40–52. https://doi.org/10.1109/76.981844
Rublee E, Rabaud V, Konolige K, Bradski G (2011) Orb: an efficient alternative to sift or surf. In: 2011 International conference on computer vision, pp 2564–2571. https://doi.org/10.1109/ICCV.2011.6126544
Safadi B, Derbas N, Quénot G (2015) Descriptor optimization for multimedia indexing and retrieval. Multimed Tools Appl 74(4):1267–1290. https://doi.org/10.1007/s11042-014-2071-6
Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: RIAO, pp 88–91
Safadi B, Quénot G (2011) Re-ranking by local re-scoring for video indexing and retrieval. In: Proceedings of the 20th ACM conference on information and knowledge management (CIKM), Glasgow, pp 2081–2084
Safadi B, Qunot G (2011) Re-ranking for multimedia indexing and retrieval. In: Proceedings of the 33rd European conference on IR research (ECIR), Dublin, pp 708–711
Schilit B, Adams N, Want R (1994) Context-aware computing applications. In: Proceedings of the 1994 first workshop on mobile computing systems and applications, WMCSA ’94, pp. 85–90. IEEE Computer Society, Washington. https://doi.org/10.1109/WMCSA.1994.16
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: CoRR, abs/1409.1556
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380. https://doi.org/10.1109/34.895972
Smith JR, Naphade MR, Natsev A (2003) Multimedia semantic indexing using model vectors. In: ICME, pp 445–448. IEEE
Snoek CG, Huurnink B, Hollink L, de Rijke M, Schreiber G, Worring M (2007) Adding semantics to detectors for video retrieval. Trans Multimed 9(5):975–986
Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: Proceedings of the 13th annual ACM international conference on multimedia, MULTIMEDIA ’05, pp. 399–402. ACM, New York. https://doi.org/10.1145/1101149.1101236
Strat TM (1993) Employing contextual information in computer vision. In: Proceedings of ARPA image understanding workshop, pp 217–229
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Torralba A (2003) Contextual priming for object detection. Int J Comput Vis 53(2):169–191. https://doi.org/10.1023/A:1023052124951
van de Sande KE, Gevers T, Snoek CG (2008) A comparison of color features for visual concept classification. In: Proceedings of the 2008 international conference on Content-based image and video retrieval, CIVR ’08, pp 141–150. ACM, New York. https://doi.org/10.1145/1386352.1386376
Wang F, Merialdo B (2009) Eurecom at trecvid 2009 high-level feature extraction. In: TREC2009 notebook
Wang G, Forsyth DA (2009) Joint learning of visual attributes, object classes and visual saliency. In: ICCV’09, pp 537–544
Wei XY, Jiang YG, Ngo CW (2011) Concept-driven multi-modality fusion for video search. IEEE Trans Circuits Syst Video Technol 21(1):62–73
Wolf L, Bileschi S (2006) A critical view of context. Int J Comput Vis 69(2):251–261
Wu Y, Tseng BL, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo (ICME) (IEEE Cat. No.04TH8763), vol 2, Taipei, pp. 1003–1006. https://doi.org/10.1109/ICME.2004.1394372
Yan R, Hauptmann AG (2003) The combination limit in multimedia retrieval. In: In Proceedings of the eleventh ACM international conference on Multimedia (MULTIMEDIA \(\acute{0}\)3, pp 339–342
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: Computer vision—ECCV 2014—13th European conference, Zurich, 6–12 Sept 2014, Proceedings, Part I, pp 818–833. https://doi.org/10.1007/978-3-319-10590-1_53
Zhang D, Lu G (2003) Evaluation of mpeg-7 shape descriptors against other shape descriptors. ACM J Multimed Syst 9(1):15–30
Zhang DS, Lu GJ (2001) Shape retrieval using fourier descriptors. In: Proceedings of international conference on multimedia and distance education (ICMADE-01), Fargo, pp 1–9
Zhang H, Cao X, Ho JKL, Chow TWS (2017) Object-level video advertising: an optimization framework. IEEE Trans Ind Inform 13(2):520–531. https://doi.org/10.1109/TII.2016.2605629
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst https://doi.org/10.1109/TNNLS.2018.2797060
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hamadi, A., Lattar, H., Khoussa, M.E.B. et al. Using semantic context for multiple concepts detection in still images. Pattern Anal Applic 23, 27–44 (2020). https://doi.org/10.1007/s10044-018-0761-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-018-0761-9