Multimedia Tools and Applications

, Volume 76, Issue 21, pp 22383–22403 | Cite as

Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression

  • Ilias Gialampoukidis
  • Anastasia Moumtzidou
  • Dimitris Liparas
  • Theodora Tsikrika
  • Stefanos Vrochidis
  • Ioannis Kompatsiaris


Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.


Multimedia retrieval Non-linear fusion Graph-based models 



This work was partially supported by the European Commission by the projects MULTISENSOR (FP7-610411) and KRISTINA (H2020-645012).


  1. 1.
    Ah-Pine J, Bressan M, Clinchant S, Csurka G, Hoppenot Y, Renders J M (2009) Crossing textual and visual content in different application scenarios. Multimed Tools Appl 42(1):31–56CrossRefGoogle Scholar
  2. 2.
    Ah-Pine J, Csurka G, Clinchant S (2015) Unsupervised visual and textual information fusion in cbmir using graph-based methods. ACM Trans Inf Syst (TOIS) 33(2):9CrossRefGoogle Scholar
  3. 3.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379CrossRefGoogle Scholar
  4. 4.
    Caicedo JC, BenAbdallah J, González FA, Nasraoui O (2012) Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 76(1):50–60CrossRefGoogle Scholar
  5. 5.
    Clinchant S, Csurka G, Perronnin F, Renders JM (2007) Xrce’s participation to imageval. In: ImageEval workshop at CVIRGoogle Scholar
  6. 6.
    Daiber J, Jakob M, Hokamp C, Mendes PN (2013) Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th international conference on semantic systems. ACM, pp 121–124Google Scholar
  7. 7.
    Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 7–16Google Scholar
  8. 8.
    Gialampoukidis I, Moumtzidou A, Liparas D, Vrochidis S, Kompatsiaris I (2016) A hybrid graph-based and non-linear late fusion approach for multimedia retrieval. In: 2016 14th International workshop on content-based multimedia indexing (CBMI). IEEE, pp 1–6Google Scholar
  9. 9.
    Gialampoukidis I, Moumtzidou A, Tsikrika T, Vrochidis S, Kompatsiaris I (2016) Retrieval of multimedia objects by fusing multiple modalities. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. ACM, pp 359–362Google Scholar
  10. 10.
    Grubinger M, Clough P, Müller H, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop OntoImage, pp 13–23Google Scholar
  11. 11.
    Hafner J, Sawhney HS, Equitz W, Flickner M, Niblack W (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Trans Pattern Anal Mach Intell 17(7):729–736CrossRefGoogle Scholar
  12. 12.
    Hsu WH, Kennedy LS, Chang SF (2007) Video search reranking through random walk over document-level context graph. In: Proceedings of the 15th international conference on multimedia. ACM, pp 971–980Google Scholar
  13. 13.
    Ionescu B, Popescu A, Lupu M, Gînsca AL, Müller H (2014) Retrieving diverse social images at mediaeval 2014: challenge, dataset and evaluation. In: MediaEvalGoogle Scholar
  14. 14.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3304–3311Google Scholar
  15. 15.
    Kitanovski I, Trojacanec K, Dimitrovski I, Loskovska S (2013) Multimodal medical image retrieval. In: ICT Innovations 2012. Springer, pp 81–89Google Scholar
  16. 16.
    Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71 (1):333–347CrossRefGoogle Scholar
  17. 17.
    Langville AN, Meyer CD (2005) A survey of eigenvector methods for web information retrieval. SIAM Rev 47(1):135–161MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999, vol 2. IEEE, pp 1150–1157Google Scholar
  19. 19.
    Magalhães J, Rüger S (2010) An information-theoretic framework for semantic-multimedia retrieval. ACM Trans Inf Syst (TOIS) 28(4):19CrossRefGoogle Scholar
  20. 20.
    Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection. In: 2015 IEEE International conference on image processing (ICIP). IEEE, pp 1786–1790Google Scholar
  21. 21.
    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM, pp 251–260Google Scholar
  22. 22.
    Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: Proceedings of international conference on multimedia retrieval. ACM, p 265Google Scholar
  23. 23.
    Sanderson C, Paliwal KK (2004) Identity verification using speech and face information. Digit Signal Process 14(5):449–480CrossRefGoogle Scholar
  24. 24.
    Siddiquie B, White B, Sharma A, Davis LS (2014) Multi-modal image retrieval for complex queries using small codes. In: Proceedings of international conference on multimedia retrieval. ACM, p 321Google Scholar
  25. 25.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv technical reportGoogle Scholar
  26. 26.
    Tsikrika T, Andreadou K, Moumtzidou A, Schinas E, Papadopoulos S, Vrochidis S, Kompatsiaris I (2015) A unified model for socially interconnected multimedia-enriched objects. In: MultiMedia modeling. Springer, pp 372–384Google Scholar
  27. 27.
    Tsikrika T, Kludas J (2010) The wikipedia image retrieval task. In: ImageCLEF. Springer, pp 163–183Google Scholar
  28. 28.
    Tsikrika T, Kludas J, Popescu A (2012) Building reliable and reusable test collections for image retrieval: the wikipedia task at imageclef. IEEE MultiMedia 19 (3):0024CrossRefGoogle Scholar
  29. 29.
    Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596CrossRefGoogle Scholar
  30. 30.
    Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 347–354Google Scholar
  31. 31.
    Wang Y, Lin X, Zhang Q (2013) Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, pp 805–810Google Scholar
  32. 32.
    Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the ACM international conference on multimedia. ACM, pp 307–316Google Scholar
  33. 33.
    Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014) Effective multi-modal retrieval based on stacked auto-encoders. Proc VLDB Endow 7(8):649–660CrossRefGoogle Scholar
  34. 34.
    Xu S, Li H, Chang X, Yu S I, Du X, Li X, Jiang L, Mao Z, Lan Z, Burger S et al (2015) Incremental multimodal query construction for video search. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 675–678Google Scholar
  35. 35.
    Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. In: Proceedings of the 2nd ACM international conference on multimedia retrieval. ACM, p 51Google Scholar

Copyright information

© Springer Science+Business Media New York 2017

Authors and Affiliations

  • Ilias Gialampoukidis
    • 1
  • Anastasia Moumtzidou
    • 1
  • Dimitris Liparas
    • 1
  • Theodora Tsikrika
    • 1
  • Stefanos Vrochidis
    • 1
  • Ioannis Kompatsiaris
    • 1
  1. 1.Centre for Research and Technology-HellasInformation Technologies InstituteThessalonikiGreece

Personalised recommendations