Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression


Heterogeneous sources of information, such as images, videos, text and metadata are often used to describe different or complementary views of the same multimedia object, especially in the online news domain and in large annotated image collections. The retrieval of multimedia objects, given a multimodal query, requires the combination of several sources of information in an efficient and scalable way. Towards this direction, we provide a novel unsupervised framework for multimodal fusion of visual and textual similarities, which are based on visual features, visual concepts and textual metadata, integrating non-linear graph-based fusion and Partial Least Squares Regression. The fusion strategy is based on the construction of a multimodal contextual similarity matrix and the non-linear combination of relevance scores from query-based similarity vectors. Our framework can employ more than two modalities and high-level information, without increase in memory complexity, when compared to state-of-the-art baseline methods. The experimental comparison is done in three public multimedia collections in the multimedia retrieval task. The results have shown that the proposed method outperforms the baseline methods, in terms of Mean Average Precision and Precision@20.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

  2. 2.

  3. 3.


  1. 1.

    Ah-Pine J, Bressan M, Clinchant S, Csurka G, Hoppenot Y, Renders J M (2009) Crossing textual and visual content in different application scenarios. Multimed Tools Appl 42(1):31–56

    Article  Google Scholar 

  2. 2.

    Ah-Pine J, Csurka G, Clinchant S (2015) Unsupervised visual and textual information fusion in cbmir using graph-based methods. ACM Trans Inf Syst (TOIS) 33(2):9

    Article  Google Scholar 

  3. 3.

    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimed Syst 16(6):345–379

    Article  Google Scholar 

  4. 4.

    Caicedo JC, BenAbdallah J, González FA, Nasraoui O (2012) Multimodal representation, indexing, automated annotation and retrieval of image collections via non-negative matrix factorization. Neurocomputing 76(1):50–60

    Article  Google Scholar 

  5. 5.

    Clinchant S, Csurka G, Perronnin F, Renders JM (2007) Xrce’s participation to imageval. In: ImageEval workshop at CVIR

  6. 6.

    Daiber J, Jakob M, Hokamp C, Mendes PN (2013) Improving efficiency and accuracy in multilingual entity extraction. In: Proceedings of the 9th international conference on semantic systems. ACM, pp 121–124

  7. 7.

    Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia. ACM, pp 7–16

  8. 8.

    Gialampoukidis I, Moumtzidou A, Liparas D, Vrochidis S, Kompatsiaris I (2016) A hybrid graph-based and non-linear late fusion approach for multimedia retrieval. In: 2016 14th International workshop on content-based multimedia indexing (CBMI). IEEE, pp 1–6

  9. 9.

    Gialampoukidis I, Moumtzidou A, Tsikrika T, Vrochidis S, Kompatsiaris I (2016) Retrieval of multimedia objects by fusing multiple modalities. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval. ACM, pp 359–362

  10. 10.

    Grubinger M, Clough P, Müller H, Deselaers T (2006) The iapr tc-12 benchmark: a new evaluation resource for visual information systems. In: International workshop OntoImage, pp 13–23

  11. 11.

    Hafner J, Sawhney HS, Equitz W, Flickner M, Niblack W (1995) Efficient color histogram indexing for quadratic form distance functions. IEEE Trans Pattern Anal Mach Intell 17(7):729–736

    Article  Google Scholar 

  12. 12.

    Hsu WH, Kennedy LS, Chang SF (2007) Video search reranking through random walk over document-level context graph. In: Proceedings of the 15th international conference on multimedia. ACM, pp 971–980

  13. 13.

    Ionescu B, Popescu A, Lupu M, Gînsca AL, Müller H (2014) Retrieving diverse social images at mediaeval 2014: challenge, dataset and evaluation. In: MediaEval

  14. 14.

    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: 2010 IEEE Conference on computer vision and pattern recognition (CVPR). IEEE, pp 3304–3311

  15. 15.

    Kitanovski I, Trojacanec K, Dimitrovski I, Loskovska S (2013) Multimodal medical image retrieval. In: ICT Innovations 2012. Springer, pp 81–89

  16. 16.

    Lan ZZ, Bao L, Yu SI, Liu W, Hauptmann AG (2014) Multimedia classification and event detection using double fusion. Multimed Tools Appl 71 (1):333–347

    Article  Google Scholar 

  17. 17.

    Langville AN, Meyer CD (2005) A survey of eigenvector methods for web information retrieval. SIAM Rev 47(1):135–161

    MathSciNet  Article  MATH  Google Scholar 

  18. 18.

    Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the seventh IEEE international conference on computer vision, 1999, vol 2. IEEE, pp 1150–1157

  19. 19.

    Magalhães J, Rüger S (2010) An information-theoretic framework for semantic-multimedia retrieval. ACM Trans Inf Syst (TOIS) 28(4):19

    Article  Google Scholar 

  20. 20.

    Markatopoulou F, Mezaris V, Patras I (2015) Cascade of classifiers based on binary, non-binary and deep convolutional network descriptors for video concept detection. In: 2015 IEEE International conference on image processing (ICIP). IEEE, pp 1786–1790

  21. 21.

    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on Multimedia. ACM, pp 251–260

  22. 22.

    Safadi B, Sahuguet M, Huet B (2014) When textual and visual information join forces for multimedia retrieval. In: Proceedings of international conference on multimedia retrieval. ACM, p 265

  23. 23.

    Sanderson C, Paliwal KK (2004) Identity verification using speech and face information. Digit Signal Process 14(5):449–480

    Article  Google Scholar 

  24. 24.

    Siddiquie B, White B, Sharma A, Davis LS (2014) Multi-modal image retrieval for complex queries using small codes. In: Proceedings of international conference on multimedia retrieval. ACM, p 321

  25. 25.

    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv technical report

  26. 26.

    Tsikrika T, Andreadou K, Moumtzidou A, Schinas E, Papadopoulos S, Vrochidis S, Kompatsiaris I (2015) A unified model for socially interconnected multimedia-enriched objects. In: MultiMedia modeling. Springer, pp 372–384

  27. 27.

    Tsikrika T, Kludas J (2010) The wikipedia image retrieval task. In: ImageCLEF. Springer, pp 163–183

  28. 28.

    Tsikrika T, Kludas J, Popescu A (2012) Building reliable and reusable test collections for image retrieval: the wikipedia task at imageclef. IEEE MultiMedia 19 (3):0024

    Article  Google Scholar 

  29. 29.

    Van De Sande KE, Gevers T, Snoek CG (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell 32(9):1582–1596

    Article  Google Scholar 

  30. 30.

    Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 347–354

  31. 31.

    Wang Y, Lin X, Zhang Q (2013) Towards metric fusion on multi-view data: a cross-view based graph random walk approach. In: Proceedings of the 22nd ACM international conference on conference on information & knowledge management. ACM, pp 805–810

  32. 32.

    Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the ACM international conference on multimedia. ACM, pp 307–316

  33. 33.

    Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014) Effective multi-modal retrieval based on stacked auto-encoders. Proc VLDB Endow 7(8):649–660

    Article  Google Scholar 

  34. 34.

    Xu S, Li H, Chang X, Yu S I, Du X, Li X, Jiang L, Mao Z, Lan Z, Burger S et al (2015) Incremental multimodal query construction for video search. In: Proceedings of the 5th ACM on international conference on multimedia retrieval. ACM, pp 675–678

  35. 35.

    Younessian E, Mitamura T, Hauptmann A (2012) Multimodal knowledge-based analysis in multimedia event detection. In: Proceedings of the 2nd ACM international conference on multimedia retrieval. ACM, p 51

Download references


This work was partially supported by the European Commission by the projects MULTISENSOR (FP7-610411) and KRISTINA (H2020-645012).

Author information



Corresponding author

Correspondence to Ilias Gialampoukidis.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gialampoukidis, I., Moumtzidou, A., Liparas, D. et al. Multimedia retrieval based on non-linear graph-based fusion and partial least squares regression. Multimed Tools Appl 76, 22383–22403 (2017).

Download citation


  • Multimedia retrieval
  • Non-linear fusion
  • Graph-based models