ImageCLEF pp 315-342 | Cite as

Leveraging Image, Text and Cross–media Similarities for Diversity–focused Multimedia Retrieval

  • Julien Ah-PineEmail author
  • Stephane Clinchant
  • Gabriela Csurka
  • Florent Perronnin
  • Jean-Michel Renders
Part of the The Information Retrieval Series book series (INRE, volume 32)


This chapter summarizes the different cross–modal information retrieval techniques Xerox Research Centre implemented during three years of participation in ImageCLEF Photo tasks. The main challenge remained constant: how to optimally couple visual and textual similarities, when they capture things at different semantic levels and when one of the media (the textual one) gives, most of the time, much better retrieval performance. Some core components turned out to be very effective all over the years: the visual similarity metrics based on Fisher Vector representation of images and the cross–media similarity principle based on relevance models. However, other components were introduced to solve additional issues: We tried different query– and document–enrichment methods by exploiting auxiliary resources such as Flickr or open–source thesauri, or by doing some statistical ‘semantic smoothing’. We also implemented some clustering mechanisms in order to promote diversity in the top results and to provide faster access to relevant information. This chapter describes, analyses and assesses each of these components, namely: the monomodal similarity measures, the different cross–media similarities, the query and document enrichment, and finally the mechanisms to ensure diversity in what is proposed to the user. To conclude, we discuss the numerous lessons we have learnt over the years by trying to solve this very challenging task.


Image Retrieval Relevance Feedback Query Expansion Visual Similarity Late Fusion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Ah-Pine J (2009) Cluster analysis based on the central tendency deviation principle. In: Proceedings of the International Conference on Advanced Data Mining and Applications, pp 5–18 Google Scholar
  2. Ah-Pine J, Cifarelli C, Clinchant S, Csurka G, Renders J (2008) XRCE’s participation to ImageCLEF 2008. In: Working Notes of CLEF 2008, Aarhus, Denmark Google Scholar
  3. Ah-Pine J, Bressan M, Clinchant S, Csurka G, Hoppenot Y, Renders J (2009) Crossing textual and visual content in different application scenarios. Multimedia Tools and Applications 42(1):31–56 CrossRefGoogle Scholar
  4. Ah-Pine J, Clinchant S, Csurka G, Liu Y (2009) XRCE’s participation to ImageCLEF 2009. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece Google Scholar
  5. Ah-Pine J, Csurka G, Renders JM (2009c) Evaluation of diversity–focused strategies for multimedia retrieval. In: Evaluating Systems for Multilingual and Multimodal Information Access, Springer, Lecture Notes in Computer Science (LNCS), vol 5706, pp 677–684 Google Scholar
  6. Ah-Pine J, Clinchant S, Csurka G Comparison of several combinations of multimodal and diversity seeking methods for multimedia retrieval. In: Multilingual Information Access Evaluation, Springer, Lecture Notes in Computer Science (LNCS) Google Scholar
  7. Barnard K, Duygulu P, Forsyth D, de Freitas N, Jordan M (2003) Matching words and pictures. Journal of Machine Learning Research 3:1107–1135 zbMATHCrossRefGoogle Scholar
  8. Blei D, Jordan MI (2003) Modeling annotated data. In: Proceedings of the ACM SIGIR conference, ACM press, pp 127–134 Google Scholar
  9. Boudin F, El-Bèze M, Torres-Moreno J (2008) A scalable MMR approach to sentence scoring for multi–document update summarization. In: Proceedings of the international conference on computational linguistics, pp 21–24 Google Scholar
  10. Carbonell J, Goldstein J (1998) The use of MMR, diversity–based reranking for reordering documents and producing summaries. In: Proceedings of the ACM SIGIR conference, ACM press, pp 335–336 Google Scholar
  11. Carbonetto P, de Freitas N, Barnard K (2004) A statistical model for general contextual object recognition. In: European conference on computer vision, vol 1, pp 350–362 Google Scholar
  12. Chang YC, Chen HH (2006) Approaches of using a word-image ontology and an annotated image corpus as intermedia for cross–language image retrieval. In: Working notes CLEF 2006 Google Scholar
  13. Clinchant S, Renders J, Csurka G (2007) XRCE’s participation to ImageCLEF 2007. In: Working Notes of CLEF 2007, Budapest, Hungary Google Scholar
  14. Clinchant S, Renders JM, Csurka G (2008) Trans–media pseudo–relevance feedback methods in multimedia retrieval. In: Advances in Multilingual and Multimodal Information Retrieval, Springer, Lecture Notes in Computer Science (LNCS), vol 5152, pp 569–576 Google Scholar
  15. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV Workshop on Statistical Learning for Computer Vision, pp 59–74 Google Scholar
  16. Deselaers T, Hanbury A (2008) The Visual Concept Detection Task in ImageCLEF 2008. In: Working Notes of CLEF 2008 Google Scholar
  17. Duygulu P, Barnard K, de Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: European conference on computer vision, vol 4, pp 97–112 Google Scholar
  18. Everingham M, Sivic J, Zisserman A (2006) “hello! my name is... buffy” – automatic naming of characters in TV video. In: British machine vision conference, pp 889–908 Google Scholar
  19. Feng S, Lavrenko V, Manmatha R (2004) Multiple bernoulli relevance models for image and video annotation. In: International conference on computer vision and pattern recognition, vol 2, pp 1002–1009 Google Scholar
  20. Huang T, Dagli C, Rajaram S, Chang E, Mandel M, Poliner G, Ellis D (2008) Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4):648–667 CrossRefGoogle Scholar
  21. Iyengar G, Duygulu P, Feng S, Ircing P, Khudanpur S, Klakow D, Krause M, Manmatha R, Nock H, Petkova D, Pytlik B, Virga P (2005) Joint visual–text modeling for automatic retrieval of multimedia documents. In: Proceedings of ACM Multimedia, ACM press, pp 21–30 Google Scholar
  22. Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, MIT Press, pp 487–493 Google Scholar
  23. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross–media relevance models. In: Proceedings of the ACM SIGIR conference, ACM press, pp 119–126 Google Scholar
  24. Lavrenko V, Manmatha R, Jeon J (2003) A model for learning the semantics of pictures. In: Annual conference on neural information processing systems, pp 553–560 Google Scholar
  25. Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1075–1088 CrossRefGoogle Scholar
  26. Lin Z, Chua T, Kan M, Lee W, Qiu L, Ye S (2005) NUS at DUC 2007: Using evolutionary models of text. In: Document Understanding Conference Google Scholar
  27. Maillot N, Chevallet JP, Valea V, Lim JH (2006) IPAL Inter–Media Pseudo–Relevance Feedback Approach to ImageCLEF 2006 photo retrieval. In: CLEF 2006 Working Notes Google Scholar
  28. Manning CD, Schütze H, Lee L (2000) Review: Foundations of statistical natural language processing Google Scholar
  29. Marcotorchino J, Michaud P (1981) Heuristic approach of the similarity aggregation problem. Methods of operation research 43:395–404 zbMATHGoogle Scholar
  30. Monay F, Gatica-Perez D (2004) PLSA–based Image Auto–Annotation: Constraining the Latent Space. In: Proceedings of ACM Multimedia, ACM press, pp 348–351 Google Scholar
  31. Mori Y, Takahashi H, Oka R (1999) Image–to–word transformation based on dividing and vector quantizing images with words. In: First International Workshop on Multimedia Intelligent Storage and Retrieval Management Google Scholar
  32. Pan J, Yang H, Faloutsos C, Duygulu P (2004) Gcap: Graph–based automatic image captioning. In: CVPR Workshop on Multimedia Data and Document Engineering at the computer Vision and Pattern recognition conference Google Scholar
  33. Perronnin F (2010) Large–scale image retrieval with compressed fisher vectors. In: International Conference on computer vision and pattern recognition, IEEE Google Scholar
  34. Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: International conference on computer vision and pattern recognition, IEEE Google Scholar
  35. Shen X, Zhai C (2005) Active feedback in ad hoc information retrieval. In: International ACM SIGIR conference, ACM press, pp 59–66 Google Scholar
  36. Sivic JS, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: International conference on computer vision, IEEE, vol 2, pp 1470–1477 Google Scholar
  37. Vinokourov A, Hardoon DR, Shawe-Taylor J (2003) Learning the semantics of multimedia content with application to web image retrieval and classification. In: Fourth International Symposium on Independent Component Analysis and Blind Source Separation Google Scholar
  38. Zhai C, Lafferty JD (2001) Model–based feedback in the language modeling approach to information retrieval. In: Conference on Information and Knowledge management, pp 403–410 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Julien Ah-Pine
    • 1
    Email author
  • Stephane Clinchant
    • 1
  • Gabriela Csurka
    • 1
  • Florent Perronnin
    • 1
  • Jean-Michel Renders
    • 1
  1. 1.Xerox Research Centre EuropeMeylanFrance

Personalised recommendations