Leveraging Image, Text and Cross–media Similarities for Diversity–focused Multimedia Retrieval

Ah-Pine, Julien; Clinchant, Stephane; Csurka, Gabriela; Perronnin, Florent; Renders, Jean-Michel

doi:10.1007/978-3-642-15181-1_17

Julien Ah-Pine⁵,
Stephane Clinchant⁵,
Gabriela Csurka⁵,
Florent Perronnin⁵ &
…
Jean-Michel Renders⁵

Part of the book series: The Information Retrieval Series ((INRE,volume 32))

1035 Accesses
3 Citations

Abstract

This chapter summarizes the different cross–modal information retrieval techniques Xerox Research Centre implemented during three years of participation in ImageCLEF Photo tasks. The main challenge remained constant: how to optimally couple visual and textual similarities, when they capture things at different semantic levels and when one of the media (the textual one) gives, most of the time, much better retrieval performance. Some core components turned out to be very effective all over the years: the visual similarity metrics based on Fisher Vector representation of images and the cross–media similarity principle based on relevance models. However, other components were introduced to solve additional issues: We tried different query– and document–enrichment methods by exploiting auxiliary resources such as Flickr or open–source thesauri, or by doing some statistical ‘semantic smoothing’. We also implemented some clustering mechanisms in order to promote diversity in the top results and to provide faster access to relevant information. This chapter describes, analyses and assesses each of these components, namely: the monomodal similarity measures, the different cross–media similarities, the query and document enrichment, and finally the mechanisms to ensure diversity in what is proposed to the user. To conclude, we discuss the numerous lessons we have learnt over the years by trying to solve this very challenging task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ah-Pine J (2009) Cluster analysis based on the central tendency deviation principle. In: Proceedings of the International Conference on Advanced Data Mining and Applications, pp 5–18
Google Scholar
Ah-Pine J, Cifarelli C, Clinchant S, Csurka G, Renders J (2008) XRCE’s participation to ImageCLEF 2008. In: Working Notes of CLEF 2008, Aarhus, Denmark
Google Scholar
Ah-Pine J, Bressan M, Clinchant S, Csurka G, Hoppenot Y, Renders J (2009) Crossing textual and visual content in different application scenarios. Multimedia Tools and Applications 42(1):31–56
Article Google Scholar
Ah-Pine J, Clinchant S, Csurka G, Liu Y (2009) XRCE’s participation to ImageCLEF 2009. In: Working Notes of the 2009 CLEF Workshop, Corfu, Greece
Google Scholar
Ah-Pine J, Csurka G, Renders JM (2009c) Evaluation of diversity–focused strategies for multimedia retrieval. In: Evaluating Systems for Multilingual and Multimodal Information Access, Springer, Lecture Notes in Computer Science (LNCS), vol 5706, pp 677–684
Google Scholar
Ah-Pine J, Clinchant S, Csurka G Comparison of several combinations of multimodal and diversity seeking methods for multimedia retrieval. In: Multilingual Information Access Evaluation, Springer, Lecture Notes in Computer Science (LNCS)
Google Scholar
Barnard K, Duygulu P, Forsyth D, de Freitas N, Jordan M (2003) Matching words and pictures. Journal of Machine Learning Research 3:1107–1135
Article MATH Google Scholar
Blei D, Jordan MI (2003) Modeling annotated data. In: Proceedings of the ACM SIGIR conference, ACM press, pp 127–134
Google Scholar
Boudin F, El-Bèze M, Torres-Moreno J (2008) A scalable MMR approach to sentence scoring for multi–document update summarization. In: Proceedings of the international conference on computational linguistics, pp 21–24
Google Scholar
Carbonell J, Goldstein J (1998) The use of MMR, diversity–based reranking for reordering documents and producing summaries. In: Proceedings of the ACM SIGIR conference, ACM press, pp 335–336
Google Scholar
Carbonetto P, de Freitas N, Barnard K (2004) A statistical model for general contextual object recognition. In: European conference on computer vision, vol 1, pp 350–362
Google Scholar
Chang YC, Chen HH (2006) Approaches of using a word-image ontology and an annotated image corpus as intermedia for cross–language image retrieval. In: Working notes CLEF 2006
Google Scholar
Clinchant S, Renders J, Csurka G (2007) XRCE’s participation to ImageCLEF 2007. In: Working Notes of CLEF 2007, Budapest, Hungary
Google Scholar
Clinchant S, Renders JM, Csurka G (2008) Trans–media pseudo–relevance feedback methods in multimedia retrieval. In: Advances in Multilingual and Multimodal Information Retrieval, Springer, Lecture Notes in Computer Science (LNCS), vol 5152, pp 569–576
Google Scholar
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: ECCV Workshop on Statistical Learning for Computer Vision, pp 59–74
Google Scholar
Deselaers T, Hanbury A (2008) The Visual Concept Detection Task in ImageCLEF 2008. In: Working Notes of CLEF 2008
Google Scholar
Duygulu P, Barnard K, de Freitas J, Forsyth D (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In: European conference on computer vision, vol 4, pp 97–112
Google Scholar
Everingham M, Sivic J, Zisserman A (2006) “hello! my name is... buffy” – automatic naming of characters in TV video. In: British machine vision conference, pp 889–908
Google Scholar
Feng S, Lavrenko V, Manmatha R (2004) Multiple bernoulli relevance models for image and video annotation. In: International conference on computer vision and pattern recognition, vol 2, pp 1002–1009
Google Scholar
Huang T, Dagli C, Rajaram S, Chang E, Mandel M, Poliner G, Ellis D (2008) Active learning for interactive multimedia retrieval. Proceedings of the IEEE 96(4):648–667
Article Google Scholar
Iyengar G, Duygulu P, Feng S, Ircing P, Khudanpur S, Klakow D, Krause M, Manmatha R, Nock H, Petkova D, Pytlik B, Virga P (2005) Joint visual–text modeling for automatic retrieval of multimedia documents. In: Proceedings of ACM Multimedia, ACM press, pp 21–30
Google Scholar
Jaakkola T, Haussler D (1999) Exploiting generative models in discriminative classifiers. In: Advances in Neural Information Processing Systems, MIT Press, pp 487–493
Google Scholar
Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross–media relevance models. In: Proceedings of the ACM SIGIR conference, ACM press, pp 119–126
Google Scholar
Lavrenko V, Manmatha R, Jeon J (2003) A model for learning the semantics of pictures. In: Annual conference on neural information processing systems, pp 553–560
Google Scholar
Li J, Wang JZ (2003) Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 25:1075–1088
Article Google Scholar
Lin Z, Chua T, Kan M, Lee W, Qiu L, Ye S (2005) NUS at DUC 2007: Using evolutionary models of text. In: Document Understanding Conference
Google Scholar
Maillot N, Chevallet JP, Valea V, Lim JH (2006) IPAL Inter–Media Pseudo–Relevance Feedback Approach to ImageCLEF 2006 photo retrieval. In: CLEF 2006 Working Notes
Google Scholar
Manning CD, Schütze H, Lee L (2000) Review: Foundations of statistical natural language processing
Google Scholar
Marcotorchino J, Michaud P (1981) Heuristic approach of the similarity aggregation problem. Methods of operation research 43:395–404
MATH Google Scholar
Monay F, Gatica-Perez D (2004) PLSA–based Image Auto–Annotation: Constraining the Latent Space. In: Proceedings of ACM Multimedia, ACM press, pp 348–351
Google Scholar
Mori Y, Takahashi H, Oka R (1999) Image–to–word transformation based on dividing and vector quantizing images with words. In: First International Workshop on Multimedia Intelligent Storage and Retrieval Management
Google Scholar
Pan J, Yang H, Faloutsos C, Duygulu P (2004) Gcap: Graph–based automatic image captioning. In: CVPR Workshop on Multimedia Data and Document Engineering at the computer Vision and Pattern recognition conference
Google Scholar
Perronnin F (2010) Large–scale image retrieval with compressed fisher vectors. In: International Conference on computer vision and pattern recognition, IEEE
Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: International conference on computer vision and pattern recognition, IEEE
Google Scholar
Shen X, Zhai C (2005) Active feedback in ad hoc information retrieval. In: International ACM SIGIR conference, ACM press, pp 59–66
Google Scholar
Sivic JS, Zisserman A (2003) Video google: A text retrieval approach to object matching in videos. In: International conference on computer vision, IEEE, vol 2, pp 1470–1477
Google Scholar
Vinokourov A, Hardoon DR, Shawe-Taylor J (2003) Learning the semantics of multimedia content with application to web image retrieval and classification. In: Fourth International Symposium on Independent Component Analysis and Blind Source Separation
Google Scholar
Zhai C, Lafferty JD (2001) Model–based feedback in the language modeling approach to information retrieval. In: Conference on Information and Knowledge management, pp 403–410
Google Scholar

Download references

Author information

Authors and Affiliations

Xerox Research Centre Europe, 6 ch. de Maupertuis, 38240, Meylan, France
Julien Ah-Pine, Stephane Clinchant, Gabriela Csurka, Florent Perronnin & Jean-Michel Renders

Authors

Julien Ah-Pine
View author publications
You can also search for this author in PubMed Google Scholar
Stephane Clinchant
View author publications
You can also search for this author in PubMed Google Scholar
Gabriela Csurka
View author publications
You can also search for this author in PubMed Google Scholar
Florent Perronnin
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Michel Renders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julien Ah-Pine .

Editor information

Editors and Affiliations

HES-SO Business Information Systems, TechnoArk 3, Sierre, 3960, Switzerland
Henning Müller
Dept. Information Studies, University of Sheffield, Portobello Street 211, Sheffield, S1 4DP, United Kingdom
Paul Clough
, Computer Vision Lab/ETF-C 113.2, ETH Zürich, Zürich, 8092, Switzerland
Thomas Deselaers
Idiap Research Institute, rue Marconi 19, Martigny, 1920, Switzerland
Barbara Caputo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ah-Pine, J., Clinchant, S., Csurka, G., Perronnin, F., Renders, JM. (2010). Leveraging Image, Text and Cross–media Similarities for Diversity–focused Multimedia Retrieval. In: Müller, H., Clough, P., Deselaers, T., Caputo, B. (eds) ImageCLEF. The Information Retrieval Series, vol 32. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15181-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-642-15181-1_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15180-4
Online ISBN: 978-3-642-15181-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics