Cross-specificity: modelling data semantics for cross-modal matching and retrieval

Verma, Yashaswi; Jha, Abhishek; Jawahar, C. V.

doi:10.1007/s13735-017-0138-7

Cross-specificity: modelling data semantics for cross-modal matching and retrieval

Short Paper
Published: 10 November 2017

Volume 7, pages 139–146, (2018)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

519 Accesses
2 Citations
Explore all metrics

Abstract

While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

During our analysis, we found that the similarity scores assigned by humans were quite sensitive to subjective aspects such as perceptibility and presence of multiple objects, which led to relatively less scores for human-computed cross-specificity compared to automated cross-specificity.
The reader is suggested to refer Sect. 3.1 of [10] to get further details on the notion of image specificity and to better appreciate its distinctions with the proposed cross-specificity.
All the correlations were found to be statistically significant at \(p<0.0001\).

References

Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: ICML
Berg AC, Berg TL, Daumé H, Dodge J, Goyal A, Han X, Mensch A, Mitchell M, Sood A, Stratos K, Yamaguchi K (2012) Understanding and predicting importance in images. In: CVPR
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. JMLR 12(1):234–278
MATH Google Scholar
Chen X, Hero A, Savarese S (2012) Multimodal video indexing and retrieval using directed information. IEEE Trans Multimed 14(1):3–16
Article Google Scholar
Duan K, Crandall D, Batra D (2014) Multimodal learning in loosely-organized web images. In: CVPR
Gong Y, Ke Q, Isard M, Lazebnik S (2013) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2):210–233
Article Google Scholar
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11):1254–1259
Article Google Scholar
Jas M, Parikh D (2015) Image specificity. CVPR, Washington
Book Google Scholar
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: ICCV
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381
Article Google Scholar
Li Y, Crandall D, Huttenlocher D (2009) Landmark classication in large-scale image collections. In: ICCV
McAuley JJ, Leskovec J (2012) Image labeling on a network: using social-network metadata for image classification. In: ECCV
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotation using amazon’s mechanical turk. In: NAACLHLT Workshop. http://vision.cs.uiuc.edu/pascal-sentences/
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: AISTATS
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: ACM MM
Sharma A, Kumar A, III HD, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: CVPR
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2013) Grounded compositional semantics for finding and describing images with sentences. TACL
Spain M, Perona P (2011) Measuring and predicting object importance. IJCV 91(1):59–76
Article Google Scholar
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. In: NIPS
Verma Y, Jawahar CV (2017) A support vector approach for cross-modal search of images and texts. Comput Vis Image Underst 154:48–63
Article Google Scholar

Download references

Acknowledgements

YV would like to thank the Department of Science and Technology (India) for the INSPIRE Faculty Award.

Author information

Authors and Affiliations

Indian Institute of Science, Bengaluru, India
Yashaswi Verma
IIIT-Hyderabad, Hyderabad, India
Abhishek Jha & C. V. Jawahar

Authors

Yashaswi Verma
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Jha
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yashaswi Verma.

Additional information

Yashaswi Verma contributed to this work while he was a graduate student at IIIT-Hyderabad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verma, Y., Jha, A. & Jawahar, C.V. Cross-specificity: modelling data semantics for cross-modal matching and retrieval. Int J Multimed Info Retr 7, 139–146 (2018). https://doi.org/10.1007/s13735-017-0138-7

Download citation

Received: 18 August 2017
Revised: 22 October 2017
Accepted: 31 October 2017
Published: 10 November 2017
Issue Date: June 2018
DOI: https://doi.org/10.1007/s13735-017-0138-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-specificity: modelling data semantics for cross-modal matching and retrieval

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation