Abstract
While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.
Notes
During our analysis, we found that the similarity scores assigned by humans were quite sensitive to subjective aspects such as perceptibility and presence of multiple objects, which led to relatively less scores for human-computed cross-specificity compared to automated cross-specificity.
All the correlations were found to be statistically significant at \(p<0.0001\).
References
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: ICML
Berg AC, Berg TL, Daumé H, Dodge J, Goyal A, Han X, Mensch A, Mitchell M, Sood A, Stratos K, Yamaguchi K (2012) Understanding and predicting importance in images. In: CVPR
Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. JMLR 12(1):234–278
Chen X, Hero A, Savarese S (2012) Multimodal video indexing and retrieval using directed information. IEEE Trans Multimed 14(1):3–16
Duan K, Crandall D, Batra D (2014) Multimodal learning in loosely-organized web images. In: CVPR
Gong Y, Ke Q, Isard M, Lazebnik S (2013) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2):210–233
Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR
Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11):1254–1259
Jas M, Parikh D (2015) Image specificity. CVPR, Washington
Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: ICCV
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381
Li Y, Crandall D, Huttenlocher D (2009) Landmark classication in large-scale image collections. In: ICCV
McAuley JJ, Leskovec J (2012) Image labeling on a network: using social-network metadata for image classification. In: ECCV
Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotation using amazon’s mechanical turk. In: NAACLHLT Workshop. http://vision.cs.uiuc.edu/pascal-sentences/
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: AISTATS
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: ACM MM
Sharma A, Kumar A, III HD, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: CVPR
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2013) Grounded compositional semantics for finding and describing images with sentences. TACL
Spain M, Perona P (2011) Measuring and predicting object importance. IJCV 91(1):59–76
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. In: NIPS
Verma Y, Jawahar CV (2017) A support vector approach for cross-modal search of images and texts. Comput Vis Image Underst 154:48–63
Acknowledgements
YV would like to thank the Department of Science and Technology (India) for the INSPIRE Faculty Award.
Author information
Authors and Affiliations
Corresponding author
Additional information
Yashaswi Verma contributed to this work while he was a graduate student at IIIT-Hyderabad.
Rights and permissions
About this article
Cite this article
Verma, Y., Jha, A. & Jawahar, C.V. Cross-specificity: modelling data semantics for cross-modal matching and retrieval. Int J Multimed Info Retr 7, 139–146 (2018). https://doi.org/10.1007/s13735-017-0138-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-017-0138-7