Skip to main content

Cross-specificity: modelling data semantics for cross-modal matching and retrieval


While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    During our analysis, we found that the similarity scores assigned by humans were quite sensitive to subjective aspects such as perceptibility and presence of multiple objects, which led to relatively less scores for human-computed cross-specificity compared to automated cross-specificity.

  2. 2.

    The reader is suggested to refer Sect. 3.1 of [10] to get further details on the notion of image specificity and to better appreciate its distinctions with the proposed cross-specificity.

  3. 3.

    All the correlations were found to be statistically significant at \(p<0.0001\).


  1. 1.

    Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: ICML

  2. 2.

    Berg AC, Berg TL, Daumé H, Dodge J, Goyal A, Han X, Mensch A, Mitchell M, Sood A, Stratos K, Yamaguchi K (2012) Understanding and predicting importance in images. In: CVPR

  3. 3.

    Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. JMLR 12(1):234–278

    MATH  Google Scholar 

  4. 4.

    Chen X, Hero A, Savarese S (2012) Multimodal video indexing and retrieval using directed information. IEEE Trans Multimed 14(1):3–16

    Article  Google Scholar 

  5. 5.

    Duan K, Crandall D, Batra D (2014) Multimodal learning in loosely-organized web images. In: CVPR

  6. 6.

    Gong Y, Ke Q, Isard M, Lazebnik S (2013) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV 106(2):210–233

    Article  Google Scholar 

  7. 7.

    Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: CVPR

  8. 8.

    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  MATH  Google Scholar 

  9. 9.

    Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11):1254–1259

    Article  Google Scholar 

  10. 10.

    Jas M, Parikh D (2015) Image specificity. CVPR, Washington

    Book  Google Scholar 

  11. 11.

    Judd T, Ehinger K, Durand F, Torralba A (2009) Learning to predict where humans look. In: ICCV

  12. 12.

    Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381

    Article  Google Scholar 

  13. 13.

    Li Y, Crandall D, Huttenlocher D (2009) Landmark classication in large-scale image collections. In: ICCV

  14. 14.

    McAuley JJ, Leskovec J (2012) Image labeling on a network: using social-network metadata for image classification. In: ECCV

  15. 15.

    Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotation using amazon’s mechanical turk. In: NAACLHLT Workshop.

  16. 16.

    Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: AISTATS

  17. 17.

    Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: ACM MM

  18. 18.

    Sharma A, Kumar A, III HD, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: CVPR

  19. 19.

    Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR

  20. 20.

    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2013) Grounded compositional semantics for finding and describing images with sentences. TACL

  21. 21.

    Spain M, Perona P (2011) Measuring and predicting object importance. IJCV 91(1):59–76

    Article  Google Scholar 

  22. 22.

    Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. In: NIPS

  23. 23.

    Verma Y, Jawahar CV (2017) A support vector approach for cross-modal search of images and texts. Comput Vis Image Underst 154:48–63

    Article  Google Scholar 

Download references


YV would like to thank the Department of Science and Technology (India) for the INSPIRE Faculty Award.

Author information



Corresponding author

Correspondence to Yashaswi Verma.

Additional information

Yashaswi Verma contributed to this work while he was a graduate student at IIIT-Hyderabad.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Verma, Y., Jha, A. & Jawahar, C.V. Cross-specificity: modelling data semantics for cross-modal matching and retrieval. Int J Multimed Info Retr 7, 139–146 (2018).

Download citation


  • Cross-media analysis
  • Semantic matching
  • Cross-modal retrieval