Estimating the information gap between textual and visual representations

  • Christian Henning
  • Ralph Ewerth
Regular Paper


To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.


Text–image relations Multimodal embeddings Deep learning Visual/verbal divide 


  1. 1.
    Agosti M, Fuhr N, Toms E, Vakkari P (2014) Evaluation methodologies in information retrieval (Dagstuhl Seminar 13441). Dagstuhl Rep 3(10):92–126Google Scholar
  2. 2.
    Barnard K, Yanai K (2006) Mutual information of words and pictures. Inf Theory Appl 2:1–5Google Scholar
  3. 3.
    Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei D, Jordan M (2003) Matching words and pictures. J Mach Learn Res 3(2):1107–1135zbMATHGoogle Scholar
  4. 4.
    Bateman J (2014) Text and image: a critical introduction to the visual/verbal divide. Routledge, LondonGoogle Scholar
  5. 5.
    Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollár P, Zitnick L (2015) Microsoft COCO captions: data collection and evaluation server. arxiv:1504.00325
  6. 6.
    Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(12):265–292zbMATHGoogle Scholar
  7. 7.
    Eickhoff C, Teevan J, White R, Dumais S (2014) Lessons from the journey: a query log analysis of within-session learning. In: Proceedings of the 7th ACM international conference on web search and data mining, pp 223–232Google Scholar
  8. 8.
    Feng Y, Lapata M (2008) Automatic image annotation using auxiliary text information. In: Proceedings of Association for Computational Linguistics, vol 8, pp 272–280Google Scholar
  9. 9.
    Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812CrossRefGoogle Scholar
  10. 10.
    Frome A, Corrado G, Shlens J, Bengio S, Dean J, Ranzato MA, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, vol 26, pp 2121–2129Google Scholar
  11. 11.
    Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of European conference on computer vision, vol 13, pp 529–545Google Scholar
  12. 12.
    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of neural information processing systems, vol 26, pp 2672–2680Google Scholar
  13. 13.
    Izadinia H, Sadeghi F, Divvala S, Hajishirzi H, Choi Y, Farhadi A (2015) Segment-phrase table for semantic segmentation, visual entailment and paraphrasing. In: Proceedings of the IEEE international conference on computer vision, pp 10–18Google Scholar
  14. 14.
    Karpathy A, Li F (2014) Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306
  15. 15.
    Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679
  16. 16.
    Liu W, Tang X (2005) Learning an image-word embedding for image auto-annotation on the nonlinear latent space. In: Proceedings of ACM international conference on multimedia, vol 13, pp 451–454Google Scholar
  17. 17.
    Mao J, Xu W, Yang Y, Wang J, Yuille A (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090
  18. 18.
    Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of neural information processing systems, vol 26, pp 3111–3119Google Scholar
  19. 19.
    Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A (2011) Multimodal deep learning. In: Proceedings of international conference on machine learning, vol 28, pp 689–696Google Scholar
  20. 20.
    Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
  21. 21.
    Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2016) Breakingnews: Article annotation by image and text processing. arXiv arXiv:1603.07141
  22. 22.
    Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
  23. 23.
    Vakkari P (2016) Searching as learning: a systematization based on literature. J Inf Sci 42(1):7–18CrossRefGoogle Scholar
  24. 24.
    Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: A neural image caption generator. arXiv:1411.4555
  25. 25.
    Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663CrossRefGoogle Scholar
  26. 26.
    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212Google Scholar
  27. 27.
    Xue J, Du Y, Shui H (2015) Semantic correlation mining between images and texts with global semantics and local mapping. In: Proceedings of international conference on multimedia modeling, vol 8936, pp 427–435Google Scholar
  28. 28.
    Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450Google Scholar
  29. 29.
    Yanai K, Barnard K (2005) Image region entropy: a measure of visualness of web images associated with one concept. In: Proceedings of the annual ACM international conference on multimedia, vol 13, pp 419–422Google Scholar
  30. 30.
    Zhang Y, Schneider J, Dubrawski A (2008) Learning the semantic correlation: An alternative way to gain from unlabeled text. In: Proceedings of the international conference on neural information processing systems, vol 21, pp 1945–1952Google Scholar
  31. 31.
    Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10(2):221–229CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2017

Authors and Affiliations

  1. 1.Institute of Distributed Systems, and L3S Research CenterLeibniz Universität HannoverHannoverGermany
  2. 2.Department of Research and Development, Research Group Visual AnalyticsLeibniz Information Centre for Science and Technology (TIB)HannoverGermany
  3. 3.Institute of NeuroinformaticsETH ZurichZurichSwitzerland

Personalised recommendations