Estimating the information gap between textual and visual representations

Henning, Christian; Ewerth, Ralph

doi:10.1007/s13735-017-0142-y

Estimating the information gap between textual and visual representations

Regular Paper
Published: 01 December 2017

Volume 7, pages 43–56, (2018)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Christian Henning^1,3 &
Ralph Ewerth^1,2

724 Accesses
9 Citations
Explore all metrics

Abstract

To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

In fact, there was even an Oscar awarded for Best Writing—Title Cards in the first Academy Awards ceremony in 1929, but there was never again an award for intertitles.
It is assumed that image and text are jointly placed on purpose.
This would also allow us to view the CMI relation of captioning samples as an inclusion, since the text does not express concepts that are not covered by the image.
https://simple.wikipedia.org.
The remaining samples are allocated for future usage.
A suitable value for weight-decay has been found via grid search.
Note, concepts should be weighted by their visualness (cmp. Sect. 4.4).

References

Agosti M, Fuhr N, Toms E, Vakkari P (2014) Evaluation methodologies in information retrieval (Dagstuhl Seminar 13441). Dagstuhl Rep 3(10):92–126
Google Scholar
Barnard K, Yanai K (2006) Mutual information of words and pictures. Inf Theory Appl 2:1–5
Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei D, Jordan M (2003) Matching words and pictures. J Mach Learn Res 3(2):1107–1135
MATH Google Scholar
Bateman J (2014) Text and image: a critical introduction to the visual/verbal divide. Routledge, London
Google Scholar
Chen X, Fang H, Lin T, Vedantam R, Gupta S, Dollár P, Zitnick L (2015) Microsoft COCO captions: data collection and evaluation server. arxiv:1504.00325
Crammer K, Singer Y (2002) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2(12):265–292
MATH Google Scholar
Eickhoff C, Teevan J, White R, Dumais S (2014) Lessons from the journey: a query log analysis of within-session learning. In: Proceedings of the 7th ACM international conference on web search and data mining, pp 223–232
Feng Y, Lapata M (2008) Automatic image annotation using auxiliary text information. In: Proceedings of Association for Computational Linguistics, vol 8, pp 272–280
Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812
Article Google Scholar
Frome A, Corrado G, Shlens J, Bengio S, Dean J, Ranzato MA, Mikolov T (2013) Devise: a deep visual-semantic embedding model. In: Proceedings of neural information processing systems, vol 26, pp 2121–2129
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of European conference on computer vision, vol 13, pp 529–545
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Proceedings of neural information processing systems, vol 26, pp 2672–2680
Izadinia H, Sadeghi F, Divvala S, Hajishirzi H, Choi Y, Farhadi A (2015) Segment-phrase table for semantic segmentation, visual entailment and paraphrasing. In: Proceedings of the IEEE international conference on computer vision, pp 10–18
Karpathy A, Li F (2014) Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306
Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. arXiv:1406.5679
Liu W, Tang X (2005) Learning an image-word embedding for image auto-annotation on the nonlinear latent space. In: Proceedings of ACM international conference on multimedia, vol 13, pp 451–454
Mao J, Xu W, Yang Y, Wang J, Yuille A (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of neural information processing systems, vol 26, pp 3111–3119
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng A (2011) Multimodal deep learning. In: Proceedings of international conference on machine learning, vol 28, pp 689–696
Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434
Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2016) Breakingnews: Article annotation by image and text processing. arXiv arXiv:1603.07141
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567
Vakkari P (2016) Searching as learning: a systematization based on literature. J Inf Sci 42(1):7–18
Article Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2014) Show and tell: A neural image caption generator. arXiv:1411.4555
Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Article Google Scholar
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Xue J, Du Y, Shui H (2015) Semantic correlation mining between images and texts with global semantics and local mapping. In: Proceedings of international conference on multimedia modeling, vol 8936, pp 427–435
Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3441–3450
Yanai K, Barnard K (2005) Image region entropy: a measure of visualness of web images associated with one concept. In: Proceedings of the annual ACM international conference on multimedia, vol 13, pp 419–422
Zhang Y, Schneider J, Dubrawski A (2008) Learning the semantic correlation: An alternative way to gain from unlabeled text. In: Proceedings of the international conference on neural information processing systems, vol 21, pp 1945–1952
Zhuang YT, Yang Y, Wu F (2008) Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans Multimed 10(2):221–229
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Distributed Systems, and L3S Research Center, Leibniz Universität Hannover, Hannover, Germany
Christian Henning & Ralph Ewerth
Department of Research and Development, Research Group Visual Analytics, Leibniz Information Centre for Science and Technology (TIB), Hannover, Germany
Ralph Ewerth
Institute of Neuroinformatics, ETH Zurich, Zurich, Switzerland
Christian Henning

Authors

Christian Henning
View author publications
You can also search for this author in PubMed Google Scholar
Ralph Ewerth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Henning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Henning, C., Ewerth, R. Estimating the information gap between textual and visual representations. Int J Multimed Info Retr 7, 43–56 (2018). https://doi.org/10.1007/s13735-017-0142-y

Download citation

Received: 31 August 2017
Revised: 03 November 2017
Accepted: 16 November 2017
Published: 01 December 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s13735-017-0142-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimating the information gap between textual and visual representations

Abstract

Access this article

Similar content being viewed by others

Characterization and classification of semantic image-text relations

Is Cross-Modal Information Retrieval Possible Without Training?

Learning to Learn from Web Data Through Deep Semantic Embeddings

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Estimating the information gap between textual and visual representations

Abstract

Access this article

Similar content being viewed by others

Characterization and classification of semantic image-text relations

Is Cross-Modal Information Retrieval Possible Without Training?

Learning to Learn from Web Data Through Deep Semantic Embeddings

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation