ACCV 2014: Computer Vision -- ACCV 2014 pp 613-627 | Cite as

Leveraging High Level Visual Information for Matching Images and Captions

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9003)

Abstract

In this paper we investigate the problem of matching images and captions. We exploit the kernel canonical correlation analysis (KCCA) to learn a similarity between images and texts. We then propose methods to build improved visual and text kernels. The visual kernels are based on visual classifiers that use responses of a deep convolutional neural network as features, and the text kernel improves the Bag-of-Words (BoW) representation by learning a vision based lexical similarity between words. We consider two application scenarios, one where only an external image set weakly related to the evaluation dataset is available for training the visual classifiers, and one where visual data closely related to the evaluation set can be used. We evaluate our visual and text kernels on a large and publicly available benchmark, where we show that our proposed methods substantially improve upon the state-of-the-art.

Keywords

Training Image Canonical Correlation Analysis Convolutional Neural Network Median Rank Text Kernel 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgement

This work has been supported by EU Chist-Era EPSRC EP/K01904X/1 Visual Sense project.

References

  1. 1.
    Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Face recognition from caption-based supervision. IJCV 96(1), 64–82 (2012)CrossRefMATHMathSciNetGoogle Scholar
  2. 2.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: CVPR (2014)Google Scholar
  3. 3.
    Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: zero-shot learning using purely textural description. In: ICCV (2013)Google Scholar
  4. 4.
    Feng, Y., Lapata, M.: Automatic caption generation for news images. PAMI 35(4), 797–812 (2013)CrossRefGoogle Scholar
  5. 5.
    Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NIPS (2011)Google Scholar
  6. 6.
    Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: CoNLL (2011)Google Scholar
  7. 7.
    Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  8. 8.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: CVPR (2011)Google Scholar
  9. 9.
    Yang, Y., Teo, C., Daumé III, H.D., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP (2011)Google Scholar
  10. 10.
    Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., Daume, H.: Midge: generating image descriptions from computer vision detections. In: EACL (2012)Google Scholar
  11. 11.
    Gupta, A., Verma, Y., Jawahar, C.: Choosing linguistics over vision to describe images. In: AAAI Conference on Artificial Intelligence (2012)Google Scholar
  12. 12.
    Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., Choi, Y.: Collective generation of natural image descriptions. In: ACL (2012)Google Scholar
  13. 13.
    Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI Conference on Artificial Intelligence (2013)Google Scholar
  14. 14.
    Das, P., Xu, C., Doell, R., Corso, J.: A thousand frames in just a few words: lingual description of videos through latent topic and sparse object stitching. In: CVPR (2013)Google Scholar
  15. 15.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)Google Scholar
  16. 16.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV (2013)Google Scholar
  17. 17.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MATHMathSciNetGoogle Scholar
  18. 18.
    Bach, F., Jordan, M.: Kernel independent component analysis. JMLR 3, 1–48 (2002)MathSciNetGoogle Scholar
  19. 19.
    Hardoon, D., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefMATHGoogle Scholar
  20. 20.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
  21. 21.
    Lin, C.: ROUGE: a package for automatic evaluation of summaries. In: Workshop on Text Summarization Branches Out (2004)Google Scholar
  22. 22.
    Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural lanugage generation systems. Comput. Linguist. 35(4), 338–529 (2009)CrossRefGoogle Scholar
  23. 23.
    Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)CrossRefMATHGoogle Scholar
  24. 24.
    Padro, L., Stanivlosky, E.: Freeling 3.0: towards wider multilinguality. In: Language Resources and Evaluation Conference (2012)Google Scholar
  25. 25.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  26. 26.
    Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., Feifei, L.: ImageNet large scale visual recognition challenge (ILSVRC) 2012 (2012). http://image-net.org/challenges/LSVRC/2012/
  27. 27.
    LeCun, Y., Boser, B., Denker, J., Henerson, D., Howard, R., Hubbard, W., Jackel, L.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
  28. 28.
    Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)CrossRefMATHMathSciNetGoogle Scholar
  29. 29.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition (2013). arXiv:1310.1531 [cs.CV]
  30. 30.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large scale hierarchical image database. In: CVPR (2009)Google Scholar
  31. 31.
    Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://caffe.berkeleyvision.org
  32. 32.
    Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011). http://www.csie.ntu.edu.tw/cjlin/libsvm CrossRefGoogle Scholar
  33. 33.
    Lin, D.: An information-theoretic definition on similarity. In: ICML (1998)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildfordUK

Personalised recommendations