ACCV 2014: Computer Vision -- ACCV 2014 pp 613-627 | Cite as
Leveraging High Level Visual Information for Matching Images and Captions
Abstract
In this paper we investigate the problem of matching images and captions. We exploit the kernel canonical correlation analysis (KCCA) to learn a similarity between images and texts. We then propose methods to build improved visual and text kernels. The visual kernels are based on visual classifiers that use responses of a deep convolutional neural network as features, and the text kernel improves the Bag-of-Words (BoW) representation by learning a vision based lexical similarity between words. We consider two application scenarios, one where only an external image set weakly related to the evaluation dataset is available for training the visual classifiers, and one where visual data closely related to the evaluation set can be used. We evaluate our visual and text kernels on a large and publicly available benchmark, where we show that our proposed methods substantially improve upon the state-of-the-art.
Keywords
Training Image Canonical Correlation Analysis Convolutional Neural Network Median Rank Text KernelNotes
Acknowledgement
This work has been supported by EU Chist-Era EPSRC EP/K01904X/1 Visual Sense project.
References
- 1.Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Face recognition from caption-based supervision. IJCV 96(1), 64–82 (2012)CrossRefMATHMathSciNetGoogle Scholar
- 2.Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? text-to-image coreference. In: CVPR (2014)Google Scholar
- 3.Elhoseiny, M., Saleh, B., Elgammal, A.: Write a classifier: zero-shot learning using purely textural description. In: ICCV (2013)Google Scholar
- 4.Feng, Y., Lapata, M.: Automatic caption generation for news images. PAMI 35(4), 797–812 (2013)CrossRefGoogle Scholar
- 5.Ordonez, V., Kulkarni, G., Berg, T.: Im2text: describing images using 1 million captioned photographs. In: NIPS (2011)Google Scholar
- 6.Li, S., Kulkarni, G., Berg, T., Berg, A., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: CoNLL (2011)Google Scholar
- 7.Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010) CrossRefGoogle Scholar
- 8.Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A., Berg, T.: Baby talk: understanding and generating simple image descriptions. In: CVPR (2011)Google Scholar
- 9.Yang, Y., Teo, C., Daumé III, H.D., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: EMNLP (2011)Google Scholar
- 10.Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., Daume, H.: Midge: generating image descriptions from computer vision detections. In: EACL (2012)Google Scholar
- 11.Gupta, A., Verma, Y., Jawahar, C.: Choosing linguistics over vision to describe images. In: AAAI Conference on Artificial Intelligence (2012)Google Scholar
- 12.Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., Choi, Y.: Collective generation of natural image descriptions. In: ACL (2012)Google Scholar
- 13.Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI Conference on Artificial Intelligence (2013)Google Scholar
- 14.Das, P., Xu, C., Doell, R., Corso, J.: A thousand frames in just a few words: lingual description of videos through latent topic and sparse object stitching. In: CVPR (2013)Google Scholar
- 15.Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)Google Scholar
- 16.Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: ICCV (2013)Google Scholar
- 17.Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MATHMathSciNetGoogle Scholar
- 18.Bach, F., Jordan, M.: Kernel independent component analysis. JMLR 3, 1–48 (2002)MathSciNetGoogle Scholar
- 19.Hardoon, D., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)CrossRefMATHGoogle Scholar
- 20.Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
- 21.Lin, C.: ROUGE: a package for automatic evaluation of summaries. In: Workshop on Text Summarization Branches Out (2004)Google Scholar
- 22.Reiter, E., Belz, A.: An investigation into the validity of some metrics for automatically evaluating natural lanugage generation systems. Comput. Linguist. 35(4), 338–529 (2009)CrossRefGoogle Scholar
- 23.Hotelling, H.: Relations between two sets of variates. Biometrika 28(3/4), 321–377 (1936)CrossRefMATHGoogle Scholar
- 24.Padro, L., Stanivlosky, E.: Freeling 3.0: towards wider multilinguality. In: Language Resources and Evaluation Conference (2012)Google Scholar
- 25.Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
- 26.Deng, J., Berg, A., Satheesh, S., Su, H., Khosla, A., Feifei, L.: ImageNet large scale visual recognition challenge (ILSVRC) 2012 (2012). http://image-net.org/challenges/LSVRC/2012/
- 27.LeCun, Y., Boser, B., Denker, J., Henerson, D., Howard, R., Hubbard, W., Jackel, L.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)CrossRefGoogle Scholar
- 28.Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313, 504–507 (2006)CrossRefMATHMathSciNetGoogle Scholar
- 29.Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition (2013). arXiv:1310.1531 [cs.CV]
- 30.Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large scale hierarchical image database. In: CVPR (2009)Google Scholar
- 31.Jia, Y.: Caffe: an open source convolutional architecture for fast feature embedding (2013). http://caffe.berkeleyvision.org
- 32.Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011). http://www.csie.ntu.edu.tw/cjlin/libsvm CrossRefGoogle Scholar
- 33.Lin, D.: An information-theoretic definition on similarity. In: ICML (1998)Google Scholar