Advertisement

Attributes as Semantic Units Between Natural Language and Visual Recognition

  • Marcus Rohrbach
Chapter
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

Impressive progress has been made in the fields of computer vision and natural language processing. However, it remains a challenge to find the best point of interaction for these very different modalities. In this chapter, we discuss how attributes allow us to exchange information between the two modalities and in this way lead to an interaction on a semantic level. Specifically we discuss how attributes allow using knowledge mined from language resources for recognizing novel visual categories, how we can generate sentence description about images and video, how we can ground natural language in visual content, and finally, how we can answer natural language questions about images.

Keywords

Visual Recognition Semantic Level Statistical Machine Translation Composite Activity Language Resource 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

I would like to thank all my coauthors, especially those whose publications are presented in this chapter. Namely, Sikandar Amin, Jacob Andreas, Mykhaylo Andriluka, Trevor Darrell, Sandra Ebert, Jiashi Feng, Annemarie Friedrich, Iryna Gurevych, Lisa Anne Hendricks, Ronghang Hu, Dan Klein, Raymond Mooney, Manfred Pinkal, Wei Qiu, Michaela Regneri, Anna Rohrbach, Kate Saenko, Michael Stark, Bernt Schiele, György Szarvas, Stefan Thater, Ivan Titov, Subhashini Venugopalan, and Huazhe Xu. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD).

References

  1. 1.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. arXiv:1511.02799 (2015)
  2. 2.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2016)Google Scholar
  3. 3.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  4. 4.
    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  5. 5.
    Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D.M., Jordan, M.I.: Matching words and pictures. J. Mach. Learn. Res. (JMLR) 3, 1107–1135 (2003)zbMATHGoogle Scholar
  6. 6.
    Bart, E., Ullman, S.: Single-example learning of novel classes using representation by similarity. In: Proceedings of the British Machine Vision Conference (BMVC) (2005)Google Scholar
  7. 7.
    Chen, H.-H., Lin, M.-S., Wei, Y.-C.: Novel association measures using web search with double checking. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2006)Google Scholar
  8. 8.
    Chen, X., Fang, H., Lin, T.-Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325 (2015)
  9. 9.
    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  10. 10.
    Deng, J., Berg, A., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: European Conference on Computer Vision (ECCV) (2010)Google Scholar
  11. 11.
    Dice, L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRefGoogle Scholar
  12. 12.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  13. 13.
    Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  14. 14.
    Ebert, S., Larlus, D., Schiele, B.: Extracting structures in image collections for object recognition. In: European Conference on Computer Vision (ECCV) (2010)Google Scholar
  15. 15.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88(2), 303–338 (2010)CrossRefGoogle Scholar
  16. 16.
    Fang, H., Gupta, S., Iandola, F.N., Srivastava, R., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  17. 17.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  18. 18.
    Farhadi, A., Endres, I., Hoiem, D.: Attribute-centric recognition for cross-category generalization. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  19. 19.
    Farrell, R., Oza, O., Morariu, V., Darrell, T., Davis, L.: Birdlets: subordinate categorization using volumetric primitives and pose-normalized appearance. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  20. 20.
    Fellbaum, C.: WordNet: An Electronical Lexical Database. The MIT Press (1998)Google Scholar
  21. 21.
    Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: Devise: a deep visual-semantic embedding model. In: Conference on Neural Information Processing Systems (NIPS) (2013)Google Scholar
  22. 22.
    Fu, Y., Hospedales, T.M., Xiang, T., Gong, S.: Learning multimodal latent attributes. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(2), 303–316 (2014)CrossRefGoogle Scholar
  23. 23.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2007)Google Scholar
  24. 24.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: Conference on Neural Information Processing Systems (NIPS) (2015)Google Scholar
  25. 25.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  26. 26.
    Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: European Conference on Computer Vision (ECCV) (2014)Google Scholar
  27. 27.
    Guadarrama, S., Rodner, E., Saenko, K., Zhang, N., Farrell, R., Donahue, J., Darrell, T.: Open-vocabulary object retrieval. In: Robotics: Science and Systems (2014)Google Scholar
  28. 28.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  29. 29.
    Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. arXiv:1511.05284v1 (2015)
  30. 30.
    Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  31. 31.
    Hoffman, J., Guadarrama, S., Tzeng, E., Donahue, J., Girshick, R., Darrell, T., Saenko, K.: LSDA: large scale detection through adaptation. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  32. 32.
    Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T.: Natural language object retrieval. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  33. 33.
    Hu, R., Rohrbach, M., Darrell, T.: Segmentation from natural language expressions. arXiv:1603.06180 (2016)
  34. 34.
    Johnson, J., Krishna, R., Stark, M., Li, L.-J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  35. 35.
    Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  36. 36.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  37. 37.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  38. 38.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.L.: Referitgame: referring to objects in photographs of natural scenes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)Google Scholar
  39. 39.
    Koehn, P.: Statistical Machine Translation. Cambridge University Press (2010)Google Scholar
  40. 40.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  41. 41.
    Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalanditis, Y., Li, L.-J., Shamma, D.A., Bernstein, M., Fei-Fei, L.: Visual genome: connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332 (2016)
  42. 42.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Conference on Neural Information Processing Systems (NIPS) (2012)Google Scholar
  43. 43.
    Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  44. 44.
    Lampert, C.H., Nickisch, H., Harmeling, S.: Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 36(3), 453–465 (2014)CrossRefGoogle Scholar
  45. 45.
    Liang, C., Xu, C., Cheng, J., Min, W., Lu, H.: Script-to-movie: a computational framework for story movie composition. IEEE Trans. Multimedia 15(2), 401–414 (2013)CrossRefGoogle Scholar
  46. 46.
    Lin, D.: An information-theoretic definition of similarity. In: International Conference on Machine Learning (ICML) (1998)Google Scholar
  47. 47.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision (ECCV) (2014)Google Scholar
  48. 48.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  49. 49.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  50. 50.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  51. 51.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  52. 52.
    Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In: Conference on Neural Information Processing Systems (NIPS) (1998)Google Scholar
  53. 53.
    Mensink, T., Verbeek, J., Perronnin, F., Csurka, G.: Metric learning for large scale image classification: generalizing to new classes at near-zero cost. In: European Conference on Computer Vision (ECCV) (2012)Google Scholar
  54. 54.
    Mihalcea, R., Moldovan, D.I.: A method for word sense disambiguation of unrestricted text. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (1999)Google Scholar
  55. 55.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Conference on Neural Information Processing Systems (NIPS) (2013)Google Scholar
  56. 56.
    Moses, Y., Ullman, S., Edelman, S.: Generalization to novel images in upright and inverted faces. Perception 25, 443–461 (1996)CrossRefGoogle Scholar
  57. 57.
    Mrowca, D., Rohrbach, M., Hoffman, J., Hu, R., Saenko, K., Darrell, T.: Spatial semantic regularisation for large scale object detection. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  58. 58.
    Palatucci, M., Pomerleau, D., Hinton, G., Mitchell, T.: Zero-shot learning with semantic output codes. In: Conference on Neural Information Processing Systems (NIPS) (2009)Google Scholar
  59. 59.
    Parikh, D., Grauman, K.: Relative attributes. In: International Conference on Computer Vision (ICCV) (2011)Google Scholar
  60. 60.
    Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  61. 61.
    Raina, R., Battle, A., Lee, H., Packer, B., Ng, A.: Self-taught learning: transfer learning from unlabeled data. In: International Conference on Machine Learning (ICML) (2007)Google Scholar
  62. 62.
    Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. In: Transactions of the Association for Computational Linguistics (TACL) (2013)Google Scholar
  63. 63.
    Ren, M., Kiros, R., Zemel, R.: Image question answering: a visual semantic embedding model and a new dataset. In: Conference on Neural Information Processing Systems (NIPS) (2015)Google Scholar
  64. 64.
    Rohrbach, M.: Combining visual recognition and computational linguistics: linguistic knowledge for visual recognition and natural language descriptions of visual content. PhD thesis, Saarland University (2014)Google Scholar
  65. 65.
    Rohrbach, M., Stark, M., Szarvas, G., Gurevych, I., Schiele, B.: What helps where—and why? Semantic relatedness for knowledge transfer. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  66. 66.
    Rohrbach, M., Stark, M., Schiele, B.: Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  67. 67.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar
  68. 68.
    Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., Schiele, B.: Script data for attribute-based recognition of composite activities. In: European Conference on Computer Vision (ECCV) (2012)Google Scholar
  69. 69.
    Rohrbach, M., Stark, M., Szarvas, G., Schiele, B.: Combining language sources and robust semantic relatedness for attribute-based knowledge transfer. In: Proceedings of the European Conference on Computer Vision Workshops (ECCV Workshops), vol. 6553 of LNCS (2012)Google Scholar
  70. 70.
    Rohrbach, M., Ebert, S., Schiele, B.: Transfer learning in a transductive setting. In: Conference on Neural Information Processing Systems (NIPS) (2013)Google Scholar
  71. 71.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  72. 72.
    Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. In: Proceedings of the German Confeence on Pattern Recognition (GCPR) (2014)Google Scholar
  73. 73.
    Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., Schiele, B.: Grounding of textual phrases in images by reconstruction. arXiv:1511.03745 (2015)
  74. 74.
    Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Recognizing fine-grained and composite activities using hand-centric features and script data. Int. J. Comput. Vision (IJCV) 119(3), 346–373 (2015)Google Scholar
  75. 75.
    Rohrbach, A., Rohrbach, M., Schiele, B.: The long-short story of movie description. In: Proceedings of the German Confeence on Pattern Recognition (GCPR) (2015)Google Scholar
  76. 76.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  77. 77.
    Senina, A., Rohrbach, M., Qiu, W., Friedrich, A., Amin, S., Andriluka, M., Pinkal, M., Schiele, B.: Coherent multi-sentence video description with variable level of detail. arXiv:1403.6173 (2014)
  78. 78.
    Silberer, C., Ferrari, V., Lapata, M.: Models of semantic representation with visual attributes. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2013)Google Scholar
  79. 79.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  80. 80.
    Sivic, J., Russell, B.C., Efros, A.A., Zisserman, A., Freeman, W.T.: Discovering object categories in image collections. In: International Conference on Computer Vision (ICCV) (2005)Google Scholar
  81. 81.
    Socher, R., Fei-Fei, L.: Connecting modalities: semi-supervised segmentation and annotation of images using unaligned text corpora. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  82. 82.
    Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5, 1–34 (1948)Google Scholar
  83. 83.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  84. 84.
    Thomee, B., Elizalde, B., Shamma, D.A., Ni, K., Friedland, G., Poland, D., Borth, D., Li, L.-J.: Yfcc100m: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)CrossRefGoogle Scholar
  85. 85.
    Thrun, S.: Is learning the n-th thing any easier than learning the first. In: Conference on Neural Information Processing Systems (NIPS) (1996)Google Scholar
  86. 86.
    Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research. arXiv:1503.01070v1 (2015)
  87. 87.
    Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vision (IJCV) 104(2), 154–171 (2013)CrossRefGoogle Scholar
  88. 88.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence—video to text. arXiv:1505.00487v2 (2015)
  89. 89.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence—video to text. In: International Conference on Computer Vision (ICCV) (2015)Google Scholar
  90. 90.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) (2015)Google Scholar
  91. 91.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  92. 92.
    Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  93. 93.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  94. 94.
    Weber, M., Welling, M., Perona, P.: Towards automatic discovery of object categories. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2000)Google Scholar
  95. 95.
    Wu, Q., Shen, C., Hengel, A.V.D., Wang, P., Dick, A.: Image captioning and visual question answering based on attributes and their related external knowledge. arXiv:1603.02814 (2016)
  96. 96.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. arXiv:1502.08029v4 (2015)
  97. 97.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. (TACL) 2, 67–78 (2014)Google Scholar
  98. 98.
    Zesch, T., Gurevych, I.: Wisdom of crowds versus wisdom of linguists—measuring the semantic relatedness of words. Nat. Lang. Eng. 16(1), 25–59 (2010)CrossRefGoogle Scholar
  99. 99.
    Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Conference on Neural Information Processing Systems (NIPS) (2004)Google Scholar
  100. 100.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Conference on Neural Information Processing Systems (NIPS) (2014)Google Scholar
  101. 101.
    Zhu, X., Ghahramani, Z., Lafferty, J.: Semi-supervised learning using Gaussian fields and harmonic functions. In: International Conference on Machine Learning (ICML) (2003)Google Scholar
  102. 102.
    Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.UC Berkley EECS and ICSIBerkeleyUSA

Personalised recommendations