International Journal of Computer Vision

, Volume 125, Issue 1–3, pp 110–135 | Cite as

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Article

Abstract

We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.

Keywords

Computer vision Scene understanding Deep learning Natural language processing Visual turing test Visual question answering 

Notes

Acknowledgements

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD). The project was in part supported by the Collaborative Research Center (CRC) 1223 from the German Research Foundation (DFG).

References

  1. Akata, Z., Malinowski, M., Fritz, M., & Schiele, B. (2016). Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  2. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  3. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Learning to compose neural networks for question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).Google Scholar
  4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015) Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  5. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., & Bengio, Y. (2012) Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.Google Scholar
  6. Berant, J., & Liang, P. (2014) Semantic parsing via paraphrasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  7. Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv:1511.05960.
  8. Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bahdanau, D., & Bengio, Y. (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  9. Chollet, F. (2015) keras. https://github.com/fchollet/keras.
  10. Cohen, J., et al. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37–46.CrossRefGoogle Scholar
  11. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  12. Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3), 613–619.CrossRefGoogle Scholar
  13. Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  14. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W. (2015) Are you talking to a machine? dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  15. Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual turing test for computer vision systems. In Proceedings of the National Academy of Sciences. National Academy of Sciences.Google Scholar
  16. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv:1512.03385.
  17. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  18. Hu, R., Rohrbach, M., & Darrell, T. (2016a). Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  19. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016b). Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  20. Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv:1604.01485.
  21. Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., & Daumé III, H. (2014). A neural network for factoid question answering over paragraphs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  22. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
  23. Jiang, A., Wang, F., Porikli, F., & Li, Y. (2015). Compositional memory for visual question answering. arXiv:1511.05676.
  24. Kafle, K., & Kanan, C. (2016). Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  25. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  26. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  27. Karpathy, A., Joulin, A., & Fei-Fei, L. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  28. Kazemzadeh, S., Ordonez, V., Matten, M., & Berg, Tamara L. (2014). Referit game: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  29. Kim, Y. (2014) Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  30. Kim, J. H., On, K. W., Kim, J., Ha, J. W., & Zhang, B. T. (2016). Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325.
  31. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
  32. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, (pp. 423–430). Association for Computational Linguistics.Google Scholar
  33. Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  34. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., Bernstein, M., & Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332.
  35. Krishnamurthy, J., & Kollar, T. (2013). Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics (TACL), 1, 193–206.Google Scholar
  36. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  37. Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I., & Socher, R. (2015). Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285.
  38. LeCun,Y., Bottou, Léon, Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.Google Scholar
  39. Liang, P., Jordan, M. I., & Klein, D. (2013). Learning dependency-based compositional semantics. Computational Linguistics, 39(2), 389–446.CrossRefMathSciNetGoogle Scholar
  40. Lin, T. Y, Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  41. Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  42. Ma, L., Lu, Z., & Li, H. (2016). Learning to answer questions from image using convolutional neural network. In Proceedings of the Conference on Artificial Intelligence (AAAI).Google Scholar
  43. Malinowski, M., & Fritz, M. (2014a). A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  44. Malinowski, M., & Fritz, M. (2014b). Towards a visual turing challenge. In Learning Semantics (NIPS workshop).Google Scholar
  45. Malinowski, M., & Fritz, M. (2014c). A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv:1411.5190.
  46. Malinowski, M., & Fritz, M. (2015). Hard to cheat: A turing test based on answering questions about images. AAAIWorkshop: Beyond the Turing Test.Google Scholar
  47. Malinowski, M., & Fritz, M. (2016). Tutorial on answering questions about images with deep learning. arXiv preprint arXiv:1610.01076.
  48. Malinowski, M, Rohrbach, M, & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 1–9).Google Scholar
  49. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing (Vol. 999). Cambridge: MIT Press.MATHGoogle Scholar
  50. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., & Murphy, K. (2016). Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  51. Matuszek, C., Fitzgerald, N., Zettlemoyer, L., Bo, L., & Fox, D. (2012). A joint model of language and perception for grounded attribute learning. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
  52. Nag Chowdhury, S., Malinowski, M., Bulling, A., & Fritz, M. (2016) Xplore-m-ego: Contextual media retrieval using natural language queries. In ACM International Conference on Multimedia Retrieval (ICMR).Google Scholar
  53. Nakashole, N., Tylenda, T., & Weikum, T. (2013). Fine-grained semantic typing of emerging entities. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  54. Noh, H., Seo, P. H., & Han, B. (2015). Image question answering using convolutional neural network with dynamic parameter prediction. arXiv:1511.05756.
  55. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).Google Scholar
  56. Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  57. Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., & Lazebnik, S. (2016). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. arXiv:1505.04870.
  58. Prakash, A., & Storer, J. (2016). Highway networks for visual question answering.Google Scholar
  59. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1, 25–36.Google Scholar
  60. Ren, M., Kiros, R., & Zemel, R. (2015). Image question answering: A visual semantic embedding model and a new dataset. Advances in Neural Information Processing Systems (NIPS), 1(2), 5.Google Scholar
  61. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2015a). Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  62. Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015b). A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  63. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.
  64. Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2016). Dualnet: Domain-invariant network for visual question answering. arXiv preprint arXiv:1606.06108.
  65. Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  66. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV).Google Scholar
  67. Simonyan, K., & Zisserman, A. ((2014)) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
  68. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
  69. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. arXiv:1409.4842.
  70. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Google Scholar
  71. Trecvid (2014). Trecvid med 14. http://nist.gov/itl/iad/mig/med14.cfm.
  72. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015a). Sequence to sequence–video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar
  73. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015b). Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).Google Scholar
  74. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2014). Show and tell: A neural image caption generator. arXiv:1411.4555.
  75. Wang, L., Li, Y., & Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  76. Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv:1410.3916.
  77. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).Google Scholar
  78. Wu, Q., Wang, P., Shen, C., van den Hengel, A., & Dick, A. (2016). Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  79. Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. arXiv preprint arXiv:1603.01417.
  80. Xu, H., & Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv:1511.05234.
  81. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
  82. Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked attention networks for image question answering. arXiv:1511.02274.
  83. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2461–2469.Google Scholar
  84. Yu, L., Poirson, P., Yang, S., Berg, A. C., & Berg, T. L. (2016). Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer.Google Scholar
  85. Zaremba, W., & Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.
  86. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., & Fergus, R. (2015). Simple baseline for visual question answering. arXiv:1512.02167.
  87. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  88. Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. G. (2015). Uncovering temporal context for video question and answering. arXiv:1511.04670.
  89. Zitnick, C. L., Parikh, D., & Vanderwende, L. (2013). Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Mateusz Malinowski
    • 1
  • Marcus Rohrbach
    • 2
  • Mario Fritz
    • 1
  1. 1.Max Planck Institute for Informatics, Saarland Informatics CampusSaarbrückenGermany
  2. 2.UC Berkeley EECSBerkeleyUSA

Personalised recommendations