Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

Abstract

We propose a Deep Learning approach to the visual question answering task, where machines answer to questions about real-world images. By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem. In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language inputs (image and question). We evaluate our approaches on the DAQUAR as well as the VQA dataset where we also report various baselines, including an analysis how much information is contained in the language part only. To study human consensus, we propose two novel metrics and collect additional answers which extend the original DAQUAR dataset to DAQUAR-Consensus. Finally, we evaluate a rich set of design choices how to encode, combine and decode information in our proposed Deep Learning formulation.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    http://mpii.de/visual_turing_test.

  2. 2.

    https://github.com/mateuszmalinowski/visual_turing_test-tutorial.

References

  1. Akata, Z., Malinowski, M., Fritz, M., & Schiele, B. (2016). Multi-cue zero-shot learning with strong supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  2. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  3. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Learning to compose neural networks for question answering. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

  4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015) Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  5. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., & Bengio, Y. (2012) Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.

  6. Berant, J., & Liang, P. (2014) Semantic parsing via paraphrasing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

  7. Chen, K., Wang, J., Chen, L. C., Gao, H., Xu, W., & Nevatia, R. (2015) Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv:1511.05960.

  8. Cho, K., van Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bahdanau, D., & Bengio, Y. (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  9. Chollet, F. (2015) keras. https://github.com/fchollet/keras.

  10. Cohen, J., et al. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37–46.

    Article  Google Scholar 

  11. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  12. Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3), 613–619.

    Article  Google Scholar 

  13. Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  14. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W. (2015) Are you talking to a machine? dataset and methods for multilingual image question answering. In Advances in Neural Information Processing Systems (NIPS).

  15. Geman, D., Geman, S., Hallonquist, N., & Younes, L. (2015). Visual turing test for computer vision systems. In Proceedings of the National Academy of Sciences. National Academy of Sciences.

  16. He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. arXiv:1512.03385.

  17. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  18. Hu, R., Rohrbach, M., & Darrell, T. (2016a). Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision (ECCV).

  19. Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., & Darrell, T. (2016b). Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  20. Ilievski, I., Yan, S., & Feng, J. (2016). A focused dynamic attention model for visual question answering. arXiv:1604.01485.

  21. Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., & Daumé III, H. (2014). A neural network for factoid question answering over paragraphs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  22. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  23. Jiang, A., Wang, F., Porikli, F., & Li, Y. (2015). Compositional memory for visual question answering. arXiv:1511.05676.

  24. Kafle, K., & Kanan, C. (2016). Answer-type prediction for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  25. Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

  26. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  27. Karpathy, A., Joulin, A., & Fei-Fei, L. (2014). Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems (NIPS).

  28. Kazemzadeh, S., Ordonez, V., Matten, M., & Berg, Tamara L. (2014). Referit game: Referring to objects in photographs of natural scenes. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  29. Kim, Y. (2014) Convolutional neural networks for sentence classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  30. Kim, J. H., On, K. W., Kim, J., Ha, J. W., & Zhang, B. T. (2016). Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325.

  31. Kingma, D., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.

  32. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, (pp. 423–430). Association for Computational Linguistics.

  33. Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  34. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L. J., Shamma, D. A., Bernstein, M., & Fei-Fei, L. (2016). Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv:1602.07332.

  35. Krishnamurthy, J., & Kollar, T. (2013). Jointly learning to parse and perceive: Connecting natural language to the physical world. Transactions of the Association for Computational Linguistics (TACL), 1, 193–206.

    Google Scholar 

  36. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS).

  37. Kumar, A., Irsoy, O., Su, J., Bradbury, J., English, R., Pierce, B., Ondruska, P., Gulrajani, I., & Socher, R. (2015). Ask me anything: Dynamic memory networks for natural language processing. arXiv preprint arXiv:1506.07285.

  38. LeCun,Y., Bottou, Léon, Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE.

  39. Liang, P., Jordan, M. I., & Klein, D. (2013). Learning dependency-based compositional semantics. Computational Linguistics, 39(2), 389–446.

    Article  MathSciNet  Google Scholar 

  40. Lin, T. Y, Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV).

  41. Lu, J., Yang, J., Batra, D., & Parikh, D. (2016). Hierarchical co-attention for visual question answering. In Advances in Neural Information Processing Systems (NIPS).

  42. Ma, L., Lu, Z., & Li, H. (2016). Learning to answer questions from image using convolutional neural network. In Proceedings of the Conference on Artificial Intelligence (AAAI).

  43. Malinowski, M., & Fritz, M. (2014a). A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems (NIPS).

  44. Malinowski, M., & Fritz, M. (2014b). Towards a visual turing challenge. In Learning Semantics (NIPS workshop).

  45. Malinowski, M., & Fritz, M. (2014c). A pooling approach to modelling spatial relations for image retrieval and annotation. arXiv:1411.5190.

  46. Malinowski, M., & Fritz, M. (2015). Hard to cheat: A turing test based on answering questions about images. AAAIWorkshop: Beyond the Turing Test.

  47. Malinowski, M., & Fritz, M. (2016). Tutorial on answering questions about images with deep learning. arXiv preprint arXiv:1610.01076.

  48. Malinowski, M, Rohrbach, M, & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), (pp. 1–9).

  49. Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing (Vol. 999). Cambridge: MIT Press.

    Google Scholar 

  50. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., & Murphy, K. (2016). Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  51. Matuszek, C., Fitzgerald, N., Zettlemoyer, L., Bo, L., & Fox, D. (2012). A joint model of language and perception for grounded attribute learning. In Proceedings of the International Conference on Machine Learning (ICML).

  52. Nag Chowdhury, S., Malinowski, M., Bulling, A., & Fritz, M. (2016) Xplore-m-ego: Contextual media retrieval using natural language queries. In ACM International Conference on Multimedia Retrieval (ICMR).

  53. Nakashole, N., Tylenda, T., & Weikum, T. (2013). Fine-grained semantic typing of emerging entities. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

  54. Noh, H., Seo, P. H., & Han, B. (2015). Image question answering using convolutional neural network with dynamic parameter prediction. arXiv:1511.05756.

  55. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).

  56. Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  57. Plummer, B., Wang, L., Cervantes, C., Caicedo, J., Hockenmaier, J., & Lazebnik, S. (2016). Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. arXiv:1505.04870.

  58. Prakash, A., & Storer, J. (2016). Highway networks for visual question answering.

  59. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1, 25–36.

    Google Scholar 

  60. Ren, M., Kiros, R., & Zemel, R. (2015). Image question answering: A visual semantic embedding model and a new dataset. Advances in Neural Information Processing Systems (NIPS), 1(2), 5.

    Google Scholar 

  61. Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele, B. (2015a). Grounding of textual phrases in images by reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV).

  62. Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015b). A dataset for movie description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  63. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.

  64. Saito, K., Shin, A., Ushiku, Y., & Harada, T. (2016). Dualnet: Domain-invariant network for visual question answering. arXiv preprint arXiv:1606.06108.

  65. Shih, K. J., Singh, S., & Hoiem, D. (2016). Where to look: Focus regions for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  66. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision (ECCV).

  67. Simonyan, K., & Zisserman, A. ((2014)) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  68. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems (NIPS).

  69. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. arXiv:1409.4842.

  70. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  71. Trecvid (2014). Trecvid med 14. http://nist.gov/itl/iad/mig/med14.cfm.

  72. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015a). Sequence to sequence–video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

  73. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., & Saenko, K. (2015b). Translating videos to natural language using deep recurrent neural networks. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).

  74. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2014). Show and tell: A neural image caption generator. arXiv:1411.4555.

  75. Wang, L., Li, Y., & Lazebnik, S. (2016). Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  76. Weston, J., Chopra, S., & Bordes, A. (2014). Memory networks. arXiv:1410.3916.

  77. Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

  78. Wu, Q., Wang, P., Shen, C., van den Hengel, A., & Dick, A. (2016). Ask Me Anything: Free-form Visual Question Answering Based on Knowledge from External Sources. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  79. Xiong, C., Merity, S., & Socher, R. (2016). Dynamic memory networks for visual and textual question answering. arXiv preprint arXiv:1603.01417.

  80. Xu, H., & Saenko, K. (2015). Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. arXiv:1511.05234.

  81. Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning (ICML).

  82. Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2015). Stacked attention networks for image question answering. arXiv:1511.02274.

  83. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 2461–2469.

  84. Yu, L., Poirson, P., Yang, S., Berg, A. C., & Berg, T. L. (2016). Modeling context in referring expressions. In European Conference on Computer Vision, pages 69–85. Springer.

  85. Zaremba, W., & Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.

  86. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., & Fergus, R. (2015). Simple baseline for visual question answering. arXiv:1512.02167.

  87. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  88. Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. G. (2015). Uncovering temporal context for video question and answering. arXiv:1511.04670.

  89. Zitnick, C. L., Parikh, D., & Vanderwende, L. (2013). Learning the visual interpretation of sentences. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).

Download references

Acknowledgements

Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD). The project was in part supported by the Collaborative Research Center (CRC) 1223 from the German Research Foundation (DFG).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marcus Rohrbach.

Additional information

Communicated by Rene Vidal, Katsushi Ikeuchi, Josef Sivic, Christoph Schnoerr.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Malinowski, M., Rohrbach, M. & Fritz, M. Ask Your Neurons: A Deep Learning Approach to Visual Question Answering. Int J Comput Vis 125, 110–135 (2017). https://doi.org/10.1007/s11263-017-1038-2

Download citation

Keywords

  • Computer vision
  • Scene understanding
  • Deep learning
  • Natural language processing
  • Visual turing test
  • Visual question answering