VQA-E: Explaining, Elaborating, and Enhancing Your Answers for Visual Questions

  • Qing LiEmail author
  • Qingyi Tao
  • Shafiq Joty
  • Jianfei Cai
  • Jiebo Luo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11211)


Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We also conduct a user study to validate the quality of the synthesized explanations. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.


Visual question answering Model with Explanation 



We thank Qianyi Wu etc. for helpful feedback on the user study. This research is partially supported by NTU-CoE Grant and Data Science & Artificial Intelligence Research Centre@NTU (DSAIR). Jiebo Luo would like to thank the support of Adobe and NSF Award #1704309.


  1. 1.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2014)Google Scholar
  4. 4.
    Chen, X., et al.: Microsoft coco captions: Data collection and evaluation server. CoRR (2015)Google Scholar
  5. 5.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  6. 6.
    Das, A., Agrawal, H., Zitnick, L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? Comput. Vis. Image Underst. 163, 90–100 (2017)CrossRefGoogle Scholar
  7. 7.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. EMNLP (2016)Google Scholar
  8. 8.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR (2017)Google Scholar
  9. 9.
    Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: coarse-to-fine learning for image captioning. In: AAAI (2018)Google Scholar
  10. 10.
    Gu, J., Wang, G., Cai, J., Chen, T.: An empirical study of language CNN for image captioning. In: ICCV (2017)Google Scholar
  11. 11.
    Gurari, D., et al.: VizWiz grand challenge: answering visual questions from blind people. In: CVPR (2018)Google Scholar
  12. 12.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  13. 13.
    Heilman, M., Smith, N.A.: Good question! statistical ranking for question generation. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, pp. pp. 609–617. Association for Computational Linguistics, Stroudsburg (2010).
  14. 14.
    Hendricks, L.A., Akata, Z., Rohrbach, M., Donahue, J., Schiele, B., Darrell, T.: Generating visual explanations. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 3–19. Springer, Cham (2016). Scholar
  15. 15.
    Ilievski, I., Yan, S., Feng, J.: A focused dynamic attention model for visual question answering. In: ECCV (2016)Google Scholar
  16. 16.
    Kazemi, V., Elqursh, A.: Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv preprint arXiv:1704.03162 (2017)
  17. 17.
    Li, Q., Fu, J., Yu, D., Mei, T., Luo, J.: Tell-and-answer: towards explainable visual question answering using attributes and captions. arXiv preprint arXiv:1801.09041 (2018)
  18. 18.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS, pp. 289–297 (2016)Google Scholar
  19. 19.
    Nam, H., Ha, J.W., Kim, J.: Dual attention networks for multimodal reasoning and matching. In: CVPR (2017)Google Scholar
  20. 20.
    Park, D.H., Hendricks, L.A., Akata, Z., Rohrbach, A., Schiele, B., Darrell, T., Rohrbach, M.: Multimodal explanations: justifying decisions and pointing to the evidence. In: CVPR (2018)Google Scholar
  21. 21.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  22. 22.
    Ren, M., Kiros, R., Zemel, R.: Image question answering: a visual semantic embedding model and a new dataset. NIPS 1(2), 5 (2015)Google Scholar
  23. 23.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)Google Scholar
  24. 24.
    Salimans, T., Kingma, D.P.: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In: NIPS, pp. 901–909 (2016)Google Scholar
  25. 25.
    Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual question answering. In: ICCV, pp. 4613–4621 (2016)Google Scholar
  26. 26.
    Teney, D., Anderson, P., He, X., Hengel, A.v.d.: Tips and tricks for visual question answering: learnings from the 2017 challenge. In: CVPR (2018)Google Scholar
  27. 27.
    Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR (2016)Google Scholar
  28. 28.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). Scholar
  29. 29.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. ICML 14, 77–81 (2015)Google Scholar
  30. 30.
    Yang, X., Zhang, H., Cai, J.: Shuffe-then-assemble: learning object-agnostic visual relationship features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part XII. LNCS, vol. 11216, pp. 38–54. Springer, Cham (2018)Google Scholar
  31. 31.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)Google Scholar
  32. 32.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)Google Scholar
  33. 33.
    Yu, D., Fu, J., Mei, T., Rui, Y.: Multi-level attention networks for visual question answering. In: CVPR (2017)Google Scholar
  34. 34.
    Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: Grounded question answering in images. In: CVPR, pp. 4995–5004 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Qing Li
    • 1
    Email author
  • Qingyi Tao
    • 2
    • 3
  • Shafiq Joty
    • 2
  • Jianfei Cai
    • 2
  • Jiebo Luo
    • 4
  1. 1.University of Science and Technology of ChinaHefeiChina
  2. 2.Nanyang Technological UniversitySingaporeSingapore
  3. 3.NVIDIA AI Technology CenterWestfordUSA
  4. 4.University of RochesterRochesterUSA

Personalised recommendations