International Journal of Computer Vision

, Volume 127, Issue 10, pp 1385–1412 | Cite as

Video Question Answering with Spatio-Temporal Reasoning

  • Yunseok Jang
  • Yale Song
  • Chris Dongjoo Kim
  • Youngjae Yu
  • Youngjin Kim
  • Gunhee KimEmail author


Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.


VQA Spatio-temporal reasoning Large-scale video QA dataset Spatial and temporal attention 



This work was supported by IITP Grant (No. 2019-0-01082, SW StarLab) (No.2017-0-01772, Video Turing Test), Brain Research Program through the NRF (2017M3C7A1047860) funded by the Korea government (MSIT) and Academic Research Program in Yahoo Research. Gunhee Kim is the corresponding author.


  1. Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPRGoogle Scholar
  2. Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACLGoogle Scholar
  3. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPRGoogle Scholar
  4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCVGoogle Scholar
  5. Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. In: arXiv:1607.06450
  6. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLRGoogle Scholar
  7. Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHIGoogle Scholar
  8. Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics ClubGoogle Scholar
  9. Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-SemanticsGoogle Scholar
  10. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPRGoogle Scholar
  11. Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.CrossRefGoogle Scholar
  12. Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLPGoogle Scholar
  13. Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIAGoogle Scholar
  14. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.CrossRefzbMATHGoogle Scholar
  15. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLPGoogle Scholar
  16. Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPRGoogle Scholar
  17. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPSGoogle Scholar
  18. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPRGoogle Scholar
  19. Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPRGoogle Scholar
  20. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPRGoogle Scholar
  21. Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPRGoogle Scholar
  22. Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPRGoogle Scholar
  23. Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPRGoogle Scholar
  24. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPRGoogle Scholar
  25. Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAIGoogle Scholar
  26. Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPSGoogle Scholar
  27. Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLRGoogle Scholar
  28. Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CISGoogle Scholar
  29. Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPSGoogle Scholar
  30. Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLPGoogle Scholar
  31. Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMIGoogle Scholar
  32. Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCVGoogle Scholar
  33. Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPRGoogle Scholar
  34. Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPRGoogle Scholar
  35. Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCVGoogle Scholar
  36. Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPRGoogle Scholar
  37. Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPSGoogle Scholar
  38. Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCVGoogle Scholar
  39. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACLGoogle Scholar
  40. Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCVGoogle Scholar
  41. Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCVGoogle Scholar
  42. Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACLGoogle Scholar
  43. Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLPGoogle Scholar
  44. Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHRGoogle Scholar
  45. Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACLGoogle Scholar
  46. Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPSGoogle Scholar
  47. Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.CrossRefGoogle Scholar
  48. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.MathSciNetCrossRefGoogle Scholar
  49. Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01Google Scholar
  50. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICMLGoogle Scholar
  51. Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPSGoogle Scholar
  52. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPRGoogle Scholar
  53. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCVGoogle Scholar
  54. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPSGoogle Scholar
  55. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCVGoogle Scholar
  56. Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPRGoogle Scholar
  57. Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCVGoogle Scholar
  58. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICMLGoogle Scholar
  59. Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPRGoogle Scholar
  60. You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPRGoogle Scholar
  61. Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPRGoogle Scholar
  62. Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCVGoogle Scholar
  63. Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAIGoogle Scholar
  64. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPSGoogle Scholar
  65. Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.MathSciNetCrossRefGoogle Scholar
  66. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPRGoogle Scholar
  67. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCVGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Seoul National UniversitySeoulSouth Korea
  2. 2.Microsoft AI & ResearchRedmondUSA

Personalised recommendations