Skip to main content
Log in

Video Question Answering with Spatio-Temporal Reasoning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript


Vision and language understanding has emerged as a subject undergoing intense study in Artificial Intelligence. Among many tasks in this line of research, visual question answering (VQA) has been one of the most successful ones, where the goal is to learn a model that understands visual content at region-level details and finds their associations with pairs of questions and answers in the natural language form. Despite the rapid progress in the past few years, most existing work in VQA have focused primarily on images. In this paper, we focus on extending VQA to the video domain and contribute to the literature in three important ways. First, we propose three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly. Next, we introduce a new large-scale dataset for video VQA named TGIF-QA that extends existing VQA work with our new tasks. Finally, we propose a dual-LSTM based approach with both spatial and temporal attention and show its effectiveness over conventional VQA techniques through empirical evaluations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others








  • Agrawal, A., Batra, D., Parikh, D., & Kembhavi, A. (2018). Don’t just assume; look and answer: Overcoming priors for visual question answering. In: CVPR

  • Andreas, J., Rohrbach, M., Darrel, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In: NAACL

  • Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In: CVPR

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., & Parikh, D. (2015). VQA: Visual question answering. In: ICCV

  • Ba, J.L., Kiros, J.R., & Hinton, G.E. (2016). Layer normalization. In: arXiv:1607.06450

  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In: ICLR

  • Bakhshi, S., Shamma, D.A., Kennedy, L., Song, Y., de Juan, P., & Kaye, J.J. (2016). Fast, Cheap, and Good—Why animated GIFs engage us. In: CHI

  • Chomsky, N. (1971). Conditions on transformations. In: Indiana University Linguistics Club

  • Daiber, J., Jakob, M., Hokamp, C., & Mendes, P.N. (2013). Improving efficiency and accuracy in multilingual entity extraction. In: I-Semantics

  • Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M.F., Parikh, D., & Batra, D. (2017). Visual dialog. In: CVPR

  • Davis, E., & Marcus, G. (2015). Commonsense reasoning and commonsense knowledge in artificial intelligence. CACM, 58, 92–103.

    Article  Google Scholar 

  • Denkowski, M., & Lavie, A. (2011). Meteor universal: Language specific translation evaluation for any target language. In: EMNLP

  • Farneback, G. (2003). Two-frame motion estimation based on polynomial expansion. In: SCIA

  • Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    Book  MATH  Google Scholar 

  • Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP

  • Gao, J., & Ge, R. (2018). Motion-appearance co-memory networks for video question answering. In: CVPR

  • Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS

  • Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., & Parikh, D. (2017). Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: CVPR

  • Gygli, M., Song, Y., & Cao, L. (2016). Video2GIF: Automatic generation of animated GIFs from video. In: CVPR

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: CVPR

  • Isola, P., Lim, J.J., & Adelson, E.H. (2015). Discovering states and transformations in image collections. In: CVPR

  • Jang, Y., Song, Y., Yu, Y., Kim, Y., & Kim, G. (2017). TGIF-QA: Toward spatio-temporal reasoning in visual question answering. In: CVPR

  • Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., & Girshick, R. (2017). CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Li, F.F. (2014). Large-scale video classification with convolutional neural networks. In: CVPR

  • Kim, K., Heo, M., Choi, S., & Zhang, B. (2017). DeepStory: Video story QA by deep embedded memory networks. In: IJCAI

  • Kim, J.H., Lee, S.W,, Kwak, D.H., Heo, M.O., Kim, J., Ha, J.W., & Zhang, B.T. (2016). Multimodal residual learning for visual QA. In: NIPS

  • Kingma, D.P., & Ba, J.L. (2015). ADAM: A method for stochastic optimization. In: ICLR

  • Kipper-Schuler, K. (2005). VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, UPenn CIS

  • Kiros, J.R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., & Fidler, S. (2015). Skip-thought vectors. In: NIPS

  • Lei, J., Yu, L., Bansal, M., & Berg, T. (2018). TVQA: Localized, compositional video question answering. In: EMNLP

  • Levi, G., & Hassner, T. (2015). Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: ICMI

  • Levy, O., & Wolf, L. (2015). Live repetition counting. In: ICCV

  • Li, Y., Song, Y., Cao, L., Tetreault, J., Goldberg, L., Jaimes, A., & Luo, J. (2016). TGIF: A new dataset and benchmark on animated GIF description. In: CVPR

  • Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In: CVPR

  • Lin, T.Y., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C.L. (2014). Microsoft COCO—common objects in context. In: ECCV

  • Maharaj, T., Ballas, N., Rohrbach, A., Courville, A., & Pal, C. (2017). A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In: CVPR

  • Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS

  • Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In: ICCV

  • Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In: ACL

  • Mun, J., Seo, P.H., Jung, I., & Han, B. (2017). MarioQA: Answering questions by watching gameplay videos. In: ICCV

  • Na, S., Lee, S., Kim, J., & Kim, G. (2018). A read-write memory network for movie story understanding. In: ICCV

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.J. (2002). Bleu: A method for automatic evaluation of machine translation. In: ACL

  • Pennington, J., Socher, R., & Manning, C.D. (2014). Glove—Global vectors for word representation. In: EMNLP

  • Pham, V., Bluche, T., Kermorvant, C., & Louradour, J. (2014). Dropout improves recurrent neural networks for handwriting recognition. In: ICFHR

  • Piotr Bojanwoski, E.G., & Armand Joulin, T.M. (2017). Enriching word vectors with subword information. In: TACL

  • Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In: NIPS

  • Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., et al. (2017). Movie description. IJCV, 123, 94–120.

    Article  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. IJCV, 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Soomro, K., Zamir, A.R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. In: CRCV-TR-12-01

  • Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In: ICML

  • Sutskever, I., Vinyals, O., & Le, Q. (2014). Sequence to sequence learning with neural networks. In: NIPS

  • Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). MovieQA: Understanding stories in movies through question-answering. In: CVPR

  • Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In: ICCV

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, l., Gomez, A., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In: NIPS

  • Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In: ICCV

  • Wu, Q., Shen, C., Liu, L., Dick, A., & van den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In: CVPR

  • Xie, S., Chen, S., Huang, J., Tu, Z., & Murphy, K. (2018). Rethinking spatiotemporal feature learning for video understanding. In: ECCV

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In: ICML

  • Yang, Z., Xiadong, H., Jianfeng, G., Li, D., & Smola, A.J. (2015). Stacked attention networks for image question answering. In: CVPR

  • You, Q., Jin, H., Wang, Z., Fang, C., & Luo, J. (2016). Image captioning with semantic attention. In: CVPR

  • Yu, Y., Ko, H., Choi, J., & Kim, G. (2017). End-to-end concept word detection for video captioning, retrieval, and question answering. In: CVPR

  • Yu, L., Park, E., Berg, A.C., & Berg, T.L. (2015). Visual madlibs: Fill in the blank description generation and question answering. In: ICCV

  • Zhao, Z., Yang, Q., Cai, D., He, X., & Zhuang, Y. (2017). Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In: NIPS

  • Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. (2017). Uncovering temporal context for video question answering. IJCV, 124, 409–421.

    Article  MathSciNet  Google Scholar 

  • Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7W: Grounded question answering in images. In: CVPR

  • Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: ICCV

Download references


This work was supported by IITP Grant (No. 2019-0-01082, SW StarLab) (No.2017-0-01772, Video Turing Test), Brain Research Program through the NRF (2017M3C7A1047860) funded by the Korea government (MSIT) and Academic Research Program in Yahoo Research. Gunhee Kim is the corresponding author.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Gunhee Kim.

Additional information

Communicated by Christoph H. Lampert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Table 7 shows the statistics of unique questions (Q), answers (A) and words (W) for each task. Near half of the question sentences in each task are unique, except for the Repeating Action task. This is because the templates used for question generation allow only the variation in the subject of the question sentences. However, since each question consists of a pair of (query sentence, video) (not just question sentence only) and no identical video is used for different questions, all questions are virtually unique to one another in our dataset. Note that the unique answers for the Repetition Action task is 10 because it allows a limited number of answers only (i.e.0 or 2 10+).

Figures 161718 and 19 shows screenshots of the instructions and tasks for the Repletion and State Transition. They are the actual interfaces for the workers of Amazon Mechanical Turk.

Fig. 16
figure 16

Instructions for the Repetition task

Fig. 17
figure 17

The main task page for the Repetition task

Fig. 18
figure 18

Instructions for the State Transition task

Fig. 19
figure 19

The main task page the State Transition task

Fig. 20
figure 20

The question/answer word distribution of each task in the TGIF-QA. We select verbs for video questions and nouns for frame questions. The inner and outer labels for each task indicates (verbs, answers) for the Repetition Count, (repetition counts, answers) for the Repeating Action, (verbs, answers) for the State Transition and (nouns, answer) for the FrameQA

In Fig. 20, multi-pie graphs display the question/answer word distribution of each task in TGIF-QA. The graphs show that each task includes a diverse set of words and thus it is hard for models to take advantage of any bias in data.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jang, Y., Song, Y., Kim, C.D. et al. Video Question Answering with Spatio-Temporal Reasoning. Int J Comput Vis 127, 1385–1412 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: