Uncovering the Temporal Context for Video Question Answering

Zhu, Linchao; Xu, Zhongwen; Yang, Yi; Hauptmann, Alexander G.

doi:10.1007/s11263-017-1033-7

Uncovering the Temporal Context for Video Question Answering

Published: 13 July 2017

Volume 124, pages 409–421, (2017)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Linchao Zhu¹,
Zhongwen Xu¹,
Yi Yang ORCID: orcid.org/0000-0002-0512-880X¹ &
…
Alexander G. Hauptmann²

2582 Accesses
133 Citations
Explore all metrics

Abstract

In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of “fill-in-the-blank”, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video Question Answering: a Survey of Models and Datasets

Article 25 January 2021

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Video Question Answering with Iterative Video-Text Co-tokenization

Notes

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In International conference on computer vision (ICCV).
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722–735). Springer.
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).
Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2015). Learning phrase representations using RNN encoder—decoder for statistical machine translation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Collobert, R., Kavukcuoglu, K., & Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In Conference on neural information processing systems workshops (NIPS workshops).
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Conference on computer vision and pattern recognition (CVPR).
Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. In Conference on neural information processing systems (NIPS).
Gan, C., Yang, Y., Zhu, L., Zhao, D., & Zhuang, Y. (2016). Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision (IJCV), 120, 61–77.
Article MathSciNet Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In Conference on neural information processing systems (NIPS).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Conference on computer vision and pattern recognition (CVPR).
Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision (IJCV), 106(2), 210–233.
Article Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research (JAIR), 47, 853–899.
MathSciNet MATH Google Scholar
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).
Jabri, A., Joulin, A., & van der Maaten, L. (2016). Revisiting visual question answering baselines. In European conference on computer vision (ECCV): Springer.
Book Google Scholar
Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Conference on neural information processing systems (NIPS).
Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Conference on neural information processing systems (NIPS).
Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Baby talk: Understanding and generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).
Lebret, R., Pinheiro, P. O., & Collobert, R. (2015). Phrase-based image captioning. In International conference on machine learning (ICML).
Lin, T.-Y., Maire, M., Belongie, S., Perona, P., Ramanan, D., Hays, J., et al. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).
Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In Conference on computer vision and pattern recognition (CVPR).
Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In Conference on neural information processing systems (NIPS).
Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In International conference on computer vision (ICCV).
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. (2015). Generation and comprehension of unambiguous object descriptions. In Conference on computer vision and pattern recognition (CVPR).
MED. (2014). TRECVID MED 14. http://nist.gov/itl/iad/mig/med14.cfm.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Conference on neural information processing systems (NIPS).
Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2015). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision (IJCV), 119, 46–59.
Article MathSciNet Google Scholar
Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. In Conference on computer vision and pattern recognition (CVPR).
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1, 25–36.
Google Scholar
Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In Conference on neural information processing systems (NIPS).
Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. In Conference on computer vision and pattern recognition (CVPR).
Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In International conference on computer vision (ICCV).
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.
Article MathSciNet Google Scholar
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Conference on neural information processing systems (NIPS).
Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In International conference on machine learning (ICML).
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Conference on neural information processing systems (NIPS).
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Conference on computer vision and pattern recognition (CVPR).
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). Movieqa: Understanding stories in movies through question-answering. In Conference on computer vision and pattern recognition (CVPR). arXiv preprint arXiv:1512.02902.
Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude.
Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV).
Tu, K., Meng, M., Lee, M. W., Choe, T. E., & Zhu, S. C. (2014). Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2), 42–70.
Article Google Scholar
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9, 2579–2605.
MATH Google Scholar
Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In Conference on computer vision and pattern recognition (CVPR).
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In International conference on computer vision (ICCV).
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Conference on computer vision and pattern recognition (CVPR).
Vondrick, C., Pirsiavash, H., & Torralba, A. (2015). Anticipating the future by watching unlabeled video. Conference on computer vision and pattern recognition (CVPR).
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103(1), 60–79.
Article MathSciNet Google Scholar
Wu, Q., Wang, P., Shen, C., Dick, A., & van den Hengel, A. (2016). Ask me anything: Free-form visual question answering based on knowledge from external sources. In Conference on computer vision and pattern recognition (CVPR).
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., & Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML).
Xu, Z., Yang, Y., & Hauptmann, A. G. (2015b). A discriminative CNN video representation for event detection. In Conference on computer vision and pattern recognition (CVPR)
Yan, Y., Nie, F., Li, W., Gao, C., Yang, Y., & Xu, D. (2016). Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia, 18(12), 2494–2502.
Article Google Scholar
Yang, Y., Xu, D., Nie, F., Luo, J., & Zhuang, Y. (2009). Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the 17th ACM international conference on multimedia (pp. 175–184). ACM.
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. In International conference on computer vision (ICCV).
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL), 2, 67–78.
Google Scholar
Yu, H., & Siskind, J. M. (2013). Grounded language learning from video described with sentences. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).
Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank image generation and question answering. In International conference on computer vision (ICCV).
Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7w: Grounded question answering in images. In Conference on computer vision and pattern recognition (CVPR).
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In International conference on computer vision (ICCV).

Download references

Acknowledgements

Our work is partially supported by the Data to Decisions Cooperative Research Centre (www.d2dcrc.com.au), Google Faculty Award, and an Australian Government Research Training Program Scholarship. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the TITAN X (Pascal) GPU used for this research.

Author information

Authors and Affiliations

CAI, University of Technology Sydney, Sydney, NSW, Australia
Linchao Zhu, Zhongwen Xu & Yi Yang
SCS, Carnegie Mellon University, Pittsburgh, PA, USA
Alexander G. Hauptmann

Authors

Linchao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongwen Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Hauptmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yang.

Additional information

Communicated by Bernt Schiele.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, L., Xu, Z., Yang, Y. et al. Uncovering the Temporal Context for Video Question Answering. Int J Comput Vis 124, 409–421 (2017). https://doi.org/10.1007/s11263-017-1033-7

Download citation

Received: 18 July 2016
Accepted: 04 July 2017
Published: 13 July 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11263-017-1033-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Uncovering the Temporal Context for Video Question Answering

Abstract

Access this article

Similar content being viewed by others

Video Question Answering: a Survey of Models and Datasets

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Video Question Answering with Iterative Video-Text Co-tokenization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Uncovering the Temporal Context for Video Question Answering

Abstract

Access this article

Similar content being viewed by others

Video Question Answering: a Survey of Models and Datasets

Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Video Question Answering with Iterative Video-Text Co-tokenization

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation