International Journal of Computer Vision

, Volume 124, Issue 3, pp 409–421

Uncovering the Temporal Context for Video Question Answering

  • Linchao Zhu
  • Zhongwen Xu
  • Yi Yang
  • Alexander G. Hauptmann
Article
  • 49 Downloads

Abstract

In this work, we introduce Video Question Answering in the temporal domain to infer the past, describe the present and predict the future. We present an encoder–decoder approach using Recurrent Neural Networks to learn the temporal structures of videos and introduce a dual-channel ranking loss to answer multiple-choice questions. We explore approaches for finer understanding of video content using the question form of “fill-in-the-blank”, and collect our Video Context QA dataset consisting of 109,895 video clips with a total duration of more than 1000 h from existing TACoS, MPII-MD and MEDTest 14 datasets. In addition, 390,744 corresponding questions are generated from annotations. Extensive experiments demonstrate that our approach significantly outperforms the compared baselines.

Keywords

Video sequence modeling Video question answering Video prediction Cross-media 

References

  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In International conference on computer vision (ICCV).Google Scholar
  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. In The semantic web (pp. 722–735). Springer.Google Scholar
  3. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In International conference on learning representations (ICLR).Google Scholar
  4. Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2015). Learning phrase representations using RNN encoder—decoder for statistical machine translation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP).Google Scholar
  5. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  6. Collobert, R., Kavukcuoglu, K., & Farabet, C. (2011). Torch7: A matlab-like environment for machine learning. In Conference on neural information processing systems workshops (NIPS workshops).Google Scholar
  7. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  8. Elliott, D., & Keller, F. (2014). Comparing automatic evaluation measures for image description. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).Google Scholar
  9. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. In Conference on neural information processing systems (NIPS).Google Scholar
  10. Gan, C., Yang, Y., Zhu, L., Zhao, D., & Zhuang, Y. (2016). Recognizing an action using its name: A knowledge-based approach. International Journal of Computer Vision (IJCV), 120, 61–77.MathSciNetCrossRefGoogle Scholar
  11. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., & Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. In Conference on neural information processing systems (NIPS).Google Scholar
  12. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  13. Gong, Y., Ke, Q., Isard, M., & Lazebnik, S. (2014). A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision (IJCV), 106(2), 210–233.CrossRefGoogle Scholar
  14. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  15. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research (JAIR), 47, 853–899.MathSciNetMATHGoogle Scholar
  16. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).Google Scholar
  17. Jabri, A., Joulin, A., & van der Maaten, L. (2016). Revisiting visual question answering baselines. In European conference on computer vision (ECCV): Springer.CrossRefGoogle Scholar
  18. Karpathy, A., & Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  19. Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In Conference on neural information processing systems (NIPS).Google Scholar
  20. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).Google Scholar
  21. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Conference on neural information processing systems (NIPS).Google Scholar
  22. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A. C., & Berg, T. L. (2011). Baby talk: Understanding and generating image descriptions. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  23. Lebret, R., Pinheiro, P. O., & Collobert, R. (2015). Phrase-based image captioning. In International conference on machine learning (ICML).Google Scholar
  24. Lin, T.-Y., Maire, M., Belongie, S., Perona, P., Ramanan, D., Hays, J., et al. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).Google Scholar
  25. Lin, X., & Parikh, D. (2015). Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  26. Malinowski, M., & Fritz, M. (2014). A multi-world approach to question answering about real-world scenes based on uncertain input. In Conference on neural information processing systems (NIPS).Google Scholar
  27. Malinowski, M., Rohrbach, M., & Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. In International conference on computer vision (ICCV).Google Scholar
  28. Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A. L., & Murphy, K. (2015). Generation and comprehension of unambiguous object descriptions. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  29. MED. (2014). TRECVID MED 14. http://nist.gov/itl/iad/mig/med14.cfm.
  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Conference on neural information processing systems (NIPS).Google Scholar
  31. Ordonez, V., Han, X., Kuznetsova, P., Kulkarni, G., Mitchell, M., Yamaguchi, K., et al. (2015). Large scale retrieval and generation of image descriptions. International Journal of Computer Vision (IJCV), 119, 46–59.MathSciNetCrossRefGoogle Scholar
  32. Pan, P., Xu, Z., Yang, Y., Wu, F., & Zhuang, Y. (2016). Hierarchical recurrent neural encoder for video representation with application to captioning. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  33. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).Google Scholar
  34. Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., & Pinkal, M. (2013). Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL), 1, 25–36.Google Scholar
  35. Ren, M., Kiros, R., & Zemel, R. (2015). Exploring models and data for image question answering. In Conference on neural information processing systems (NIPS).Google Scholar
  36. Rohrbach, A., Rohrbach, M., Tandon, N., & Schiele, B. (2015). A dataset for movie description. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  37. Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., & Schiele, B. (2013). Translating video content to natural language descriptions. In International conference on computer vision (ICCV).Google Scholar
  38. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  39. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Conference on neural information processing systems (NIPS).Google Scholar
  40. Srivastava, N., Mansimov, E., & Salakhudinov, R. (2015). Unsupervised learning of video representations using LSTMs. In International conference on machine learning (ICML).Google Scholar
  41. Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Conference on neural information processing systems (NIPS).Google Scholar
  42. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., et al. (2015). Going deeper with convolutions. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  43. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., & Fidler, S. (2016). Movieqa: Understanding stories in movies through question-answering. In Conference on computer vision and pattern recognition (CVPR). arXiv preprint arXiv:1512.02902.
  44. Tieleman, T., & Hinton, G. (2012). Lecture 6.5-RMSprop: Divide the gradient by a running average of its recent magnitude.Google Scholar
  45. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV).Google Scholar
  46. Tu, K., Meng, M., Lee, M. W., Choe, T. E., & Zhu, S. C. (2014). Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2), 42–70.CrossRefGoogle Scholar
  47. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research (JMLR), 9, 2579–2605.MATHGoogle Scholar
  48. Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). CIDEr: Consensus-based image description evaluation. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  49. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., & Saenko, K. (2015). Sequence to sequence—video to text. In International conference on computer vision (ICCV).Google Scholar
  50. Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  51. Vondrick, C., Pirsiavash, H., & Torralba, A. (2015). Anticipating the future by watching unlabeled video. Conference on computer vision and pattern recognition (CVPR).Google Scholar
  52. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision (IJCV), 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  53. Wu, Q., Wang, P., Shen, C., Dick, A., & van den Hengel, A. (2016). Ask me anything: Free-form visual question answering based on knowledge from external sources. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  54. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., & Bengio, Y. (2015a). Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML).Google Scholar
  55. Xu, Z., Yang, Y., & Hauptmann, A. G. (2015b). A discriminative CNN video representation for event detection. In Conference on computer vision and pattern recognition (CVPR) Google Scholar
  56. Yan, Y., Nie, F., Li, W., Gao, C., Yang, Y., & Xu, D. (2016). Image classification by cross-media active learning with privileged information. IEEE Transactions on Multimedia, 18(12), 2494–2502.CrossRefGoogle Scholar
  57. Yang, Y., Xu, D., Nie, F., Luo, J., & Zhuang, Y. (2009). Ranking with local regression and global alignment for cross media retrieval. In Proceedings of the 17th ACM international conference on multimedia (pp. 175–184). ACM.Google Scholar
  58. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., & Courville, A. (2015). Describing videos by exploiting temporal structure. In International conference on computer vision (ICCV).Google Scholar
  59. Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics (TACL), 2, 67–78.Google Scholar
  60. Yu, H., & Siskind, J. M. (2013). Grounded language learning from video described with sentences. In Proceedings of the annual meeting of the Association for Computational Linguistics (ACL).Google Scholar
  61. Yu, L., Park, E., Berg, A. C., & Berg, T. L. (2015). Visual Madlibs: Fill in the blank image generation and question answering. In International conference on computer vision (ICCV).Google Scholar
  62. Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
  63. Zhu, Y., Groth, O., Bernstein, M., & Fei-Fei, L. (2016). Visual7w: Grounded question answering in images. In Conference on computer vision and pattern recognition (CVPR).Google Scholar
  64. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In International conference on computer vision (ICCV).Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  1. 1.CAIUniversity of Technology SydneySydneyAustralia
  2. 2.SCSCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations