Skip to main content
Log in

Instance-sequence reasoning for video question answering

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Video question answering (Video QA) involves a thorough understanding of video content and question language, as well as the grounding of the textual semantic to the visual content of videos. Thus, to answer the questions more accurately, not only the semantic entity should be associated with certain visual instance in video frames, but also the action or event in the question should be localized to a corresponding temporal slot. It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames. In this paper, we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization. In our model, both visual instances and textual representations are firstly embedded into graph nodes, which benefits the integration of intra- and inter-modality. Then, we propose graph causal convolution (GCC) on graph-structured sequence with a large receptive field to capture more causal connections, which is vital for visual grounding and instance-sequence reasoning. Finally, we evaluate our model on TVQA+ dataset, which contains the groundtruth of instance grounding and temporal localization, three other Video QA datasets and three multimodal language processing datasets. Extensive experiments demonstrate the effectiveness and generalization of the proposed method. Specifically, our method outperforms the state-of-the-art methods on these benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Gao J, Ge R, Chen K, Nevatia R. Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6576–6585

  2. Jang Y, Song Y, Kim C D, Yu Y, Kim Y, Kim G. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision, 2019, 127(10): 1385–1412

    Article  Google Scholar 

  3. Xu Y, Han Y, Hong R, Tian Q. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933–4944

    Article  MathSciNet  Google Scholar 

  4. Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q. Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(8): 1839–1849

    Article  Google Scholar 

  5. Yang T, Zha Z J, Xie H, Wang M, Zhang H. Question-aware tubeswitch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1184–1192

  6. Fan C, Zhang X, Zhang S, Wang W, Zhang C, Huang H. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1999–2007

  7. Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 5075–5084

  8. Zhao Z, Xiao S, Song Z, Lu C, Xiao J, Zhuang Y. Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Transactions on Image Processing, 2020, 29: 3859–3870

    Article  Google Scholar 

  9. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6077–6086

  10. Huang D, Chen P, Zeng R, Du Q, Tan M, Gan C. Location-aware graph convolutional networks for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11021–11028

  11. Chen S, Zhao Y, Jin Q, Wu Q. Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10635–10644

  12. Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X. Semisupervised feature selection via spline regression for video semantic recognition. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(2): 252–264

    Article  MathSciNet  Google Scholar 

  13. Fan H, Yang Y. Person tube retrieval via language description. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 10754–10761

  14. Zhou Q, Wang R, Li J, Tian N, Zhang W. Siamese single object tracking algorithm with natural language prior. Frontiers of Computer Science, 2021, 15(5): 155335

    Article  Google Scholar 

  15. Shen L, Hong R, Hao Y. Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, 2020, 14(5): 145702

    Article  Google Scholar 

  16. Nan G, Qiao R, Xiao Y, Liu J, Leng S, Zhang H, Lu W. Interventional video grounding with dual contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 2764–2774

  17. Fan H, Zhu L, Yang Y, Wu F. Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): 78

    Article  Google Scholar 

  18. Gao J, Sun C, Yang Z, Nevatia R. TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5277–5285

  19. Lei J, Yu L, Bansal M, Berg T L. TVQA: localized, compositional video question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018, 1369–1379

  20. Lei J, Yu L, Berg T L, Bansal M. TVQA+: spatio-temporal grounding for video question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8211–8225

  21. Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6713–6724

  22. Zhao Z, Zhang Z, Jiang X, Cai D. Multi-turn video question answering via hierarchical attention context reinforced networks. IEEE Transactions on Image Processing, 2019, 28(8): 3860–3872

    Article  MathSciNet  Google Scholar 

  23. Zhao Z, Zhang Z, Xiao S, Xiao Z, Yan X, Yu J, Cai D, Wu F. Longform video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, 2019, 28(12): 5939–5952

    Article  MathSciNet  Google Scholar 

  24. Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017

  25. Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y. Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia. 2017, 1645–1653

  26. Zadeh A, Zellers R, Pincus E, Morency L P. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31(6): 82–88

    Article  Google Scholar 

  27. Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236–2246

  28. Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335

    Article  Google Scholar 

  29. Jin W, Zhao Z, Gu M, Yu J, Xiao J, Zhuang Y. Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1193–1201

  30. Li X, Gao L, Wang X, Liu W, Xu X, Shen H, Song J. Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1166–1174

  31. Le T M, Le V, Venkatesh S, Tran T. Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9969–9978

  32. Toheed A, Javed A, Irtaza A, Dawood H, Dawood H, Alfakeeh A S. An automated framework for advertisement detection and removal from sports videos using audio-visual cues. Frontiers of Computer Science, 2021, 15(2): 152313

    Article  Google Scholar 

  33. Kim E S, Kang W Y, On K W, Heo Y J, Zhang B T. Hypergraph attention networks for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14569–14578

  34. Li X, Song J, Gao L, Liu X, Huang W, He X, Gan C. Beyond RNNS: positional self-attention with co-attention for video question answering. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8658–8665

  35. Han Y, Wu A, Zhu L, Yang Y. Visual commonsense reasoning with directional visual connections. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 625–637

    Article  Google Scholar 

  36. Fukui A, Park H D, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 457–468

  37. Ben-younes H, Cadene R, Thome N, Cord M. BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8102–8109

  38. Jiang P, Han Y. Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11109–11116

  39. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 91–99

  40. Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186

  41. Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413–431

  42. Yu W, Zhou J, Yu W, Liang X, Xiao N. Heterogeneous graph learning for visual commonsense reasoning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 249

  43. Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the International Conference on Learning Representations. 2018

  44. Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271

  45. Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of the 4th International Conference on Learning Representations. 2016

  46. Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T, Russell B C. Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1380–1390

  47. Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Fei-Fei L. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32–73

    Article  MathSciNet  Google Scholar 

  48. Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543

  49. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015

  50. Jiang J, Chen Z, Lin H, Zhao X, Gao Y. Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11101–11108

  51. van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9: 2579–2605

    Google Scholar 

  52. Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558–6569

  53. Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 7216–7223

  54. Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019

  55. Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6892–6899

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61876130, 61932009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yahong Han.

Additional information

Rui Liu received his BS degree from Northeastern University, China in 2019. He is currently pursuing his MS degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include computer vision and federated learning.

Yahong Han received his PhD degree from Zhejiang University, China in 2012. He is currently a professor with the College of Intelligence and Computing, Tianjin University, China. From November 2014 to November 2015, he visited Professor Bin Yu’s group at UC Berkeley, USA as a visiting scholar. His current research interests include multimedia analysis, computer vision and machine learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, R., Han, Y. Instance-sequence reasoning for video question answering. Front. Comput. Sci. 16, 166708 (2022). https://doi.org/10.1007/s11704-021-1248-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-021-1248-1

Keywords

Navigation