Abstract
Video question answering (Video QA) involves a thorough understanding of video content and question language, as well as the grounding of the textual semantic to the visual content of videos. Thus, to answer the questions more accurately, not only the semantic entity should be associated with certain visual instance in video frames, but also the action or event in the question should be localized to a corresponding temporal slot. It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames. In this paper, we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization. In our model, both visual instances and textual representations are firstly embedded into graph nodes, which benefits the integration of intra- and inter-modality. Then, we propose graph causal convolution (GCC) on graph-structured sequence with a large receptive field to capture more causal connections, which is vital for visual grounding and instance-sequence reasoning. Finally, we evaluate our model on TVQA+ dataset, which contains the groundtruth of instance grounding and temporal localization, three other Video QA datasets and three multimodal language processing datasets. Extensive experiments demonstrate the effectiveness and generalization of the proposed method. Specifically, our method outperforms the state-of-the-art methods on these benchmarks.
Similar content being viewed by others
References
Gao J, Ge R, Chen K, Nevatia R. Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6576–6585
Jang Y, Song Y, Kim C D, Yu Y, Kim Y, Kim G. Video question answering with spatio-temporal reasoning. International Journal of Computer Vision, 2019, 127(10): 1385–1412
Xu Y, Han Y, Hong R, Tian Q. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933–4944
Zhao S, Liu Y, Han Y, Hong R, Hu Q, Tian Q. Pooling the convolutional layers in deep ConvNets for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2018, 28(8): 1839–1849
Yang T, Zha Z J, Xie H, Wang M, Zhang H. Question-aware tubeswitch network for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1184–1192
Fan C, Zhang X, Zhang S, Wang W, Zhang C, Huang H. Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 1999–2007
Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 5075–5084
Zhao Z, Xiao S, Song Z, Lu C, Xiao J, Zhuang Y. Open-ended video question answering via multi-modal conditional adversarial networks. IEEE Transactions on Image Processing, 2020, 29: 3859–3870
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6077–6086
Huang D, Chen P, Zeng R, Du Q, Tan M, Gan C. Location-aware graph convolutional networks for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11021–11028
Chen S, Zhao Y, Jin Q, Wu Q. Fine-grained video-text retrieval with hierarchical graph reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 10635–10644
Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X. Semisupervised feature selection via spline regression for video semantic recognition. IEEE Transactions on Neural Networks and Learning Systems, 2015, 26(2): 252–264
Fan H, Yang Y. Person tube retrieval via language description. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 10754–10761
Zhou Q, Wang R, Li J, Tian N, Zhang W. Siamese single object tracking algorithm with natural language prior. Frontiers of Computer Science, 2021, 15(5): 155335
Shen L, Hong R, Hao Y. Advance on large scale near-duplicate video retrieval. Frontiers of Computer Science, 2020, 14(5): 145702
Nan G, Qiao R, Xiao Y, Liu J, Leng S, Zhang H, Lu W. Interventional video grounding with dual contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021, 2764–2774
Fan H, Zhu L, Yang Y, Wu F. Recurrent attention network with reinforced generator for visual dialog. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(3): 78
Gao J, Sun C, Yang Z, Nevatia R. TALL: temporal activity localization via language query. In: Proceedings of the IEEE International Conference on Computer Vision. 2017, 5277–5285
Lei J, Yu L, Bansal M, Berg T L. TVQA: localized, compositional video question answering. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2018, 1369–1379
Lei J, Yu L, Berg T L, Bansal M. TVQA+: spatio-temporal grounding for video question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020, 8211–8225
Zellers R, Bisk Y, Farhadi A, Choi Y. From recognition to cognition: visual commonsense reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 6713–6724
Zhao Z, Zhang Z, Jiang X, Cai D. Multi-turn video question answering via hierarchical attention context reinforced networks. IEEE Transactions on Image Processing, 2019, 28(8): 3860–3872
Zhao Z, Zhang Z, Xiao S, Xiao Z, Yan X, Yu J, Cai D, Wu F. Longform video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, 2019, 28(12): 5939–5952
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations. 2017
Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y. Video question answering via gradually refined attention over appearance and motion. In: Proceedings of the 25th ACM International Conference on Multimedia. 2017, 1645–1653
Zadeh A, Zellers R, Pincus E, Morency L P. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intelligent Systems, 2016, 31(6): 82–88
Zadeh A B, Liang P P, Poria S, Cambria E, Morency L P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018, 2236–2246
Busso C, Bulut M, Lee C C, Kazemzadeh A, Mower E, Kim S, Chang J N, Lee S, Narayanan S S. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42(4): 335
Jin W, Zhao Z, Gu M, Yu J, Xiao J, Zhuang Y. Multi-interaction network with object relation for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1193–1201
Li X, Gao L, Wang X, Liu W, Xu X, Shen H, Song J. Learnable aggregating net with diversity learning for video question answering. In: Proceedings of the 27th ACM International Conference on Multimedia. 2019, 1166–1174
Le T M, Le V, Venkatesh S, Tran T. Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 9969–9978
Toheed A, Javed A, Irtaza A, Dawood H, Dawood H, Alfakeeh A S. An automated framework for advertisement detection and removal from sports videos using audio-visual cues. Frontiers of Computer Science, 2021, 15(2): 152313
Kim E S, Kang W Y, On K W, Heo Y J, Zhang B T. Hypergraph attention networks for multimodal learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 14569–14578
Li X, Song J, Gao L, Liu X, Huang W, He X, Gan C. Beyond RNNS: positional self-attention with co-attention for video question answering. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8658–8665
Han Y, Wu A, Zhu L, Yang Y. Visual commonsense reasoning with directional visual connections. Frontiers of Information Technology & Electronic Engineering, 2021, 22(5): 625–637
Fukui A, Park H D, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016, 457–468
Ben-younes H, Cadene R, Thome N, Cord M. BLOCK: bilinear superdiagonal fusion for visual question answering and visual relationship detection. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 8102–8109
Jiang P, Han Y. Reasoning with heterogeneous graph alignment for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11109–11116
Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems. 2015, 91–99
Devlin J, Chang M W, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019, 4171–4186
Wang X, Gupta A. Videos as space-time region graphs. In: Proceedings of the 15th European Conference on Computer Vision. 2018, 413–431
Yu W, Zhou J, Yu W, Liang X, Xiao N. Heterogeneous graph learning for visual commonsense reasoning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 249
Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. In: Proceedings of the International Conference on Learning Representations. 2018
Bai S, Kolter J Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. 2018, arXiv preprint arXiv: 1803.01271
Yu F, Koltun V. Multi-scale context aggregation by dilated convolutions. In: Proceedings of the 4th International Conference on Learning Representations. 2016
Hendricks L A, Wang O, Shechtman E, Sivic J, Darrell T, Russell B C. Localizing moments in video with temporal language. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018, 1380–1390
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L J, Shamma D A, Bernstein M S, Fei-Fei L. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017, 123(1): 32–73
Pennington J, Socher R, Manning C D. Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014, 1532–1543
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of the 3rd International Conference on Learning Representations. 2015
Jiang J, Chen Z, Lin H, Zhao X, Gao Y. Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 11101–11108
van der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9: 2579–2605
Tsai Y H H, Bai S, Liang P P, Kolter J Z, Morency L P, Salakhutdinov R. Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019, 6558–6569
Wang Y, Shen Y, Liu Z, Liang P P, Zadeh A, Morency L P. Words can shift: dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 7216–7223
Tsai Y H H, Liang P P, Zadeh A, Morency L P, Salakhutdinov R. Learning factorized multimodal representations. In: Proceedings of the 7th International Conference on Learning Representations. 2019
Pham H, Liang P P, Manzini T, Morency L P, Póczos B. Found in translation: learning robust joint representations by cyclic translations between modalities. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 6892–6899
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61876130, 61932009).
Author information
Authors and Affiliations
Corresponding author
Additional information
Rui Liu received his BS degree from Northeastern University, China in 2019. He is currently pursuing his MS degree with the College of Intelligence and Computing, Tianjin University, China. His research interests include computer vision and federated learning.
Yahong Han received his PhD degree from Zhejiang University, China in 2012. He is currently a professor with the College of Intelligence and Computing, Tianjin University, China. From November 2014 to November 2015, he visited Professor Bin Yu’s group at UC Berkeley, USA as a visiting scholar. His current research interests include multimedia analysis, computer vision and machine learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Liu, R., Han, Y. Instance-sequence reasoning for video question answering. Front. Comput. Sci. 16, 166708 (2022). https://doi.org/10.1007/s11704-021-1248-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-021-1248-1