Abstract
This paper proposes a Video Graph Transformer (VGT) model for Video Question Answering (VideoQA). VGT’s uniqueness are two-fold: 1) it designs a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations, and dynamics for complex spatio-temporal reasoning; and 2) it exploits disentangled video and text Transformers for relevance comparison between the video and text to perform QA, instead of entangled cross-modal Transformer for answer classification. Vision-text communication is done by additional cross-modal interaction modules. With more reasonable video encoding and QA solution, we show that VGT can achieve much better performances on VideoQA tasks that challenge dynamic relation reasoning than prior arts in the pretraining-free scenario. Its performances even surpass those models that are pretrained with millions of external data. We further show that VGT can also benefit a lot from self-supervised cross-modal pretraining, yet with orders of magnitude smaller data. These results clearly demonstrate the effectiveness and superiority of VGT, and reveal its potential for more data-efficient pretraining. With comprehensive analyses and some heuristic observations, we hope that VGT can promote VQA research beyond coarse recognition/description towards fine-grained relation reasoning in realistic videos. Our code is available at https://github.com/sail-sg/VGT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The model demands on less training data to achieve good performance.
- 2.
We assume that the group of objects do not change in a short video clip.
References
Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 35, pp. 6644–6652 (2021)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086 (2018)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1728–1738 (2021)
Battaglia, P.W., et al.: Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML, pp. 813–824. PMLR (2021)
Buch, S., Eyzaguirre, C., Gaidon, A., Wu, J., Fei-Fei, L., Niebles, J.C.: Revisiting the “video” in video-language understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2917–2927 (2022)
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.C., et al.: UNITER: UNiversal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Cherian, A., Hori, C., Marks, T.K., Le Roux, J.: (2.5+ 1) d spatio-temporal scene graphs for video question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 444–453 (2022)
Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. In: IJCAI (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
Ding, D., Hill, F., Santoro, A., Reynolds, M., Botvinick, M.: Attention over learned object embeddings enables complex visual reasoning. Adv. Neural Inf. Process. syst. (NeurIPS) 34, 9112–9124 (2021)
Dosovitskiy, A., et al.: An image is worth \(16\times 16\) words: transformers for image recognition at scale. In: International Conference on Representation Learning (ICLR) (2020)
Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1999–2007 (2019)
Fu, T.J., et al.: Violet: end-to-end video-language transformers with masked visual-token modeling. In: arXiv preprint arXiv:2111.12681 (November 2021)
Gao, J., Ge, R., Chen, K., Nevatia, R.: Motion-appearance co-memory networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6576–6585 (2018)
Geng, S., et al.: Dynamic graph representation learning for video dialog via multi-modal shuffled transformers. In: AAAI Conference on Artificial Intelligence (AAAI) (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Huang, D., Chen, P., Zeng, R., Du, Q., Tan, M., Gan, C.: Location-aware graph convolutional networks for video question answering. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 11021–11028 (2020)
Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2758–2766 (2017)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML. pp, 4904–4916. PMLR (2021)
Jiang, J., Chen, Z., Lin, H., Zhao, X., Gao, Y.: Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: AAAI Conference on Artificial Intelligence (AAAI), vol. 34, pp. 11101–11108 (2020)
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI Conference on Artificial Intelligence (AAAI) (2020)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1780–1790 (2021)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Representation Learning (ICLR) (2017)
Krishna, R., Chami, I., Bernstein, M., Fei-Fei, L.: Referring relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6867–6876 (2018)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017). https://doi.org/10.1007/S11263-016-0981-7
Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9972–9981 (2020)
Lei, J., et al.: Less is more: ClipBERT for video-and-language learning via sparse sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7331–7341 (2021)
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: Empirical Methods in Natural Language Processing (EMNLP) (2018)
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34 (2021)
Li, L., Chen, Y.C., Cheng, Y., Gan, Z., Yu, L., Liu, J.: Hero: hierarchical encoder for video+ language omni-representation pre-training. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2046–2065 (2020)
Li, X., et al.: Beyond RNNs: positional self-attention with co-attention for video question answering. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 8658–8665 (2019)
Li, X., et al.: OSCAR: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Li, Y., Wang, X., Xiao, J., Ji, W., Chua, T.S.: Invariant grounding for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2928–2937 (2022)
Liu, F., Liu, J., Wang, W., Lu, H.: Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1698–1707 (2021)
Liu, Z., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3202–3211 (2022)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 13–23 (2019)
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9879–9889 (2020)
Miech, A., Zhukov, D., Alayrac, J.B., Tapaswi, M., Laptev, I., Sivic, J.: Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2630–2640 (2019)
Park, J., Lee, J., Sohn, K.: Bridge to answer: structure-aware graph interaction network for video question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15526–15535 (2021)
Peng, L., Yang, S., Bin, Y., Wang, G.: Progressive graph attention network for video question answering. In: ACM MM, pp. 2871–2879 (2021)
Peng, M., Wang, C., Gao, Y., Shi, Y., Zhou, X.D.: Multilevel hierarchical network with multiscale sampling for video question answering. In: IJCAI (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 28, 1–9 (2015)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS 2019
Seo, A., Kang, G.C., Park, J., Zhang, B.T.: Attend what you need: motion-appearance synergistic networks for video question answering. In: ACL, pp. 6167–6177 (2021)
Seo, P.H., Nagrani, A., Schmid, C.: Look before you speak: visually contextualized utterances. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16877–16887 (2021)
Shang, X., Di, D., Xiao, J., Cao, Y., Yang, X., Chua, T.S.: Annotating objects and relations in user-generated videos. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval (ICMR), pp. 279–287 (2019)
Shang, X., Xiao, J., Di, D., Chua, T.S.: Relation understanding in videos: a grand challenge overview. In: Proceedings of the 27th ACM International Conference on Multimedia (MM), pp. 2652–2656 (2019)
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018)
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. In: International Conference on Representation Learning (ICLR) (2020)
Sun, C., Myers, A., Vondrick, C., Murphy, K., Schmid, C.: VideoBERT: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7464–7473 (2019)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111 (2019)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
Wang, L., et al.: TCL: transformer-based dynamic graph modelling via contrastive learning. arXiv preprint arXiv:2105.07944 (2021)
Wang, X., Gupta, A.: Videos as space-time region graphs. In: European Conference on Computer Vision (ECCV), pp. 399–417 (2018)
Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 447–464. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_27
Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9777–9786 (2021)
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 2804–2812 (2022)
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: European Conference on Computer Vision (ECCV), pp. 305–321 (2018)
Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: ACM MM, pp. 1645–1653 (2017)
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding. In: EMNLP, pp. 6787–6800 (2021)
Yang, A., Miech, A., Sivic, J., Laptev, I., Schmid, C.: Just ask: learning to answer questions from millions of narrated videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1686–1697 (2021)
Yi, K., et al.: CLEVRER: collision events for video representation and reasoning. In: International Conference on Learning Representations (ICLR) (2019)
Ying, C., et al.: Do transformers really perform badly for graph representation? Adv. Neural Inf. Process. Syst. (NeurIPS) 34, 28877–28888 (2021)
Yu, W., et al.: Learning from inside: self-driven siamese sampling and reasoning for video question answering. Adv. Neural Inf. Process. Syst. (NeurIPS) 34, 26462–26474 (2021)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: European Conference on Computer Vision (ECCV), pp. 471–487 (2018)
Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks. Adv. Neural Inf. Process. Syst. (NeurIPS) 32, 1–11 (2019)
Zellers, R., et al.: Merlot: multimodal neural script knowledge models. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 34 (2021)
Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.S.: Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225 (2022)
Zhu, L., Yang, Y.: ActBERT: learning global-local video-text representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8746–8755 (2020)
Zoph, B., et al.: Rethinking pre-training and self-training. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 3833–3845 (2020)
Acknowledgements
This research is supported by the Sea-NExT joint Lab. Major work was done when Junbin was a research intern at Sea AI Lab. We greatly thank Angela Yao as well as the anonymous reviewers for their thoughtful comments towards a better work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xiao, J., Zhou, P., Chua, TS., Yan, S. (2022). Video Graph Transformer for Video Question Answering. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-20059-5_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20058-8
Online ISBN: 978-3-031-20059-5
eBook Packages: Computer ScienceComputer Science (R0)