Skip to main content

Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2023 (ICANN 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14260))

Included in the following conference series:

  • 1275 Accesses


Video Question Answering (VideoQA) is a challenging task that requires the model to understand the complex nature of video data and the variety of questions that can be asked about them. Existing approaches often suffer from the problem of ambiguous answer candidates with low relevance to the visual and auditory part of the video, which limits the performance of VideoQA systems. In this paper, we introduce a novel approach that leverages multi-modal fusion and cross-modal contrastive learning to utilize multi-modal information and enhance the relevance of answer candidates in VideoQA. First, we introduce a gated multi-modal fusion network that learns to combine different modalities such as visual and speech based on their relevance to the question to enrich the representations of video and improve the accuracy of finding the correct answer. Second, we introduce cross-modal contrastive learning to increase the similarity between positive example pairs (i.e., correct answers and corresponding video clips) while decreasing the similarity between negative example pairs (i.e., incorrect answers and unpaired video clips). Specifically, we use three-way contrastive learning between answer and video frame, answer and audio, answer and cross-modal features. Our proposed approach is evaluated on two benchmark audio-aware VideoQA datasets, including AVQA and Music-AVQA, and compared to several state-of-the-art methods. The results show that our approach significantly improves the performance of VideoQA, achieving new state-of-the-art results on these benchmarks.

C. Lyu and W. Li—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.


  1. Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)

    Google Scholar 

  2. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: a joint video and image encoder for end-to-end retrieval. In: ICCV (2021)

    Google Scholar 

  3. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: A clip-hitchhiker’s guide to long video retrieval. arXiv preprint arXiv:2205.08508 (2022)

  4. Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. arXiv preprint arXiv:2106.13432 (2021)

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)

    Google Scholar 

  6. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2020)

    Google Scholar 

  7. Fan, C., Zhang, X., Zhang, S., Wang, W., Zhang, C., Huang, H.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR (2019)

    Google Scholar 

  8. Fayek, H.M., Johnson, J.: Temporal reasoning via audio question answering. IEEE/ACM TASLP 28, 2283–2294 (2020)

    Google Scholar 

  9. Gan, Z., et al.: Vision-language pre-training: basics, recent advances, and future trends. FTCGV 14(3–4), 163–352 (2022)

    Google Scholar 

  10. Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI (2020)

    Google Scholar 

  11. Kim, J., Ma, M., Pham, T., Kim, K., Yoo, C.D.: Modality shifting attention network for multi-modal video question answering. In: CVPR, pp. 10106–10115 (2020)

    Google Scholar 

  12. Le, T.M., Le, V., Venkatesh, S., Tran, T.: Hierarchical conditional relation networks for video question answering. In: CVPR (2020)

    Google Scholar 

  13. Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)

  14. Li, G., Wei, Y., Tian, Y., Xu, C., Wen, J.R., Hu, D.: Learning to answer questions in dynamic audio-visual scenarios. In: CVPR (2022)

    Google Scholar 

  15. Li, X., et al.: Learnable aggregating net with diversity learning for video question answering. In: ACM MM (2019)

    Google Scholar 

  16. Li, X., et al.: Beyond RNNs: positional self-attention with co-attention for video question answering. In: AAAI (2019)

    Google Scholar 

  17. Li, Y., Wang, X., Xiao, J., Ji, W., Chua, T.S.: Invariant grounding for video question answering. In: CVPR (2022)

    Google Scholar 

  18. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)

    Google Scholar 

  19. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. arXiv preprint arXiv:1606.00061 (2016)

  20. Lyu, C., Nguyen, M.D., Ninh, V.T., Zhou, L., Gurrin, C., Foster, J.: Dialogue-to-video retrieval. In: ECIR (2023)

    Google Scholar 

  21. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  22. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust speech recognition via large-scale weak supervision. CoRR abs/2212.04356 (2022)

    Google Scholar 

  23. Schwartz, I., Schwing, A.G., Hazan, T.: A simple baseline for audio-visual scene-aware dialog. In: CVPR (2019)

    Google Scholar 

  24. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: Movieqa: understanding stories in movies through question-answering. In: CVPR, pp. 4631–4640 (2016)

    Google Scholar 

  25. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  26. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP (2020)

    Google Scholar 

  27. Xiao, J., Shang, X., Yang, X., Tang, S., Chua, T.-S.: Visual relation grounding in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 447–464. Springer, Cham (2020).

    Chapter  Google Scholar 

  28. Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. In: AAAI, vol. 36, pp. 2804–2812 (2022)

    Google Scholar 

  29. Yang, H., Chaisorn, L., Zhao, Y., Neo, S.Y., Chua, T.S.: Videoqa: question answering on news video. In: ACM MM (2003)

    Google Scholar 

  30. Yang, P., et al.: AVQA: a dataset for audio-visual question answering on videos. In: ACM MM (2022)

    Google Scholar 

  31. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR (2019)

    Google Scholar 

  32. Yun, H., Yu, Y., Yang, W., Lee, K., Kim, G.: Pano-AVQA: grounded audio-visual question answering on 360\(^{\circ }\) videos. In: ICCV (2021)

    Google Scholar 

  33. Zhang, J., Shao, J., Cao, R., Gao, L., Xu, X., Shen, H.T.: Action-centric relation transformer network for video question answering. IEEE TCSVT 32(1), 63–74 (2020)

    Google Scholar 

  34. Zhao, Z., Yang, Q., Cai, D., He, X., Zhuang, Y.: Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI (2017)

    Google Scholar 

  35. Zhong, Y., Ji, W., Xiao, J., Li, Y., Deng, W., Chua, T.S.: Video question answering: datasets, algorithms and challenges. arXiv preprint arXiv:2203.01225 (2022)

  36. Zhou, P., et al.: Attention-based bidirectional long short-term memory networks for relation classification. In: ACL (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tianbo Ji .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lyu, C., Li, W., Ji, T., Zhou, L., Gurrin, C. (2023). Gated Multi-modal Fusion with Cross-modal Contrastive Learning for Video Question Answering. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14260. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44194-3

  • Online ISBN: 978-3-031-44195-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics