Hierarchical Temporal Fusion of Multi-grained Attention Features for Video Question Answering

  • Shaoning Xiao
  • Yimeng Li
  • Yunan Ye
  • Long Chen
  • Shiliang Pu
  • Zhou Zhao
  • Jian Shao
  • Jun XiaoEmail author


This work aims to address the problem of video question answering (VideoQA) with a novel model and a new open-ended VideoQA dataset. VideoQA is a challenging field in visual information retrieval, which aims to generate the answer according to the video content and question. Ultimately, VideoQA is a video understanding task. Efficiently combining the multi-grained representations is the key factor in understanding a video. The existing works mostly focus on overall frame-level visual understanding to tackle the problem, which neglects finer-grained and temporal information inside the video, or just combines the multi-grained representations simply by concatenation or addition. Thus, we propose the multi-granularity temporal attention network that enables to search for the specific frames in a video that are holistically and locally related to the answer. We first learn the mutual attention representations of multi-grained visual content and question. Then the mutually attended features are combined hierarchically using a double layer LSTM to generate the answer. Furthermore, we illustrate several different multi-grained fusion configurations to prove the advancement of this hierarchical architecture. The effectiveness of our model is demonstrated on the large-scale video question answering dataset based on ActivityNet dataset.


Video question answering Multi-grained representation Temporal co-attention 



This work was supported by Zhejiang Natural Science Foundation (LR19F020002, LZ17F020001), National Natural Science Foundation of China (61572431), Key R&D Program of Zhejiang Province (2018C01006), Chinese Knowledge Center for Engineering Sciences and Technology and Joint Research Program of ZJU and Hikvision Research Institute.


  1. 1.
    Agrawal A, Lu J, Antol S, Mitchell M, Zitnick CL, Parikh D, Batra D (2017) VQA: visual question answering. Int J Compu Vis 123:431MathSciNetCrossRefGoogle Scholar
  2. 2.
    Kafle K, Kanan C (2016) Answer-type prediction for visual question answering. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4976–4984Google Scholar
  3. 3.
    Teney D, Liu L, Van Den Hengel A (2017) Graph-structured representations for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 3233–3241Google Scholar
  4. 4.
    Yang Z, He X, Gao J, Deng L, Smola A (2015) Stacked attention networks for image question answering. CVPR 2016:21–29Google Scholar
  5. 5.
    Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Underst 163:320CrossRefGoogle Scholar
  6. 6.
    Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. CVPR 2017:4187–4195Google Scholar
  7. 7.
    Zhang H, Zha Z, Yang Y (2013) Attribute-augmented semantic hierarchy: towards bridging semantic gap and intention gap in image retrieval. In: Proceedings of the 21st ACM international conference on multimedia, pp 33–42Google Scholar
  8. 8.
    Hong R, Li L, Cai J (2017) Coherent semantic-visual indexing for large-scale image retrieval in the cloud. IEEE Trans Image Process 26(9):4128–4138MathSciNetCrossRefGoogle Scholar
  9. 9.
    Ye Y, Zhao Z, Li Y, Chen L, Xiao J, Zhuang Y (2017) Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. ACM, Cambridge, pp 829–832Google Scholar
  10. 10.
    Xu D, Zhao Z, Xiao J, Wu F, Zhang H, He X, Zhuang Y (2017) Video question answering via gradually refined attention over appearance and motion. In: ACM multimedia conference, pp 1645–1653Google Scholar
  11. 11.
    Zhao Z, Yang Q, Cai D, He X, Zhuang Y (2017) Video question answering via hierarchical spatio-temporal attention networks. In: IJCAI international joint conference on artificial intelligence, pp 3518–3524Google Scholar
  12. 12.
    Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421MathSciNetCrossRefGoogle Scholar
  13. 13.
    Hong R, Wang M, Li G (2012) Multimedia question answering. IEEE Multimedia 19(4):72–78CrossRefGoogle Scholar
  14. 14.
    Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. In: IJCAI international joint conference on artificial intelligence, pp 2737–2743Google Scholar
  15. 15.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: ICMLGoogle Scholar
  16. 16.
    Yao L, Torabi A, Cho K, Ballas N, Pal C, Larochelle H, Courville A (2015) Describing videos by exploiting temporal structure. In: ICCVGoogle Scholar
  17. 17.
    Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua T-S (2017) SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: CVPRGoogle Scholar
  18. 18.
    Chen K, Wang J, Chen L-C, Gao H, Xu W, Nevatia R (2016) ABCCNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
  19. 19.
    Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: CVPRGoogle Scholar
  20. 20.
    Xu H, Saenko K (2016) Ask, attend and answer: exploring question- guided spatial attention for visual question answering. In: ECCVGoogle Scholar
  21. 21.
    Zeng K-H, Chen T-H, Chuang C-Y, Liao, Y-H, Niebles JC, Sun M (2016) Leveraging video descriptions to learn video question answering. arXiv preprint arXiv:1611.04021
  22. 22.
    Backhouse J (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
  23. 23.
    Zhao S, Liu Y, Han Y (2018) Pooling the convolutional layers in deep convnets for video action recognition. IEEE Trans Circ Syst Video Technol 28(8):1839–1849CrossRefGoogle Scholar
  24. 24.
    Gao Z, Zhang H, Xu G (2015) Multi-perspective and multi-modality joint representation and recognition model for 3D action recognition. Neurocomputing 151(2):554–564CrossRefGoogle Scholar
  25. 25.
    Xiao S, Li Y, Ye Y (2018) Video question answering via multi-granularity temporal attention network learning. In: ICIMCSGoogle Scholar
  26. 26.
    Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. arXiv Preprint arXiv:1606.00061
  27. 27.
    Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp. 961–970Google Scholar
  28. 28.
    Kong T, Yao A, Chen Y, Sun F (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 845–853Google Scholar
  29. 29.
    Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with. IEEE Trans Pattern Anal Mach Intel 39(6):1137–1149CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Shaoning Xiao
    • 1
  • Yimeng Li
    • 1
  • Yunan Ye
    • 1
  • Long Chen
    • 1
  • Shiliang Pu
    • 1
  • Zhou Zhao
    • 1
  • Jian Shao
    • 1
  • Jun Xiao
    • 1
    Email author
  1. 1.Zhejiang UniversityHangzhouChina

Personalised recommendations