Hierarchical Recurrent Contextual Attention Network for Video Question Answering

Zhou, Fei; Han, Yahong

doi:10.1007/978-3-031-20500-2_23

Fei Zhou^12,13 &
Yahong Han¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13605))

Included in the following conference series:

CAAI International Conference on Artificial Intelligence

1267 Accesses

Abstract

Video question answering (VideoQA) is a task of answering a natural language question related to the content of a video. Existing methods that utilize the fine-grained object information have achieved significant improvements, however, they rely on costly external object detectors or fail to explore the rich structure of videos. In this work, we propose to understand video from two dimensions: temporal and semantic. In semantic space, videos are organized in a hierarchical structure (pixels, objects, activities, events). In temporal space, video can be viewed as a sequence of events, which contain multiple objects and activities. Based on this insight, we propose a reusable neural unit called recurrent contextual attention (RCA). RCA receives a 2D grid feature and conditional features as input, and computes multiple high-order compositional semantic representations. We then stack these units to build our hierarchy and utilize recurrent attention to generate diverse representations for different views of each subsequence. Without the bells and whistles, our model achieves excellent performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA using only grid features. Visualization results further validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. arXiv preprint. arXiv:2106.13432 (2021)
Fan, C., Zhang, X., Zhang, S.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR, pp. 1999–2007 (2019)
Google Scholar
Gao, J., Ge, R., Chen, K.: Motion-appearance co-memory networks for video question answering. In: CVPR, pp. 6576–6585 (2018)
Google Scholar
Gu, M., Zhao, Z., Jin, W., Hong, R., Wu, F.: Graph-based multi-interaction network for video question answering. IEEE Trans. Image Process. 30, 2758–2770 (2021)
Article Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: CVPR, pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Huang, D., Chen, P., Zeng, R.: Location-aware graph convolutional networks for video question answering. In: AAAI, pp. 11021–11028 (2020)
Google Scholar
Jang, Y., Song, Y., Yu, Y.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: CVPR, pp. 2758–2766 (2017)
Google Scholar
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
Google Scholar
Jiang, J., Chen, Z.: Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: AAAI, pp. 11101–11108 (2020)
Google Scholar
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI, pp. 11109–11116 (2020)
Google Scholar
Le, T.M., Le, V., Venkatesh, S.: Hierarchical conditional relation networks for video question answering. In: CVPR, pp. 9972–9981 (2020)
Google Scholar
Li, X., Song, J., Gao, L.: Beyond rnns: positional self-attention with co-attention for video question answering. In: AAAI, pp. 8658–8665 (2019)
Google Scholar
Liu, F., Liu, J., Wang, W., Lu, H.: Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1698–1707 (2021)
Google Scholar
Park, J., Lee, J., Sohn, K.: Bridge to answer: Structure-aware graph interaction network for video question answering. In: CVPR, pp. 15526–15535 (2021)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. AAAI (2022)
Google Scholar
Xu, D., Zhao, Z., Xiao, J.: Video question answering via gradually refined attention over appearance and motion. In: ACM MM, pp. 1645–1653 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Intelligence and Computing, Tianjin University, Tianjin, China
Fei Zhou & Yahong Han
Tianjin International Engineering Institute, Tianjin University, Tianjin, China
Fei Zhou

Authors

Fei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yahong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yahong Han .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Lu Fang
Xiaomi Inc., Beijing, China
Daniel Povey
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai
JD Explore Academy, Beijing, China
Tao Mei
Chinese Academy of Sciences, Beijing, China
Ruiping Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, F., Han, Y. (2022). Hierarchical Recurrent Contextual Attention Network for Video Question Answering. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13605. Springer, Cham. https://doi.org/10.1007/978-3-031-20500-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-20500-2_23
Published: 01 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20499-9
Online ISBN: 978-3-031-20500-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Hierarchical Recurrent Contextual Attention Network for Video Question Answering