Abstract
Video question answering (VideoQA) is a task of answering a natural language question related to the content of a video. Existing methods that utilize the fine-grained object information have achieved significant improvements, however, they rely on costly external object detectors or fail to explore the rich structure of videos. In this work, we propose to understand video from two dimensions: temporal and semantic. In semantic space, videos are organized in a hierarchical structure (pixels, objects, activities, events). In temporal space, video can be viewed as a sequence of events, which contain multiple objects and activities. Based on this insight, we propose a reusable neural unit called recurrent contextual attention (RCA). RCA receives a 2D grid feature and conditional features as input, and computes multiple high-order compositional semantic representations. We then stack these units to build our hierarchy and utilize recurrent attention to generate diverse representations for different views of each subsequence. Without the bells and whistles, our model achieves excellent performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA using only grid features. Visualization results further validate the effectiveness of our method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Dang, L.H., Le, T.M., Le, V., Tran, T.: Hierarchical object-oriented spatio-temporal reasoning for video question answering. arXiv preprint. arXiv:2106.13432 (2021)
Fan, C., Zhang, X., Zhang, S.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: CVPR, pp. 1999–2007 (2019)
Gao, J., Ge, R., Chen, K.: Motion-appearance co-memory networks for video question answering. In: CVPR, pp. 6576–6585 (2018)
Gu, M., Zhao, Z., Jin, W., Hong, R., Wu, F.: Graph-based multi-interaction network for video question answering. IEEE Trans. Image Process. 30, 2758–2770 (2021)
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: CVPR, pp. 6546–6555 (2018)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Huang, D., Chen, P., Zeng, R.: Location-aware graph convolutional networks for video question answering. In: AAAI, pp. 11021–11028 (2020)
Jang, Y., Song, Y., Yu, Y.: Tgif-qa: toward spatio-temporal reasoning in visual question answering. In: CVPR, pp. 2758–2766 (2017)
Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
Jiang, J., Chen, Z.: Divide and conquer: question-guided spatio-temporal contextual attention for video question answering. In: AAAI, pp. 11101–11108 (2020)
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question answering. In: AAAI, pp. 11109–11116 (2020)
Le, T.M., Le, V., Venkatesh, S.: Hierarchical conditional relation networks for video question answering. In: CVPR, pp. 9972–9981 (2020)
Li, X., Song, J., Gao, L.: Beyond rnns: positional self-attention with co-attention for video question answering. In: AAAI, pp. 8658–8665 (2019)
Liu, F., Liu, J., Wang, W., Lu, H.: Hair: hierarchical visual-semantic relational reasoning for video question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1698–1707 (2021)
Park, J., Lee, J., Sohn, K.: Bridge to answer: Structure-aware graph interaction network for video question answering. In: CVPR, pp. 15526–15535 (2021)
Vaswani, A., Shazeer, N., Parmar, N.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Xiao, J., Yao, A., Liu, Z., Li, Y., Ji, W., Chua, T.S.: Video as conditional graph hierarchy for multi-granular question answering. AAAI (2022)
Xu, D., Zhao, Z., Xiao, J.: Video question answering via gradually refined attention over appearance and motion. In: ACM MM, pp. 1645–1653 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, F., Han, Y. (2022). Hierarchical Recurrent Contextual Attention Network for Video Question Answering. In: Fang, L., Povey, D., Zhai, G., Mei, T., Wang, R. (eds) Artificial Intelligence. CICAI 2022. Lecture Notes in Computer Science(), vol 13605. Springer, Cham. https://doi.org/10.1007/978-3-031-20500-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-20500-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20499-9
Online ISBN: 978-3-031-20500-2
eBook Packages: Computer ScienceComputer Science (R0)