Abstract
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
‘Do androids dream of electric sheep?’ (Philip K. Dick, 1968).
- 2.
- 3.
- 4.
For example, https://bigbangtrans.wordpress.com/.
- 5.
Boy, girl, guy, lady, man, person, player, woman.
- 6.
For example, https://the-big-bang-theory.com/.
- 7.
Generating video plot summaries automatically from the whole video story is a challenging task by itself and out of the scope of this work. However, it is an interesting problem that we aim to study as a our future work.
- 8.
In The Big Bang Theory, the longest summary contains 1,605 words.
References
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the ICCV, pp. 2425–2433 (2015)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the ECCV, pp. 358–373 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL, pp. 4171–4186 (2019)
Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: Proceedings of the AAAI (2020)
Garcia, N., Vogiatzis, G.: Asymmetric spatio-temporal embeddings for large-scale image-to-video retrieval. In: BMVC (2018)
Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the ICCV (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)
Hewlett, D., Jones, L., Lacoste, A., Gur, I.: Accurate supervised and semi-supervised machine reading for long documents. In: Proceedings of the EMNLP, pp. 2011–2020 (2017)
Hu, M., Peng, Y., Huang, Z., Li, D.: Retrieve, read, rerank: towards end-to-end multi-document reading comprehension. In: Proceedings of the ACL, pp. 2285–2295 (2019)
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, L.C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the CVPR, pp. 2901–2910 (2017)
Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the CVPR, pp. 3668–3678 (2015)
Kim, J., Ma, M., Kim, K., Kim, S., Yoo, C.D.: Progressive attention memory network for movie story question answering. In: Proceedings of the CVPR, pp. 8337–8346 (2019)
Kim, K.M., Choi, S.H., Kim, J.H., Zhang, B.T.: Multimodal dual attention memory for video story question answering. In: Proceedings of the ECCV, pp. 673–688 (2018)
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the IJCAI, pp. 2016–2022 (2017)
Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the ICCV, pp. 706–715 (2017)
Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: Proceedings of the EMNLP, pp. 1369–1379 (2018)
Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)
Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the ICCV, pp. 1261–1270 (2017)
Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: Proceedings of the CVPR, pp. 6135–6143 (2018)
Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM Multimedia, pp. 1425–1434 (2018)
Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the CVPR, pp. 3195–3204 (2019)
Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: Proceedings of the ICCV, pp. 677–685 (2017)
Narasimhan, M., Lazebnik, S., Schwing, A.: Out of the box: reasoning with graph convolution nets for factual visual question answering. In: Proceedings of the NIPS, pp. 2659–2670 (2018)
Narasimhan, M., Schwing, A.G.: Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the ECCV, pp. 451–468 (2018)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the CVPR, pp. 4594–4602 (2016)
Pini, S., Cornia, M., Bolelli, F., Baraldi, L., Cucchiara, R.: M-VAD names: a dataset for video captioning with naming. Multimed. Tools Appl. 78(10), 14007–14027 (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NIPS, pp. 91–99 (2015)
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the CVPR, pp. 3202–3212 (2015)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the CVPR, pp. 815–823 (2015)
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI (2019)
Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the CVPR, pp. 8376–8384 (2019)
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the ECCV, pp. 510–526 (2016)
Speer, R., Chin, J., Havasi, C.: ConceptnNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the AAAI (2017)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the CVPR, pp. 4631–4640 (2016)
Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the CVPR, pp. 1–9 (2017)
Vaswani, A., et al.: Attention is all you need. In: Proceedings of the NIPS, pp. 5998–6008 (2017)
Vicol, P., Tapaswi, M., Castrejon, L., Fidler, S.: MovieGraphs: towards understanding human-centric situations from videos. In: Proceedings of the CVPR, pp. 8581–8590 (2018)
Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the ICCV, pp. 2641–2650 (2019)
Wang, B., Xu, Y., Han, Y., Hong, R.: Movie question answering: remembering the textual cues for layered visual contents. In: Proceedings of the AAAI (2018)
Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. PAMI 40(10), 2413–2427 (2018)
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the IJCAI, pp. 1290–1296 (2017)
Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal networks hard? In: Proceedings of the CVPR, pp. 12695–12705 (2020)
Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the CVPR, pp. 284–293 (2019)
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the CVPR, pp. 4622–4630 (2016)
Xiong, P., Zhan, H., Wang, X., Sinha, B., Wu, Y.: Visual query answering by entity-attribute graph matching and reasoning. In: Proceedings of the CVPR, pp. 8357–8366 (2019)
Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: Proceedings of the ICCV, pp. 4592–4601 (2019)
Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the CVPR, pp. 5410–5419 (2017)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the CVPR, pp. 5288–5296 (2016)
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the ECCV, pp. 670–685 (2018)
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the CVPR, pp. 10685–10694 (2019)
Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the WACV (2020)
Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the ICCV, pp. 4507–4515 (2015)
Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the CVPR, pp. 5831–5840 (2018)
Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI, vol. 33, pp. 9185–9194 (2019)
Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the CVPR, pp. 11535–11543 (2019)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. PAMI 40, 1452–1464 (2017)
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the CVPR, pp. 8739–8748 (2018)
Acknowledgement
This work was supported by a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Nos. 18H03264 and 20K19822. We also would like to thank the anonymous reviewers for they insightful comments to improve the paper.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Garcia, N., Nakashima, Y. (2020). Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-58523-5_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58522-8
Online ISBN: 978-3-030-58523-5
eBook Packages: Computer ScienceComputer Science (R0)