Advertisement

Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions

Conference paper
  • 464 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12363)

Abstract

To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+.

Keywords

Video question answering Video description Knowledge bases 

Notes

Acknowledgement

This work was supported by a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO), and JSPS KAKENHI Nos. 18H03264 and 20K19822. We also would like to thank the anonymous reviewers for they insightful comments to improve the paper.

Supplementary material

504473_1_En_34_MOESM1_ESM.pdf (531 kb)
Supplementary material 1 (pdf 530 KB)

References

  1. 1.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings of the ICCV, pp. 2425–2433 (2015)Google Scholar
  2. 2.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-76298-0_52CrossRefGoogle Scholar
  3. 3.
    Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the ECCV, pp. 358–373 (2018)Google Scholar
  4. 4.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the NAACL, pp. 4171–4186 (2019)Google Scholar
  5. 5.
    Garcia, N., Otani, M., Chu, C., Nakashima, Y.: KnowIT VQA: answering knowledge-based questions about videos. In: Proceedings of the AAAI (2020)Google Scholar
  6. 6.
    Garcia, N., Vogiatzis, G.: Asymmetric spatio-temporal embeddings for large-scale image-to-video retrieval. In: BMVC (2018)Google Scholar
  7. 7.
    Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the ICCV (2019)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the CVPR, pp. 770–778 (2016)Google Scholar
  9. 9.
    Hewlett, D., Jones, L., Lacoste, A., Gur, I.: Accurate supervised and semi-supervised machine reading for long documents. In: Proceedings of the EMNLP, pp. 2011–2020 (2017)Google Scholar
  10. 10.
    Hu, M., Peng, Y., Huang, Z., Li, D.: Retrieve, read, rerank: towards end-to-end multi-document reading comprehension. In: Proceedings of the ACL, pp. 2285–2295 (2019)Google Scholar
  11. 11.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, L.C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the CVPR, pp. 2901–2910 (2017)Google Scholar
  12. 12.
    Johnson, J., et al.: Image retrieval using scene graphs. In: Proceedings of the CVPR, pp. 3668–3678 (2015)Google Scholar
  13. 13.
    Kim, J., Ma, M., Kim, K., Kim, S., Yoo, C.D.: Progressive attention memory network for movie story question answering. In: Proceedings of the CVPR, pp. 8337–8346 (2019)Google Scholar
  14. 14.
    Kim, K.M., Choi, S.H., Kim, J.H., Zhang, B.T.: Multimodal dual attention memory for video story question answering. In: Proceedings of the ECCV, pp. 673–688 (2018)Google Scholar
  15. 15.
    Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the IJCAI, pp. 2016–2022 (2017)Google Scholar
  16. 16.
    Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Carlos Niebles, J.: Dense-captioning events in videos. In: Proceedings of the ICCV, pp. 706–715 (2017)Google Scholar
  17. 17.
    Lei, J., Yu, L., Bansal, M., Berg, T.L.: TVQA: localized, compositional video question answering. In: Proceedings of the EMNLP, pp. 1369–1379 (2018)Google Scholar
  18. 18.
    Lei, J., Yu, L., Berg, T.L., Bansal, M.: TVQA+: spatio-temporal grounding for video question answering. arXiv preprint arXiv:1904.11574 (2019)
  19. 19.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the ICCV, pp. 1261–1270 (2017)Google Scholar
  20. 20.
    Liang, J., Jiang, L., Cao, L., Li, L.J., Hauptmann, A.G.: Focal visual-text attention for visual question answering. In: Proceedings of the CVPR, pp. 6135–6143 (2018)Google Scholar
  21. 21.
    Liu, S., Ren, Z., Yuan, J.: SibNet: sibling convolutional encoder for video captioning. In: Proceedings of the ACM Multimedia, pp. 1425–1434 (2018)Google Scholar
  22. 22.
    Marino, K., Rastegari, M., Farhadi, A., Mottaghi, R.: OK-VQA: a visual question answering benchmark requiring external knowledge. In: Proceedings of the CVPR, pp. 3195–3204 (2019)Google Scholar
  23. 23.
    Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: Proceedings of the ICCV, pp. 677–685 (2017)Google Scholar
  24. 24.
    Narasimhan, M., Lazebnik, S., Schwing, A.: Out of the box: reasoning with graph convolution nets for factual visual question answering. In: Proceedings of the NIPS, pp. 2659–2670 (2018)Google Scholar
  25. 25.
    Narasimhan, M., Schwing, A.G.: Straight to the facts: learning knowledge base retrieval for factual visual question answering. In: Proceedings of the ECCV, pp. 451–468 (2018)Google Scholar
  26. 26.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: Proceedings of the CVPR, pp. 4594–4602 (2016)Google Scholar
  27. 27.
    Pini, S., Cornia, M., Bolelli, F., Baraldi, L., Cucchiara, R.: M-VAD names: a dataset for video captioning with naming. Multimed. Tools Appl. 78(10), 14007–14027 (2019)CrossRefGoogle Scholar
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of the NIPS, pp. 91–99 (2015)Google Scholar
  29. 29.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the CVPR, pp. 3202–3212 (2015)Google Scholar
  30. 30.
    Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: Proceedings of the CVPR, pp. 815–823 (2015)Google Scholar
  31. 31.
    Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: KVQA: knowledge-aware visual question answering. In: Proceedings of the AAAI (2019)Google Scholar
  32. 32.
    Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the CVPR, pp. 8376–8384 (2019)Google Scholar
  33. 33.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the ECCV, pp. 510–526 (2016)Google Scholar
  34. 34.
    Speer, R., Chin, J., Havasi, C.: ConceptnNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the AAAI (2017)Google Scholar
  35. 35.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the CVPR, pp. 4631–4640 (2016)Google Scholar
  36. 36.
    Teney, D., Liu, L., van den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the CVPR, pp. 1–9 (2017)Google Scholar
  37. 37.
    Vaswani, A., et al.: Attention is all you need. In: Proceedings of the NIPS, pp. 5998–6008 (2017)Google Scholar
  38. 38.
    Vicol, P., Tapaswi, M., Castrejon, L., Fidler, S.: MovieGraphs: towards understanding human-centric situations from videos. In: Proceedings of the CVPR, pp. 8581–8590 (2018)Google Scholar
  39. 39.
    Wang, B., Ma, L., Zhang, W., Jiang, W., Wang, J., Liu, W.: Controllable video captioning with POS sequence guidance based on gated fusion network. In: Proceedings of the ICCV, pp. 2641–2650 (2019)Google Scholar
  40. 40.
    Wang, B., Xu, Y., Han, Y., Hong, R.: Movie question answering: remembering the textual cues for layered visual contents. In: Proceedings of the AAAI (2018)Google Scholar
  41. 41.
    Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. IEEE Trans. PAMI 40(10), 2413–2427 (2018)CrossRefGoogle Scholar
  42. 42.
    Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: Proceedings of the IJCAI, pp. 1290–1296 (2017)Google Scholar
  43. 43.
    Wang, W., Tran, D., Feiszli, M.: What makes training multi-modal networks hard? In: Proceedings of the CVPR, pp. 12695–12705 (2020)Google Scholar
  44. 44.
    Wu, C.Y., Feichtenhofer, C., Fan, H., He, K., Krahenbuhl, P., Girshick, R.: Long-term feature banks for detailed video understanding. In: Proceedings of the CVPR, pp. 284–293 (2019)Google Scholar
  45. 45.
    Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: Proceedings of the CVPR, pp. 4622–4630 (2016)Google Scholar
  46. 46.
    Xiong, P., Zhan, H., Wang, X., Sinha, B., Wu, Y.: Visual query answering by entity-attribute graph matching and reasoning. In: Proceedings of the CVPR, pp. 8357–8366 (2019)Google Scholar
  47. 47.
    Xiong, Y., Huang, Q., Guo, L., Zhou, H., Zhou, B., Lin, D.: A graph-based framework to bridge movies and synopses. In: Proceedings of the ICCV, pp. 4592–4601 (2019)Google Scholar
  48. 48.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the CVPR, pp. 5410–5419 (2017)Google Scholar
  49. 49.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the CVPR, pp. 5288–5296 (2016)Google Scholar
  50. 50.
    Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph R-CNN for scene graph generation. In: Proceedings of the ECCV, pp. 670–685 (2018)Google Scholar
  51. 51.
    Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the CVPR, pp. 10685–10694 (2019)Google Scholar
  52. 52.
    Yang, Z., Garcia, N., Chu, C., Otani, M., Nakashima, Y., Takemura, H.: BERT representations for video question answering. In: Proceedings of the WACV (2020)Google Scholar
  53. 53.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: Proceedings of the ICCV, pp. 4507–4515 (2015)Google Scholar
  54. 54.
    Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: scene graph parsing with global context. In: Proceedings of the CVPR, pp. 5831–5840 (2018)Google Scholar
  55. 55.
    Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI, vol. 33, pp. 9185–9194 (2019)Google Scholar
  56. 56.
    Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the CVPR, pp. 11535–11543 (2019)Google Scholar
  57. 57.
    Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: A 10 million image database for scene recognition. IEEE Trans. PAMI 40, 1452–1464 (2017)CrossRefGoogle Scholar
  58. 58.
    Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the CVPR, pp. 8739–8748 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Osaka UniversitySuitaJapan

Personalised recommendations