Advertisement

A2A: Attention to Attention Reasoning for Movie Question Answering

  • Chao-Ning Liu
  • Ding-Jie Chen
  • Hwann-Tzong ChenEmail author
  • Tyng-Luh Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11366)

Abstract

This paper presents the Attention to Attention (A2A) reasoning mechanism to address the challenging task of movie question answering (MQA). By focusing on the various aspects of attention cues, we establish the technique of attention propagation to uncover latent but useful information to the underlying QA task. In addition, the proposed A2A reasoning seamlessly leads to effective fusion of different representation modalities about the data, and also can be conveniently constructed with popular neural network architectures. To tackle the out-of-vocabulary issue caused by the diverse language usages in nowadays movies, we adopt the GloVe mapping as a teacher model and establish a new and flexible word embedding based on character n-grams learning. Our method is evaluated on the MovieQA benchmark dataset and achieves the state-of-the-art accuracy for the “Video+Subtitles” entry.

Notes

Acknowledgement

This work was supported in part by MOST Grants 107-2634-F-001-002 and 106-2221-E-007-080-MY3 in Taiwan.

References

  1. 1.
    Agrawal, A., et al.: VQA: visual question answering. Int. J. Comput. Vis. 123(1), 4–31 (2017). www.visualqa.orgMathSciNetCrossRefGoogle Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  3. 3.
    Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings. In: ICLR (2017)Google Scholar
  4. 4.
    Azab, M., Wang, M., Smith, M., Kojima, N., Deng, J., Mihalcea, R.: Speaker naming in movies. In: NAACL-HLT, pp. 2206–2216 (2018)Google Scholar
  5. 5.
    Bello, I., Zoph, B., Vasudevan, V., Le, Q.V.: Neural optimizer search with reinforcement learning. In: ICML, pp. 459–468 (2017)Google Scholar
  6. 6.
    Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP, pp. 457–468 (2016)Google Scholar
  7. 7.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS, pp. 249–256 (2010)Google Scholar
  8. 8.
    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. In: CVPR (2017)Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385 (2015)
  10. 10.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  11. 11.
    Hu, H., Chao, W.L., Sha, F.: Learning answer embeddings for visual question answering. In: CVPR (2018)Google Scholar
  12. 12.
    Huang, J., et al.: Speed/accuracy trade-offs for modern convolutional object detectors. In: CVPR, pp. 3296–3297 (2017)Google Scholar
  13. 13.
    Jang, Y., Song, Y., Yu, Y., Kim, Y., Kim, G.: TGIF-QA: toward spatio-temporal reasoning in visual question answering. In: CVPR (2017)Google Scholar
  14. 14.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: ICCV, pp. 3008–3017 (2017)Google Scholar
  15. 15.
    Kazemi, V., Elqursh, A.: Show, ask, attend, and answer: a strong baseline for visual question answering. CoRR abs/1704.03162 (2017)Google Scholar
  16. 16.
    Kim, K., Heo, M., Choi, S., Zhang, B.: DeepStory: video story QA by deep embedded memory networks. In: IJCAI, pp. 2016–2022 (2017)Google Scholar
  17. 17.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  18. 18.
    Liu, F., Perez, J.: Gated end-to-end memory networks. In: EACL (2017)Google Scholar
  19. 19.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS, pp. 289–297 (2016)Google Scholar
  20. 20.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR (2013)Google Scholar
  21. 21.
    Miller, A.H., Fisch, A., Dodge, J., Karimi, A., Bordes, A., Weston, J.: Key-value memory networks for directly reading documents. In: EMNLP, pp. 1400–1409 (2016)Google Scholar
  22. 22.
    Mun, J., Seo, P.H., Jung, I., Han, B.: MarioQA: answering questions by watching gameplay videos. In: ICCV, pp. 2886–2894 (2017)Google Scholar
  23. 23.
    Na, S., Lee, S., Kim, J., Kim, G.: A read-write memory network for movie story understanding. In: ICCV (2017)Google Scholar
  24. 24.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  25. 25.
    Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100, 000+ questions for machine comprehension of text. In: EMNLP, pp. 2383–2392 (2016)Google Scholar
  26. 26.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  27. 27.
    Rohrbach, A., et al.: Movie description. Int. J. Comput. Vis. (2017)Google Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  29. 29.
    Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks. In: NIPS (2015)Google Scholar
  30. 30.
    Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR (2016)Google Scholar
  31. 31.
    Wang, B., Xu, Y., Han, Y., Hong, R.: Movie question answering: remembering the textual cues for layered visual contents. In: AAAI (2018)Google Scholar
  32. 32.
    Weston, J., Bordes, A., Chopra, S., Mikolov, T.: Towards AI-complete question answering: a set of prerequisite toy tasks. CoRR abs/1502.05698 (2015)Google Scholar
  33. 33.
    Weston, J., Chopra, S., Bordes, A.: Memory networks. CoRR abs/1410.3916 (2014)Google Scholar
  34. 34.
    Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A.R., van den Hengel, A.: Visual question answering: a survey of methods and datasets. Comput. Vis. Image Underst. 163, 21–40 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Chao-Ning Liu
    • 1
  • Ding-Jie Chen
    • 2
  • Hwann-Tzong Chen
    • 1
    Email author
  • Tyng-Luh Liu
    • 2
  1. 1.Department of Computer ScienceNational Tsing Hua UniversityHsinchuTaiwan
  2. 2.Institute of Information Science, Academia SinicaTaipeiTaiwan

Personalised recommendations