Remember and forget: video and text fusion for video question answering

Article
  • 28 Downloads

Abstract

Video question answering (Video QA) has received much attention in recent years. It can answer questions according to the visual content of a video clip. Video QA task can be solved only according to the video data. But if the video clip has some relevant text information, It can also be solved by using the fused video and text data. How to select the useful region features from the video frames and select the useful text features from the text information needs to be solved. And how to fuse the video and text features also needs to be solved. Therefore, we propose a forget memory network to solve these problems. The forget memory network with video framework can solve Video QA task only according to the video data. It can select the useful region features for the question and forget the irrelevant region features from the video frames. The forget memory network with video and text framework can extract the useful text features and forget the irrelevant text features for the question. And it can fuse the video and text data to solve Video QA task. The fused video and text features can help improve the experimental performance.

Keywords

Video QA Forget memory network Fused video and text features 

References

  1. 1.
    Bahdanau D, Chorowski J, Serdyuk D, Brakel P, Bengio Y (2016) End-to-end attention-based large vocabulary speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, Shanghai, pp 4945–4949Google Scholar
  2. 2.
    Ballas N, Yao L, Pal C, Courville A (2015) Delving deeper into convolutional networks for learning video representations. arXiv:151106432
  3. 3.
    Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
  4. 4.
    Chung J, Cho K, Bengio Y (2016) A character-level decoder without explicit segmentation for neural machine translation. arXiv:160306147
  5. 5.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
  6. 6.
    Gao H, Mao J, Zhou J, Huang Z, Wang L, Xu W (2015) Are you talking to a machine? dataset and methods for multilingual image question. In: Advances in neural information processing systems, pp 2296–2304Google Scholar
  7. 7.
    Graves A, Mohamed Ar, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (icassp) . IEEE, Vancouver, pp 6645–6649Google Scholar
  8. 8.
    Han Y, Yang Y, Ma Z, Shen H, Sebe N, Zhou X (2014) Image attribute adaptation. IEEE Trans Multimedia 16:1115–1126CrossRefGoogle Scholar
  9. 9.
    Han Y, Yang Y, Wu F, Hong R (2015) Compact and discriminative descriptor inference using multi-cues. IEEE Trans Image Process 24:5114–5126MathSciNetCrossRefGoogle Scholar
  10. 10.
    Han Y, Yang Y, Yan Y, Ma Z, Sebe N, Zhou X (2015) Semisupervised feature selection via spline regression for video semantic recognition. IEEE Transactions on Neural Networks and Learning Systems 26:252–264MathSciNetCrossRefGoogle Scholar
  11. 11.
    He K, Ren XZS, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  12. 12.
    Hill F, Cho K, Jean S, Bengio Y (2017) The representational geometry of word meanings acquired by neural machine translation models. Mach Transl 31(1-2):3–18CrossRefGoogle Scholar
  13. 13.
    Jiang A, Wang F, Porikli F, Li Y (2015) Compositional memory for visual question answering. arXiv:151105676
  14. 14.
    Joulin A, Grave E, Bojanowski P, Mikolov T (2016) Bag of tricks for efficient text classification. arXiv:160701759
  15. 15.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137Google Scholar
  16. 16.
    Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:14085882
  17. 17.
    Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv:14126980
  18. 18.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  19. 19.
    Li G, Ma S, Han Y (2015) Summarization-based video caption via deep neural networks. In: Proceedings of the 23rd ACM international conference on multimedia. ACM, Brisbane, pp 1191–1194Google Scholar
  20. 20.
    Malinowski M, Rohrbach M, Fritz M (2015) Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings of the IEEE international conference on computer vision , pp 1–9Google Scholar
  21. 21.
    Mathieu M, Couprie C, LeCun Y (2015) Deep multi-scale video prediction beyond mean square error. arXiv:151105440
  22. 22.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  23. 23.
    Noh H, Hongsuck Seo P, Han B (2016) Image question answering using convolutional neural network with dynamic parameter prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 30–38Google Scholar
  24. 24.
    Redmon J, Farhadi A (2016) Yolo9000: better, faster, stronger. arXiv:161208242
  25. 25.
    Ren S, He K, Ross G, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  26. 26.
    Rohrbach A, Rohrbach M, Tandon N, Schiele B (2015) A dataset for movie description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3202–3212Google Scholar
  27. 27.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:14091556
  28. 28.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9Google Scholar
  29. 29.
    Tapaswi M, Zhu Y, Stiefelhagen R, Torralba A, Urtasun R, Fidler S (2016) Movieqa: Understanding stories in movies through question-answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4631–4640Google Scholar
  30. 30.
    Xiong C, Merity S, Socher R (2016) Dynamic memory networks for visual and textual question answering. In: International conference on machine learning, pp 2397–2406Google Scholar
  31. 31.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057Google Scholar
  32. 32.
    Yang Z, He X, Gao J, Li D, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29Google Scholar
  33. 33.
    Yang Z, Han Y, Wang Z (2017) Catching the temporal regions-of-interest for video captioning. In: Proceedings of the ACM international conference on multimedia (ACM MM)Google Scholar
  34. 34.
    Yu G, Yuan J (2015) Fast action proposals for human action detection and search. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1302–1311Google Scholar
  35. 35.
    Zeng KH, Chen TH, Chuang CY, Liao YH, Niebles JC, Sun M (2017) Leveraging video descriptions to learn video question answering. In: AAAI, pp 4334–4340Google Scholar
  36. 36.
    Zhu L, Xu Z, Yang Y, Hauptmann AG (2015) Uncovering temporal context for video question and answering. arXiv:151104670
  37. 37.
    Zhu Y, Groth O, Bernstein M, Li FF (2016) Visual7w: Grounded question answering in images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4995–5004Google Scholar
  38. 38.
    Zolfaghari M, Oliveira GL, Sedaghat N, Brox T (2017) Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. arXiv:170400616

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Computer and Information EngineeringAnyang Normal UniversityAnyangChina
  2. 2.Henan Key Laboratory of Oracle Bone Inscriptions Information ProcessingAnyang Normal UniversityAnyang ShiChina
  3. 3.Collaborative Innovation Center of International Dissemination of Chinese Language Henan Province (HNIDCL)HenanChina
  4. 4.School of Computer Science and TechnologyTianjin UniversityTianjinChina

Personalised recommendations