Advertisement

Multimedia Tools and Applications

, Volume 78, Issue 3, pp 3843–3858 | Cite as

Word-to-region attention network for visual question answering

  • Liang Peng
  • Yang YangEmail author
  • Yi Bin
  • Ning Xie
  • Fumin Shen
  • Yanli Ji
  • Xing Xu
Article
  • 145 Downloads

Abstract

Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.

Keywords

Visual question answering Word attention Image attention Word-to-region 

Notes

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Project 61572108, Project 61632007 and the 111 Project No. B17008.

References

  1. 1.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2017) Bottom-up and top-down attention for image captioning and visual question answering. arXiv:1707.07998
  2. 2.
    Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Lawrence Zitnick C, Parikh D (2015) Vqa: visual question answering. In: IEEE international conference on computer vision, pp 2425–2433Google Scholar
  3. 3.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
  4. 4.
    Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2018) Describing video with attention based bidirectional LSTM. IEEE Trans Cybern.  https://doi.org/10.1109/TCYB.2018.2831447
  5. 5.
    Cho K, Van Merriënboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259
  6. 6.
    Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv:1606.01847
  7. 7.
    Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in VQA matter: elevating the role of image understanding in visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 6325–6334Google Scholar
  8. 8.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  9. 9.
    Hedi B-Y, Rémi C, Nicolas T, Matthieu C (2017) Mutan: multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision, pp 2631–2639Google Scholar
  10. 10.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  11. 11.
    Hu M, Yang Y, Shen F, Zhang L, Shen HT, Li X (2017) Robust web image annotation via exploring multi-facet and structural knowledge. IEEE Trans Image Process 26(10):4871–4884MathSciNetCrossRefGoogle Scholar
  12. 12.
    Hu M, Yang Y, Shen F, Xie N, Shen HT (2018) Hashing with angular reconstructive embeddings. IEEE Trans Image Process 27(2):545–555MathSciNetCrossRefGoogle Scholar
  13. 13.
    Ilievski I, Feng J (2017) Multimodal learning and reasoning for visual question answering. In: Conference and workshop on neural information processing systems, pp 551–562Google Scholar
  14. 14.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp 448–456Google Scholar
  15. 15.
    Itti L, Koch C, Niebur E (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Trans Pattern Anal Mach Intell 20(11):1254–1259CrossRefGoogle Scholar
  16. 16.
    Kazemi V, Elqursh A (2017) Show, ask, attend, and answer: a strong baseline for visual question answering. arXiv:1704.03162
  17. 17.
    Kim J-H, Lee S-W, Kwak D, Heo M-O, Kim J, Ha J-W, Zhang B-T (2016) Multimodal residual learning for visual qa. In: Conference and workshop on neural information processing systems, pp 361–369Google Scholar
  18. 18.
    Kim J-H, On K-W, Kim J, Ha J-W, Zhang B-T (2016) Hadamard product for low-rank bilinear pooling. arXiv:1610.04325
  19. 19.
    Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
  20. 20.
    Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Conference and workshop on neural information processing systems, pp 3294–3302Google Scholar
  21. 21.
    Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA et al (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int J Comput Vision 123(1):32–73Google Scholar
  22. 22.
    Li R, Jia J (2016) Visual question answering with question representation update (QRU). In: Conference and workshop on neural information processing systems, pp 4655–4663Google Scholar
  23. 23.
    Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740–755Google Scholar
  24. 24.
    Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical co-attention for visual question answering. In: Conference and workshop on neural information processing systems, pp 289–297Google Scholar
  25. 25.
    Lu P, Li H, Zhang W, Wang J, Wang X (2017) Co-attending free-form regions and detections with multi-modal multiplicative feature embedding for visual question answering. arXiv:1711.06794
  26. 26.
    Nam H, Ha J-W, Kim J (2016) Dual attention networks for multimodal reasoning and matching. arXiv:1611.00471
  27. 27.
    Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In: Conference and workshop on neural information processing systems, pp 91–99Google Scholar
  28. 28.
    Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. arXiv:1509.00685
  29. 29.
    Shen F, Zhou X, Yang Y, Song J, Shen HT, Tao D (2016) A fast optimization method for general binary code learning. IEEE Trans Image Process 25(12):5610–5621MathSciNetCrossRefGoogle Scholar
  30. 30.
    Shen F, Yang Y, Liu L, Liu W, Tao D, Shen HT (2017) Asymmetric binary coding for image search. IEEE Trans Multimed 19(9):2022–2032CrossRefGoogle Scholar
  31. 31.
    Shen F, Xu Y, Liu L, Yang Y, Huang Z, Shen HT (2018) Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans Pattern Anal Mach Intell.  https://doi.org/10.1109/TPAMI.2018.2789887
  32. 32.
    Shih KJ, Singh S, Hoiem D (2016) Where to look: focus regions for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 4613–4621Google Scholar
  33. 33.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
  34. 34.
    Teney D, Anderson P, He X, Hengel AD (2017) Tips and tricks for visual question answering: learnings from the 2017 challenge. arXiv:1708.02711
  35. 35.
    Xing XU, Shen F, Yang Y, Shen HT, Li X (2017) Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans Image Process 26(5):2494–2507MathSciNetCrossRefGoogle Scholar
  36. 36.
    Yang Y, Ma Z, Yang Y, Nie F, Shen HT (2015) Multitask spectral clustering by exploring intertask correlation. IEEE Trans Cybern 45(5):1083–1094CrossRefGoogle Scholar
  37. 37.
    Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: IEEE conference on computer vision and pattern recognition, pp 21–29Google Scholar
  38. 38.
    Yang Y, Shen F, Huang Z, Shen H T, Li X (2017) Discrete nonnegative spectral clustering. IEEE Trans Knowl Data Eng 29(9):1834–1845CrossRefGoogle Scholar
  39. 39.
    Yang Y, Duan Y, Wang X, Huang Z, Xie N, Shen HT (2018) Hierarchical multi-clue modelling for POI popularity prediction with heterogeneous tourist information. IEEE Trans Knowl Data Eng.  https://doi.org/10.1109/TKDE.2018.2842190
  40. 40.
    Yang Y, Zhou J, Ai J, Bin Y, Hanjalic A, Shen H T (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process.  https://doi.org/10.1109/TIP.2018.2855422
  41. 41.
    Yu D, Fu J, Mei T, Rui Y (2017) Multi-level attention networks for visual question answering. In: IEEE conference on computer vision and pattern recognition, pp 4187–4195Google Scholar
  42. 42.
    Yu Z, Yu J, Fan J, Tao D (2017) Beyond bilinear: generalized multi-modal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn SystGoogle Scholar
  43. 43.
    Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: IEEE international conference on computer vision, pp 1839– 1848Google Scholar
  44. 44.
    Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection. In The IEEE conference on computer vision and pattern recognition, pp 3107–3115Google Scholar
  45. 45.
    Zhang M, Yang Y, Zhang H, Ji Y, Xie N, Shen HT (2017) Deep semantic indexing using convolutional localization network with region-based visual attention for image database. In: Australasian database conference, pp 261–272Google Scholar
  46. 46.
    Zhang W, Yu X, He X (2017) Learning bidirectional temporal cues for video-based person re-identification. IEEE Trans Circ Syst Video Technol.  https://doi.org/10.1109/TCSVT.2017.2718188
  47. 47.
    Zhang W, Chen Q, Zhang W, He X (2018) Long-range terrain perception using convolutional neural networks. Neurocomputing 275:781–787CrossRefGoogle Scholar
  48. 48.
    Zhang H, Niu Y, Chang SF (2018) Grounding referring expressions in images by variational context. In: The IEEE conference on computer vision and pattern recognitionGoogle Scholar
  49. 49.
    Zhang S, Li X, Zong M, Zhu X, Wang R (2018) Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst 29(5):1774–1785MathSciNetCrossRefGoogle Scholar
  50. 50.
    Zhou J T, Zhao H, Peng X, Fang M, Qin Z, Goh RSM (2018) Transfer hashing: from shallow to deep. IEEE Trans Neural Netw Learn Syst.  https://doi.org/10.1109/TNNLS.2018.2827036
  51. 51.
    Zhu X, Li X, Zhang S, Xu Z, Yu L, Wang C (2017) Graph pca hashing for similarity search. IEEE Trans Multimed 19(9):2033–2044CrossRefGoogle Scholar
  52. 52.
    Zhu H, Vial R, Lu S, Peng X, Fu H, Tian Y, Cao X (2018) YoTube: searching action proposal via recurrent and static regression networks. IEEE Trans Image Process 27(6):2609MathSciNetCrossRefGoogle Scholar
  53. 53.
    Zhu X, Zhang S, Hu R, Zhu Y, et al (2018) Local and global structure preservation for robust unsupervised spectral feature selection. IEEE Trans Knowl Data Eng 30(3):517–529CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Center for Future Media and School of Computer Science and EngineeringUniversity of Electronic Science and Technology of ChinaChengduChina

Personalised recommendations