Advertisement

Object-aware semantics of attention for image captioning

  • Shiwei Wang
  • Long LanEmail author
  • Xiang Zhang
  • Guohua Dong
  • Zhigang Luo
Article
  • 24 Downloads

Abstract

In image captioning, exploring the advanced semantic concepts is very important for boosting captioning performance. Although much progress has been made in this regard, most existing image captioning models usually neglect the interrelationships between objects in an image, which is a key factor of accurately understanding image content. In this paper, we propose the object-aware semantic attention object-aware semantic attention (OSA) based captioning model to address this issue. Specifically, our attention model allows the explicit associations between the objects by coupling the attention mechanism with three types of semantic concepts, i.e., the category information, relative sizes of the objects, and relative distances between objects. In reality, they are easily built up and seamlessly coupled with the well-known encoder-decoder captioning framework. In our empirical analysis, these semantic concepts favor different aspects of the image content like the number of the objects belonging to each category, the main focus of an image, and the closeness between the objects. Importantly, they are cooperated with visual features to help the attention model effectively highlight the image regions of interest for significant performance gains. By leveraging three types of semantic concepts, we derive four semantic attention models for image captioning. Extensive experiments on MSCOCO dataset show our attention models within the encoder-decoder image captioning framework perform favorably as compared to representative captioning models.

Keywords

High-level semantic concepts Semantic attention Image captioning 

Notes

Acknowledgements

This work was supported by the National Natural Science Foundation of China [61806213, U1435222].

References

  1. 1.
    Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  2. 2.
    Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:14090473
  3. 3.
    Banerjee S, Lavie A (2005) METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72Google Scholar
  4. 4.
    Chang X, Ma Z, Lin M, Yang Y, Hauptmann A G (2017) Feature interaction augmented sparse learning for fast kinect motion detection. IEEE Trans Image Process 26(8):3911–3920MathSciNetCrossRefGoogle Scholar
  5. 5.
    Chang X, Yu Y L, Yang Y, Xing E P (2017) Semantic pooling for complex event analysis in untrimmed videos. IEEE Trans Pattern Anal Mach Intell 39 (8):1617–1632CrossRefGoogle Scholar
  6. 6.
    Chang Y S (2018) Fine-grained attention for image caption generation. Multimedia Tools and Applications pp 2959–2971CrossRefGoogle Scholar
  7. 7.
    Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6298–6306Google Scholar
  8. 8.
    Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431Google Scholar
  9. 9.
    Cheng Z, Ding Y, He X, Zhu L, Song X, Kankanhalli MS (2018) A3NCF: An adaptive aspect attention model for rating prediction. In: IJCAI, pp 3748–3754Google Scholar
  10. 10.
    Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:14061078
  11. 11.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634Google Scholar
  12. 12.
    Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. arXiv:170503122
  13. 13.
    Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448Google Scholar
  14. 14.
    Girshick R, Donahue J, Darrell T, Malik J (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38:142–158CrossRefGoogle Scholar
  15. 15.
    Gkioxari G, Girshick R, Dollár P, He K (2017) Detecting and recognizing human-object interactions. arXiv:170407333
  16. 16.
    Guo Y, Cheng Z, Nie L, Wang Y, Ma J, Kankanhalli M (2019) Attentive long short-term preference modeling for personalized product search. ACM Trans Inf Syst 37(2):19CrossRefGoogle Scholar
  17. 17.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  18. 18.
    Hu H, Gu J, Zhang Z, Dai J, Wei Y (2017) Relation networks for object detection. arXiv:171111575
  19. 19.
    Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding long-short term memory for image caption generation. arXiv:150904942
  20. 20.
    Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574Google Scholar
  21. 21.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137Google Scholar
  22. 22.
    Li Z, Nie F, Chang X, Yang Y (2017) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110CrossRefGoogle Scholar
  23. 23.
    Lin CY (2004) Rouge: A package for automatic evaluation of summaries. Text Summarization Branches OutGoogle Scholar
  24. 24.
    Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European conference on computer vision, pp 740–755Google Scholar
  25. 25.
    Liu A, Xu N, Zhang H, Nie W, Su Y, Zhang Y (2018) Multi-level policy and reward reinforcement learning for image captioning. In: IJCAI, pp 821–827Google Scholar
  26. 26.
    Liu A A, Xu N, Wong Y, Li J, Su Y T, Kankanhalli M (2017) Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput Vis Image Underst 163:113–125CrossRefGoogle Scholar
  27. 27.
    Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv:170205658
  28. 28.
    Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3242–3250Google Scholar
  29. 29.
    Lu J, Yang J, Batra D, Parikh D (2018) Neural Baby Talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228Google Scholar
  30. 30.
    Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:14126632
  31. 31.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, pp 311–318Google Scholar
  32. 32.
    Pedersoli M, Lucas T, Schmid C, Verbeek J (2016) Areas of attention for image captioning. arXiv:161201033
  33. 33.
    Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99Google Scholar
  34. 34.
    Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1179–1195Google Scholar
  35. 35.
    Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112Google Scholar
  36. 36.
    Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008Google Scholar
  37. 37.
    Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575Google Scholar
  38. 38.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164Google Scholar
  39. 39.
    Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212Google Scholar
  40. 40.
    Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057Google Scholar
  41. 41.
    Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 21–29Google Scholar
  42. 42.
    Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5263–5271Google Scholar
  43. 43.
    Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp 22–29Google Scholar
  44. 44.
    You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659Google Scholar
  45. 45.
    Zhou L, Xu C, Koch P, Corso J J (2017) Watch what you just said: Image captioning with text-conditional attention. In: Proceedings of the on Thematic Workshops of ACM Multimedia, vol 2017, pp 305–313Google Scholar
  46. 46.
    Zhu Z, Xue Z, Yuan Z (2018) Topic-guided attention for image captioning. arXiv:180703514

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Science and Technology on Parallel and distributed ProcessingNational University of Defense TechnologyChangshaChina
  2. 2.Institute for Quantum Information, State Key Laboratory of High Performance ComputingNational University of Defense TechnologyChangshaChina
  3. 3.College of ComputerNational University of Defense TechnologyChangshaChina

Personalised recommendations