Advertisement

Boosted Attention: Leveraging Human Attention for Image Captioning

  • Shi Chen
  • Qi ZhaoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11215)

Abstract

Visual attention has shown usefulness in image captioning, with the goal of enabling a caption model to selectively focus on regions of interest. Existing models typically rely on top-down language information and learn attention implicitly by optimizing the captioning objectives. While somewhat effective, the learned top-down attention can fail to focus on correct regions of interest without direct supervision of attention. Inspired by the human visual system which is driven by not only the task-specific top-down signals but also the visual stimuli, we in this work propose to use both types of attention for image captioning. In particular, we highlight the complementary nature of the two types of attention and develop a model (Boosted Attention) to integrate them for image captioning. We validate the proposed approach with state-of-the-art performance across various evaluation metrics.

Keywords

Image captioning Visual attention Human attention 

Notes

Acknowledgements

This work is supported by NSF Grant 1763761 and University of Minnesota Department of Computer Science and Engineering Start-up Fund (QZ).

Supplementary material

474198_1_En_5_MOESM1_ESM.pdf (828 kb)
Supplementary material 1 (pdf 828 KB)

References

  1. 1.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 1171–1179. MIT Press, Cambridge (2015). http://dl.acm.org/citation.cfm?id=2969239.2969370
  2. 2.
    Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306 (2017).  https://doi.org/10.1109/CVPR.2017.667
  3. 3.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Visual saliency for image captioning in new multimedia services. In: 2017 IEEE International Conference on Multimedia Expo Workshops (ICMEW), pp. 309–314 (2017).  https://doi.org/10.1109/ICMEW.2017.8026277
  4. 4.
    Farhadi, A.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_2CrossRefGoogle Scholar
  5. 5.
    Gan, Z., et al.: Semantic compositional networks for visual captioning. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017).  https://doi.org/10.1109/CVPR.2017.127
  6. 6.
    Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 529–545. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10593-2_35CrossRefGoogle Scholar
  7. 7.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)Google Scholar
  8. 8.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997).  https://doi.org/10.1162/neco.1997.9.8.1735CrossRefGoogle Scholar
  9. 9.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47(1), 853–899 (2013). http://dl.acm.org/citation.cfm?id=2566972.2566993MathSciNetzbMATHGoogle Scholar
  10. 10.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. CoRR abs/1709.01507 (2017). http://arxiv.org/abs/1709.01507
  11. 11.
    Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1072–1080 (2015).  https://doi.org/10.1109/CVPR.2015.7298710
  12. 12.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015).  https://doi.org/10.1109/CVPR.2015.7298932
  13. 13.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
  14. 14.
    Kulkarni, G., et al.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR (2011)Google Scholar
  15. 15.
    Lavie, A., Agarwal, A.: METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second Workshop on Statistical Machine Translation, StatMT 2007, pp. 228–231. Association for Computational Linguistics, Stroudsburg (2007). http://dl.acm.org/citation.cfm?id=1626355.1626389
  16. 16.
    Li, J., Xia, C., Song, Y., Fang, S., Chen, X.: A data-driven metric for comprehensive evaluation of saliency models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 190–198 (2015).  https://doi.org/10.1109/ICCV.2015.30
  17. 17.
    Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale N-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, pp. 220–228. Association for Computational Linguistics, Stroudsburg (2011). http://dl.acm.org/citation.cfm?id=2018936.2018962
  18. 18.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004) (2004)Google Scholar
  19. 19.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  20. 20.
    Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of spider. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 873–881 (2017).  https://doi.org/10.1109/ICCV.2017.100
  21. 21.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  22. 22.
    Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Neural Information Processing Systems (NIPS) (2011)Google Scholar
  23. 23.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 311–318. Association for Computational Linguistics, Stroudsburg (2002).  https://doi.org/10.3115/1073083.1073135
  24. 24.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2641–2649 (2015).  https://doi.org/10.1109/ICCV.2015.303
  25. 25.
    Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1151–1159 (2017).  https://doi.org/10.1109/CVPR.2017.128
  26. 26.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
  28. 28.
    Sugano, Y., Bulling, A.: Seeing with humans: Gaze-assisted neural image captioning. CoRR abs/1608.05203 (2016). http://arxiv.org/abs/1608.05203
  29. 29.
    Tavakoliy, H.R., Shetty, R., Borji, A., Laaksonen, J.: Paying attention to descriptions generated by image captioning models. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2506–2515 (2017).  https://doi.org/10.1109/ICCV.2017.272
  30. 30.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 4566–4575. IEEE Computer Society (2015).  https://doi.org/10.1109/CVPR.2015.7299087
  31. 31.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015)Google Scholar
  32. 32.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Blei, D., Bach, F. (eds.) Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), JMLR Workshop and Conference Proceedings, pp. 2048–2057 (2015). http://jmlr.org/proceedings/papers/v37/xuc15.pdf
  33. 33.
    Yang, Z., Yuan, Y., Wu, Y., Cohen, W.W., Salakhutdinov, R.R.: Review networks for caption generation. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems 29, pp. 2361–2369. Curran Associates, Inc. (2016). http://papers.nips.cc/paper/6167-review-networks-for-caption-generation.pdf
  34. 34.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016).  https://doi.org/10.1109/CVPR.2016.503
  35. 35.
    Yun, K., Peng, Y., Samaras, D., Zelinsky, G.J., Berg, T.L.: Studying relationships between human gaze, description, and computer vision. In: 2013 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of MinnesotaMinneapolisUSA

Personalised recommendations