Exploring Visual Relationship for Image Captioning

  • Ting YaoEmail author
  • Yingwei Pan
  • Yehao Li
  • Tao Mei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)


It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.


Image captioning Graph convolutional networks Visual relationship Long short-term memory 


  1. 1.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  3. 3.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: ACL Workshop (2005)Google Scholar
  4. 4.
    Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
  5. 5.
    Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: CVPR (2017)Google Scholar
  6. 6.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR (2014)Google Scholar
  7. 7.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  8. 8.
    Fu, K., Jin, J., Cui, R., Sha, F., Zhang, C.: Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans. PAMI 39, 2321–2334 (2017)CrossRefGoogle Scholar
  9. 9.
    Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-occurrence, location and appearance. In: CVPR (2008)Google Scholar
  10. 10.
    Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. IJCV 80, 300–316 (2008)CrossRefGoogle Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  12. 12.
    Jia, Y., et al.: Caffe: Convolutional architecture for fast feature embedding. In: MM (2014)Google Scholar
  13. 13.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  14. 14.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  15. 15.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)Google Scholar
  16. 16.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV (2017)Google Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  18. 18.
    Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: CVPR (2018)Google Scholar
  19. 19.
    Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: ICCV (2017)Google Scholar
  20. 20.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: ACL Workshop (2004)Google Scholar
  21. 21.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  22. 22.
    Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Optimization of image description metrics using policy gradient methods. In: ICCV (2017)Google Scholar
  23. 23.
    Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 852–869. Springer, Cham (2016). Scholar
  24. 24.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR (2017)Google Scholar
  25. 25.
    Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. In: EMNLP (2017)Google Scholar
  26. 26.
    Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language. In: CVPR (2016)Google Scholar
  27. 27.
    Pan, Y., Yao, T., Li, H., Mei, T.: Video captioning with transferred semantic attributes. In: CVPR (2017)Google Scholar
  28. 28.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
  29. 29.
    Plummer, B.A., Mallya, A., Cervantes, C.M., Hockenmaier, J., Lazebnik, S.: Phrase localization and visual relationship detection with comprehensive image-language cues. In: ICCV (2017)Google Scholar
  30. 30.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  31. 31.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: CVPR (2017)Google Scholar
  32. 32.
    Sadeghi, M.A., Farhadi, A.: Recognition using visual phrases. In: CVPR (2011)Google Scholar
  33. 33.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  34. 34.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  35. 35.
    Wu, Q., Shen, C., Liu, L., Dick, A., van den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: CVPR (2016)Google Scholar
  36. 36.
    Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: CVPR (2017)Google Scholar
  37. 37.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  38. 38.
    Yao, T., Pan, Y., Li, Y., Mei, T.: Incorporating copying mechanism in image captioning for learning novel objects. In: CVPR (2017)Google Scholar
  39. 39.
    Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV (2017)Google Scholar
  40. 40.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.JD AI ResearchBeijingChina
  2. 2.Sun Yat-sen UniversityGuangzhouChina

Personalised recommendations