Towards Unique and Informative Captioning of Images

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12352)


Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be ‘topped’ using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model – by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics (Code is available at



This work is partially supported by KAUST under Award No. OSRCRG2017-3405, by Samsung and by the Princeton CSML DataX award. We would like to thank Arjun Mani, Vikram Ramaswamy and Angelina Wang for their helpful feedback on the paper.


  1. 1.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: ECCV (2016)Google Scholar
  2. 2.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  3. 3.
    Bahl, L., Brown, P., de Souza, P., Mercer, R.: Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In: ICASSP (1986)Google Scholar
  4. 4.
    Cui, Y., Yang, G., Veit, A., Huang, X., Belongie, S.: Learning to evaluate image captioning. In: CVPR (2018)Google Scholar
  5. 5.
    Datta, D., Varma, S., Chowdary, C.R., Singh, S.K.: Multimodal retrieval using mutual information based textual query reformulation. Expert Syst. Appl. 68, 81–92 (2017)CrossRefGoogle Scholar
  6. 6.
    Dognin, P., Melnyk, I., Mroueh, Y., Ross, J., Sercu, T.: Adversarial semantic alignment for improved image captions. In: CVPR (2019)Google Scholar
  7. 7.
    Henning, C.A., Ewerth, R.: Estimating the information gap between textual and visual representations. In: ICMR (2017)Google Scholar
  8. 8.
    Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. In: ICCV (2019)Google Scholar
  9. 9.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: CVPR (2016)Google Scholar
  10. 10.
    Johnson, J., et al.: Image retrieval using scene graphs. In: CVPR (2015)Google Scholar
  11. 11.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  12. 12.
    Kimura, R., Iida, S., Cui, H., Hung, P.H., Utsuro, T., Nagata, M.: Selecting informative context sentence by forced back-translation. In: MT Summit XVII (2019)Google Scholar
  13. 13.
    Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)Google Scholar
  14. 14.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017). Scholar
  15. 15.
    Kulkarni, G., et al.: BabyTalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRefGoogle Scholar
  16. 16.
    Lavie, A., Agarwal, A.: Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: StatMT (2007)Google Scholar
  17. 17.
    Li, J., Galley, M., Brockett, C., Gao, J., Dolan, B.: A diversity-promoting objective function for neural conversation models. In: NAACL HLT (2016)Google Scholar
  18. 18.
    Li, J., Jurafsky, D.: Mutual Information and Diverse Decoding Improve Neural Machine Translation. arXiv:1601.00372 [cs] (2016). arXiv: 1601.00372
  19. 19.
    Li, W., et al.: Object-driven text-to-image synthesis via adversarial training. In: CVPR (2019)Google Scholar
  20. 20.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)Google Scholar
  21. 21.
    Lindh, A., Ross, R.J., Mahalunkar, A., Salton, G., Kelleher, J.D.: Generating diverse and meaningful captions. In: ICANN (2018)Google Scholar
  22. 22.
    Liu, L., Tang, J., Wan, X., Guo, Z.: Generating diverse and descriptive image captions using visual paraphrases. In: ICCV (2019)Google Scholar
  23. 23.
    Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of SPIDEr. In: ICCV (2017)Google Scholar
  24. 24.
    Liu, X., Li, H., Shao, J., Chen, D., Wang, X.: Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: ECCV (2018)Google Scholar
  25. 25.
    Lu, D., Whitehead, S., Huang, L., Ji, H., Chang, S.F.: Entity-aware image caption generation. In: EMNLP (2018)Google Scholar
  26. 26.
    Lu, J., Xiong, C., Parikh, D., Socher, R.: knowing when to look: adaptive attention via a visual sentinel for image captioning. In: CVPR (2017)Google Scholar
  27. 27.
    Luo, R., Shakhnarovich, G., Cohen, S., Price, B.: Discriminability objective for training descriptive captions. In: CVPR (2018)Google Scholar
  28. 28.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)Google Scholar
  29. 29.
    Melas-Kyriazi, L., Rush, A., Han, G.: Training for diversity in image paragraph captioning. In: EMNLP (2018)Google Scholar
  30. 30.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL (2001)Google Scholar
  31. 31.
    Povey, D., Woodland, P.: Minimum phone error and I-smoothing for improved discriminative training. In: ICASSP (2002)Google Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. (2017)Google Scholar
  33. 33.
    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)Google Scholar
  34. 34.
    Shetty, R., Rohrbach, M., Hendricks, L.A., Fritz, M., Schiele, B.: speaking the same language: matching machine to human captions by adversarial training. In: ICCV (2017)Google Scholar
  35. 35.
    Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. (1972)Google Scholar
  36. 36.
    Tu, Z., Liu, Y., Shang, L., Liu, X., Li, H.: Neural machine translation with reconstruction. In: AAAI (2017)Google Scholar
  37. 37.
    Vedantam, R., Bengio, S., Murphy, K., Parikh, D., Chechik, G.: Context-aware captions from context-agnostic supervision. In: CVPR (2017)Google Scholar
  38. 38.
    Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  39. 39.
    Vijayakumar, A.K., et al.: Diverse beam search for improved description of complex scenes. In: AAAI (2018)Google Scholar
  40. 40.
    Vijayakumar, A.K., et al.: Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. arXiv:1610.02424 [cs] (2018). arXiv: 1610.02424
  41. 41.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  42. 42.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge (2017)Google Scholar
  43. 43.
    Wang, Q., Chan, A.B.: Describing like humans: on diversity in image captioning. In: CVPR (2019)Google Scholar
  44. 44.
    Wu, B., Jia, F., Liu, W., Ghanem, B.: Diverse image annotation. In: CVPR (2017)Google Scholar
  45. 45.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  46. 46.
    Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: CVPR (2018)Google Scholar
  47. 47.
    Yao, T., Mei, T., Ngo, C.W.: Co-reranking by mutual reinforcement for image search. In: CVPR (2010)Google Scholar
  48. 48.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)Google Scholar
  49. 49.
    Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)Google Scholar
  50. 50.
    Zhang, Y., et al.: Generating informative and diverse conversational responses via adversarial information maximization. In: NeurIPS (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Princeton UniversityPrincetonUSA
  2. 2.California Institute of TechnologyPasadenaUSA

Personalised recommendations