Improving Image Captioning by Concept-Based Sentence Reranking

  • Xirong Li
  • Qin Jin
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9917)


This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task. We improve Google’s CNN-LSTM model by introducing concept-based sentence reranking, a data-driven approach which exploits the large amounts of concept-level annotations on Flickr. Different from previous usage of concept detection that is tailored to specific image captioning models, the propose approach reranks predicted sentences in terms of their matches with detected concepts, essentially treating the underlying model as a black box. This property makes the approach applicable to a number of existing solutions. We also experiment with fine tuning on the deep language model, which improves the performance further. Scoring METEOR of 0.1875 on the ImageCLEF 2015 test set, our system outperforms the runner-up (METEOR of 0.1687) with a clear margin.


Image captioning Sentence reranking Neural language modeling ImageCLEF 2015 benchmark evaluation 



The authors are grateful to the ImageCLEF coordinators for the benchmark organization efforts [14, 23]. This research was supported by the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China (No. 16XNQ013).


  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification using deep convolutional neural networks. In: Proceedings of NIPS (2012)Google Scholar
  2. 2.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS (2014)Google Scholar
  3. 3.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)Google Scholar
  4. 4.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of ICLR (2015)Google Scholar
  5. 5.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)Google Scholar
  6. 6.
    Li, X., Snoek, C., Worring, M.: Learning social tag relevance by neighbor voting. IEEE Trans. Multimedia 11(7), 1310–1322 (2009)CrossRefGoogle Scholar
  7. 7.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47(1), 853–899 (2013)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)Google Scholar
  9. 9.
    Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)Google Scholar
  10. 10.
    Li, X., Lan, W., Dong, J., Liu, H.: Adding Chinese captions to images. In: Proceedings of ICMR (2016)Google Scholar
  11. 11.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, L., Zweig, G.: From captions to visual concepts and back. In: Proceedings of CVPR (2015)Google Scholar
  12. 12.
    Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 529–545. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10593-2_35 Google Scholar
  13. 13.
    Li, X., Jin, Q., Liao, S., Liang, J., He, X., Huo, Y., Lan, W., Xiao, B., Lu, Y., Xu, J.: RUC-tencent at ImageCLEF 2015: concept detection, localization and sentence generation. In: CLEF Working Notes (2015)Google Scholar
  14. 14.
    Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E., Gaizauskas, R., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2015 scalable image annotation. In: CLEF Working Notes, Localization and Sentence Generation Task (2015)Google Scholar
  15. 15.
    Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: Proceedings of NIPS (2013)Google Scholar
  16. 16.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)Google Scholar
  17. 17.
    Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: cross-media retrieval by visual feature prediction. CoRR abs/1604.06838 (2016)Google Scholar
  18. 18.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)Google Scholar
  19. 19.
    Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 329–344. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_22 Google Scholar
  20. 20.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Li, X., Liao, S., Lan, W., Du, X., Yang, G.: Zero-shot image tagging by hierarchical semantic embedding. In: Proceedings of SIGIR (2015)Google Scholar
  22. 22.
    Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C.G.M., Del Bimbo, A.: Socializing the semantic gap: a comparative survey on image tag assignment, refinement and retrieval. ACM Comput. Surv. 49(1), 14:1–14:39 (2016)CrossRefGoogle Scholar
  23. 23.
    Villegas, M., et al.: General overview of ImageCLEF at the CLEF 2015 Labs. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 444–461. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24027-5_45 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Key Lab of DEKERenmin University of ChinaBeijingChina
  2. 2.Multimedia Computing LabRenmin University of ChinaBeijingChina

Personalised recommendations