Improving Image Captioning by Concept-Based Sentence Reranking

Li, Xirong; Jin, Qin

doi:10.1007/978-3-319-48896-7_23

Xirong Li^16,17 &
Qin Jin^16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9917))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2556 Accesses
1 Citations

Abstract

This paper describes our winning entry in the ImageCLEF 2015 image sentence generation task. We improve Google’s CNN-LSTM model by introducing concept-based sentence reranking, a data-driven approach which exploits the large amounts of concept-level annotations on Flickr. Different from previous usage of concept detection that is tailored to specific image captioning models, the propose approach reranks predicted sentences in terms of their matches with detected concepts, essentially treating the underlying model as a black box. This property makes the approach applicable to a number of existing solutions. We also experiment with fine tuning on the deep language model, which improves the performance further. Scoring METEOR of 0.1875 on the ImageCLEF 2015 test set, our system outperforms the runner-up (METEOR of 0.1687) with a clear margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Over 6 million images tagged with ‘dog’ on Flickr, https://www.flickr.com/search/?tags=dog, retrieved on April-29-2016.

References

Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification using deep convolutional neural networks. In: Proceedings of NIPS (2012)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS (2014)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of CVPR (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of ICLR (2015)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of CVPR (2015)
Google Scholar
Li, X., Snoek, C., Worring, M.: Learning social tag relevance by neighbor voting. IEEE Trans. Multimedia 11(7), 1310–1322 (2009)
Article Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47(1), 853–899 (2013)
MathSciNet MATH Google Scholar
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
Google Scholar
Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)
Google Scholar
Li, X., Lan, W., Dong, J., Liu, H.: Adding Chinese captions to images. In: Proceedings of ICMR (2016)
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J., Zitnick, L., Zweig, G.: From captions to visual concepts and back. In: Proceedings of CVPR (2015)
Google Scholar
Gong, Y., Wang, L., Hodosh, M., Hockenmaier, J., Lazebnik, S.: Improving image-sentence embeddings using large weakly annotated photo collections. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 529–545. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10593-2_35
Google Scholar
Li, X., Jin, Q., Liao, S., Liang, J., He, X., Huo, Y., Lan, W., Xiao, B., Lu, Y., Xu, J.: RUC-tencent at ImageCLEF 2015: concept detection, localization and sentence generation. In: CLEF Working Notes (2015)
Google Scholar
Gilbert, A., Piras, L., Wang, J., Yan, F., Dellandrea, E., Gaizauskas, R., Villegas, M., Mikolajczyk, K.: Overview of the ImageCLEF 2015 scalable image annotation. In: CLEF Working Notes, Localization and Sentence Generation Task (2015)
Google Scholar
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., Mikolov, T.: DeViSE: a deep visual-semantic embedding model. In: Proceedings of NIPS (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS (2013)
Google Scholar
Dong, J., Li, X., Snoek, C.G.M.: Word2VisualVec: cross-media retrieval by visual feature prediction. CoRR abs/1604.06838 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR (2015)
Google Scholar
Agrawal, P., Girshick, R., Malik, J.: Analyzing the performance of multilayer neural networks for object recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 329–344. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10584-0_22
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Li, X., Liao, S., Lan, W., Du, X., Yang, G.: Zero-shot image tagging by hierarchical semantic embedding. In: Proceedings of SIGIR (2015)
Google Scholar
Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C.G.M., Del Bimbo, A.: Socializing the semantic gap: a comparative survey on image tag assignment, refinement and retrieval. ACM Comput. Surv. 49(1), 14:1–14:39 (2016)
Article Google Scholar
Villegas, M., et al.: General overview of ImageCLEF at the CLEF 2015 Labs. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 444–461. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24027-5_45
Chapter Google Scholar

Download references

Acknowledgements

The authors are grateful to the ImageCLEF coordinators for the benchmark organization efforts [14, 23]. This research was supported by the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China (No. 16XNQ013).

Author information

Authors and Affiliations

Key Lab of DEKE, Renmin University of China, Beijing, China
Xirong Li & Qin Jin
Multimedia Computing Lab, Renmin University of China, Beijing, China
Xirong Li & Qin Jin

Authors

Xirong Li
View author publications
You can also search for this author in PubMed Google Scholar
Qin Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qin Jin .

Editor information

Editors and Affiliations

Zhengzhou University, Zhengzhou, China
Enqing Chen
Jiaotong University, Xi’an, China
Yihong Gong
Zhengzhou University, Zhengzhou, China
Yun Tie

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, X., Jin, Q. (2016). Improving Image Captioning by Concept-Based Sentence Reranking. In: Chen, E., Gong, Y., Tie, Y. (eds) Advances in Multimedia Information Processing - PCM 2016. PCM 2016. Lecture Notes in Computer Science(), vol 9917. Springer, Cham. https://doi.org/10.1007/978-3-319-48896-7_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-48896-7_23
Published: 27 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48895-0
Online ISBN: 978-3-319-48896-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics