Skip to main content

Fusion Models for Improved Image Captioning

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Abstract

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [30]. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://www.imageclef.org/2020/medical/caption/.

  2. 2.

    https://github.com/google-research/bert.

  3. 3.

    https://cs.stanford.edu/people/karpathy/deepimagesent.

References

  1. Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24

    Chapter  Google Scholar 

  2. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics (2005)

    Google Scholar 

  3. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  4. Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020), https://arxiv.org/abs/2005.14165

  5. Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of Bleu in machine translation research. In: 11th Conference EACL, Trento, Italy (2006)

    Google Scholar 

  6. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. CoRR abs/2005.12872 (2020)

    Google Scholar 

  7. Cho, J., et al.: Language model integration based on memory control for sequence to sequence speech recognition. In: ICASSP, Brighton, United Kingdom, pp. 6191–6195 (2019)

    Google Scholar 

  8. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014)

    Google Scholar 

  9. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: Proceedings of the 34th ICML, pp. 933–941 (2017)

    Google Scholar 

  10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, Minneapolis, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423

  11. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of ACL 2018, Melbourne, pp. 889–898 (2018)

    Google Scholar 

  12. Gülçehre, Ç., et al.: On using monolingual corpora in neural machine translation. CoRR abs/1503.03535 (2015), http://arxiv.org/abs/1503.03535

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR 2016, pp. 770–778 (2016)

    Google Scholar 

  14. Kalimuthu, M., Nunnari, F., Sonntag, D.: A competitive deep neural network approach for the imageclefmed caption 2020 task. In: Working Notes of CLEF 2020, Thessaloniki. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020)

    Google Scholar 

  15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego (2015)

    Google Scholar 

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, Lake Tahoe, pp. 1106–1114 (2012)

    Google Scholar 

  17. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Assoc. for Comp. Linguistics (2004)

    Google Scholar 

  18. Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. Trans. Geosci. Remote Sens. 56, 2183–2195 (2017)

    Google Scholar 

  19. Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: Proceedings of ICLR 2018, Vancouver, Conference Track Proceedings (2018)

    Google Scholar 

  20. Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)

    Google Scholar 

  21. Mogadala, A., Kalimuthu, M., Klakow, D.: Trends in integration of vision and language research: a survey of tasks, datasets, and methods. CoRR abs/1907.09358 (2019), http://arxiv.org/abs/1907.09358, (Accepted at the Journal of AI Research)

  22. Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines. CoRR abs/2006.04884 (2020), https://arxiv.org/abs/2006.04884

  23. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)

    Google Scholar 

  24. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 2019, 8026–8037 (2019)

    Google Scholar 

  25. Pelka, O., Friedrich, C.M., Garcıa Seco de Herrera, A., Müller, H.: Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In: CLEF2020 Working Notes, CEUR Workshop Proceedings, vol. 2696 (2020)

    Google Scholar 

  26. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  27. Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of EMNLP Brussels, pp. 4035–4045 (2018)

    Google Scholar 

  28. Sammani, F., Elsayed, M.: Look and modify: modification networks for image captioning. In: 30th BMVC 2019, Cardiff, UK, p. 75. BMVA Press (2019)

    Google Scholar 

  29. Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of CVPR 2020, Seattle, pp. 4808–4816. IEEE (2020)

    Google Scholar 

  30. Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold fusion: training seq2seq models together with language models. In: Proceedings of Interspeech 2018, pp. 387–391 (2018)

    Google Scholar 

  31. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS 2017, 5998–6008 (2017)

    Google Scholar 

  32. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of CVPR 2015, pp. 4566–4575. IEEE CS (2015)

    Google Scholar 

  33. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: aeural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015)

    Google Scholar 

  34. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016), http://arxiv.org/abs/1609.08144

  35. Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)

    Google Scholar 

Download references

Acknowledgements

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project-id 232722074 – SFB 1102. We extend our thanks to Matthew Kuhn for painstakingly proofing the whole manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marimuthu Kalimuthu .

Editor information

Editors and Affiliations

A Appendix

A Appendix

Here we provide examples of (token) corrections made by the fusion models and categorize the edits into one of the following five categories: (i) Gender (ii) Color (iii) Specificity (iv) Syntactic (v) Semantic, based on the nature of change.

The above classification has been provided only for the purpose of preliminary illustration. For a thorough understanding of the trends in caption emendations and to draw conclusions, a detailed study using human evaluation should be performed on all three datasets. We leave this aspiration for future work. In the following examples, we color the incorrect tokens in , the correct replacements in , and the equally valid tokens in .

Semantic Correction. This section presents examples where the fusion models have corrected few tokens in the baseline captions so as to make them semantically valid with respect to the image. Edits to achieve semantic correctness may include emendation of visual attributes such as colors, objects, object size, etc. (Figs. 3 and 4).

Fig. 3.
figure 3

An example illustrating the correction of semantic errors in the captions by our simple fusion model.

Fig. 4.
figure 4

Another example to show correction of semantic errors with cold fusion.

Gender Alteration. This section provides an illustration of the case where the fusion models corrected the wrong gender of captions from the baseline model (Fig. 5).

Fig. 5.
figure 5

An example of cold fusion approach achieving gender correction.

Specificity. This deals with emendations of fusion models where the corrected captions end up describing the images more precisely than the baseline captions (Fig. 6).

Fig. 6.
figure 6

An example to show achievement of specificity with hierarchical fusion.

Syntactic Correction. In this section, we show an example to demonstrate the case where syntactic errors such as token repetitions in the baseline captions are correctly emended by the fusion models (Fig. 7).

Fig. 7.
figure 7

Replacement of repetitive tokens with a correct alternative.

Color Correction. In this part, we show an example to illustrate the case where the fusion models emended color attributes in the captions of the baseline model (Fig. 8).

Fig. 8.
figure 8

An example of cold fusion achieving emendation of color attribute.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D. (2021). Fusion Models for Improved Image Captioning. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-68780-9_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68779-3

  • Online ISBN: 978-3-030-68780-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics