Fusion Models for Improved Image Captioning

Kalimuthu, Marimuthu; Mogadala, Aditya; Mosbach, Marius; Klakow, Dietrich

doi:10.1007/978-3-030-68780-9_32

Fusion Models for Improved Image Captioning

Marimuthu Kalimuthu¹⁶,
Aditya Mogadala¹⁶,
Marius Mosbach¹⁶ &
…
Dietrich Klakow¹⁶

Conference paper
First Online: 25 February 2021

2443 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Abstract

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [30]. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics (2005)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020), https://arxiv.org/abs/2005.14165
Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of Bleu in machine translation research. In: 11th Conference EACL, Trento, Italy (2006)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. CoRR abs/2005.12872 (2020)
Google Scholar
Cho, J., et al.: Language model integration based on memory control for sequence to sequence speech recognition. In: ICASSP, Brighton, United Kingdom, pp. 6191–6195 (2019)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014)
Google Scholar
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: Proceedings of the 34th ICML, pp. 933–941 (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, Minneapolis, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of ACL 2018, Melbourne, pp. 889–898 (2018)
Google Scholar
Gülçehre, Ç., et al.: On using monolingual corpora in neural machine translation. CoRR abs/1503.03535 (2015), http://arxiv.org/abs/1503.03535
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR 2016, pp. 770–778 (2016)
Google Scholar
Kalimuthu, M., Nunnari, F., Sonntag, D.: A competitive deep neural network approach for the imageclefmed caption 2020 task. In: Working Notes of CLEF 2020, Thessaloniki. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego (2015)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, Lake Tahoe, pp. 1106–1114 (2012)
Google Scholar
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Assoc. for Comp. Linguistics (2004)
Google Scholar
Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. Trans. Geosci. Remote Sens. 56, 2183–2195 (2017)
Google Scholar
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: Proceedings of ICLR 2018, Vancouver, Conference Track Proceedings (2018)
Google Scholar
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
Google Scholar
Mogadala, A., Kalimuthu, M., Klakow, D.: Trends in integration of vision and language research: a survey of tasks, datasets, and methods. CoRR abs/1907.09358 (2019), http://arxiv.org/abs/1907.09358, (Accepted at the Journal of AI Research)
Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines. CoRR abs/2006.04884 (2020), https://arxiv.org/abs/2006.04884
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 2019, 8026–8037 (2019)
Google Scholar
Pelka, O., Friedrich, C.M., Garcıa Seco de Herrera, A., Müller, H.: Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In: CLEF2020 Working Notes, CEUR Workshop Proceedings, vol. 2696 (2020)
Google Scholar
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of EMNLP Brussels, pp. 4035–4045 (2018)
Google Scholar
Sammani, F., Elsayed, M.: Look and modify: modification networks for image captioning. In: 30th BMVC 2019, Cardiff, UK, p. 75. BMVA Press (2019)
Google Scholar
Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of CVPR 2020, Seattle, pp. 4808–4816. IEEE (2020)
Google Scholar
Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold fusion: training seq2seq models together with language models. In: Proceedings of Interspeech 2018, pp. 387–391 (2018)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS 2017, 5998–6008 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of CVPR 2015, pp. 4566–4575. IEEE CS (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: aeural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016), http://arxiv.org/abs/1609.08144
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar

Download references

Acknowledgements

This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project-id 232722074 – SFB 1102. We extend our thanks to Matthew Kuhn for painstakingly proofing the whole manuscript.

Author information

Authors and Affiliations

Spoken Language Systems (LSV), Saarland Informatics Campus, Saarland University, Saarbrücken, Germany
Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach & Dietrich Klakow

Authors

Marimuthu Kalimuthu
View author publications
You can also search for this author in PubMed Google Scholar
Aditya Mogadala
View author publications
You can also search for this author in PubMed Google Scholar
Marius Mosbach
View author publications
You can also search for this author in PubMed Google Scholar
Dietrich Klakow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marimuthu Kalimuthu .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

A Appendix

Here we provide examples of (token) corrections made by the fusion models and categorize the edits into one of the following five categories: (i) Gender (ii) Color (iii) Specificity (iv) Syntactic (v) Semantic, based on the nature of change.

The above classification has been provided only for the purpose of preliminary illustration. For a thorough understanding of the trends in caption emendations and to draw conclusions, a detailed study using human evaluation should be performed on all three datasets. We leave this aspiration for future work. In the following examples, we color the incorrect tokens in , the correct replacements in , and the equally valid tokens in .

Semantic Correction. This section presents examples where the fusion models have corrected few tokens in the baseline captions so as to make them semantically valid with respect to the image. Edits to achieve semantic correctness may include emendation of visual attributes such as colors, objects, object size, etc. (Figs. 3 and 4).

Gender Alteration. This section provides an illustration of the case where the fusion models corrected the wrong gender of captions from the baseline model (Fig. 5).

Specificity. This deals with emendations of fusion models where the corrected captions end up describing the images more precisely than the baseline captions (Fig. 6).

Syntactic Correction. In this section, we show an example to demonstrate the case where syntactic errors such as token repetitions in the baseline captions are correctly emended by the fusion models (Fig. 7).

Color Correction. In this part, we show an example to illustrate the case where the fusion models emended color attributes in the captions of the baseline model (Fig. 8).

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D. (2021). Fusion Models for Improved Image Captioning. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-68780-9_32
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Abstract

Buying options

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation