Abstract
Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [30]. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics (2005)
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Brown, T.B., et al.: Language models are few-shot learners. CoRR abs/2005.14165 (2020), https://arxiv.org/abs/2005.14165
Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of Bleu in machine translation research. In: 11th Conference EACL, Trento, Italy (2006)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. CoRR abs/2005.12872 (2020)
Cho, J., et al.: Language model integration based on memory control for sequence to sequence speech recognition. In: ICASSP, Brighton, United Kingdom, pp. 6191–6195 (2019)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014)
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: Proceedings of the 34th ICML, pp. 933–941 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL 2019, Minneapolis, pp. 4171–4186 (2019). https://doi.org/10.18653/v1/n19-1423
Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of ACL 2018, Melbourne, pp. 889–898 (2018)
Gülçehre, Ç., et al.: On using monolingual corpora in neural machine translation. CoRR abs/1503.03535 (2015), http://arxiv.org/abs/1503.03535
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR 2016, pp. 770–778 (2016)
Kalimuthu, M., Nunnari, F., Sonntag, D.: A competitive deep neural network approach for the imageclefmed caption 2020 task. In: Working Notes of CLEF 2020, Thessaloniki. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, Lake Tahoe, pp. 1106–1114 (2012)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Assoc. for Comp. Linguistics (2004)
Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. Trans. Geosci. Remote Sens. 56, 2183–2195 (2017)
Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: Proceedings of ICLR 2018, Vancouver, Conference Track Proceedings (2018)
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
Mogadala, A., Kalimuthu, M., Klakow, D.: Trends in integration of vision and language research: a survey of tasks, datasets, and methods. CoRR abs/1907.09358 (2019), http://arxiv.org/abs/1907.09358, (Accepted at the Journal of AI Research)
Mosbach, M., Andriushchenko, M., Klakow, D.: On the stability of fine-tuning BERT: misconceptions, explanations, and strong baselines. CoRR abs/2006.04884 (2020), https://arxiv.org/abs/2006.04884
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 2019, 8026–8037 (2019)
Pelka, O., Friedrich, C.M., Garcıa Seco de Herrera, A., Müller, H.: Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In: CLEF2020 Working Notes, CEUR Workshop Proceedings, vol. 2696 (2020)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of EMNLP Brussels, pp. 4035–4045 (2018)
Sammani, F., Elsayed, M.: Look and modify: modification networks for image captioning. In: 30th BMVC 2019, Cardiff, UK, p. 75. BMVA Press (2019)
Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of CVPR 2020, Seattle, pp. 4808–4816. IEEE (2020)
Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold fusion: training seq2seq models together with language models. In: Proceedings of Interspeech 2018, pp. 387–391 (2018)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS 2017, 5998–6008 (2017)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of CVPR 2015, pp. 4566–4575. IEEE CS (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: aeural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016), http://arxiv.org/abs/1609.08144
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Acknowledgements
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – project-id 232722074 – SFB 1102. We extend our thanks to Matthew Kuhn for painstakingly proofing the whole manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
Here we provide examples of (token) corrections made by the fusion models and categorize the edits into one of the following five categories: (i) Gender (ii) Color (iii) Specificity (iv) Syntactic (v) Semantic, based on the nature of change.
The above classification has been provided only for the purpose of preliminary illustration. For a thorough understanding of the trends in caption emendations and to draw conclusions, a detailed study using human evaluation should be performed on all three datasets. We leave this aspiration for future work. In the following examples, we color the incorrect tokens in , the correct replacements in , and the equally valid tokens in .
Semantic Correction. This section presents examples where the fusion models have corrected few tokens in the baseline captions so as to make them semantically valid with respect to the image. Edits to achieve semantic correctness may include emendation of visual attributes such as colors, objects, object size, etc. (Figs. 3 and 4).
Gender Alteration. This section provides an illustration of the case where the fusion models corrected the wrong gender of captions from the baseline model (Fig. 5).
Specificity. This deals with emendations of fusion models where the corrected captions end up describing the images more precisely than the baseline captions (Fig. 6).
Syntactic Correction. In this section, we show an example to demonstrate the case where syntactic errors such as token repetitions in the baseline captions are correctly emended by the fusion models (Fig. 7).
Color Correction. In this part, we show an example to illustrate the case where the fusion models emended color attributes in the captions of the baseline model (Fig. 8).
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kalimuthu, M., Mogadala, A., Mosbach, M., Klakow, D. (2021). Fusion Models for Improved Image Captioning. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-030-68780-9_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)