Transformer with Prior Language Knowledge for Image Captioning

Yan, Daisong; Yu, Wenxin; Zhang, Zhiqiang; Gong, Jun

doi:10.1007/978-3-030-92270-2_4

Daisong Yan¹³,
Wenxin Yu¹³,
Zhiqiang Zhang¹⁴ &
…
Jun Gong¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 13109))

Included in the following conference series:

International Conference on Neural Information Processing

1697 Accesses
1 Citations

Abstract

The Transformer architecture represents state-of-the-art in image captioning tasks. However, even the transformer uses positional encodings to encode sentences, its performance still not good enough in grammar. To improve the performance of image captioning, we present Prior Language Knowledge Transformer (PLKT)—a transformer-based model that can integrate learned a priori language knowledge for image captioning. In our proposal, when our model predicts the next word, it not only depends on the previously generated sequence but also relies on prior language knowledge. To obtain prior language knowledge, we embed a learnable memory vector inside the self-attention. Meanwhile, we use reinforcement learning to fine-tune the model in training. To prove the advancement and promising effectiveness of PLKT, we compare our approach with other recent image captioning methods in the experiments. Through objective results, our proposal increased the CIDEr score of the baseline by 0.6 points on the “Karpathy” test split when tested on COCO2014 dataset. In subjective results, our approach generated sentences is obviously better than baseline in grammar.

D. Yan and W. Yu—These authors have contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2016)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018)
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and VQA. arXiv preprint arXiv: 1707.07998 (2017)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Meeting on Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: The Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Computer Science, pp. 4566–4575 (2015)
Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Liu, D., Zha, Z.-J., Zhang, H., Zhang, Y., Wu, F.: Context-aware visual policy network for sequence-level image captioning. In: 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1416–1424. ACM (2018)
Google Scholar
Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685–10694 (2019)
Google Scholar
Guo, L., Liu, J., Tang, J., Li, J., Luo, W., Lu, H.: Aligning linguistic words and visual semantic units for image captioning. In: ACM MM (2019)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv: 1906.05963 (2019)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2020)
Google Scholar
Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., Lu, H.: Normalized and geometry-aware self-attention network for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Liu, S., Zhu, Z., Ye, N., Guadarrama, S., Murphy, K.: Improved image captioning via policy gradient optimization of SPIDEr. In: Proceedings of the International Conference on Computer Vision (2017)
Google Scholar
Ranzato, M.A., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: Proceedings of the International Conference on Learning Representations (2015)
Google Scholar
Jiang, W., Ma, L., Jiang, Y.-G., Liu, W., Zhang, T.: Recurrent fusion network for image captioning. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 510–526. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_31
Chapter Google Scholar
Chen, F., Ji, R., Sun, X., Wu, Y., Su, J.: GroupCap: group-based image captioning with structured relevance and diversity constraints. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Google Scholar
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018)
Google Scholar
Huang, L., Wang, W., Chen, J., Wei, X.-Y.: Attention on attention for image captioning. In: Proceedings of the International Conference on Computer Vision (2019)
Google Scholar
Herdade, S., Kappeler, A., Boakye, K., Soares, J.: Image captioning: transforming objects into words. arXiv preprint arXiv:1906.05963 (2019)

Download references

Acknowledgment

This research is supported by Sichuan Science and Technology Program (No. 2020YFS0307, No. 2020YFG0430, No. 2019YFS0146), Mianyang Science and Technology Program (2020YFZJ016).

Author information

Authors and Affiliations

Southwest University of Science and Technology, School of Computer Science and Technology, Sichuan, China
Daisong Yan & Wenxin Yu
Graduate School of Science and Engineering, Hosei University, Tokyo, Japan
Zhiqiang Zhang
Beijing Institute of Technology, Beijing, China
Jun Gong

Authors

Daisong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Wenxin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenxin Yu .

Editor information

Editors and Affiliations

Sampoerna University, Jakarta, Indonesia
Teddy Mantoro
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee
Sampoerna University, Jakarta, Indonesia
Media Anugerah Ayu
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Universitas Indonesia, Depok, Indonesia
Achmad Nizar Hidayanto

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yan, D., Yu, W., Zhang, Z., Gong, J. (2021). Transformer with Prior Language Knowledge for Image Captioning. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds) Neural Information Processing. ICONIP 2021. Lecture Notes in Computer Science(), vol 13109. Springer, Cham. https://doi.org/10.1007/978-3-030-92270-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-92270-2_4
Published: 07 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-92269-6
Online ISBN: 978-3-030-92270-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Transformer with Prior Language Knowledge for Image Captioning