Adaptive Syncretic Attention for Constrained Image Captioning

Yang, Liang; Hu, Haifeng

doi:10.1007/s11063-019-10045-5

Adaptive Syncretic Attention for Constrained Image Captioning

Published: 26 April 2019

Volume 50, pages 549–564, (2019)
Cite this article

Neural Processing Letters Aims and scope Submit manuscript

317 Accesses
7 Citations
Explore all metrics

Abstract

Recently, deep learning approaches for image captioning have gained a lot of attention and achieved overwhelming progress. In this paper, we propose a novel model which simultaneously explores a better representation of images and the relationship between visual and semantic information. The model consists of three parts: an Adaptive Syncretic Attention (ASA) mechanism, a LSTM + MLP mimic constraint network and a multimodal layer. In the ASA, we integrate local semantics features captured by region proposal network with time-varying global visual features through attention mechanism. In the LSTM + MLP mimic constraint network, we designed a network which consists of Multilayer Perceptron (MLP) and Long Short Term Memory (LSTM) model. During test process, this network generates a Mimic Constraint Vector for each test image. Further, we combine textual and visual information in our multimodal layer. Based on these three parts, our full model is capable of both capturing meaningful local features and generating sentence that is more relevant to image content. We evaluate our model on two popular datasets (i.e., Flickr30k and MSCOCO datasets). The results show that each module can improve our model. Moreover, our entire model is on par with or even better than the state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multilevel attention and relation network based image captioning model

Article 16 September 2022

A New Attention-Based LSTM for Image Captioning

Article 14 February 2022

XGL-T transformer model for intelligent image captioning

Article 24 May 2023

References

Mao J, Xu W, Yang Y, Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv:1412.6632
Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539
Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Computer vision and pattern recognition, pp 3128–3137
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Computer vision and pattern recognition, p 677
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: image captioning with region-based attention and scene-specific contexts. IEEE Trans Pattern Anal Mach Intell 39(12):2321–2334
Article Google Scholar
Peng KC, Chen T, Sadovnik A, Gallagher A (2015) A mixed bag of emotions: model, predict, and transfer emotion distributions. In: Computer vision and pattern recognition, pp 860–868
Hong C, Yu J, You J, Chen X, Tao D (2015) Multi-view ensemble manifold regularization for 3D object recognition. Inf Sci 320:395–405
Article MathSciNet Google Scholar
Hong C, Yu J, Tao D, Wang M (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751
Google Scholar
Liu W, Tao D (2013) Multiview Hessian regularization for image annotation. IEEE Trans Image Process 22(7):2676–2687
Article MathSciNet MATH Google Scholar
Liu W, Yang X, Tao D, Cheng J, Tang Y (2018) Multiview dimension reduction via Hessian multiset canonical correlations. Inf Fusion 41:119–128
Article Google Scholar
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Article Google Scholar
Chaudhuri K, Kakade SM, Livescu K, Sridharan K (2009) Multi-view clustering via canonical correlation analysis. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 129–136
Yu J, Rui Y, Tao D (2014) Click prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet MATH Google Scholar
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Article MathSciNet MATH Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78
Article Google Scholar
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. arXiv:1504.00325
Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y (2013) Generalizing image captions for image-text parallel corpus. Assoc Comput Linguist (ACL) 2:790–796
Google Scholar
Kuznetsova P, Ordonez V, Berg TL, Choi Y (2014) Treetalk: composition and compression of trees for image descriptions. TACL 2:351–362
Google Scholar
Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903
Article Google Scholar
Yang Y, Teo CL, Daumé III H, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 444–454
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning, pp 595–603
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: IEEE conference on computer vision and pattern recognition, pp 1179–1195
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Tan M, Santos CD, Xiang B, Zhou B (2015) Lstm-based deep learning models for non-factoid answer selection. arXiv:1511.04108
Wang B, Liu K, Zhao J (2016) Inner attention based recurrent neural networks for answer selection. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1, pp 1288–1297
Wang D, Nyberg E (2015) A long short-term memory model for answer sentence selection in question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing, vol 2, pp 707–712
LeCun Y, Bottou L, Orr G, Muller K (1998) Efficient backprop in neural networks: tricks of the trade. In: Orr G, Müller K (eds) Lecture notes in computer science, vol 1524(98), p 111
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, ACL ’02. Association for Computational Linguistics, Stroudsburg, pp 311–318
Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Marie-Francine Moens SS (ed) Text summarization branches out: proceedings of the ACL-04 workshop. Association for Computational Linguistics, Barcelona, pp 74–81
Vedantam R, Zitnick CL, Parikh D (2015) Cider: consensus-based image description evaluation. In: Computer vision and pattern recognition, pp 4566–4575
Lavie A, Agarwal A (2007) Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the second workshop on statistical machine translation. Association for Computational Linguistics, pp 228–231
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China, under Grant 61673402, 61273270 and 60802069, the National Key R&D Program of China under Grant 2018YFB1601101, the Natural Science Foundation of Guangdong Province (2017A030311029 and 2016B010109002), and by the Science and Technology Program of Guangzhou, China, under Grant 201704020180, and the Fundamental Research Funds for the Central Universities of China.

Author information

Authors and Affiliations

School of Electronics and Information Technology, Sun Yat-Sen University, Guangzhou, 510006, China
Liang Yang & Haifeng Hu

Authors

Liang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Haifeng Hu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, L., Hu, H. Adaptive Syncretic Attention for Constrained Image Captioning. Neural Process Lett 50, 549–564 (2019). https://doi.org/10.1007/s11063-019-10045-5

Download citation

Published: 26 April 2019
Issue Date: 15 August 2019
DOI: https://doi.org/10.1007/s11063-019-10045-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive Syncretic Attention for Constrained Image Captioning

Abstract

Access this article

Similar content being viewed by others

Multilevel attention and relation network based image captioning model

A New Attention-Based LSTM for Image Captioning

XGL-T transformer model for intelligent image captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Adaptive Syncretic Attention for Constrained Image Captioning

Abstract

Access this article

Similar content being viewed by others

Multilevel attention and relation network based image captioning model

A New Attention-Based LSTM for Image Captioning

XGL-T transformer model for intelligent image captioning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation