Topic-Based Image Caption Generation

Dash, Sandeep Kumar; Acharya, Shantanu; Pakray, Partha; Das, Ranjita; Gelbukh, Alexander

doi:10.1007/s13369-019-04262-2

Topic-Based Image Caption Generation

Research Article - Computer Engineering and Computer Science
Published: 26 November 2019

Volume 45, pages 3025–3034, (2020)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Sandeep Kumar Dash¹,
Shantanu Acharya¹,
Partha Pakray ORCID: orcid.org/0000-0003-3834-5154²,
Ranjita Das¹ &
…
Alexander Gelbukh³

797 Accesses
13 Citations
Explore all metrics

Abstract

Image captioning is to generate captions for a given image based on the content of the image. To describe an image efficiently, it requires extracting as much information from it as possible. Apart from detecting the presence of objects and their relative orientation, the respective purpose intending the topic of the image is another vital information which can be incorporated with the model to improve the efficiency of the caption generation system. The sole aim is to put extra thrust on the context of the image imitating human approach, as the mere presence of objects which may not be related to the context representing the image should not be a part of the generated caption. In this work, the focus is on detecting the topic concerning the image so as to guide a novel deep learning-based encoder–decoder framework to generate captions for the image. The method is compared with some of the earlier state-of-the-art models based on the result obtained from MSCOCO 2017 training data set. BLEU, CIDEr, ROGUE-L, METEOR scores are used to measure the efficacy of the model which show improvement in performance of the caption generation process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Blei, D.M.; Ng, A.Y.; Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Lee, D.D.; Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Yang, Y.; Teo, C.L.; DaumÃl’ III, H.; Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)
Kulkarni, G.; Premraj, V.; Dhar, S.; Li, S.; Choi,Y.; Berg, A.C.; Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of 24th CVPR, pp. 1601–1608 (2011)
Mitchell, M.; Dodge,X. J.; Mensch,A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; DaumÃl’ III, H.: Midge: generating image descriptions from computer vision detections. In: Proceedings of 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747–756 (2012)
Oliva, A.; Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42(3), 145–175 (2001)
Article Google Scholar
Torralba, A.; Fergus, R.; Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
Article Google Scholar
Ordonez, V.; Kulkarni, G.; Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: Proceedings of Advances in Neural Information Processing Systems, pp. 1143–1151 (2011)
Dash, S.K.; Saha, S.; Pakray, P.; Gelbukh, A.: Generating image captions through multimodal embedding. J. Intell. Fuzzy Syst. 36(5), 4787–4796 (2019)
Article Google Scholar
Kiros, R.; Salakhutdinov, R.; Zemel, R.: Multimodal neural language models. In: Proceedings of International Conference on Machine Learning, pp. 595–603 (2014)
Zhu, Z.; Xue, Z.; Yuan, Z.: Topic-guided attention for image captioning. In: 25th IEEE In Proceedings of International Conference on Image Processing, pp. 2615–2619 (2018)
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollr, P.; Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of European Conference on Computer Vision, pp. 740–755 (2014)
Liu, F.; Ren, X.; Liu, Y.; Wang, H.; Sun, X.: simNet: stepwise image-topic merging network for generating detailed and comprehensive image captions. arXiv preprint arXiv:1808.08732. (2018)
Ding, S.; Qu, S.; Xi, Y.; Wan, S.: Stimulus-driven and concept-driven analysis for image caption generation. In: Proceedings of Neurocomputing (2019)
Gomez, L.; Patel, Y.; Rusiñol, M.; Karatzas, D.; Jawahar, C.V.: Self-supervised learning of visual features through embedding images into text topic spaces. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 4230–4239 (2017)
Tsikrika, T.; Popescu, A.; Kludas, J.: Overview of the Wikipedia image retrieval task at ImageCLEF 2011. In: Proceedings of CLEF (2011)
Dash, S.K.; Kumar, R.; Pakray, P.; Gelbukh, A.: Visually aligned text embedding model for identification of spatial roles. In: Proceedings of 1st International Conference on Recent Trends on Electronics and Computer Science (ICRTECS 2019)(2019) (accepted)
Miller, A.G.: WordNet: a lexical database for English. Proc. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Li, W.; Liu, X.; Liu, J.; Chen, P.; Wan, S.; Cui, X.: On improving the accuracy with auto-encoder on conjunctivitis. Proc. Appl. Soft Comput. 81, 105489 (2019)
Article Google Scholar
Blei, D.M.; Jordan, M.I.: Modeling annotated data. In: Proceedings of 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 127–134 (2003)
Wang, Y.; Mori, G.: Max-margin latent Dirichlet allocation for image classification and annotation. Proc. BMVC 2(6), 7 (2011)
Google Scholar
Rasiwasia, N.; Vasconcelos, N.: Latent Dirichlet allocation models for image classification. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2665–2679 (2013)
Article Google Scholar
Putthividhy, D.; Attias, H.T.; Nagarajan, S.S.: Topic regression multi-modal latent Dirichlet allocation for image annotation. In: Proceedings of CVPR (2010)
Yu, N.; Hu, X.; Song, B.; Yang, J.; Zhang, J.: Topic-oriented image captioning based on order-embedding. IEEE Trans. Image Process. 28(6), 2743–2754 (2018)
Article MathSciNet Google Scholar
Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R.: Order-embeddings of images and language. In: Proceedings of ICLR (2016)
Mao, Y.; Zhou, C.; Wang, X.; Li, R.: Show and tell more: topic-oriented multi-sentence image captioning. In: Proceedings of IJCAI, pp. 4258–4264 (2018)
Horn, R.A.: The hadamard product. Proc. Symp. Appl. Math. 40, 87–169 (1990)
Article MathSciNet Google Scholar
Zhou, C.; Mao, Y.; Wang, X.: Topic-specific image caption generation. In: Proceedings of Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 321–332 (2017)
Ding, S.; Qu, S.; Xi, Y.; Sangaiah, A.K.; Wan, S.: Image caption generation with high-level image features. Proc. Pattern Recognit. Lett. 123, 89–95 (2019)
Article Google Scholar
Gan, Z.; Gan, C.; He, X.; Pu, Y.; Tran, K.; Gao, J.; Carin, L.; Deng, L.: Semantic compositional networks for visual captioning. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1141–1150 (2017)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1–2), 177–196 (2001)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Article Google Scholar
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z.: Re-thinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567 (2015)
Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Karpathy, A.; Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Bird, S.; Klein, E.; Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. O’Reilly Media Inc., Newton (2009)
MATH Google Scholar
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vander- plas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Pennington, J.; Socher, R.; Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Kingma, D.P.; Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting on Association for Computational linguistics, pp. 311–318 (2002)
Vedantam, R.; Zitnick, C.L.; Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 566–4575 (2015)
Denkowski, M.; Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of Text Summarization Branches Out: ACL-04 Workshop, p. 8 (2004)
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of International Conference on Machine Learning, pp. 2048–2057 (2015)

Download references

Author information

Authors and Affiliations

Department of CSE, NIT Mizoram, Aizawl, India
Sandeep Kumar Dash, Shantanu Acharya & Ranjita Das
Department of CSE, NIT Silchar, Silchar, India
Partha Pakray
CIC, IPN, Mexico, Mexico
Alexander Gelbukh

Authors

Sandeep Kumar Dash
View author publications
You can also search for this author in PubMed Google Scholar
Shantanu Acharya
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pakray
View author publications
You can also search for this author in PubMed Google Scholar
Ranjita Das
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Partha Pakray.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dash, S.K., Acharya, S., Pakray, P. et al. Topic-Based Image Caption Generation. Arab J Sci Eng 45, 3025–3034 (2020). https://doi.org/10.1007/s13369-019-04262-2

Download citation

Received: 20 July 2019
Accepted: 15 November 2019
Published: 26 November 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s13369-019-04262-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Topic-Based Image Caption Generation

Abstract

Access this article

Similar content being viewed by others

Topic-Specific Image Caption Generation

An efficient automated image caption generation by the encoder decoder model

Assamese news image caption generation using attention mechanism

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Topic-Based Image Caption Generation

Abstract

Access this article

Similar content being viewed by others

Topic-Specific Image Caption Generation

An efficient automated image caption generation by the encoder decoder model

Assamese news image caption generation using attention mechanism

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation