Abstract
The objective of description models is to generate image captions to elaborate contents. Despite recent advancements in machine learning and computer vision, generating discriminative captions still remains a challenging problem. Traditional approaches imitate frequent language patterns without considering the semantic alignment of words. In this work, an image captioning framework is proposed that generates topic sensitive descriptions. The model captures the semantic relation and polysemous nature of the words that describe the images and resultantly generates superior descriptions for the target images. The efficacy of the proposed model is indicated by the evaluation on the state-of-the-art captioning datasets. The model shows promising performance compared to the existing description models proposed in the recent literature.
Similar content being viewed by others
References
Ramisa A, Yan F, Moreno-Noguer F, Mikolajczyk K (2018) Breakingnews: article annotation by image and text processing. IEEE Trans Pattern Anal Mach Intell 40(5):1072–1085
Ling H, Fidler S (2017) Teaching machines to describe images via natural language feedback. In: NIPS
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), Boston MA, pp 3156–3164
Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: IEEE conference on computer vision and pattern recognition
Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Annual meeting of the association for computational linguistics
Tan YH, Chan CS (2016) phi-LSTM: a phrase-based hierarchical LSTM model for image captioning. In: ACCV
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 4651–4659
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39(4):664–676
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: ICLR Workshop
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532-1543
Blei D, Lafferty J (2007) A correlated topic model of science. Ann Appl Stat 1(1):17–35
Chen B (2009) Latent topic modelling of word co-occurence information for spoken document retrieval. In: 2009 IEEE international conference on acoustics, speech and signal processing. Taipei, pp 3961–3964
Socher R, Karpathy A, Le QV, Manning CD, Ng A (2014) Grounded compositional semantics for finding and describing images with sentences. In: Transactions of the association for computational linguistics, pp 207218
Deng J, Dong W, Socher R, Li LJ, Li Kai, Fei-Fei Li (2009) ImageNet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, pp 248–255
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: IEEE conference on computer vision and pattern recognition
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: 2015 IEEE international conference on computer vision (ICCV), pp. 2407–2415
Chen X, Zitnick CL (2015) Minds eye: a recurrent visual representation for image caption generation. In: IEEE Conference on computer vision and pattern recognition
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhutdinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057
Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems? In: CVPR
Jin J, Fu K, Cui R, Sha F, Zhang C (2015) Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272
Uijlings J, van de Sande K, Gevers T, Smeulders A (2013) Selective search for object recognition. IJCV 104(2):154–171
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: ICLR
Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen W (2016) Encode, review, and decode: reviewer module for caption generation. In: NIPS
Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: CVPR
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2014) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Garten J, Sagae K, Ustun V, Dehghani M (2015) Combining distributed vector representations for words. In: Proceedings of the 1st workshop on vector space modeling for natural language processing, pp 95–101
Fadaee M, Bisazza A, Monz C (2017) Learning topic-sensitive word representations. In: Proceedings of the 55th annual meeting of the association for computational linguistics. Association for Computational Linguistics, pp 441–447
Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: 2010 ACM/IEEE 32nd international conference on software engineering, pp 95–104
Aldous DJ (1985) Exchangeability and related topics. In: École d’Été de Probabilités de Saint-Flour XIII—1983, pp 1–198
Graves A, Liwicki M, Fernández S, Bertolami R, Bunke H, Schmidhuber J (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31(5):855–868
Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell 47:853–899
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Lin 2:67–78
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39(4):652–663
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105
Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Vedantam R, Zitnick CL, Parikh D (2015) CIDEr: consensus-based image description evaluation. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 4566–4575
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Explain images with multimodal recurrent neural networks. In: ICLR
Aneja J, Aditya D, Alexander SG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: image captioning by skeleton-attribute decomposition. In: Proceedings of the IEEE conference on computer vision and pattern recognition
Wang Q, Chan AB (2018) CNN+ CNN: convolutional decoders for image captioning. arXiv preprint arXiv:1805.09019
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zia, U., Riaz, M.M., Ghafoor, A. et al. Topic sensitive image descriptions. Neural Comput & Applic 32, 10471–10479 (2020). https://doi.org/10.1007/s00521-019-04587-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-019-04587-x