Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

  • Neeraj GuptaEmail author
  • Anand Singh Jalal
Developing nature-inspired intelligence by neural systems


The automatic narration of a natural scene is an important trait in artificial intelligence that unites computer vision and natural language processing. Caption generation is a challenging task in scene understanding. Most of the state-of-the-art methods are using deep convolutional neural network models to extract visual features of the entire image, based on which the parallel structures between images and sentences are exploited using recurrent neural networks for image captioning. However, in such models, only visual features are exploited for caption generation. This work investigated that fusion of text available in an image can give more fined-grained captioning of a scene. In this paper, we have proposed a model which incorporates a deep convolutional neural network and long short-term memory to boost the accuracy of image captioning by fusing text feature available in an image with the visual features extracted in state-of-the-art methods. We have validated the effectiveness of the proposed model on the benchmark datasets (Flickr8k and Flickr30k). The experimental outcomes illustrate that the proposed model outperformed the state-of-the-art methods for image captioning.


Text saliency Image captioning Convolution neural network Long short-term memory 


Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Goh H, Thome N, Cord M, Lim JH (2014) Learning deep hierarchical visual feature coding. IEEE Trans Neural Netw Learn Syst 25(12):2212–2225CrossRefGoogle Scholar
  2. 2.
    Zhang N, Ding S, Zhang J, Xue Y (2017) Research on point-wise gated deep networks. Appl Soft Comput 52:1210–1221CrossRefGoogle Scholar
  3. 3.
    Papa JP, Scheirer W, Cox DD (2016) Fine-tuning deep belief networks using harmony search. Appl Soft Comput 46:875–885CrossRefGoogle Scholar
  4. 4.
    Bai S (2017) Growing random forest on deep convolutional neural networks for scene categorization. Expert Syst Appl 71:279–287CrossRefGoogle Scholar
  5. 5.
    Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
  6. 6.
    Rashtchian C, Young P, Hodosh M, Hockenmaier J. (2010) Collecting image annotations using Amazon’s Mechanical Turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk. Association for Computational Linguistics, pp 139–147Google Scholar
  7. 7.
    Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist 2:67–78CrossRefGoogle Scholar
  8. 8.
    Gupta N, Jalal AS (2017) A comparison of visual attention models for the salient text content detection in natural scene. In: 2017 conference on information and communication technology (CICT). IEEE, pp 1–5Google Scholar
  9. 9.
    Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: European conference on computer vision. Springer, Berlin, pp 15–29Google Scholar
  10. 10.
    Ordonez V, Kulkarni G, Berg TL (2011) Im2text: describing images using 1 million captioned photographs. In: Advances in neural information processing systems, pp 1143–1151Google Scholar
  11. 11.
    Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: AAAI, p 1Google Scholar
  12. 12.
    Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S, Choi Y et al (2013) Babytalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  13. 13.
    Mitchell M, Han X, Dodge J, Mensch A, Goyal A, Berg A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics. Association for Computational Linguistics, pp 747–756Google Scholar
  14. 14.
    Ushiku Y, Yamaguchi M, Mukuta Y, Harada T (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676Google Scholar
  15. 15.
    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436CrossRefGoogle Scholar
  16. 16.
    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2(1):207–218CrossRefGoogle Scholar
  17. 17.
    Karpathy A, Joulin A, Fei-Fei LF (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems, pp 1889–1897Google Scholar
  18. 18.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587Google Scholar
  19. 19.
    Ma L, Lu Z, Shang L, Li H (2015) Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE international conference on computer vision, pp 2623–2631Google Scholar
  20. 20.
    Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning, pp 595–603Google Scholar
  21. 21.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090
  22. 22.
    Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105Google Scholar
  23. 23.
    Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164Google Scholar
  24. 24.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
  25. 25.
    Gupta N, Jalal AS (2018) A robust model for salient text detection in natural scene images using MSER feature detector and Grabcut. Multimedia Tools Appl 78(8):10821–10835CrossRefGoogle Scholar
  26. 26.
    Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813Google Scholar
  27. 27.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  28. 28.
    Mikolov T, Chen K, Corrado G, Dean J (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  29. 29.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  30. 30.
    Graves A (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850
  31. 31.
    Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
  32. 32.
    Papineni K, Roukos S, Ward T, Zhu WJ (2002). BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, pp 311–318Google Scholar
  33. 33.
    Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380Google Scholar
  34. 34.
    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Engineering and ApplicationsGLA UniversityMathuraIndia

Personalised recommendations