Automated Image Captioning for Flickr8K Dataset

  • K. Anitha Kumari
  • C. Mouneeshwari
  • R. B. Udhaya
  • R. Jasmitha
Conference paper


Automated, accurate image captioning is currently a hot topic in the field of deep learning. The model must have the capability to generate human-readable sentences for regions in the image. The model must understand the image to find the words that string together to be comprehensive. To achieve this, in this research work, Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are used on Flickr8K dataset. To identify the regions in the image and to recognize the objects in the regions, an advanced region-based CNN (RCNN) methodology has been used. To generate the caption that is most relevant to the image, RNN is used in this paper. Bilingual evaluation understudy (BLEU) score is considered as the evaluation parameter.


Automated image captioning Region-based CNN Recurrent neural network Flickr8K dataset 



Bilingual evaluation understudy


Convolutional neural networks


Natural language processing


Region-based convolutional neural networks


Recurrent Neural Networks


  1. 1.
    Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefGoogle Scholar
  2. 2.
    Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptionsGoogle Scholar
  3. 3.
    Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptionsGoogle Scholar
  4. 4.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networksGoogle Scholar
  5. 5.
    Kiros R, Zemel RS, Salakhutdinov R (2014) Multimodal neural language modelsGoogle Scholar
  6. 6.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language modelsGoogle Scholar
  7. 7.
    Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mappingGoogle Scholar
  8. 8.
    Jia Y, Salzmann M, Darrell T (2011) Learning cross-modality similarity for multinomial dataGoogle Scholar
  9. 9.
    Jeffrey Pennington R, Manning C (2014) Glove: Global vectors for word representationGoogle Scholar
  10. 10.
    Chen X, Fang F, Lin TY, Vedantam R, Gupta S, Dollar P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation serverGoogle Scholar
  11. 11.
    Frome A, Corrado GS, Shlens J, Bengio S, Dean S, Mikolov J et al (2013) Devise: A deep visual-semantic embedding model. In NIPSGoogle Scholar
  12. 12.
    Elliott D, Keller F (2013) Image description using visual dependency representationsGoogle Scholar
  13. 13.
    Fidler S, Sharma A, Urtasun R (2013) A sentence is worth a thousand pixels. In CVPRGoogle Scholar
  14. 14.
    Kong C, Lin D, Bansal M, Urtasun R, Fidler S (2014) What are you talking about?Google Scholar
  15. 15.
    Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt J et al (2014) From captions to visual concepts and backGoogle Scholar
  16. 16.
    Farhadi A, Hejrati M, Sadeghi A, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: Generating sentences from images. In ECCVGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • K. Anitha Kumari
    • 1
  • C. Mouneeshwari
    • 1
  • R. B. Udhaya
    • 1
  • R. Jasmitha
    • 1
  1. 1.PSG College of TechnologyCoimbatoreIndia

Personalised recommendations