Cluster Computing

, Volume 22, Supplement 3, pp 6143–6155 | Cite as

Fast image captioning using LSTM

  • Meng Han
  • Wenyu ChenEmail author
  • Alemu Dagmawi Moges


Computer vision and natural language processing have been some of the long-standing challenges in artificial intelligence. In this paper, we explore a generative automatic image annotation model, which utilizes recent advances on both fronts. Our approach makes use of a deep-convolutional neural network to detect image regions, which later will be fed to recurrent neural network that is trained to maximize the likely-hood of the target sentence description of the given image. During our experimentation we found that better accuracy and training was achieved when the image representation from our detection model is coupled with the input word embedding, we also found out most of the information from the last layer of detection model vanishes when it is fed as thought vector for our LSTM decoder. This is mainly because the information within the last fully connected layer of the YOLO model represents the class probabilities for the detected objects and their bounding box and this information is not rich enough. We trained our model on coco benchmark for 60 h on 64,000 training and 12,800-validation dataset achieving 23% accuracy. We also realized a significant training speed drop when we changed the number of hidden units in the LSTM layer from 1470 to 4096.


Computer vision Natural language processing Image annotation LSTM 


  1. 1.
    Picard, R.W., Minka, T.P.: Vision texture for annotation. Multimed. Syst. 3(1), 3–14 (1995)CrossRefGoogle Scholar
  2. 2.
    Cusano, C., Bicocca, M., Bicocca, V.: Image annotation using SVM. Proc. SPIE 1, 330–338 (2003)CrossRefGoogle Scholar
  3. 3.
    Tang, J., Lewis, P.H.: A study of quality issues for image auto-annotation with the corel dataset. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)CrossRefGoogle Scholar
  4. 4.
    Li, J., Wang, J.Z.: Real-time computerized annotation of pictures. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 985–1002 (2008)CrossRefGoogle Scholar
  5. 5.
    Li, J., Wang, J.Z.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)CrossRefGoogle Scholar
  6. 6.
    Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. Proc. CVIR Lect. Notes Comput. Sci. 3115, 24–32 (2004)CrossRefGoogle Scholar
  7. 7.
    Vinyals, O., Toshev, A., Bengio, S., et al.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, 07–12 June, pp. 3156–3164 (2015)Google Scholar
  8. 8.
    Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07–12 June, pp. 3128–3137 (2015)Google Scholar
  9. 9.
    Kulkarni, G., Premraj, V., Ordonez, V., et al.: Baby talk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)CrossRefGoogle Scholar
  10. 10.
    Girshick, R., Donahue, .J, Darrell, T., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)Google Scholar
  11. 11.
    Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRefGoogle Scholar
  12. 12.
    Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, 4–9 Feb, pp. 4278–4284 (2017)Google Scholar
  13. 13.
    Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: CVPR 2016, pp. 779–788 (2016)Google Scholar
  14. 14.
    Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and understanding recurrent networks. arXiv:1506.02078 (2015)
  15. 15.
    Chung, J., Gulcehre, C., Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 (2014)
  16. 16.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, 2014, pp. 3104–3112 (2014)Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Political Science and Public AdministrationUniversity of Electronic Science & Technology of ChinaChengduPeople’s Republic of China
  2. 2.School of Computer Science and EngineeringUniversity of Electronic Science & Technology of ChinaChengduPeople’s Republic of China

Personalised recommendations