Automatic Judgement of Neural Network-Generated Image Captions

  • Rajarshi BiswasEmail author
  • Aditya Mogadala
  • Michael Barz
  • Daniel Sonntag
  • Dietrich Klakow
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11816)


Manual evaluation of individual results of natural language generation tasks is one of the bottlenecks. It is very time consuming and expensive if it is, for example, crowdsourced. In this work, we address this problem for the specific task of automatic image captioning. We automatically generate human-like judgements on grammatical correctness, image relevance and diversity of the captions obtained from a neural image caption generator. For this purpose, we use pool-based active learning with uncertainty sampling and represent the captions using fixed size vectors from Google’s Universal Sentence Encoder. In addition, we test common metrics, such as BLEU, ROUGE, METEOR, Levenshtein distance, and n-gram counts and report F1 score for the classifiers used under the active learning scheme for this task. To the best of our knowledge, our work is the first in this direction and promises to reduce time, cost, and human effort.


Active learning NLP NLG Automated human judgement Image captioning Neural networks 



This research was funded in part by the German Federal Ministry of Education and Research (BMBF) under grant number 01IS17043 (project SciBot). Aditya Mogadala was supported by the German Research Foundation (DFG) as part of SFB1102.


  1. 1.
    Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)Google Scholar
  2. 2.
    Barz, M., Polzehl, T., Sonntag, D.: Towards hybrid human-machine translation services. EasyChair Preprint (2018)Google Scholar
  3. 3.
    Biswas, R.: Diverse Image Caption Generation And Automated Human Judgement through Active Learning. Master’s thesis, Saarland University (2019)Google Scholar
  4. 4.
    Cer, D., et al.: Universal sentence encoder. arXiv:1803.11175 (2018)
  5. 5.
    Chawla, N., Bowyer, K., Hall, L., Kegelmeyer, W.: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)CrossRefGoogle Scholar
  6. 6.
    Cho, K., Merrienboer, B., Gulcehre, C., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)Google Scholar
  7. 7.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  8. 8.
    Haibo, H., Bai, Y., Garcia, E., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IEEE International Joint Conference on Neural Networks, pp. 1322–1328 (2008)Google Scholar
  9. 9.
    Harnad, S.: The symbol grounding problem. Physica 42, 335–346 (1990)Google Scholar
  10. 10.
    Harzig, P., Brehm, S., Lienhart, R., Kaiser, C., Schallner, R.: Multimodal image captioning for marketing analysis, February 2018Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short term memory. Neural Comput. 9, 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
  14. 14.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR (2015)Google Scholar
  15. 15.
    Karpathy, A., Joulin, A., Fei-Fei, L.: Deep fragment embeddings for bidirectional image sentence mapping. In: NIPS (2014)Google Scholar
  16. 16.
    Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.: Textual explanations for self-driving vehicles. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 577–593. Springer, Cham (2018). Scholar
  17. 17.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)Google Scholar
  18. 18.
    Kiros, R., Salahutdinov, R., Zemel, R.: Multimodal neural language models. In: ICLR, pp. 595–603 (2014)Google Scholar
  19. 19.
    Kiros, R., Salahutdinov, R., Zemel, R.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv:1411.2539 (2014)
  20. 20.
    Kisilev, P., Sason, E., Barkan, E., Hashoul, S.Y.: Medical image captioning : learning to describe medical image findings using multitask-loss CNN (2016)Google Scholar
  21. 21.
    Lin, C.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  22. 22.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  23. 23.
    Lowerre, B., Reddy, R.: The harpy speech understanding system. In: Readings in Speech Recognition, pp. 576–586 (1990)CrossRefGoogle Scholar
  24. 24.
    Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv:1412.6632 (2014)
  25. 25.
    Oviatt, S., Schuller, B., Cohen, P., Sonntag, D., Potamianos, G.: The Handbook Of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations. ACM, New York (2017)CrossRefGoogle Scholar
  26. 26.
    Oviatt, S., Schuller, B., Cohen, P., Sonntag, D., Potamianos, G., Kruger, A.: Introduction: scope, trends, and paradigm shift in the field of computer interfaces, pp. 1–15. ACM, New York (2017)Google Scholar
  27. 27.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Association for Computational Linguistics, pp. 311–318 (2002)Google Scholar
  28. 28.
    Roy, D., Reiter, E.: Connecting language to the world. Artif. Intell. 167, 1–12 (2005)CrossRefGoogle Scholar
  29. 29.
    Settles, B.: Active Learning Literature Survey, vol. 52, no. 55-66, p. 11. University of Wisconsin, Madison (2010)Google Scholar
  30. 30.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  31. 31.
    Xu, A., Liu, Z., Guo, Y., Sinha, V., Akkiraju, R.: A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3506–3510 (2017)Google Scholar
  32. 32.
    Xu, K., er al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Rajarshi Biswas
    • 1
    Email author
  • Aditya Mogadala
    • 3
  • Michael Barz
    • 1
    • 2
  • Daniel Sonntag
    • 1
  • Dietrich Klakow
    • 3
  1. 1.German Research Center for Artificial Intelligence (DFKI)SaarbrückenGermany
  2. 2.Saarbrücken Graduate School of Computer ScienceSaarbrückenGermany
  3. 3.Spoken Language Systems (LSV)SaarbrückenGermany

Personalised recommendations