Knowledge Guided Attention and Inference for Describing Images Containing Unseen Objects

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10843)


Images on the Web encapsulate diverse knowledge about varied abstract concepts. They cannot be sufficiently described with models learned from image-caption pairs that mention only a small number of visual object categories. In contrast, large-scale knowledge graphs contain many more concepts that can be detected by image recognition models. Hence, to assist description generation for those images which contain visual objects unseen in image-caption pairs, we propose a two-step process by leveraging large-scale knowledge graphs. In the first step, a multi-entity recognition model is built to annotate images with concepts not mentioned in any caption. In the second step, those annotations are leveraged as external semantic attention and constrained inference in the image description generation model. Evaluations show that our models outperform most of the prior work on out-of-domain MSCOCO image description generation and also scales better to broad domains with more unseen objects.



First author is grateful to KHYS at KIT for their research travel grant and Computational Media Lab at ANU for providing access to their K40x GPUs.


  1. 1.
    Ahn, S., Choi, H., Pärnamaa, T., Bengio, Y.: A neural knowledge language model. arXiv preprint arXiv:1608.00318 (2016)
  2. 2.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: Guided open vocabulary image captioning with constrained beam search. In: EMNLP (2017)Google Scholar
  3. 3.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 248–255. IEEE (2009)Google Scholar
  4. 4.
    Hendricks, L.A., Venugopalan, S., Rohrbach, M., Mooney, R., Saenko, K., Darrell, T.: Deep compositional captioning: describing novel object categories without paired training data. In: CVPR, pp. 1–10 (2016)Google Scholar
  5. 5.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  6. 6.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6, 167–195 (2015)Google Scholar
  7. 7.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  8. 8.
    Navigli, R., Ponzetto, S.P.: BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)Google Scholar
  10. 10.
    Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: Groth, P., Simperl, E., Gray, A., Sabou, M., Krötzsch, M., Lecue, F., Flöck, F., Gil, Y. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 498–514. Springer, Cham (2016). Scholar
  11. 11.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Serban, I.V., García-Durán, A., Gulcehre, C., Ahn, S., Chandar, S., Courville, A., Bengio, Y.: Generating factoid questions with recurrent neural networks: the 30M factoid question-answer corpus. arXiv preprint arXiv:1603.06807 (2016)
  13. 13.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  14. 14.
    Vedantam, R., Zitnick, L.C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR, pp. 4566–4575 (2015)Google Scholar
  15. 15.
    Venugopalan, S., Hendricks, L.A., Rohrbach, M., Mooney, R., Darrell, T., Saenko, K.: Captioning images with diverse objects. In: CVPR (2017)Google Scholar
  16. 16.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)CrossRefGoogle Scholar
  17. 17.
    Yao, T., Yingwei, P., Yehao, L., Mei, T.: Incorporating copying mechanism in image captioning for learning novel objects. In: CVPR (2017)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Institute of Applied Informatics and Formal Description Methods (AIFB), Karlsruhe Institute for Technology (KIT)KarlsruheGermany
  2. 2.Computational Media LabAustralian National University (ANU)CanberraAustralia

Personalised recommendations