A Two-Step Retrieval Method for Image Captioning

  • Luis Pellegrin
  • Jorge A. Vanegas
  • John Arevalo
  • Viviana Beltrán
  • Hugo Jair Escalante
  • Manuel Montes-y-Gómez
  • Fabio A. González
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9822)

Abstract

Image captioning is the task of assigning phrases to images describing their visual content. Two main approaches for image captioning are commonly used. On the one hand, traditional approaches assign the captions from the most similar images to the image query. On the other hand, recent methods generate captions by sentence generation systems that learn a joint distribution of captions-images relying on a training set. The main limitation is that both approaches require a great number of manually labeled captioned images. This paper presents a unsupervised approach for image captioning based in a two steps image-textual retrieval process. First, given a query image, visually related words are retrieved from a multimodal indexing. The multimodal indexing is built by using a large dataset of web pages containing images. A vocabulary of words is extracted from web pages, for each word is used the visual representation of images to learn a feature model, in this way we can match query images with words by simply measuring visual similarity. Second, a query is formed with the retrieved words and candidate captions are retrieved from a reference dataset of sentences. Despite the simplicity of our method, it is able to get rid of the need of manually labeled images and instead takes advantage of the noisy data derived from the Web, e.g. web pages. The proposed approach has been evaluated on Generation of Textual Descriptions of Images Task at ImageCLEF 2015. Experimental results show the competitiveness of the proposed approach. In addition we report preliminary results on the use of our method for the auto-illustration task.

Keywords

Image captioning Auto-illustration Image retrieval Multimodal indexing Visual prototypes 

Notes

Acknowledgments

This work was supported by CONACYT under project grant CB-2014-241306 (Clasificación y recuperación de imágenes mediante técnicas de minería de textos). The first author was supported by CONACyT under scholarship No. 214764.

References

  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C., Bottou, L., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates Inc. (2012)Google Scholar
  2. 2.
    Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: describing images using 1 million captioned photographs. In: NIPS, pp. 1143–1151 (2011)Google Scholar
  3. 3.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Int. Res. 47, 853–899 (2013)MathSciNetMATHGoogle Scholar
  4. 4.
    Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  5. 5.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating image descriptions. In: Proceedings of the 24th CVPR (2011)Google Scholar
  6. 6.
    Calfa, A., Iftene, A.: Using textual and visual processing in scalable concept image annotation challenge. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)Google Scholar
  7. 7.
    Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 15, 2949–2980 (2014)MathSciNetMATHGoogle Scholar
  8. 8.
    Li, X., Jin, Q., Liao, S., Liang, J., He, X., Huo, Y., Lan, W., Xiao, B., Lu, Y., Xu, J.: Ruc-tencent at imageclef 2015: concept detection, localization and sentence generation. In: CLEF 2015 Evaluation Labs and Workshop, Online Working Notes (2015)Google Scholar
  9. 9.
    Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June, pp. 3128–3137 (2015)Google Scholar
  10. 10.
    Villegas, M., et al.: General overview of ImageCLEF at the CLEF 2015 labs. In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G., San Juan, E., Capellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 444–461. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-24027-5_45 CrossRefGoogle Scholar
  11. 11.
    Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014)Google Scholar
  12. 12.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Luis Pellegrin
    • 1
  • Jorge A. Vanegas
    • 2
  • John Arevalo
    • 2
  • Viviana Beltrán
    • 2
  • Hugo Jair Escalante
    • 1
  • Manuel Montes-y-Gómez
    • 1
  • Fabio A. González
    • 2
  1. 1.Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE)TonantzintlaMexico
  2. 2.MindLab Research GroupUniversidad Nacional de Colombia (UNAL)BogotáColombia

Personalised recommendations