Every Picture Tells a Story: Generating Sentences from Images

  • Ali Farhadi
  • Mohsen Hejrati
  • Mohammad Amin Sadeghi
  • Peter Young
  • Cyrus Rashtchian
  • Julia Hockenmaier
  • David Forsyth
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6314)


Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned using data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.


Machine Translation Head Noun Node Feature Generate Sentence Edge Potential 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Barnard, K., Duygulu, P., Forsyth, D.: Clustering art. In: CVPR, vol. II, pp. 434–441 (2001)Google Scholar
  2. 2.
    Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: WMISR (1999)Google Scholar
  3. 3.
    Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Datta, R., Li, J., Wang, J.Z.: Content-based image retrieval: approaches and trends of the new age. In: MIR 2005, pp. 253–262 (2005)Google Scholar
  5. 5.
    Forsyth, D., Berg, T., Alm, C., Farhadi, A., Hockenmaier, J., Loeff, N., Wang, G.: Words and pictures: Categories, modifiers, depiction and iconography. In: Object Categorization: Computer and Human Vision Perspectives, CUP (2009)Google Scholar
  6. 6.
    Phillips, P.J., Newton, E.: Meta-analysis of face recognition algorithms. In: ICAFGR (2002)Google Scholar
  7. 7.
    Gupta, A., Davis, L.: Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 16–29. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Li, L.J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: ICCV (2007)Google Scholar
  9. 9.
    Li, L.J., Socher, R., Fei-Fei, L.: Towards total scene understanding:classification, annotation and segmentation in an automatic framework. In: CVPR (2009)Google Scholar
  10. 10.
    Gupta, A., Davis, L.: Objects in action: An approach for combining action understanding and object perception. In: CVPR (2007)Google Scholar
  11. 11.
    Gupta, A., Davis, A.K.,, L.: Observing human-object interactions: Using spatial and functional compatibility for recognition. Trans. on PAMI (2009)Google Scholar
  12. 12.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: CVPR (2010)Google Scholar
  13. 13.
    Berg, T.L., Berg, A.C., Edwards, J., Forsyth, D.A.: Who’s in the picture. In: Advances in Neural Information Processing (2004)Google Scholar
  14. 14.
    Mensink, T., Verbeek, J.: Improving people search using query expansions: How friends help to find people. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 86–99. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  15. 15.
    Luo, J., Caputo, B., Ferrari, V.: Who’s doing what: Joint modeling of names and verbs for simultaneous face and pose annotation. In: NIPS (2009)Google Scholar
  16. 16.
    Coyne, B., Sproat, R.: Wordseye: an automatic text-to-scene conversion system. In: SIGGRAPH 2001 (2001)Google Scholar
  17. 17.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In: CVPR (2009)Google Scholar
  18. 18.
    Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2t: Image parsing to text description. Proc. IEEE (2010) (in Press)Google Scholar
  19. 19.
    Felzenszwalb, P., Mcallester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR 2008 (2008)Google Scholar
  20. 20.
    Hoiem, D., Divvala, S., Hays, J.: Pascal voc 2009 challenge. In: PASCAL challenge workshop in ECCV (2009)Google Scholar
  21. 21.
    Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Progress in Brain Research, p. 2006 (2006)Google Scholar
  22. 22.
    Curran, J., Clark, S., Bos, J.: Linguistically motivated large-scale nlp with c&c and boxer. In: ACL, pp. 33–36Google Scholar
  23. 23.
    Lin, D.: An information-theoretic definition of similarity. In: ICML, 296–304 (1998)Google Scholar
  24. 24.
    Taskar, B., Chatalbashev, V., Koller, D., Guestrin, C.: Learning structured prediction models: a large margin approach. In: ICML, pp. 896–903 (2005)Google Scholar
  25. 25.
    Ratliff, N., Bagnell, J.A., Zinkevich, M.: Subgradient methods for maximum margin structured learning. In: ICML (2006)Google Scholar
  26. 26.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using amazon’s mechanical turk. In: NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ali Farhadi
    • 1
  • Mohsen Hejrati
    • 2
  • Mohammad Amin Sadeghi
    • 2
  • Peter Young
    • 1
  • Cyrus Rashtchian
    • 1
  • Julia Hockenmaier
    • 1
  • David Forsyth
    • 1
  1. 1.Computer Science DepartmentUniversity of Illinois at Urbana-Champaign 
  2. 2.Computer Vision Group, School of MathematicsInstitute for studies in theoretical Physics and Mathematics(IPM) 

Personalised recommendations