Translating Images to Words for Recognizing Objects in Large Image and Video Collections

  • Pınar Duygulu
  • Muhammet Baştan
  • David Forsyth
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4170)


We present a new approach to the object recognition problem, motivated by the recent availability of large annotated image and video collections. This approach considers object recognition as the translation of visual elements to words, similar to the translation of text from one language to another. The visual elements represented in feature space are categorized into a finite set of blobs. The correspondences between the blobs and the words are learned, using a method adapted from Statistical Machine Translation. Once learned, these correspondences can be used to predict words corresponding to particular image regions (region naming), to predict words associated with the entire images (auto-annotation), or to associate the speech transcript text with the correct video frames (video alignment). We present our results on the Corel data set which consists of annotated images and on the TRECVID 2004 data set which consists of video frames associated with speech transcript text and manual annotations.


Machine Translation Automatic Speech Recognition News Video Statistical Machine Translation Correspondence Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    TREC Video Retrieval Evaluation,
  3. 3.
    Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D.A., Blei, D., Jordan, M.: Matching words and pictures. Journal of Machine Learning Research 3, 1107–1135 (2003)MATHCrossRefGoogle Scholar
  4. 4.
    Barnard, K., Duygulu, P., Forsyth, D.A.: Clustering art. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 434–439 (2001)Google Scholar
  5. 5.
    Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision (ICCV), vol. 2, pp. 408–415 (2001)Google Scholar
  6. 6.
    Blei, D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 127–134 (2003)Google Scholar
  7. 7.
    Brill, E.: A simple rule-based part of speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing, Trento, Italy, pp. 152–155 (1992)Google Scholar
  8. 8.
    Brown, P., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)Google Scholar
  9. 9.
    Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 (2004)Google Scholar
  10. 10.
    Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  11. 11.
    Duygulu, P., Wactlar, H.: Associating video frames with text. In: Multimedia Information Retrieval Workshop in conjuction with the 26th annual ACM SIGIR conference on Information Retrieval, Toronto, Canada, August 1 (2003)Google Scholar
  12. 12.
    Feng, S., Manmatha, R., Lavrenko, V.: Multiple bernoulli relevance models for image and video annotation. In: The Proceedings of the International Conference on Pattern Recognition (CVPR 2004), vol.2, pp. 1002–1009 (2004)Google Scholar
  13. 13.
    Forsyth, D.A., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall, Englewood Cliffs (2002)Google Scholar
  14. 14.
    Gauvain, J., Lamel, L., Adda, G.: The limsi broadcast news transcription system. Speech Communication 37(1–2), 89–108 (2002)MATHCrossRefGoogle Scholar
  15. 15.
    Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 (2005)Google Scholar
  16. 16.
    Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, 2003, pp. 119–126 (2003)Google Scholar
  17. 17.
    Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing Computational Linguistics and Speech Recognition. Prentice-Hall, Englewood Cliffs (2000)Google Scholar
  18. 18.
    Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: The Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)Google Scholar
  19. 19.
    Li, J., Wang, J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. on Pattern Analysis and Machine Intelligence 25(9), 1075–1088 (2003)CrossRefGoogle Scholar
  20. 20.
    Lin, C.-Y., Tseng, B.L., Smith, J.R.: Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In: NIST TREC 2003 Video Retrieval Evaluation Conference, Gaithersburg, MD (November 2003)Google Scholar
  21. 21.
    Manning, C.D., utze, H.S.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)MATHGoogle Scholar
  22. 22.
    Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The Fifteenth International Conference on Machine Learning, pp. 341–349 (1998)Google Scholar
  23. 23.
    Melamed, I.D.: Empirical Methods for Exploiting Parallel Texts. MIT Press, Cambridge (2001)Google Scholar
  24. 24.
    Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: Proc. ACM Int. Conf. on Multimedia (ACM MM), Berkeley, CA, USA (November 2003)Google Scholar
  25. 25.
    Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: First International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)Google Scholar
  26. 26.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 1(29), 19–51 (2003)CrossRefGoogle Scholar
  27. 27.
    Pan, J.-Y., Yang, H.-J., Faloutsos, C., Duygulu, P.: Automatic multimedia cross-modal correlation discovery. In: Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 (2004)Google Scholar
  28. 28.
    Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888–905 (2000)CrossRefGoogle Scholar
  29. 29.
    Duygulu, P., Virga, P.: Systematic Evaluation of Machine Translation Methods for Image and Video Annotation. In: Leow, W.-K., Lew, M., Chua, T.-S., Ma, W.-Y., Chaisorn, L., Bakker, E.M. (eds.) CIVR 2005. LNCS, vol. 3568, pp. 174–183. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  30. 30.
    Wenyin, L., Dumais, S., Sun, Y., Zhang, H., Czerwinski, M., Field, B.: Semi-automatic image annotation. In: Proc. INTERACT: Conference on Human-Computer Interaction, Tokyo, Japan, July 9-13, 2001, pp. 326–333 (2001)Google Scholar
  31. 31.
    Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: Correlating Names with Visual Appearances. In: Enser, P.G.B., Kompatsiaris, Y., O’Connor, N.E., Smeaton, A.F., Smeulders, A.W.M. (eds.) CIVR 2004. LNCS, vol. 3115, pp. 270–278. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Pınar Duygulu
    • 1
  • Muhammet Baştan
    • 1
  • David Forsyth
    • 2
  1. 1.Department of Computer EngineeringBilkent UniversityAnkaraTurkey
  2. 2.University of IllinoisUrbanaUSA

Personalised recommendations