Combining Textual and Visual Information for Semantic Labeling of Images and Videos

  • Pinar Duygulu
  • Muhammet Baştan
  • Derya Ozkan
Part of the Cognitive Technologies book series (COGTECH)


Semantic labeling of large volumes of image and video archives is difficult, if not impossible, with the traditional methods due to the huge amount of human effort required for manual labeling used in a supervised setting. Recently, semi-supervised techniques which make use of annotated image and video collections are proposed as an alternative to reduce the human effort. In this direction, different techniques, which are mostly adapted from information retrieval literature, are applied to learn the unknown one-to-one associations between visual structures and semantic descriptions. When the links are learned, the range of application areas is wide including better retrieval and automatic annotation of images and videos, labeling of image regions as a way of large-scale object recognition and association of names with faces as a way of large-scale face recognition. In this chapter, after reviewing and discussing a variety of related studies, we present two methods in detail, namely, the so called “translation approach” which translates the visual structures to semantic descriptors using the idea of statistical machine translation techniques, and another approach which finds the densest component of a graph corresponding to the largest group of similar visual structures associated with a semantic description.


Machine Translation Automatic Speech Recognition Mean Average Precision Image Annotation News Video 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Giza++. Scholar
  2. 2.
    Trec vieo retrieval evaluation. Scholar
  3. 3.
    J. Argillander, G. Iyengar, and H. Nock. Semantic annotation of multimedia using maximum entropy models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, March 18–23 2005.Google Scholar
  4. 4.
    L. H. Armitage and P.G.B. Enser. Analysis of user need in image archives. Journal of Information Science, 23(4):287–299, 1997.CrossRefGoogle Scholar
  5. 5.
    K. Barnard, P. Duygulu, N. de Freitas, D.A. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.zbMATHCrossRefGoogle Scholar
  6. 6.
    K. Barnard and D.A. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision, pages 408–415, 2001.Google Scholar
  7. 7.
    A. B. Benitez and S.-F. Chang. Semantic knowledge construction from annotated image collections. In IEEE International Conference On Multimedia and Expo (ICME-2002), Lausanne, Switzerland, August 2002.Google Scholar
  8. 8.
    T. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth. Who is in the picture. In Neural Information Processing Systems (NIPS), 2004.Google Scholar
  9. 9.
    D.M. Blei and M.I. Jordan. Modeling annotated data. In 26th Annual International ACM SIGIR Conference, pages 127–134, Toronto, Canada, July 28 – August 1 2003.Google Scholar
  10. 10.
    P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.Google Scholar
  11. 11.
    P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 2004.Google Scholar
  12. 12.
    G. Carneiro and N. Vasconcelos. Formulating semantic image annotation as a supervised learning problem. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005.Google Scholar
  13. 13.
    C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002.CrossRefGoogle Scholar
  14. 14.
    M.L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Santa Barbara CA USA, June 1998.Google Scholar
  15. 15.
    S. Chang and A. Hsu. Image information systems: Where do we go from here? IEEE Trans. on Knowledge and Data Enginnering, 4(5):431–442, October 1992.CrossRefMathSciNetGoogle Scholar
  16. 16.
    M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization, London, UK, 2000.Google Scholar
  17. 17.
    F. Chen, U. Gargi, L. Niles, and H. Schuetze. Multi-modal browsing of images in web documents. In Proceedings of SPIE Document Recognition and Retrieval VI, 1999.Google Scholar
  18. 18.
    P. Duygulu, K. Barnard, N.d. Freitas, and D.A. Forsyth. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Seventh European Conference on Computer Vision (ECCV), volume 4, pages 97–112, Copenhagen Denmark, May 27 – June 2 2002.Google Scholar
  19. 19.
    H. Feng, R. Shi, and T.-S. Chua. A bootstrapping framework for annotating and retrieving www images. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 960–967, New York, NY, USA, 2004.Google Scholar
  20. 20.
    S.L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In the Proceedings of the International Conference on Pattern Recognition (CVPR 2004), volume 2, pages 1002–1009, 2004.Google Scholar
  21. 21.
    D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice-Hall, 2002.Google Scholar
  22. 22.
    J.L. Gauvain, L. Lamel, and G. Adda. The limsi broadcast news transcription system. Speech Communication, 37(1–2):89–108, 2002.zbMATHCrossRefGoogle Scholar
  23. 23.
    A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden markov models for automatic annotation and content based retrieval of images and video. In The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 2005.Google Scholar
  24. 24.
    E. Izquierdo and A. Dorado. Semantic labelling of images combining color, texture and keywords. In Proceedings of the IEEE International Conference on Image Processing (ICIP2003), Barcelona, Spain, September 2003.Google Scholar
  25. 25.
    J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In 26th Annual International ACM SIGIR Conference, pages 119–126, Toronto, Canada, July 28 – August 1 2003.Google Scholar
  26. 26.
    J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In the Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR 2004), pages 24–32, Dublin City University, Ireland, July 21–23 2004.Google Scholar
  27. 27.
    R. Jin, J. Y. Chai, and S. Luo. Automatic image annotation via coherent language model and active learning. In The 12th ACM Annual Conference on Multimedia (ACM MM 2004), New York, USA, October 10–16 2004.Google Scholar
  28. 28.
    V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In the Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, volume 16, pages 553–560, 2003.Google Scholar
  29. 29.
    J. Li and J.Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, September 2003.CrossRefGoogle Scholar
  30. 30.
    C.-Y. Lin, B.L. Tseng, and J.R. Smith. Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November 2003.Google Scholar
  31. 31.
    D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004.Google Scholar
  32. 32.
    O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In The Fifteenth International Conference on Machine Learning, 1998.Google Scholar
  33. 33.
    K. Mikolajczyk. Face detector. INRIA Rhone-Alpes, 2004. Ph.D Report.Google Scholar
  34. 34.
    F. Monay and D. Gatica-Perez. On image auto-annotation with latent space models. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Berkeley, CA, USA, November 2003.Google Scholar
  35. 35.
    F. Monay and D. Gatica-Perez. Plsa-based image auto-annotation: Constraining the latent space. In Proceedings of the ACM International Conference on Multimedia (ACM MM), New York, October 2004.Google Scholar
  36. 36.
    Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.Google Scholar
  37. 37.
    F.J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 1(29):19–51, 2003.CrossRefGoogle Scholar
  38. 38.
    J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos. Automatic image captioning. In In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME2004), Taipei, Taiwan, June 27–30 2004.Google Scholar
  39. 39.
    J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 2004.Google Scholar
  40. 40.
    S. Satoh and T. Kanade. Name-it: Association of face and name in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 1997.Google Scholar
  41. 41.
    A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.CrossRefGoogle Scholar
  42. 42.
    C.G.M. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5–35, January 2005.CrossRefGoogle Scholar
  43. 43.
    R.K. Srihari and D.T Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In AAAI 94, Seattle, WA, 1994.Google Scholar
  44. 44.
    P. Virga and P. Duygulu. Systematic evaluation of machine translation methods for image and video annotation. In The Fourth International Conference on Image and Video Retrieval (CIVR 2005), Singapore, July 20–22 2005.Google Scholar
  45. 45.
    L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B. Field. Semi-automatic image annotation. In Proceedings of the INTERACT : Conference on Human-Computer Interaction, pages 326–333, Tokyo Japan, July 9–13 2001.Google Scholar
  46. 46.
    J. Yang, M-Y. Chen, and A. Hauptmann. Finding person x: Correlating names with visual appearances. In International Conference on Image and Video Retrieval (CIVR‘04), Dublin City University Ireland, July 21–23 2004.Google Scholar
  47. 47.
    R. Zhao and W.I. Grosk. Narrowing the semantic gap: Improved text-based web document retrieval using visual features. EEE Transactions on Multimedia, 4(2):189–200, 2002.CrossRefGoogle Scholar
  48. 48.
    W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Pinar Duygulu
    • 1
  • Muhammet Baştan
    • 1
  • Derya Ozkan
    • 1
  1. 1.Bilkent UniversityTurkey

Personalised recommendations