Skip to main content

Combining Textual and Visual Information for Semantic Labeling of Images and Videos

  • Chapter
  • 4350 Accesses

Part of the book series: Cognitive Technologies ((COGTECH))

Abstract

Semantic labeling of large volumes of image and video archives is difficult, if not impossible, with the traditional methods due to the huge amount of human effort required for manual labeling used in a supervised setting. Recently, semi-supervised techniques which make use of annotated image and video collections are proposed as an alternative to reduce the human effort. In this direction, different techniques, which are mostly adapted from information retrieval literature, are applied to learn the unknown one-to-one associations between visual structures and semantic descriptions. When the links are learned, the range of application areas is wide including better retrieval and automatic annotation of images and videos, labeling of image regions as a way of large-scale object recognition and association of names with faces as a way of large-scale face recognition. In this chapter, after reviewing and discussing a variety of related studies, we present two methods in detail, namely, the so called “translation approach” which translates the visual structures to semantic descriptors using the idea of statistical machine translation techniques, and another approach which finds the densest component of a graph corresponding to the largest group of similar visual structures associated with a semantic description.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Giza++. http://www.fjoch.com/GIZA++.html.

    Google Scholar 

  2. Trec vieo retrieval evaluation. http://www-nlpir.nist.gov/projects/trecvid.

    Google Scholar 

  3. J. Argillander, G. Iyengar, and H. Nock. Semantic annotation of multimedia using maximum entropy models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, March 18–23 2005.

    Google Scholar 

  4. L. H. Armitage and P.G.B. Enser. Analysis of user need in image archives. Journal of Information Science, 23(4):287–299, 1997.

    Article  Google Scholar 

  5. K. Barnard, P. Duygulu, N. de Freitas, D.A. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.

    Article  MATH  Google Scholar 

  6. K. Barnard and D.A. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision, pages 408–415, 2001.

    Google Scholar 

  7. A. B. Benitez and S.-F. Chang. Semantic knowledge construction from annotated image collections. In IEEE International Conference On Multimedia and Expo (ICME-2002), Lausanne, Switzerland, August 2002.

    Google Scholar 

  8. T. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth. Who is in the picture. In Neural Information Processing Systems (NIPS), 2004.

    Google Scholar 

  9. D.M. Blei and M.I. Jordan. Modeling annotated data. In 26th Annual International ACM SIGIR Conference, pages 127–134, Toronto, Canada, July 28 – August 1 2003.

    Google Scholar 

  10. P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.

    Google Scholar 

  11. P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 2004.

    Google Scholar 

  12. G. Carneiro and N. Vasconcelos. Formulating semantic image annotation as a supervised learning problem. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005.

    Google Scholar 

  13. C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002.

    Article  Google Scholar 

  14. M.L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Santa Barbara CA USA, June 1998.

    Google Scholar 

  15. S. Chang and A. Hsu. Image information systems: Where do we go from here? IEEE Trans. on Knowledge and Data Enginnering, 4(5):431–442, October 1992.

    Article  MathSciNet  Google Scholar 

  16. M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization, London, UK, 2000.

    Google Scholar 

  17. F. Chen, U. Gargi, L. Niles, and H. Schuetze. Multi-modal browsing of images in web documents. In Proceedings of SPIE Document Recognition and Retrieval VI, 1999.

    Google Scholar 

  18. P. Duygulu, K. Barnard, N.d. Freitas, and D.A. Forsyth. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Seventh European Conference on Computer Vision (ECCV), volume 4, pages 97–112, Copenhagen Denmark, May 27 – June 2 2002.

    Google Scholar 

  19. H. Feng, R. Shi, and T.-S. Chua. A bootstrapping framework for annotating and retrieving www images. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 960–967, New York, NY, USA, 2004.

    Google Scholar 

  20. S.L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In the Proceedings of the International Conference on Pattern Recognition (CVPR 2004), volume 2, pages 1002–1009, 2004.

    Google Scholar 

  21. D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice-Hall, 2002.

    Google Scholar 

  22. J.L. Gauvain, L. Lamel, and G. Adda. The limsi broadcast news transcription system. Speech Communication, 37(1–2):89–108, 2002.

    Article  MATH  Google Scholar 

  23. A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden markov models for automatic annotation and content based retrieval of images and video. In The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 2005.

    Google Scholar 

  24. E. Izquierdo and A. Dorado. Semantic labelling of images combining color, texture and keywords. In Proceedings of the IEEE International Conference on Image Processing (ICIP2003), Barcelona, Spain, September 2003.

    Google Scholar 

  25. J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In 26th Annual International ACM SIGIR Conference, pages 119–126, Toronto, Canada, July 28 – August 1 2003.

    Google Scholar 

  26. J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In the Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR 2004), pages 24–32, Dublin City University, Ireland, July 21–23 2004.

    Google Scholar 

  27. R. Jin, J. Y. Chai, and S. Luo. Automatic image annotation via coherent language model and active learning. In The 12th ACM Annual Conference on Multimedia (ACM MM 2004), New York, USA, October 10–16 2004.

    Google Scholar 

  28. V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In the Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, volume 16, pages 553–560, 2003.

    Google Scholar 

  29. J. Li and J.Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, September 2003.

    Article  Google Scholar 

  30. C.-Y. Lin, B.L. Tseng, and J.R. Smith. Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November 2003.

    Google Scholar 

  31. D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004.

    Google Scholar 

  32. O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In The Fifteenth International Conference on Machine Learning, 1998.

    Google Scholar 

  33. K. Mikolajczyk. Face detector. INRIA Rhone-Alpes, 2004. Ph.D Report.

    Google Scholar 

  34. F. Monay and D. Gatica-Perez. On image auto-annotation with latent space models. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Berkeley, CA, USA, November 2003.

    Google Scholar 

  35. F. Monay and D. Gatica-Perez. Plsa-based image auto-annotation: Constraining the latent space. In Proceedings of the ACM International Conference on Multimedia (ACM MM), New York, October 2004.

    Google Scholar 

  36. Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.

    Google Scholar 

  37. F.J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 1(29):19–51, 2003.

    Article  Google Scholar 

  38. J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos. Automatic image captioning. In In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME2004), Taipei, Taiwan, June 27–30 2004.

    Google Scholar 

  39. J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 2004.

    Google Scholar 

  40. S. Satoh and T. Kanade. Name-it: Association of face and name in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 1997.

    Google Scholar 

  41. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.

    Article  Google Scholar 

  42. C.G.M. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5–35, January 2005.

    Article  Google Scholar 

  43. R.K. Srihari and D.T Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In AAAI 94, Seattle, WA, 1994.

    Google Scholar 

  44. P. Virga and P. Duygulu. Systematic evaluation of machine translation methods for image and video annotation. In The Fourth International Conference on Image and Video Retrieval (CIVR 2005), Singapore, July 20–22 2005.

    Google Scholar 

  45. L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B. Field. Semi-automatic image annotation. In Proceedings of the INTERACT : Conference on Human-Computer Interaction, pages 326–333, Tokyo Japan, July 9–13 2001.

    Google Scholar 

  46. J. Yang, M-Y. Chen, and A. Hauptmann. Finding person x: Correlating names with visual appearances. In International Conference on Image and Video Retrieval (CIVR‘04), Dublin City University Ireland, July 21–23 2004.

    Google Scholar 

  47. R. Zhao and W.I. Grosk. Narrowing the semantic gap: Improved text-based web document retrieval using visual features. EEE Transactions on Multimedia, 4(2):189–200, 2002.

    Article  Google Scholar 

  48. W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Duygulu, P., Baştan, M., Ozkan, D. (2008). Combining Textual and Visual Information for Semantic Labeling of Images and Videos. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75171-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75170-0

  • Online ISBN: 978-3-540-75171-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics