Abstract
Semantic labeling of large volumes of image and video archives is difficult, if not impossible, with the traditional methods due to the huge amount of human effort required for manual labeling used in a supervised setting. Recently, semi-supervised techniques which make use of annotated image and video collections are proposed as an alternative to reduce the human effort. In this direction, different techniques, which are mostly adapted from information retrieval literature, are applied to learn the unknown one-to-one associations between visual structures and semantic descriptions. When the links are learned, the range of application areas is wide including better retrieval and automatic annotation of images and videos, labeling of image regions as a way of large-scale object recognition and association of names with faces as a way of large-scale face recognition. In this chapter, after reviewing and discussing a variety of related studies, we present two methods in detail, namely, the so called “translation approach” which translates the visual structures to semantic descriptors using the idea of statistical machine translation techniques, and another approach which finds the densest component of a graph corresponding to the largest group of similar visual structures associated with a semantic description.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Giza++. http://www.fjoch.com/GIZA++.html.
Trec vieo retrieval evaluation. http://www-nlpir.nist.gov/projects/trecvid.
J. Argillander, G. Iyengar, and H. Nock. Semantic annotation of multimedia using maximum entropy models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, March 18–23 2005.
L. H. Armitage and P.G.B. Enser. Analysis of user need in image archives. Journal of Information Science, 23(4):287–299, 1997.
K. Barnard, P. Duygulu, N. de Freitas, D.A. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.
K. Barnard and D.A. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision, pages 408–415, 2001.
A. B. Benitez and S.-F. Chang. Semantic knowledge construction from annotated image collections. In IEEE International Conference On Multimedia and Expo (ICME-2002), Lausanne, Switzerland, August 2002.
T. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth. Who is in the picture. In Neural Information Processing Systems (NIPS), 2004.
D.M. Blei and M.I. Jordan. Modeling annotated data. In 26th Annual International ACM SIGIR Conference, pages 127–134, Toronto, Canada, July 28 – August 1 2003.
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 2004.
G. Carneiro and N. Vasconcelos. Formulating semantic image annotation as a supervised learning problem. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005.
C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002.
M.L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Santa Barbara CA USA, June 1998.
S. Chang and A. Hsu. Image information systems: Where do we go from here? IEEE Trans. on Knowledge and Data Enginnering, 4(5):431–442, October 1992.
M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization, London, UK, 2000.
F. Chen, U. Gargi, L. Niles, and H. Schuetze. Multi-modal browsing of images in web documents. In Proceedings of SPIE Document Recognition and Retrieval VI, 1999.
P. Duygulu, K. Barnard, N.d. Freitas, and D.A. Forsyth. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Seventh European Conference on Computer Vision (ECCV), volume 4, pages 97–112, Copenhagen Denmark, May 27 – June 2 2002.
H. Feng, R. Shi, and T.-S. Chua. A bootstrapping framework for annotating and retrieving www images. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 960–967, New York, NY, USA, 2004.
S.L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In the Proceedings of the International Conference on Pattern Recognition (CVPR 2004), volume 2, pages 1002–1009, 2004.
D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice-Hall, 2002.
J.L. Gauvain, L. Lamel, and G. Adda. The limsi broadcast news transcription system. Speech Communication, 37(1–2):89–108, 2002.
A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden markov models for automatic annotation and content based retrieval of images and video. In The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 2005.
E. Izquierdo and A. Dorado. Semantic labelling of images combining color, texture and keywords. In Proceedings of the IEEE International Conference on Image Processing (ICIP2003), Barcelona, Spain, September 2003.
J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In 26th Annual International ACM SIGIR Conference, pages 119–126, Toronto, Canada, July 28 – August 1 2003.
J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In the Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR 2004), pages 24–32, Dublin City University, Ireland, July 21–23 2004.
R. Jin, J. Y. Chai, and S. Luo. Automatic image annotation via coherent language model and active learning. In The 12th ACM Annual Conference on Multimedia (ACM MM 2004), New York, USA, October 10–16 2004.
V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In the Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, volume 16, pages 553–560, 2003.
J. Li and J.Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, September 2003.
C.-Y. Lin, B.L. Tseng, and J.R. Smith. Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November 2003.
D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004.
O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In The Fifteenth International Conference on Machine Learning, 1998.
K. Mikolajczyk. Face detector. INRIA Rhone-Alpes, 2004. Ph.D Report.
F. Monay and D. Gatica-Perez. On image auto-annotation with latent space models. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Berkeley, CA, USA, November 2003.
F. Monay and D. Gatica-Perez. Plsa-based image auto-annotation: Constraining the latent space. In Proceedings of the ACM International Conference on Multimedia (ACM MM), New York, October 2004.
Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.
F.J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 1(29):19–51, 2003.
J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos. Automatic image captioning. In In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME2004), Taipei, Taiwan, June 27–30 2004.
J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 2004.
S. Satoh and T. Kanade. Name-it: Association of face and name in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 1997.
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.
C.G.M. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5–35, January 2005.
R.K. Srihari and D.T Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In AAAI 94, Seattle, WA, 1994.
P. Virga and P. Duygulu. Systematic evaluation of machine translation methods for image and video annotation. In The Fourth International Conference on Image and Video Retrieval (CIVR 2005), Singapore, July 20–22 2005.
L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B. Field. Semi-automatic image annotation. In Proceedings of the INTERACT : Conference on Human-Computer Interaction, pages 326–333, Tokyo Japan, July 9–13 2001.
J. Yang, M-Y. Chen, and A. Hauptmann. Finding person x: Correlating names with visual appearances. In International Conference on Image and Video Retrieval (CIVR‘04), Dublin City University Ireland, July 21–23 2004.
R. Zhao and W.I. Grosk. Narrowing the semantic gap: Improved text-based web document retrieval using visual features. EEE Transactions on Multimedia, 4(2):189–200, 2002.
W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Duygulu, P., Baştan, M., Ozkan, D. (2008). Combining Textual and Visual Information for Semantic Labeling of Images and Videos. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-75171-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75170-0
Online ISBN: 978-3-540-75171-7
eBook Packages: Computer ScienceComputer Science (R0)