Combining Textual and Visual Information for Semantic Labeling of Images and Videos

Duygulu, Pinar; Baştan, Muhammet; Ozkan, Derya

doi:10.1007/978-3-540-75171-7_9

Combining Textual and Visual Information for Semantic Labeling of Images and Videos

Pinar Duygulu⁵,
Muhammet Baştan⁵ &
Derya Ozkan⁵

Chapter

4350 Accesses

Part of the book series: Cognitive Technologies ((COGTECH))

Abstract

Semantic labeling of large volumes of image and video archives is difficult, if not impossible, with the traditional methods due to the huge amount of human effort required for manual labeling used in a supervised setting. Recently, semi-supervised techniques which make use of annotated image and video collections are proposed as an alternative to reduce the human effort. In this direction, different techniques, which are mostly adapted from information retrieval literature, are applied to learn the unknown one-to-one associations between visual structures and semantic descriptions. When the links are learned, the range of application areas is wide including better retrieval and automatic annotation of images and videos, labeling of image regions as a way of large-scale object recognition and association of names with faces as a way of large-scale face recognition. In this chapter, after reviewing and discussing a variety of related studies, we present two methods in detail, namely, the so called “translation approach” which translates the visual structures to semantic descriptors using the idea of statistical machine translation techniques, and another approach which finds the densest component of a graph corresponding to the largest group of similar visual structures associated with a semantic description.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Giza++. http://www.fjoch.com/GIZA++.html.
Google Scholar
Trec vieo retrieval evaluation. http://www-nlpir.nist.gov/projects/trecvid.
Google Scholar
J. Argillander, G. Iyengar, and H. Nock. Semantic annotation of multimedia using maximum entropy models. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005), Philadelphia, PA, USA, March 18–23 2005.
Google Scholar
L. H. Armitage and P.G.B. Enser. Analysis of user need in image archives. Journal of Information Science, 23(4):287–299, 1997.
Article Google Scholar
K. Barnard, P. Duygulu, N. de Freitas, D.A. Forsyth, D. Blei, and M. Jordan. Matching words and pictures. Journal of Machine Learning Research, 3:1107–1135, 2003.
Article MATH Google Scholar
K. Barnard and D.A. Forsyth. Learning the semantics of words and pictures. In International Conference on Computer Vision, pages 408–415, 2001.
Google Scholar
A. B. Benitez and S.-F. Chang. Semantic knowledge construction from annotated image collections. In IEEE International Conference On Multimedia and Expo (ICME-2002), Lausanne, Switzerland, August 2002.
Google Scholar
T. Berg, A.C. Berg, J. Edwards, and D.A. Forsyth. Who is in the picture. In Neural Information Processing Systems (NIPS), 2004.
Google Scholar
D.M. Blei and M.I. Jordan. Modeling annotated data. In 26th Annual International ACM SIGIR Conference, pages 127–134, Toronto, Canada, July 28 – August 1 2003.
Google Scholar
P.F. Brown, S.A. Della Pietra, V.J. Della Pietra, and R.L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, 1993.
Google Scholar
P. Carbonetto, N. de Freitas, and K. Barnard. A statistical model for general contextual object recognition. In Eight European Conference on Computer Vision (ECCV), Prague, Czech Republic, May 11–14 2004.
Google Scholar
G. Carneiro and N. Vasconcelos. Formulating semantic image annotation as a supervised learning problem. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Diego, 2005.
Google Scholar
C. Carson, S. Belongie, H. Greenspan, and J. Malik. Blobworld: Image segmentation using expectation-maximization and its application to image querying. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8):1026–1038, August 2002.
Article Google Scholar
M.L. Cascia, S. Sethi, and S. Sclaroff. Combining textual and visual cues for content-based image retrieval on the world wide web. In Proceedings of the IEEE Workshop on Content-Based Access of Image and Video Libraries, Santa Barbara CA USA, June 1998.
Google Scholar
S. Chang and A. Hsu. Image information systems: Where do we go from here? IEEE Trans. on Knowledge and Data Enginnering, 4(5):431–442, October 1992.
Article MathSciNet Google Scholar
M. Charikar. Greedy approximation algorithms for finding dense components in a graph. In APPROX ’00: Proceedings of the 3rd International Workshop on Approximation Algorithms for Combinatorial Optimization, London, UK, 2000.
Google Scholar
F. Chen, U. Gargi, L. Niles, and H. Schuetze. Multi-modal browsing of images in web documents. In Proceedings of SPIE Document Recognition and Retrieval VI, 1999.
Google Scholar
P. Duygulu, K. Barnard, N.d. Freitas, and D.A. Forsyth. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Seventh European Conference on Computer Vision (ECCV), volume 4, pages 97–112, Copenhagen Denmark, May 27 – June 2 2002.
Google Scholar
H. Feng, R. Shi, and T.-S. Chua. A bootstrapping framework for annotating and retrieving www images. In Proceedings of the 12th annual ACM international conference on Multimedia, pages 960–967, New York, NY, USA, 2004.
Google Scholar
S.L. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoulli relevance models for image and video annotation. In the Proceedings of the International Conference on Pattern Recognition (CVPR 2004), volume 2, pages 1002–1009, 2004.
Google Scholar
D.A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice-Hall, 2002.
Google Scholar
J.L. Gauvain, L. Lamel, and G. Adda. The limsi broadcast news transcription system. Speech Communication, 37(1–2):89–108, 2002.
Article MATH Google Scholar
A. Ghoshal, P. Ircing, and S. Khudanpur. Hidden markov models for automatic annotation and content based retrieval of images and video. In The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19 2005.
Google Scholar
E. Izquierdo and A. Dorado. Semantic labelling of images combining color, texture and keywords. In Proceedings of the IEEE International Conference on Image Processing (ICIP2003), Barcelona, Spain, September 2003.
Google Scholar
J. Jeon, V. Lavrenko, and R. Manmatha. Automatic image annotation and retrieval using cross-media relevance models. In 26th Annual International ACM SIGIR Conference, pages 119–126, Toronto, Canada, July 28 – August 1 2003.
Google Scholar
J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In the Proceedings of the 3rd International Conference on Image and Video Retrieval (CIVR 2004), pages 24–32, Dublin City University, Ireland, July 21–23 2004.
Google Scholar
R. Jin, J. Y. Chai, and S. Luo. Automatic image annotation via coherent language model and active learning. In The 12th ACM Annual Conference on Multimedia (ACM MM 2004), New York, USA, October 10–16 2004.
Google Scholar
V. Lavrenko, R. Manmatha, and J. Jeon. A model for learning the semantics of pictures. In the Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems, volume 16, pages 553–560, 2003.
Google Scholar
J. Li and J.Z. Wang. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Transaction on Pattern Analysis and Machine Intelligence, 25(9):1075–1088, September 2003.
Article Google Scholar
C.-Y. Lin, B.L. Tseng, and J.R. Smith. Video collaborative annotation forum:establishing ground-truth labels on large multimedia datasets. In NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November 2003.
Google Scholar
D.G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 2004.
Google Scholar
O. Maron and A.L. Ratan. Multiple-instance learning for natural scene classification. In The Fifteenth International Conference on Machine Learning, 1998.
Google Scholar
K. Mikolajczyk. Face detector. INRIA Rhone-Alpes, 2004. Ph.D Report.
Google Scholar
F. Monay and D. Gatica-Perez. On image auto-annotation with latent space models. In Proceedings of the ACM International Conference on Multimedia (ACM MM), Berkeley, CA, USA, November 2003.
Google Scholar
F. Monay and D. Gatica-Perez. Plsa-based image auto-annotation: Constraining the latent space. In Proceedings of the ACM International Conference on Multimedia (ACM MM), New York, October 2004.
Google Scholar
Y. Mori, H. Takahashi, and R. Oka. Image-to-word transformation based on dividing and vector quantizing images with words. In First International Workshop on Multimedia Intelligent Storage and Retrieval Management, 1999.
Google Scholar
F.J. Och and H. Ney. A systematic comparison of various statistical alignment models. Computational Linguistics, 1(29):19–51, 2003.
Article Google Scholar
J.-Y. Pan, H.-J. Yang, P. Duygulu, and C. Faloutsos. Automatic image captioning. In In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME2004), Taipei, Taiwan, June 27–30 2004.
Google Scholar
J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In Proceedings of the 10th ACM SIGKDD Conference, Seatle, WA, August 22–25 2004.
Google Scholar
S. Satoh and T. Kanade. Name-it: Association of face and name in video. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 1997.
Google Scholar
A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content based image retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349–1380, 2000.
Article Google Scholar
C.G.M. Snoek and M. Worring. Multimodal video indexing: A review of the state-of-the-art. Multimedia Tools and Applications, 25(1):5–35, January 2005.
Article Google Scholar
R.K. Srihari and D.T Burhans. Visual semantics: Extracting visual information from text accompanying pictures. In AAAI 94, Seattle, WA, 1994.
Google Scholar
P. Virga and P. Duygulu. Systematic evaluation of machine translation methods for image and video annotation. In The Fourth International Conference on Image and Video Retrieval (CIVR 2005), Singapore, July 20–22 2005.
Google Scholar
L. Wenyin, S. Dumais, Y. Sun, H. Zhang, M. Czerwinski, and B. Field. Semi-automatic image annotation. In Proceedings of the INTERACT : Conference on Human-Computer Interaction, pages 326–333, Tokyo Japan, July 9–13 2001.
Google Scholar
J. Yang, M-Y. Chen, and A. Hauptmann. Finding person x: Correlating names with visual appearances. In International Conference on Image and Video Retrieval (CIVR‘04), Dublin City University Ireland, July 21–23 2004.
Google Scholar
R. Zhao and W.I. Grosk. Narrowing the semantic gap: Improved text-based web document retrieval using visual features. EEE Transactions on Multimedia, 4(2):189–200, 2002.
Article Google Scholar
W. Zhao, R. Chellappa, P.J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, 2003.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Bilkent University, Ankara, Turkey
Pinar Duygulu, Muhammet Baştan & Derya Ozkan

Authors

Pinar Duygulu
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Baştan
View author publications
You can also search for this author in PubMed Google Scholar
Derya Ozkan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UPMC University, CNRS (UMR 7606) Lab. LIP6, 104 Avenue du Président Kennedy, 75016 Paris, France
Matthieu Cord
University College Dublin, School of Computer Science & Informatics, Belfield, Dublin 2, Ireland
Pádraig Cunningham

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Duygulu, P., Baştan, M., Ozkan, D. (2008). Combining Textual and Visual Information for Semantic Labeling of Images and Videos. In: Cord, M., Cunningham, P. (eds) Machine Learning Techniques for Multimedia. Cognitive Technologies. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75171-7_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-75171-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-75170-0
Online ISBN: 978-3-540-75171-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics