Multimedia translation for linking visual data to semantics in videos

Duygulu, Pınar; Baştan, Muhammet

doi:10.1007/s00138-009-0217-8

Multimedia translation for linking visual data to semantics in videos

Original Paper
Published: 30 September 2009

Volume 22, pages 99–115, (2011)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Pınar Duygulu¹ &
Muhammet Baştan¹

188 Accesses
2 Citations
Explore all metrics

Abstract

The semantic gap problem, which can be referred to as the disconnection between low-level multimedia data and high-level semantics, is an important obstacle to build real-world multimedia systems. The recently developed methods that can use large volumes of loosely labeled data to provide solutions for automatic image annotation stand as promising approaches toward solving this problem. In this paper, we are interested in how some of these methods can be applied to semantic gap problems that appear in other application domains beyond image annotation. Specifically, we introduce new problems that appear in videos, such as the linking of keyframes with speech transcript text and the linking of faces with names. In a common framework, we formulate these problems as the problem of finding missing correspondences between visual and semantic data and apply the multimedia translation method. We evaluate the performance of the multimedia translation method on these problems and compare its performance against other auto-annotation and classifier-based methods. The experiments, carried out on over 300 h of news videos from TRECVid 2004 and TRECVid 2006 corpora, show that the multimedia translation method provides a performance that is comparable to the other auto-annotation methods and superior performance compared to other classifier-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Smeulders A., Worring M., Santini S., Gupta A., Jain R.: Content based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
Article Google Scholar
Lew M.S., Sebe N., Djeraba C., Jain R.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2(1), 1–19 (2006)
Article Google Scholar
Hare, J., Lewis, P., Enser, P., Sandom, C.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Multimedia Content Analysis, Management and Retrieval, SPIE, vol. 6073, San Jose, California, USA (2006)
Chang S., Hsu A.: Image information systems: where do we go from here?. IEEE Trans. Knowl. Data Eng. 4(5), 431–442 (1992)
Article MathSciNet Google Scholar
Rui Y., Huang T., Chang S.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10(4), 39–62 (1999)
Article Google Scholar
Snoek C., Worring M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)
Article Google Scholar
Carson C., Belongie S., Greenspan H., Malik J.: Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)
Article Google Scholar
Zhao R., Grosky W.I.: Narrowing the semantic gap: improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4(2), 189–200 (2002)
Article Google Scholar
Benitez, A., Chang, S.-F.: Semantic knowledge construction from annotated image collections. In: IEEE International Conference on Multimedia and Expo, vol. 2, pp. 205–208 (2002)
Chang, S.-F., Manmatha, R., Chua, T.-S.: Combining Text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, Philadelphia, PA, March, pp. 1005–1008 (2005)
Viola P., Jones M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Article Google Scholar
Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 878–885 (2005)
Fergus R., Perona P., Zisserman A.: Object class recognition by unsupervised scale-invariant learning. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, II-264–II-271 (2003)
Google Scholar
Wu L., Hu Y., Li M., Yu N., Hua X.-S.: Scale-invariant visual language modeling for object categorization. IEEE Trans. Multimedia 11(2), 286–294 (2003)
Article Google Scholar
Fei-Fei L., Fergus R., Perona P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Article Google Scholar
Quelhas P., Monay F., Odobez J.-M., Gatica-Perez D., Tuytelaars T.: A thousand words in a scene. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1575–1589 (2007)
Article Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: IEEE CVPR 2004, Workshop on Generative-Model Based Vision (2004)
Caltech 101 Dataset Homepage [Online]. Available: http://www.vision.caltech.edu/Image_Datasets/Caltech101
Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset, California Institute of Technology, Tech. Rep. 7694 (2007)
Everingham, M., Zisserman, A., Williams, C., Gool, L. V., Allan, M., Bishop, C., Chapelle, O., Dalal, N., Deselaers, T., Dorko, G., Duffner, S., Eichhorn, J., Farquhar, J., Fritz, M., Garcia, C., Griffiths, T., Jurie, F., Keysers, D., Koskela, M., Laaksonen, J., Larlus, D., Leibe, B., Meng, H., Ney, H., Schiele, B., Schmid, C., Seemann, E., Shawe-Taylor, J., Storkey, A., Szedmak, S., Triggs, B., Ulusoy, I., Viitaniemi, V., Zhang, J.: The 2005 PASCAL Visual Object Classes Challenge. In: Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag (2006)
The PASCAL Visual Object Classes Homepage [Online]. Available: http://pascallin.ecs.soton.ac.uk/challenges/VOC
Russell B.C., Torralba A., Murphy K.P., Freeman W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)
Article Google Scholar
LabelMe Homepage [Online]. Available: http://labelme.csail.mit.edu
Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image retrieval. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
Article Google Scholar
Getty Images [Online]. Available: http://www.gettyimages.com
Flickr Photo Sharing Service [Online]. Available: http://www.espgame.org/gwap
von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM Conference on Human Factors in Computing Systems (CHI 2004), pp. 319–326 (2004)
The ESP Game [Online]. Available: http://www.flickr.com
Yahoo! News [Online]. Available: http://news.yahoo.com
Kender, J. R., Naphade, M. R.: Visual concepts for news story tracking: analyzing and exploiting the NIST TRECVID video annotation experiment. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1174–1181 (2005)
Enser P., Sandom C.J., Hare J., Lewis P.: Facing the reality of semantic image retrieval. J. Document. 63(4), 465–481 (2007)
Article Google Scholar
Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The 15th International Conference on Machine Learning, pp. 341–349 (1998)
Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, Philadelphia, PA, USA, March 18–23, pp. 153–156 (2005)
Carneiro, G., Vasconcelos, N.: Formulating semantic image annotation as a supervised learning problem. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, San Diego, June, pp. 163–168 (2005)
Li J., Wang J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)
Article Google Scholar
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual Int. ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 119–126 (2003)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: 17th Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)
Feng, S., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1002–1009 (2004)
Barnard K., Duygulu P., de Freitas N., Forsyth D.A., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
MATH Google Scholar
Blei D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 127–134 (2003)
Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision, vol. 2, pp. 408–415 (2001)
Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data, AI Memo 1625, CBCL Memo 159, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, Tech. Rep., February (1998)
Monay, F., Gatica-Perez, D.: PLSA-based image auto-annotation: constraining the latent space. In: ACM International Conference on Multimedia, October, pp. 348–351 (2004)
Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: 1st International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)
Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: 7th European Conference on Computer Vision, vol. 4, Copenhagen Denmark, May 27–June 2, pp. 97–112 (2002)
Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: The 2004 IEEE International Conference on Multimedia and Expo, vol. 3, Taipei, Taiwan, June, pp. 1987–1990 (2004)
Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, pp. 350–362 (2004)
Brown P., Pietra S.A.D., Pietra V.J.D., Mercer R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Shi J., Malik J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.: The effects of segmentation and feature choice in a translation model of object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, Wisconsin, June, pp. 675–682 (2003)
Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: 3rd International Conference on Image and Video Retrieval, Ireland, July 21–23, pp. 24–32 (2004)
Virga P., Duygulu, P.: Systematic evaluation of machine translation methods for image and video annotation. In: The 4th International Conference on Image and Video Retrieval, Singapore, July 20–22, pp. 174–183 (2005)
Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
Article Google Scholar
TREC Video Retrieval Evaluation [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid, In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)
Smeaton, A., Over, P., Kraaij, W.: High level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications. Springer, Berlin (2008)
Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden Markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19, pp. 544–551 (2005)
Duygulu, P., Hauptmann, A.: What’s news what’s not? Associating News videos with words. In: The 3rd International Conference on Image and Video Retrieval (CIVR 2004), Ireland, July 21–23, pp. 21–23 (2004)
Wactlar, H., Hauptmann, A., Witbrock, M.: Informedia News-On Demand: Using Speech Recognition to Create a Digital Video Library, CMU Technical Report, CMU-CS-98-109, Tech. Rep. (1998)
Gross, R., Baker, S., Matthews, I., Kanade, T.: Face recognition across pose and illumination. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition. Springer, Berlin, pp. 193–216 (2004)
Zhao W., Chellappa R., Phillips P., Rosenfeld A.: Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)
Article Google Scholar
Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: correlating names with visual appearances. In: International Conference on Image and Video Retrieval, Ireland, pp. 270–278 (2004)
Berg, T., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.-W., Learned-Miller, E., Forsyth, D.: Faces and names in the news. In: IEEE Conference on Computer Vision and Pattern Recognition (2004)
Ozkan, D., Duygulu, P.: A graph based approach for naming faces in news photos. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1477–1482 (2006)
Satoh, S., Kanade, T.: Name-It: Association of face and name in video. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)
Baştan, M., Duygulu, P.: Recognizing objects and scenes in news videos. In: The International Conference on Image and Video Retrieval, Lecture Notes in Computer Science, 40071, pp. 380–390 (2006)
Rasiwasia N., Vasconcelos N.: Bridging the semantic gap: query by semantic example. IEEE Trans. Multimedia 9(5), 923–938 (2007)
Article Google Scholar
Giza++: Training of statistical translation models [Online]. Available: http://www.fjoch.com/GIZA++.html
Och F.J., Ney H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 1(29), 19–51 (2003)
Article Google Scholar
Lin, C. -Y., Tseng, B. L., Smith, J. R.: Video Collaborative annotation forum: establishing ground-truth labels on large multimedia datasets. In: NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November (2003)
Naphade M., Curtis J., Hauptmann A., Kennedy L., Hsu W., Chang S.-F., Smith J.: Large-scale concept ontology for multimedia. IEEE Multimedia 13(3), 86–91 (2006)
Article Google Scholar
Gauvain J., Lamel L., Adda G.: The LIMSI broadcast news transcription system. Speech Commun. 37(1-2), 89–108 (2002)
Article MATH Google Scholar
Lowe D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV) (2003)
Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J.: Introduction to WordNet: an online lexical database. Int. J. Lexicogr. 3, 235–244 (1990)
Article Google Scholar
Tang J., Lewis P.: A study of quality issues for image auto-annotation with the corel data-set. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)
Article Google Scholar
Joachims, T.: Multi-class support vector machine [Online]. Available: http://svmlight.joachims.org/svm-multiclass.html

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Bilkent University, Ankara, Turkey
Pınar Duygulu & Muhammet Baştan

Authors

Pınar Duygulu
View author publications
You can also search for this author in PubMed Google Scholar
Muhammet Baştan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pınar Duygulu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duygulu, P., Baştan, M. Multimedia translation for linking visual data to semantics in videos. Machine Vision and Applications 22, 99–115 (2011). https://doi.org/10.1007/s00138-009-0217-8

Download citation

Received: 25 July 2008
Revised: 01 March 2009
Accepted: 28 July 2009
Published: 30 September 2009
Issue Date: January 2011
DOI: https://doi.org/10.1007/s00138-009-0217-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimedia translation for linking visual data to semantics in videos

Abstract

Access this article

Similar content being viewed by others

Semantic Gap in Image and Video Analysis: An Introduction

A comprehensive review of the video-to-text problem

Explaining Multimodal Image Retrieval Using A Vision and Language Task Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimedia translation for linking visual data to semantics in videos

Abstract

Access this article

Similar content being viewed by others

Semantic Gap in Image and Video Analysis: An Introduction

A comprehensive review of the video-to-text problem

Explaining Multimodal Image Retrieval Using A Vision and Language Task Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation