Skip to main content
Log in

Multimedia translation for linking visual data to semantics in videos

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

The semantic gap problem, which can be referred to as the disconnection between low-level multimedia data and high-level semantics, is an important obstacle to build real-world multimedia systems. The recently developed methods that can use large volumes of loosely labeled data to provide solutions for automatic image annotation stand as promising approaches toward solving this problem. In this paper, we are interested in how some of these methods can be applied to semantic gap problems that appear in other application domains beyond image annotation. Specifically, we introduce new problems that appear in videos, such as the linking of keyframes with speech transcript text and the linking of faces with names. In a common framework, we formulate these problems as the problem of finding missing correspondences between visual and semantic data and apply the multimedia translation method. We evaluate the performance of the multimedia translation method on these problems and compare its performance against other auto-annotation and classifier-based methods. The experiments, carried out on over 300 h of news videos from TRECVid 2004 and TRECVid 2006 corpora, show that the multimedia translation method provides a performance that is comparable to the other auto-annotation methods and superior performance compared to other classifier-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Smeulders A., Worring M., Santini S., Gupta A., Jain R.: Content based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)

    Article  Google Scholar 

  2. Lew M.S., Sebe N., Djeraba C., Jain R.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2(1), 1–19 (2006)

    Article  Google Scholar 

  3. Hare, J., Lewis, P., Enser, P., Sandom, C.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Multimedia Content Analysis, Management and Retrieval, SPIE, vol. 6073, San Jose, California, USA (2006)

  4. Chang S., Hsu A.: Image information systems: where do we go from here?. IEEE Trans. Knowl. Data Eng. 4(5), 431–442 (1992)

    Article  MathSciNet  Google Scholar 

  5. Rui Y., Huang T., Chang S.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10(4), 39–62 (1999)

    Article  Google Scholar 

  6. Snoek C., Worring M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)

    Article  Google Scholar 

  7. Carson C., Belongie S., Greenspan H., Malik J.: Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)

    Article  Google Scholar 

  8. Zhao R., Grosky W.I.: Narrowing the semantic gap: improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4(2), 189–200 (2002)

    Article  Google Scholar 

  9. Benitez, A., Chang, S.-F.: Semantic knowledge construction from annotated image collections. In: IEEE International Conference on Multimedia and Expo, vol. 2, pp. 205–208 (2002)

  10. Chang, S.-F., Manmatha, R., Chua, T.-S.: Combining Text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, Philadelphia, PA, March, pp. 1005–1008 (2005)

  11. Viola P., Jones M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)

    Article  Google Scholar 

  12. Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 878–885 (2005)

  13. Fergus R., Perona P., Zisserman A.: Object class recognition by unsupervised scale-invariant learning. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, II-264–II-271 (2003)

    Google Scholar 

  14. Wu L., Hu Y., Li M., Yu N., Hua X.-S.: Scale-invariant visual language modeling for object categorization. IEEE Trans. Multimedia 11(2), 286–294 (2003)

    Article  Google Scholar 

  15. Fei-Fei L., Fergus R., Perona P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)

    Article  Google Scholar 

  16. Quelhas P., Monay F., Odobez J.-M., Gatica-Perez D., Tuytelaars T.: A thousand words in a scene. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1575–1589 (2007)

    Article  Google Scholar 

  17. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: IEEE CVPR 2004, Workshop on Generative-Model Based Vision (2004)

  18. Caltech 101 Dataset Homepage [Online]. Available: http://www.vision.caltech.edu/Image_Datasets/Caltech101

  19. Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset, California Institute of Technology, Tech. Rep. 7694 (2007)

  20. Everingham, M., Zisserman, A., Williams, C., Gool, L. V., Allan, M., Bishop, C., Chapelle, O., Dalal, N., Deselaers, T., Dorko, G., Duffner, S., Eichhorn, J., Farquhar, J., Fritz, M., Garcia, C., Griffiths, T., Jurie, F., Keysers, D., Koskela, M., Laaksonen, J., Larlus, D., Leibe, B., Meng, H., Ney, H., Schiele, B., Schmid, C., Seemann, E., Shawe-Taylor, J., Storkey, A., Szedmak, S., Triggs, B., Ulusoy, I., Viitaniemi, V., Zhang, J.: The 2005 PASCAL Visual Object Classes Challenge. In: Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag (2006)

  21. The PASCAL Visual Object Classes Homepage [Online]. Available: http://pascallin.ecs.soton.ac.uk/challenges/VOC

  22. Russell B.C., Torralba A., Murphy K.P., Freeman W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)

    Article  Google Scholar 

  23. LabelMe Homepage [Online]. Available: http://labelme.csail.mit.edu

  24. Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image retrieval. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)

    Article  Google Scholar 

  25. Getty Images [Online]. Available: http://www.gettyimages.com

  26. Flickr Photo Sharing Service [Online]. Available: http://www.espgame.org/gwap

  27. von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM Conference on Human Factors in Computing Systems (CHI 2004), pp. 319–326 (2004)

  28. The ESP Game [Online]. Available: http://www.flickr.com

  29. Yahoo! News [Online]. Available: http://news.yahoo.com

  30. Kender, J. R., Naphade, M. R.: Visual concepts for news story tracking: analyzing and exploiting the NIST TRECVID video annotation experiment. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1174–1181 (2005)

  31. Enser P., Sandom C.J., Hare J., Lewis P.: Facing the reality of semantic image retrieval. J. Document. 63(4), 465–481 (2007)

    Article  Google Scholar 

  32. Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The 15th International Conference on Machine Learning, pp. 341–349 (1998)

  33. Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, Philadelphia, PA, USA, March 18–23, pp. 153–156 (2005)

  34. Carneiro, G., Vasconcelos, N.: Formulating semantic image annotation as a supervised learning problem. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, San Diego, June, pp. 163–168 (2005)

  35. Li J., Wang J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)

    Article  Google Scholar 

  36. Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual Int. ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 119–126 (2003)

  37. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: 17th Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)

  38. Feng, S., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1002–1009 (2004)

  39. Barnard K., Duygulu P., de Freitas N., Forsyth D.A., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)

    MATH  Google Scholar 

  40. Blei D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 127–134 (2003)

  41. Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision, vol. 2, pp. 408–415 (2001)

  42. Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data, AI Memo 1625, CBCL Memo 159, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, Tech. Rep., February (1998)

  43. Monay, F., Gatica-Perez, D.: PLSA-based image auto-annotation: constraining the latent space. In: ACM International Conference on Multimedia, October, pp. 348–351 (2004)

  44. Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: 1st International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)

  45. Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: 7th European Conference on Computer Vision, vol. 4, Copenhagen Denmark, May 27–June 2, pp. 97–112 (2002)

  46. Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: The 2004 IEEE International Conference on Multimedia and Expo, vol. 3, Taipei, Taiwan, June, pp. 1987–1990 (2004)

  47. Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, pp. 350–362 (2004)

  48. Brown P., Pietra S.A.D., Pietra V.J.D., Mercer R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)

    Google Scholar 

  49. Shi J., Malik J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  50. Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.: The effects of segmentation and feature choice in a translation model of object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, Wisconsin, June, pp. 675–682 (2003)

  51. Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: 3rd International Conference on Image and Video Retrieval, Ireland, July 21–23, pp. 24–32 (2004)

  52. Virga P., Duygulu, P.: Systematic evaluation of machine translation methods for image and video annotation. In: The 4th International Conference on Image and Video Retrieval, Singapore, July 20–22, pp. 174–183 (2005)

  53. Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)

    Article  Google Scholar 

  54. TREC Video Retrieval Evaluation [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid

  55. Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid, In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)

  56. Smeaton, A., Over, P., Kraaij, W.: High level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications. Springer, Berlin (2008)

  57. Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden Markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19, pp. 544–551 (2005)

  58. Duygulu, P., Hauptmann, A.: What’s news what’s not? Associating News videos with words. In: The 3rd International Conference on Image and Video Retrieval (CIVR 2004), Ireland, July 21–23, pp. 21–23 (2004)

  59. Wactlar, H., Hauptmann, A., Witbrock, M.: Informedia News-On Demand: Using Speech Recognition to Create a Digital Video Library, CMU Technical Report, CMU-CS-98-109, Tech. Rep. (1998)

  60. Gross, R., Baker, S., Matthews, I., Kanade, T.: Face recognition across pose and illumination. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition. Springer, Berlin, pp. 193–216 (2004)

  61. Zhao W., Chellappa R., Phillips P., Rosenfeld A.: Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)

    Article  Google Scholar 

  62. Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: correlating names with visual appearances. In: International Conference on Image and Video Retrieval, Ireland, pp. 270–278 (2004)

  63. Berg, T., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.-W., Learned-Miller, E., Forsyth, D.: Faces and names in the news. In: IEEE Conference on Computer Vision and Pattern Recognition (2004)

  64. Ozkan, D., Duygulu, P.: A graph based approach for naming faces in news photos. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1477–1482 (2006)

  65. Satoh, S., Kanade, T.: Name-It: Association of face and name in video. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)

  66. Baştan, M., Duygulu, P.: Recognizing objects and scenes in news videos. In: The International Conference on Image and Video Retrieval, Lecture Notes in Computer Science, 40071, pp. 380–390 (2006)

  67. Rasiwasia N., Vasconcelos N.: Bridging the semantic gap: query by semantic example. IEEE Trans. Multimedia 9(5), 923–938 (2007)

    Article  Google Scholar 

  68. Giza++: Training of statistical translation models [Online]. Available: http://www.fjoch.com/GIZA++.html

  69. Och F.J., Ney H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 1(29), 19–51 (2003)

    Article  Google Scholar 

  70. Lin, C. -Y., Tseng, B. L., Smith, J. R.: Video Collaborative annotation forum: establishing ground-truth labels on large multimedia datasets. In: NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November (2003)

  71. Naphade M., Curtis J., Hauptmann A., Kennedy L., Hsu W., Chang S.-F., Smith J.: Large-scale concept ontology for multimedia. IEEE Multimedia 13(3), 86–91 (2006)

    Article  Google Scholar 

  72. Gauvain J., Lamel L., Adda G.: The LIMSI broadcast news transcription system. Speech Commun. 37(1-2), 89–108 (2002)

    Article  MATH  Google Scholar 

  73. Lowe D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  74. Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV) (2003)

  75. Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J.: Introduction to WordNet: an online lexical database. Int. J. Lexicogr. 3, 235–244 (1990)

    Article  Google Scholar 

  76. Tang J., Lewis P.: A study of quality issues for image auto-annotation with the corel data-set. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)

    Article  Google Scholar 

  77. Joachims, T.: Multi-class support vector machine [Online]. Available: http://svmlight.joachims.org/svm-multiclass.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pınar Duygulu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duygulu, P., Baştan, M. Multimedia translation for linking visual data to semantics in videos. Machine Vision and Applications 22, 99–115 (2011). https://doi.org/10.1007/s00138-009-0217-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-009-0217-8

Keywords

Navigation