Machine Vision and Applications

, Volume 22, Issue 1, pp 99–115 | Cite as

Multimedia translation for linking visual data to semantics in videos

  • Pınar DuyguluEmail author
  • Muhammet Baştan
Original Paper


The semantic gap problem, which can be referred to as the disconnection between low-level multimedia data and high-level semantics, is an important obstacle to build real-world multimedia systems. The recently developed methods that can use large volumes of loosely labeled data to provide solutions for automatic image annotation stand as promising approaches toward solving this problem. In this paper, we are interested in how some of these methods can be applied to semantic gap problems that appear in other application domains beyond image annotation. Specifically, we introduce new problems that appear in videos, such as the linking of keyframes with speech transcript text and the linking of faces with names. In a common framework, we formulate these problems as the problem of finding missing correspondences between visual and semantic data and apply the multimedia translation method. We evaluate the performance of the multimedia translation method on these problems and compare its performance against other auto-annotation and classifier-based methods. The experiments, carried out on over 300 h of news videos from TRECVid 2004 and TRECVid 2006 corpora, show that the multimedia translation method provides a performance that is comparable to the other auto-annotation methods and superior performance compared to other classifier-based methods.


Machine Translation Automatic Speech Recognition Visual Data Image Annotation Visual Content 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Smeulders A., Worring M., Santini S., Gupta A., Jain R.: Content based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)CrossRefGoogle Scholar
  2. 2.
    Lew M.S., Sebe N., Djeraba C., Jain R.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2(1), 1–19 (2006)CrossRefGoogle Scholar
  3. 3.
    Hare, J., Lewis, P., Enser, P., Sandom, C.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Multimedia Content Analysis, Management and Retrieval, SPIE, vol. 6073, San Jose, California, USA (2006)Google Scholar
  4. 4.
    Chang S., Hsu A.: Image information systems: where do we go from here?. IEEE Trans. Knowl. Data Eng. 4(5), 431–442 (1992)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Rui Y., Huang T., Chang S.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10(4), 39–62 (1999)CrossRefGoogle Scholar
  6. 6.
    Snoek C., Worring M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)CrossRefGoogle Scholar
  7. 7.
    Carson C., Belongie S., Greenspan H., Malik J.: Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)CrossRefGoogle Scholar
  8. 8.
    Zhao R., Grosky W.I.: Narrowing the semantic gap: improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4(2), 189–200 (2002)CrossRefGoogle Scholar
  9. 9.
    Benitez, A., Chang, S.-F.: Semantic knowledge construction from annotated image collections. In: IEEE International Conference on Multimedia and Expo, vol. 2, pp. 205–208 (2002)Google Scholar
  10. 10.
    Chang, S.-F., Manmatha, R., Chua, T.-S.: Combining Text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, Philadelphia, PA, March, pp. 1005–1008 (2005)Google Scholar
  11. 11.
    Viola P., Jones M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  12. 12.
    Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 878–885 (2005)Google Scholar
  13. 13.
    Fergus R., Perona P., Zisserman A.: Object class recognition by unsupervised scale-invariant learning. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, II-264–II-271 (2003)Google Scholar
  14. 14.
    Wu L., Hu Y., Li M., Yu N., Hua X.-S.: Scale-invariant visual language modeling for object categorization. IEEE Trans. Multimedia 11(2), 286–294 (2003)CrossRefGoogle Scholar
  15. 15.
    Fei-Fei L., Fergus R., Perona P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)CrossRefGoogle Scholar
  16. 16.
    Quelhas P., Monay F., Odobez J.-M., Gatica-Perez D., Tuytelaars T.: A thousand words in a scene. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1575–1589 (2007)CrossRefGoogle Scholar
  17. 17.
    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: IEEE CVPR 2004, Workshop on Generative-Model Based Vision (2004)Google Scholar
  18. 18.
    Caltech 101 Dataset Homepage [Online]. Available:
  19. 19.
    Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset, California Institute of Technology, Tech. Rep. 7694 (2007)Google Scholar
  20. 20.
    Everingham, M., Zisserman, A., Williams, C., Gool, L. V., Allan, M., Bishop, C., Chapelle, O., Dalal, N., Deselaers, T., Dorko, G., Duffner, S., Eichhorn, J., Farquhar, J., Fritz, M., Garcia, C., Griffiths, T., Jurie, F., Keysers, D., Koskela, M., Laaksonen, J., Larlus, D., Leibe, B., Meng, H., Ney, H., Schiele, B., Schmid, C., Seemann, E., Shawe-Taylor, J., Storkey, A., Szedmak, S., Triggs, B., Ulusoy, I., Viitaniemi, V., Zhang, J.: The 2005 PASCAL Visual Object Classes Challenge. In: Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag (2006)Google Scholar
  21. 21.
    The PASCAL Visual Object Classes Homepage [Online]. Available:
  22. 22.
    Russell B.C., Torralba A., Murphy K.P., Freeman W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)CrossRefGoogle Scholar
  23. 23.
    LabelMe Homepage [Online]. Available:
  24. 24.
    Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image retrieval. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)CrossRefGoogle Scholar
  25. 25.
    Getty Images [Online]. Available:
  26. 26.
    Flickr Photo Sharing Service [Online]. Available:
  27. 27.
    von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM Conference on Human Factors in Computing Systems (CHI 2004), pp. 319–326 (2004)Google Scholar
  28. 28.
    The ESP Game [Online]. Available:
  29. 29.
    Yahoo! News [Online]. Available:
  30. 30.
    Kender, J. R., Naphade, M. R.: Visual concepts for news story tracking: analyzing and exploiting the NIST TRECVID video annotation experiment. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1174–1181 (2005)Google Scholar
  31. 31.
    Enser P., Sandom C.J., Hare J., Lewis P.: Facing the reality of semantic image retrieval. J. Document. 63(4), 465–481 (2007)CrossRefGoogle Scholar
  32. 32.
    Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The 15th International Conference on Machine Learning, pp. 341–349 (1998)Google Scholar
  33. 33.
    Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, Philadelphia, PA, USA, March 18–23, pp. 153–156 (2005)Google Scholar
  34. 34.
    Carneiro, G., Vasconcelos, N.: Formulating semantic image annotation as a supervised learning problem. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, San Diego, June, pp. 163–168 (2005)Google Scholar
  35. 35.
    Li J., Wang J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)CrossRefGoogle Scholar
  36. 36.
    Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual Int. ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 119–126 (2003)Google Scholar
  37. 37.
    Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: 17th Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)Google Scholar
  38. 38.
    Feng, S., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1002–1009 (2004)Google Scholar
  39. 39.
    Barnard K., Duygulu P., de Freitas N., Forsyth D.A., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)zbMATHGoogle Scholar
  40. 40.
    Blei D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 127–134 (2003)Google Scholar
  41. 41.
    Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision, vol. 2, pp. 408–415 (2001)Google Scholar
  42. 42.
    Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data, AI Memo 1625, CBCL Memo 159, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, Tech. Rep., February (1998)Google Scholar
  43. 43.
    Monay, F., Gatica-Perez, D.: PLSA-based image auto-annotation: constraining the latent space. In: ACM International Conference on Multimedia, October, pp. 348–351 (2004)Google Scholar
  44. 44.
    Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: 1st International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)Google Scholar
  45. 45.
    Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: 7th European Conference on Computer Vision, vol. 4, Copenhagen Denmark, May 27–June 2, pp. 97–112 (2002)Google Scholar
  46. 46.
    Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: The 2004 IEEE International Conference on Multimedia and Expo, vol. 3, Taipei, Taiwan, June, pp. 1987–1990 (2004)Google Scholar
  47. 47.
    Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, pp. 350–362 (2004)Google Scholar
  48. 48.
    Brown P., Pietra S.A.D., Pietra V.J.D., Mercer R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)Google Scholar
  49. 49.
    Shi J., Malik J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)CrossRefGoogle Scholar
  50. 50.
    Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.: The effects of segmentation and feature choice in a translation model of object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, Wisconsin, June, pp. 675–682 (2003)Google Scholar
  51. 51.
    Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: 3rd International Conference on Image and Video Retrieval, Ireland, July 21–23, pp. 24–32 (2004)Google Scholar
  52. 52.
    Virga P., Duygulu, P.: Systematic evaluation of machine translation methods for image and video annotation. In: The 4th International Conference on Image and Video Retrieval, Singapore, July 20–22, pp. 174–183 (2005)Google Scholar
  53. 53.
    Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)CrossRefGoogle Scholar
  54. 54.
    TREC Video Retrieval Evaluation [Online]. Available:
  55. 55.
    Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid, In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)Google Scholar
  56. 56.
    Smeaton, A., Over, P., Kraaij, W.: High level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications. Springer, Berlin (2008)Google Scholar
  57. 57.
    Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden Markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19, pp. 544–551 (2005)Google Scholar
  58. 58.
    Duygulu, P., Hauptmann, A.: What’s news what’s not? Associating News videos with words. In: The 3rd International Conference on Image and Video Retrieval (CIVR 2004), Ireland, July 21–23, pp. 21–23 (2004)Google Scholar
  59. 59.
    Wactlar, H., Hauptmann, A., Witbrock, M.: Informedia News-On Demand: Using Speech Recognition to Create a Digital Video Library, CMU Technical Report, CMU-CS-98-109, Tech. Rep. (1998)Google Scholar
  60. 60.
    Gross, R., Baker, S., Matthews, I., Kanade, T.: Face recognition across pose and illumination. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition. Springer, Berlin, pp. 193–216 (2004)Google Scholar
  61. 61.
    Zhao W., Chellappa R., Phillips P., Rosenfeld A.: Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)CrossRefGoogle Scholar
  62. 62.
    Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: correlating names with visual appearances. In: International Conference on Image and Video Retrieval, Ireland, pp. 270–278 (2004)Google Scholar
  63. 63.
    Berg, T., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.-W., Learned-Miller, E., Forsyth, D.: Faces and names in the news. In: IEEE Conference on Computer Vision and Pattern Recognition (2004)Google Scholar
  64. 64.
    Ozkan, D., Duygulu, P.: A graph based approach for naming faces in news photos. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1477–1482 (2006)Google Scholar
  65. 65.
    Satoh, S., Kanade, T.: Name-It: Association of face and name in video. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)Google Scholar
  66. 66.
    Baştan, M., Duygulu, P.: Recognizing objects and scenes in news videos. In: The International Conference on Image and Video Retrieval, Lecture Notes in Computer Science, 40071, pp. 380–390 (2006)Google Scholar
  67. 67.
    Rasiwasia N., Vasconcelos N.: Bridging the semantic gap: query by semantic example. IEEE Trans. Multimedia 9(5), 923–938 (2007)CrossRefGoogle Scholar
  68. 68.
    Giza++: Training of statistical translation models [Online]. Available:
  69. 69.
    Och F.J., Ney H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 1(29), 19–51 (2003)CrossRefGoogle Scholar
  70. 70.
    Lin, C. -Y., Tseng, B. L., Smith, J. R.: Video Collaborative annotation forum: establishing ground-truth labels on large multimedia datasets. In: NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November (2003)Google Scholar
  71. 71.
    Naphade M., Curtis J., Hauptmann A., Kennedy L., Hsu W., Chang S.-F., Smith J.: Large-scale concept ontology for multimedia. IEEE Multimedia 13(3), 86–91 (2006)CrossRefGoogle Scholar
  72. 72.
    Gauvain J., Lamel L., Adda G.: The LIMSI broadcast news transcription system. Speech Commun. 37(1-2), 89–108 (2002)CrossRefzbMATHGoogle Scholar
  73. 73.
    Lowe D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)CrossRefGoogle Scholar
  74. 74.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV) (2003)Google Scholar
  75. 75.
    Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J.: Introduction to WordNet: an online lexical database. Int. J. Lexicogr. 3, 235–244 (1990)CrossRefGoogle Scholar
  76. 76.
    Tang J., Lewis P.: A study of quality issues for image auto-annotation with the corel data-set. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)CrossRefGoogle Scholar
  77. 77.
    Joachims, T.: Multi-class support vector machine [Online]. Available:

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Department of Computer EngineeringBilkent UniversityAnkaraTurkey

Personalised recommendations