Abstract
The semantic gap problem, which can be referred to as the disconnection between low-level multimedia data and high-level semantics, is an important obstacle to build real-world multimedia systems. The recently developed methods that can use large volumes of loosely labeled data to provide solutions for automatic image annotation stand as promising approaches toward solving this problem. In this paper, we are interested in how some of these methods can be applied to semantic gap problems that appear in other application domains beyond image annotation. Specifically, we introduce new problems that appear in videos, such as the linking of keyframes with speech transcript text and the linking of faces with names. In a common framework, we formulate these problems as the problem of finding missing correspondences between visual and semantic data and apply the multimedia translation method. We evaluate the performance of the multimedia translation method on these problems and compare its performance against other auto-annotation and classifier-based methods. The experiments, carried out on over 300 h of news videos from TRECVid 2004 and TRECVid 2006 corpora, show that the multimedia translation method provides a performance that is comparable to the other auto-annotation methods and superior performance compared to other classifier-based methods.
Similar content being viewed by others
References
Smeulders A., Worring M., Santini S., Gupta A., Jain R.: Content based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 22(12), 1349–1380 (2000)
Lew M.S., Sebe N., Djeraba C., Jain R.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimedia Comput. Commun. Appl. 2(1), 1–19 (2006)
Hare, J., Lewis, P., Enser, P., Sandom, C.: Mind the gap: another look at the problem of the semantic gap in image retrieval. In: Multimedia Content Analysis, Management and Retrieval, SPIE, vol. 6073, San Jose, California, USA (2006)
Chang S., Hsu A.: Image information systems: where do we go from here?. IEEE Trans. Knowl. Data Eng. 4(5), 431–442 (1992)
Rui Y., Huang T., Chang S.: Image retrieval: current techniques, promising directions, and open issues. J. Vis. Commun. Image Represent. 10(4), 39–62 (1999)
Snoek C., Worring M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)
Carson C., Belongie S., Greenspan H., Malik J.: Blobworld: image segmentation using expectation-maximization and its application to image querying. IEEE Trans. Pattern Anal. Mach. Intell. 24(8), 1026–1038 (2002)
Zhao R., Grosky W.I.: Narrowing the semantic gap: improved text-based web document retrieval using visual features. IEEE Trans. Multimedia 4(2), 189–200 (2002)
Benitez, A., Chang, S.-F.: Semantic knowledge construction from annotated image collections. In: IEEE International Conference on Multimedia and Expo, vol. 2, pp. 205–208 (2002)
Chang, S.-F., Manmatha, R., Chua, T.-S.: Combining Text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, Philadelphia, PA, March, pp. 1005–1008 (2005)
Viola P., Jones M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)
Leibe, B., Seemann, E., Schiele, B.: Pedestrian detection in crowded scenes. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. 1, pp. 878–885 (2005)
Fergus R., Perona P., Zisserman A.: Object class recognition by unsupervised scale-invariant learning. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, II-264–II-271 (2003)
Wu L., Hu Y., Li M., Yu N., Hua X.-S.: Scale-invariant visual language modeling for object categorization. IEEE Trans. Multimedia 11(2), 286–294 (2003)
Fei-Fei L., Fergus R., Perona P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006)
Quelhas P., Monay F., Odobez J.-M., Gatica-Perez D., Tuytelaars T.: A thousand words in a scene. IEEE Trans. Pattern Anal. Mach. Intell. 29(9), 1575–1589 (2007)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: IEEE CVPR 2004, Workshop on Generative-Model Based Vision (2004)
Caltech 101 Dataset Homepage [Online]. Available: http://www.vision.caltech.edu/Image_Datasets/Caltech101
Griffin, G., Holub, A., Perona, P.: Caltech-256 Object Category Dataset, California Institute of Technology, Tech. Rep. 7694 (2007)
Everingham, M., Zisserman, A., Williams, C., Gool, L. V., Allan, M., Bishop, C., Chapelle, O., Dalal, N., Deselaers, T., Dorko, G., Duffner, S., Eichhorn, J., Farquhar, J., Fritz, M., Garcia, C., Griffiths, T., Jurie, F., Keysers, D., Koskela, M., Laaksonen, J., Larlus, D., Leibe, B., Meng, H., Ney, H., Schiele, B., Schmid, C., Seemann, E., Shawe-Taylor, J., Storkey, A., Szedmak, S., Triggs, B., Ulusoy, I., Viitaniemi, V., Zhang, J.: The 2005 PASCAL Visual Object Classes Challenge. In: Selected Proceedings of the First PASCAL Challenges Workshop, LNAI, Springer-Verlag (2006)
The PASCAL Visual Object Classes Homepage [Online]. Available: http://pascallin.ecs.soton.ac.uk/challenges/VOC
Russell B.C., Torralba A., Murphy K.P., Freeman W.T.: LabelMe: a database and web-based tool for image annotation. Int. J. Comput. Vis. 77(1–3), 157–173 (2008)
LabelMe Homepage [Online]. Available: http://labelme.csail.mit.edu
Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image retrieval. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
Getty Images [Online]. Available: http://www.gettyimages.com
Flickr Photo Sharing Service [Online]. Available: http://www.espgame.org/gwap
von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: ACM Conference on Human Factors in Computing Systems (CHI 2004), pp. 319–326 (2004)
The ESP Game [Online]. Available: http://www.flickr.com
Yahoo! News [Online]. Available: http://news.yahoo.com
Kender, J. R., Naphade, M. R.: Visual concepts for news story tracking: analyzing and exploiting the NIST TRECVID video annotation experiment. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 1174–1181 (2005)
Enser P., Sandom C.J., Hare J., Lewis P.: Facing the reality of semantic image retrieval. J. Document. 63(4), 465–481 (2007)
Maron, O., Ratan, A.L.: Multiple-instance learning for natural scene classification. In: The 15th International Conference on Machine Learning, pp. 341–349 (1998)
Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, Philadelphia, PA, USA, March 18–23, pp. 153–156 (2005)
Carneiro, G., Vasconcelos, N.: Formulating semantic image annotation as a supervised learning problem. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, San Diego, June, pp. 163–168 (2005)
Li J., Wang J.: Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. Pattern Anal. Mach. Intell. 25(9), 1075–1088 (2003)
Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: 26th Annual Int. ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 119–126 (2003)
Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: 17th Annual Conference on Neural Information Processing Systems, vol. 16, pp. 553–560 (2003)
Feng, S., Manmatha, R., Lavrenko, V.: Multiple Bernoulli relevance models for image and video annotation. In: International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1002–1009 (2004)
Barnard K., Duygulu P., de Freitas N., Forsyth D.A., Blei D., Jordan M.: Matching words and pictures. J. Mach. Learn. Res. 3, 1107–1135 (2003)
Blei D., Jordan, M.I.: Modeling annotated data. In: 26th Annual International ACM SIGIR Conference, Toronto, Canada, July 28–August 1, pp. 127–134 (2003)
Barnard, K., Forsyth, D.A.: Learning the semantics of words and pictures. In: International Conference on Computer Vision, vol. 2, pp. 408–415 (2001)
Hofmann, T., Puzicha, J.: Statistical Models for Co-occurrence Data, AI Memo 1625, CBCL Memo 159, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, MIT, Tech. Rep., February (1998)
Monay, F., Gatica-Perez, D.: PLSA-based image auto-annotation: constraining the latent space. In: ACM International Conference on Multimedia, October, pp. 348–351 (2004)
Mori, Y., Takahashi, H., Oka, R.: Image-to-word transformation based on dividing and vector quantizing images with words. In: 1st International Workshop on Multimedia Intelligent Storage and Retrieval Management (1999)
Duygulu, P., Barnard, K., Freitas, N., Forsyth, D.A.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: 7th European Conference on Computer Vision, vol. 4, Copenhagen Denmark, May 27–June 2, pp. 97–112 (2002)
Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: The 2004 IEEE International Conference on Multimedia and Expo, vol. 3, Taipei, Taiwan, June, pp. 1987–1990 (2004)
Carbonetto, P., de Freitas, N., Barnard, K.: A statistical model for general contextual object recognition. In: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11–14, pp. 350–362 (2004)
Brown P., Pietra S.A.D., Pietra V.J.D., Mercer R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Shi J., Malik J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Barnard, K., Duygulu, P., Guru, R., Gabbur, P., Forsyth, D.: The effects of segmentation and feature choice in a translation model of object recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, Wisconsin, June, pp. 675–682 (2003)
Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: 3rd International Conference on Image and Video Retrieval, Ireland, July 21–23, pp. 24–32 (2004)
Virga P., Duygulu, P.: Systematic evaluation of machine translation methods for image and video annotation. In: The 4th International Conference on Image and Video Retrieval, Singapore, July 20–22, pp. 174–183 (2005)
Monay F., Gatica-Perez D.: Modeling semantic aspects for cross-media image indexing. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1802–1817 (2007)
TREC Video Retrieval Evaluation [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid
Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVid, In: 8th ACM International Workshop on Multimedia Information Retrieval, pp. 321–330 (2006)
Smeaton, A., Over, P., Kraaij, W.: High level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications. Springer, Berlin (2008)
Ghoshal, A., Ircing, P., Khudanpur, S.: Hidden Markov models for automatic annotation and content based retrieval of images and video. In: The 28th International ACM SIGIR Conference, Salvador, Brazil, August 15–19, pp. 544–551 (2005)
Duygulu, P., Hauptmann, A.: What’s news what’s not? Associating News videos with words. In: The 3rd International Conference on Image and Video Retrieval (CIVR 2004), Ireland, July 21–23, pp. 21–23 (2004)
Wactlar, H., Hauptmann, A., Witbrock, M.: Informedia News-On Demand: Using Speech Recognition to Create a Digital Video Library, CMU Technical Report, CMU-CS-98-109, Tech. Rep. (1998)
Gross, R., Baker, S., Matthews, I., Kanade, T.: Face recognition across pose and illumination. In: Li, S.Z., Jain, A.K. (eds.) Handbook of Face Recognition. Springer, Berlin, pp. 193–216 (2004)
Zhao W., Chellappa R., Phillips P., Rosenfeld A.: Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)
Yang, J., Chen, M.-Y., Hauptmann, A.: Finding Person X: correlating names with visual appearances. In: International Conference on Image and Video Retrieval, Ireland, pp. 270–278 (2004)
Berg, T., Berg, A.C., Edwards, J., Maire, M., White, R., Teh, Y.-W., Learned-Miller, E., Forsyth, D.: Faces and names in the news. In: IEEE Conference on Computer Vision and Pattern Recognition (2004)
Ozkan, D., Duygulu, P.: A graph based approach for naming faces in news photos. In: IEEE International Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1477–1482 (2006)
Satoh, S., Kanade, T.: Name-It: Association of face and name in video. In: IEEE Conference on Computer Vision and Pattern Recognition (1997)
Baştan, M., Duygulu, P.: Recognizing objects and scenes in news videos. In: The International Conference on Image and Video Retrieval, Lecture Notes in Computer Science, 40071, pp. 380–390 (2006)
Rasiwasia N., Vasconcelos N.: Bridging the semantic gap: query by semantic example. IEEE Trans. Multimedia 9(5), 923–938 (2007)
Giza++: Training of statistical translation models [Online]. Available: http://www.fjoch.com/GIZA++.html
Och F.J., Ney H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 1(29), 19–51 (2003)
Lin, C. -Y., Tseng, B. L., Smith, J. R.: Video Collaborative annotation forum: establishing ground-truth labels on large multimedia datasets. In: NIST TREC-2003 Video Retrieval Evaluation Conference, Gaithersburg, MD, November (2003)
Naphade M., Curtis J., Hauptmann A., Kennedy L., Hsu W., Chang S.-F., Smith J.: Large-scale concept ontology for multimedia. IEEE Multimedia 13(3), 86–91 (2006)
Gauvain J., Lamel L., Adda G.: The LIMSI broadcast news transcription system. Speech Commun. 37(1-2), 89–108 (2002)
Lowe D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV) (2003)
Miller G.A., Beckwith R., Fellbaum C., Gross D., Miller K.J.: Introduction to WordNet: an online lexical database. Int. J. Lexicogr. 3, 235–244 (1990)
Tang J., Lewis P.: A study of quality issues for image auto-annotation with the corel data-set. IEEE Trans. Circuits Syst. Video Technol. 17(3), 384–389 (2007)
Joachims, T.: Multi-class support vector machine [Online]. Available: http://svmlight.joachims.org/svm-multiclass.html
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Duygulu, P., Baştan, M. Multimedia translation for linking visual data to semantics in videos. Machine Vision and Applications 22, 99–115 (2011). https://doi.org/10.1007/s00138-009-0217-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-009-0217-8