What’s News, What’s Not? Associating News Videos with Words

  • Pınar Duygulu
  • Alexander Hauptmann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3115)


Text retrieval from broadcast news video is unsatisfactory, because a transcript word frequently does not directly ‘describe’ the shot when it was spoken. Extending the retrieved region to a window around the matching keyword provides better recall, but low precision. We improve on text retrieval using the following approach: First we segment the visual stream into coherent story-like units, using a set of visual news story delimiters. After filtering out clearly irrelevant classes of shots, we are still left with an ambiguity of how words in the transcript relate to the visual content in the remaining shots of the story. Using a limited set of visual features at different semantic levels ranging from color histograms, to faces, cars, and outdoors, an association matrix captures the correlation of these visual features to specific transcript words. This matrix is then refined using an EM approach. Preliminary results show that this approach has the potential to significantly improve retrieval performance from text queries.


Visual Feature Machine Translation News Story News Video Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Brown, P.F., Della Pietra, S.A., Della Pietra, V.J., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2 (1993)Google Scholar
  2. 2.
    Chua, T.-S., Zhao, Y., Chaisorn, L., Koh, C.-K., Yang, H., Xu, H.: TREC 2003 Video Retrieval and Story Segmentation task at NUS PRIS. In: TREC (VIDEO) Conference (2003)Google Scholar
  3. 3.
    Duygulu, P., Barnard, K., de Freitas, N., Forsyth, D.: Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 97–112. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  4. 4.
    Duygulu, P., Wactlar, H.: Associating video frames with text. In: Multimedia Information Retrieval Workshop, in conjuction with ACM-SIGIR (2003)Google Scholar
  5. 5.
    Duygulu, P., Chen, M.-Y., Hauptmann, A.: Comparison and Combination of Two Novel Commercial Detection Methods. In: ICME 2004 (2004)Google Scholar
  6. 6.
    Hamerly, G., Elkan, C.: Learning the k in k-means. In: NIPS 2003 (2003)Google Scholar
  7. 7.
    Hauptmann, A., et al.: Informedia at TRECVID 2003: Analyzing and Searching Broadcast News Video. In: TREC (VIDEO) Conference (2003)Google Scholar
  8. 8.
    Kumar, S., Hebert, M.: Man-Made Structure Detection in Natural Images using a Causal Multiscale Random Field. In: CVPR (2003)Google Scholar
  9. 9.
    Pan, J.-Y., Yang, H.-J., Duygulu, P., Faloutsos, C.: Automatic Image Captioning. In: ICME 2004 (2004)Google Scholar
  10. 10.
    Schneiderman, H., Kanade, T.: Object detection using the statistics of parts. International Journal of Computer Vision (2002)Google Scholar
  11. 11.

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Pınar Duygulu
    • 1
  • Alexander Hauptmann
    • 2
  1. 1.Department of Computer EngineeringBilkent UniversityAnkaraTurkey
  2. 2.Informedia ProjectCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations