Modal Keywords, Ontologies, and Reasoning for Video Understanding

  • Alejandro Jaimes
  • Belle L. Tseng
  • John R. Smith
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2728)


We proposed a novel framework for video content understanding that uses rules constructed from knowledge bases and multimedia ontologies. Our framework consists of an expert system that uses a rule-based engine, domain knowledge, visual detectors (for objects and scenes), and metadata (text from automatic speech recognition, related text, etc.). We introduce the idea of modal keywords, which are keywords that represent perceptual concepts in the following categories: visual (e.g., sky), aural (e.g., scream), olfactory (e.g., vanilla), tactile (e.g., feather), and taste (e.g., candy). A method is presented to automatically classify keywords from speech recognition, queries, or related text into these categories using WordNet and TGM I. For video understanding, the following operations are performed automatically: scene cut detection, automatic speech recognition, feature extraction, and visual detection (e.g., sky, face, indoor). These operation results are used in our system by a rule-based engine that uses context information (e.g., text from speech) to enhance visual detection results. We discuss semi-automatic construction of multimedia ontologies and present experiments in which visual detector outputs are modified by simple rules that use context information available with the video.


Confidence Score Automatic Speech Recognition Video Shot Related Text Reasoning Engine 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    Y.A. Aslandogan, et al. “Using semantic contents and WordNet in image retrieval,” ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, 1997Google Scholar
  2. [2]
    Baldonado, M., Chang, C.-C.K., Gravano, L., Paepcke, A., “The Stanford Digital Library Metadata Architecture,” Int. J. Digit. Libraries 1, pp. 108–121, 1997CrossRefGoogle Scholar
  3. [3]
    B. Adams, A. Amir, C. Dorai, S. Ghoshal, G. Iyengar, A. Jaimes, C. Lang, C. Lin, A. Natsev, M. Naphade, C. Neti, H. Permuter, R. Singh, J. Smith, S. Srinivasan, B. Tseng, T. Ashwin, D. Zhang, “IBM Research TREC 2002 Video Retrieval System,” in proceedings NIST TREC 2002, Nov., 2002Google Scholar
  4. [4]
    K. Barnard and D.A. Forsyth, “Learning the semantics of words and pictures”, inProc. IEEE International Conference on Computer Vision, pp. 408–415, July, 2001Google Scholar
  5. [5]
    A.B. Benitez, J.R. Smith, and S.-F. Chang, “MediaNet: A Multimedia Information Network for Knowledge Representation,” IS&T/SPIE-2000, Vol. 4210, Boston, MA, Nov. 2000Google Scholar
  6. [6]
    B. Benitez and S.-F. Chang, “Semantic Knowledge Construction from Annotated Image Collections,” Proc. IEEE ICME-2002, Lausanne, Switzerland, Aug 26–29, 2002Google Scholar
  7. [7]
    Bhandari, et al. “Computer program product and method for using natural language for the description, search, and retrieval of multi-media objects,” U.S.Pat. 5,895,464, April 1999Google Scholar
  8. [8]
    Bosco et al. “Context and Multimedia Corpora,” Context’ 01, 3 rd International and Interdisciplinary Conference on Modeling and Using Context, Dundee, Scotland, July 2001Google Scholar
  9. [9]
    M. Dastani et al., “Modeling Context Effect in Perceptual Domains,” 3 rd International and Interdisciplinary Conference on Modeling and Using Context, Dundee, Scotland, July 2001Google Scholar
  10. [10]
    M.J. Denber, “Computer program product for retrieving multi-media objects using natural language having a pronoun,” U.S. Patent 6,233,547, May 2001Google Scholar
  11. [11]
    N. Dimitrova, Expert Panel Statement, in Borko Furht and Oge Marques eds Handbook of Video Databases, CRC Press, March 2003 (to appear)Google Scholar
  12. [12]
    J. Durkin, Expert Systems: Design and Development, Prentice Hall, Englewood Cliffs, NJ, 1994zbMATHGoogle Scholar
  13. [13]
    E.j. Guglielmo, et al., “Natural-language retrieval of images based on descriptive captions,” ACM Transactions on Information Systems 14(3), pp. 237–267, July 1996CrossRefGoogle Scholar
  14. [14]
    N. Guarino, “Formal Ontology and Information Systems,” in proc. FOIS’ 98, Trento, Italy, July 1998Google Scholar
  15. [15]
    S. Harnad, “The Symbol Grounding Problem,” Physica D 42: 335–346, 1990CrossRefGoogle Scholar
  16. [16]
    Jaimes and J.R. Smith, “Semi-automatic, Data-Driven Construction of Multimedia Ontologies,” ICME 2003, Baltimore, MD, 2003Google Scholar
  17. [17]
    Jaimes. Conceptual Structures and Techniques for Indexing and Organization of Visual Information. Ph.D. thesis, Electrical Engineering Department, Columbia U., February 2003Google Scholar
  18. [18]
    Jaimes, M. Naphade, H. Nock, J.R. Smith, and B.L. Tseng, “Context Enhanced Video Understanding,” SPIE Storage and Media Databases 2003, Santa Clara, CA, January 2003Google Scholar
  19. [19]
    IBM Alphaworks (Video Annex and ABLE: Scholar
  20. [20]
    Informedia project ( Scholar
  21. [21]
    Internet Movie Archive ( Scholar
  22. [22]
    L. Khan and D. McLeod, “Audio Structuring and Personalzied Retrieval Using Ontologies,” in proc. IEEE Advances in Digital Libraries (ADL 2000), Washington, DC, May 2000Google Scholar
  23. [23]
    Library of Congress Thesaurus for Graphic Materials I (TGM I), 1995. ( Scholar
  24. [24]
    Maedche, et al., “Seal-tying up information integration and web site management by ontologies,” IEEE Data Engineering Bulletin, Vol. 25, March 2002Google Scholar
  25. [25]
    Miller, George A. “WordNet: a lexical database for English.” In: Communications of the ACM 38(11), November 1995, pp. 39–41CrossRefGoogle Scholar
  26. [26]
    S. Mukherjea et al. “Method and Aparathus for assigning keywords to objects,” U.S. Patent 6,317,740, Nov., 2001Google Scholar
  27. [27]
    E.d.S. Moreira, “Embedded Video Content and Context Awareness,” Proceedings of the CHI 2000 Workshop on “The What, Who, Where, When, Why and How of Context-Awareness,” The Hague, Netherlands, 2000Google Scholar
  28. [28]
    M. McKeown, et al., “Ruled-based interpretation of aerial imagery,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 7, No.5, pp. 570–585, 1985CrossRefGoogle Scholar
  29. [29]
    M. Naphade, I.V. Kozintsev, and T.S. Huang, “Factor graph framework for semantic video indexing, ” IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12 No. 1, pp. 40–52, Jan. 2002CrossRefGoogle Scholar
  30. [30]
    Natsev, M. Naphade, and J.R. Smith, “Exploring Semantic Dependencies for Scalable Concept Detection,” IEEE ICIP 2003, Barcelona, Spain, October 2003Google Scholar
  31. [31]
    N.C. Rowe, “Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions,” IEEE Intelligent Systems, July/August, 2002Google Scholar
  32. [32]
    D.K. Roy, “Learning Visually-Grounded Words and Syntax for a Scene Description Task,” submitted to Computer Speech and Language, 2002Google Scholar
  33. [33]
    A.F. Smeaton, and I. Quigley, “Experiments on using semantic distances between words in image caption retrieval,” In Proceedings of the 19th Annual International Conference on Research and Development in Information Retrieval, pp. 174–180, Zürich, 1996Google Scholar
  34. [34]
    A.Th. Schreiber, et al., “Ontology-Based Photo Annotation,” IEEE Intelligent Systems, May–June 2001Google Scholar
  35. [35]
    http://www.semanticweb.orgGoogle Scholar
  36. [36]
    N. Shiotani and S. Miyamoto, “Image Retrieval System Using an Iconic Thesaurus,” in Proc. IEEE Int. Conf. On Intelligent Processing Systems, Oct. 1997Google Scholar
  37. [37]
    S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs, N.J., 1995zbMATHGoogle Scholar
  38. [38]
    G. Salton and M. J. McGill. Introduction to Modern Information Retrieval, McGraw Hill Computer Science Series, New York, 1983Google Scholar
  39. [39]
    R. Tansley, “The Multimedia Thesaurus: An Aid for Multimedia Retrieval and Navigation,” Master Thesis, Computer Science, University of Southhampton, UK, 1998Google Scholar
  40. [40]
    TREC Video Retrieval Track ( Scholar
  41. [41]
    B.L. Tseng, C. Lin, M. Naphade, A. Natsev, J.R. Smith, “Normalized Classifier Fusion for Semantic Visual Detection,” IEEE International Conference on Image Processing, 2003Google Scholar
  42. [42]
    J. Yang, et al., “Thesaurus-Aided Approach for Image Browsing and Retreival,” in proc. IEEE ICME 2001, pp. 313–316, Tokyo, Japan, Aug. 2001Google Scholar
  43. [43]
    Weissman., et al. “Meaning-based information organization and retrieval,” United States Patent 6,453,315, Sept. 2002Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Alejandro Jaimes
    • 1
  • Belle L. Tseng
    • 1
  • John R. Smith
    • 1
  1. 1.Pervasive Media ManagementIBM T.J. Watson Research CenterHawthorneUSA

Personalised recommendations