A Self-Referential Perceptual Inference Framework for Video Interpretation

  • Christopher Town
  • David Sinclair
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2626)


This paper presents an extensible architectural model for general content-based analysis and indexing of video data which can be customised for a given problem domain. Video interpretation is approached as a joint inference problems which can be solved through the use of modern machine learning and probabilistic inference techniques. An important aspect of the work concerns the use of a novel active knowledge representation methodology based on an ontological query language. This representation allows one to pose the problem of video analysis in terms of queries expressed in a visual language incorporating prior hierarchical knowledge of the syntactic and semantic structure of entities, relationships, and events of interest occurring in a video sequence. Perceptual inference then takes place within an ontological domain defined by the structure of the problem and the current goal set.


Video Analysis Perceptual Inference Content Base Image Retrieval Ontological Language Symbol Grounding Problem 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    M. Addlesee, R. Curwen, S. Hodges, J. Newman, P. Steggles, A. Ward, and A. Hopper. Implementing a sentient computing system. IEEE Computer, 34(8):50–56, 2001.Google Scholar
  2. 2.
    K. Barnard and D. Forsyth. Learning the semantics of words and pictures. In Proc. International Conference on Computer Vision, 2001.Google Scholar
  3. 3.
    A. Bobick and Y. Ivanov. Action recognition using probabilistic parsing. In Proc. Conference on Computer Vision and Pattern Recognition, 1998.Google Scholar
  4. 4.
    H. Bunke and D. Pasche. Structural Pattern Analysis, chapter Parsing multivalued strings and its application to image and waveform recognition. World Scientific Publishing, 1990.Google Scholar
  5. 5.
    H. Buxton and S. Gong. Advanced visual surveillance using bayesian networks. In Proc. International Conference on Computer Vision, 1995.Google Scholar
  6. 6.
    H. Buxton and N. Walker. Query based visual analysis: Spatio-temporal reasoning in computer vision. Vision Computing, 6(4):247–254, 1988.CrossRefGoogle Scholar
  7. 7.
    Y. Chen, Y. Rui, and T. Huang. JPDAF based HMM for real-time contour tracking. In Proc. Conference on Computer Vision and Pattern Recognition, 2001.Google Scholar
  8. 8.
    J. Crowley, J. Coutaz, and F. Berard. Things that see: Machine perception for human computer interaction. Communications of the ACM, 43(3):54–64, 2000.CrossRefGoogle Scholar
  9. 9.
    J. Crowley, J. Coutaz, G. Rey, and P. Reignier. Perceptual components for context aware computing. In Proc. Ubicomp 2002, 2002.Google Scholar
  10. 10.
    J. Crowley and Y. Demazeau. Principles and techniques for sensor data fusion. Signal Processing, 32(1–2):5–27, 1993.CrossRefGoogle Scholar
  11. 11.
    T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. In Proc. Conference on Computer Vision and Pattern Recognition, 1998.Google Scholar
  12. 12.
    D. C. Dennett. Minds, machines, and evolution, chapter Cognitive Wheels: The Frame Problem of AI, pages 129–151. Cambridge University Press, 1984.Google Scholar
  13. 13.
    B. Draper, U. Ahlrichs, and D. Paulus. Adapting object recognition across domains: A demonstration. Lecture Notes in Computer Science, 2095:256–270, 2001.Google Scholar
  14. 14.
    P. Duygulu, K. Barnard, J.F.H. De Freitas, and D.A. Forsyth. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proc. European Conference on Computer Vision, 2002.Google Scholar
  15. 15.
    J. Glicksman. A cooperative scheme for image understanding using multiple sources of information. Technical Report TR-82-13, University of British Columbia, Department of Computer Science, 1982.Google Scholar
  16. 16.
    S. Harnad. The symbol grounding problem. Physica D, 42:335–346, 1990.CrossRefGoogle Scholar
  17. 17.
    A. Harter, A. Hopper, P. Steggles, A. Ward, and P. Webster. The anatomy of a context-aware application. In Mobile Computing and Networking, pages 59–68, 1999.Google Scholar
  18. 18.
    G. Herzog and K. Rohr. Integrating vision and language: Towards automatic description of human movements. In I. Wachsmuth, C.-R. Rollinger, and W. Brauer, editors, KI-95: Advances in Artificial Intelligence. 19th Annual German Conference on Artificial Intelligence, pages 257–268. Springer, 1995.Google Scholar
  19. 19.
    S. Intille and A. Bobick. Representation and visual recognition of complex, multiagent actions using belief networks. In IEEE Workshop on the Interpretation of Visual Motion, 1998.Google Scholar
  20. 20.
    M. Isard and A. Blake. ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. Lecture Notes in Computer Science, 1406, 1998.Google Scholar
  21. 21.
    Y. Ivanov and A. Bobick. Recognition of visual activities and interactions by stochastic parsing. IEEE Trans. on Pattern Analysis and Machine Intell., 22(8), 2000.Google Scholar
  22. 22.
    A. Jaimes and S. Chang. A conceptual framework for indexing visual information at multiple levels. In IS&T SPIE Internet Imaging, 2000.Google Scholar
  23. 23.
    F.V. Jensen. An Introduction to Bayesian Networks. Springer Verlag, 1996.Google Scholar
  24. 24.
    A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. Int. Journal of Computer Vision (to appear), 2002.Google Scholar
  25. 25.
    D. Moore and I. Essa. Recognizing multitasked activities using stochastic context-free grammar. In Proc. Workshop on Models vs Exemplars in Computer Vision, 2001.Google Scholar
  26. 26.
    N. Oliver, B. Rosario, and A. Pentland. A bayesian computer vision system for modeling human interactions. IEEE Trans. on Pattern Analysis and Machine Intell., 22(8):831–843, 2000.CrossRefGoogle Scholar
  27. 27.
    C. Pinhanez and A. Bobick. Approximate world models: Incorporating qualitative and linguistic information into vision systems. In AAAI’96, 1996.Google Scholar
  28. 28.
    R. Rimey. Control of Selective Perception using Bayes Nets and Decision Theory. PhD thesis, University of Rochester Computer Science Department, 1993.Google Scholar
  29. 29.
    J. Sherrah and S. Gong. Tracking discontinuous motion using bayesian inference. In Proc. European Conference on Computer Vision, pages 150–166, 2000.Google Scholar
  30. 30.
    J. Sherrah and S. Gong. Continuous global evidence-based bayesian modality fusion for simultaneous tracking of multiple objects. In Proc. International Conference on Computer Vision, 2001.Google Scholar
  31. 31.
    P. Smith. Edge-based Motion Segmentation. PhD thesis, Cambridge University Engineering Department, 2001.Google Scholar
  32. 32.
    K. Sparck Jones. Information retrieval and artificial intelligence. Artificial Intelligence, 114: 257–281, 1999.zbMATHCrossRefGoogle Scholar
  33. 33.
    M. Spengler and B. Schiele. Towards robust multi-cue integration for visual tracking. Lecture Notes in Computer Science, 2095:93–106, 2001.CrossRefGoogle Scholar
  34. 34.
    R. Srihari. Computational models for integrating linguistic and visual information: A survey. Artificial Intelligence Review, special issue on Integrating Language and Vision, 8:349–369, 1995.Google Scholar
  35. 35.
    S. Stillman and I. Essa. Towards reliable multimodal sensing in aware environments. In Proc. Perceptual User Interfaces Workshop, ACM UIST 2001, 2001.Google Scholar
  36. 36.
    M. Thonnat and N. Rota. Image understanding for visual surveillance applications. In Proc. of 3rd Int. Workshop on Cooperative Distributed Vision, 1999.Google Scholar
  37. 37.
    C.P. Town and D.A. Sinclair. Ontological query language for content based image retrieval. In Proc. IEEE Workshop on Content-based Access of Image and Video Libraries, pages 75–81, 2001.Google Scholar
  38. 38.
    K. Toyama and E. Horvitz. Bayesian modality fusion: Probabilistic integration of multiple vision algorithms for head tracking. In Proc. Asian Conference on Computer Vision, 2000.Google Scholar
  39. 39.
    W. Tsai and K. Fu. Attributed grammars — a tool for combining syntactic and statistical approaches to pattern recognition. IEEE Transactions on Systems, Man and Cybernetics, SMC-10(12), 1980.Google Scholar
  40. 40.
    J. Tsotsos, J. Mylopoulos, H. Covvey, and S. Zucker. A framework for visual motion understanding. IEEE Trans. on Pattern Analysis and Machine Intell., Special Issue on Computer Analysis of Time-Varying Imagery:563–573, 1980.Google Scholar
  41. 41.
    Y. Wu and T. Huang. A co-inference approach to robust visual tracking. In Proc. International Conference on Computer Vision, 2001.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Christopher Town
    • 1
  • David Sinclair
    • 2
  1. 1.University of Cambridge Computer LaboratoryCambridgeUK
  2. 2.Waimara LtdCambridgeUK

Personalised recommendations