Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Multi-sensory and Multi-modal Fusion for Sentient Computing


This paper presents an approach to multi-sensory and multi-modal fusion in which computer vision information obtained from calibrated cameras is integrated with a large-scale sentient computing system known as “SPIRIT”. The SPIRIT system employs an ultrasonic location infrastructure to track people and devices in an office building and model their state. Vision techniques include background and object appearance modelling, face detection, segmentation, and tracking modules. Integration is achieved at the system level through the metaphor of shared perceptions, in the sense that the different modalities are guided by and provide updates to a shared world model. This model incorporates aspects of both the static (e.g. positions of office walls and doors) and the dynamic (e.g. location and appearance of devices and people) environment.

Fusion and inference are performed by Bayesian networks that model the probabilistic dependencies and reliabilities of different sources of information over time. It is shown that the fusion process significantly enhances the capabilities and robustness of both sensory modalities, thus enabling the system to maintain a richer and more accurate world model.

This is a preview of subscription content, log in to check access.


  1. Addlesee, M., Curwen, R., Hodges, S., Newman, J., Steggles, P., Ward, A., and Hopper, A. 2001. Implementing a sentient computing system. IEEE Computer, 34(8):50–56.

  2. Bouguet, J.-Y. Matlab calibration toolbox.

  3. Cattin, P., Zlatnik, D., and Borer, R. 2001. Biometric System using Human Gait. Mechatronics and Machine Vision in Practice, (M2VIP).

  4. Cerney, M. and Vance, J. 2005. Gesture recognition in virtual environments: A review and framework for future development. Technical report, Human Computer Interaction Center, Iowa State University.

  5. Choudhury, T., Rehg, J., Pavlovic, V., and Pentland, A. 2002. Boosting and structure learning in dynamic Bayesian networks for audio-visual speaker detection. In Proc. Int. Conference on Pattern Recognition.

  6. Crowley, J., Coutaz, J., Rey, G., and Reignier, P. 2002. Perceptual components for context aware computing. In Proc. Ubicomp.

  7. Dey, A. 2001. Understanding and using context. Personal and Ubiquitous Computing, 5(1):4–7.

  8. De la Torre, F. and Black, M. 2001. Robust principal component analysis for computer vision. In Proc. International Conference on Computer Vision.

  9. De la Torre, F. and Black, M. 2003. Robust parameterized component analysis: Theory and applications to 2d facial appearance models. Computer Vision and Image Understanding.

  10. Erickson, T. 2002. Some problems with the notion of context-aware computing. Communications of the ACM, 45(2):102–104.

  11. Fritsch, J., Kleinehagenbrock, M., Lang, S., Plotz, T., Fink, G., and Sagerer, G. 2003. Multi-modal anchoring for human-robot interaction. Robotics and Autonomous Systems, 43(2).

  12. Garcia, C. and Tziritas, G. 1999. Face detection using quantized skin color regions merging and wavelet packet analysis. IEEE Transactions on Multimedia, 1(3):264–277, 1999.

  13. Gavrila, D. 1999. The visual analysis of human movement: A survey. Computer Vision and Image Understanding: CVIU, 73(1):82–98.

  14. Genco, A. 2005. Three Step Bluetooth Positioning. In LNCS 3479: Location- and Context-Awareness.

  15. Hanbury, A. 2003. Circular statistics applied to colour images. 8th Computer Vision Winter Workshop.

  16. Harle, R. 2004. Maintaining World Models in Context-Aware Environments. PhD thesis, University of Cambridge Engineering Department.

  17. Harle, R. and Hopper, A. 2005. Deploying and evaluating a location-aware system. In Proc. MobiSys 2005.

  18. Harter, A. and Hopper, A. 1994. A distributed location system for the active office. IEEE Network, 8(1).

  19. Hazas M., Scott J., and Krumm, J. 2004. Location-aware computing comes of age. IEEE Computer, pp. 95–97.

  20. Hopper, A. 2000. Sentient computing—the Royal society clifford paterson lecture. Philosophical Transactions of the Royal Society of London, 358(1773):2349–2358.

  21. Ipina, D. and Hopper, A. 2002. TRIP: A low-cost vision-based location system for ubiquitous computing. Personal and Ubiquitous Computing, 6(3):206–219.

  22. Isard, M. and Blake, A. 1998. Condensation—Conditional density propagation for visual tracking. Int. Journal of Computer Vision, 29(1):5–28.

  23. Mansley, K., Beresford, A., and Scott, D. 2004. The Carrot Approach: Encouraging use of location systems. In Proceedings of UbiComp. Springer.

  24. McKenna, S., Raja, Y., and Gong, S. 1998. Object tracking using adaptive color mixture models. In Proc. Asian Conference on Computer Vision, pp. 615–622.

  25. Nummiaro, K., Koller-Meier, E., and Gool, L.V. 2003. An adaptive color-based particle filter. Image and Vision Computing, 21:99–110.

  26. Perez, P., Vermaak, J., and Blake, A. 2004. Data fusion for visual tracking with particles. In IEEE Trans. on Pattern Analysis and Machine Intelligence.

  27. Priyantha, N., Allen, K., Balakrishnan, H., and Teller, S.J. 2001. The cricket compass for context-aware mobile applications. In Mobile Computing and Networking, pp. 1–14.

  28. Sherrah, J. and Gong, S. 2001. Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects. In Proc. International Conference on Computer Vision.

  29. Sinclair, D. 2000. Smooth region structure: Folds, domes, bowls, ridges, valleys and slopes. In Proc. Conference on Computer Vision and Pattern Recognition, pp. 389–394.

  30. Skocaj, D. and Leonardis, A. 2002. Robust continuous subspace learning and recognition. In Proc. Int. Electrotechnical and Computer Science Conference.

  31. Spengler, M. and Schiele, B. 2001. Towards robust multi-cue integration for visual tracking. Lecture Notes in Computer Science, 2095:93–106.

  32. Stillman, S. and Essa, I. 2001. Towards reliable multimodal sensing in aware environments. In Proc. Perceptual User Interfaces Workshop, ACM UIST.

  33. Murphy, A.K., Freeman, W., and Mark, A. 2003. Context-based vision system for place and object recognition. In Proc. International Conference on Computer Vision

  34. Town, C.P. 2004a. Ontology based Visual Information Processing. PhD thesis, University of Cambridge.

  35. Town, C.P. 2004b. Ontology-driven Bayesian networks for dynamic scene understanding. In Proc. Int. Workshop on Detection and Recognition of Events in Video (at CVPR04).

  36. Toyama, K. and Horvitz, E. 2000. Bayesian modality fusion: Probabilistic integration of multiple vision algorithms for head tracking. In Proc. Asian Conference on Computer Vision.

  37. Turk, M. 2004. Computer vision in the interface. Communications of the ACM, 47(1).

Download references

Author information

Correspondence to Christopher Town.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Town, C. Multi-sensory and Multi-modal Fusion for Sentient Computing. Int J Comput Vision 71, 235–253 (2007).

Download citation


  • multi-sensory fusion
  • multi-modal fusion
  • sentient computing
  • object tracking
  • Bayesian networks