Audio-Video Integration for Background Modelling

  • Marco Cristani
  • Manuele Bicego
  • Vittorio Murino
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3022)


This paper introduces a new concept of surveillance, namely, audio-visual data integration for background modelling. Actually, visual data acquired by a fixed camera can be easily supported by audio information allowing a more complete analysis of the monitored scene. The key idea is to build a multimodal model of the scene background, able to promptly detect single auditory or visual events, as well as simultaneous audio and visual foreground situations. In this way, it is also possible to tackle some open problems (e.g., the sleeping foreground problems) of standard visual surveillance systems, if they are also characterized by an audio foreground. The method is based on the probabilistic modelling of the audio and video data streams using separate sets of adaptive Gaussian mixture models, and on their integration using a coupled audio-video adaptive model working on the frame histogram, and the audio frequency spectrum. This framework has shown to be able to evaluate the time causality between visual and audio foreground entities. To the best of our knowledge, this is the first attempt to the multimodal modelling of scenes working on-line and using one static camera and only one microphone. Preliminary results show the effectiveness of the approach at facing problems still unsolved by only visual monitoring approaches.


Gaussian Mixture Model Background Modelling Audio Signal Blind Source Separation Multimodal Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    PAMI: Special issue on video surveillance. IEEE Trans. on Pattern Analysis and Machine Intelligence 22 (2000)Google Scholar
  2. 2.
    Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Int. Conf. Computer Vision and Pattern Recognition, vol. 2 (1999)Google Scholar
  3. 3.
    Toyama, K., Krumm, J., Brumitt, B., Meyers, B.: Wallflower: Principles and practice of background maintenance. In: Int. Conf. Computer Vision, pp. 255–261 (1999)Google Scholar
  4. 4.
    Niebur, E., Hsiao, S., Johnson, K.: Synchrony: a neuronal mechanism for attentional selection? Current Opinion in Neurobiology, 190–194 (2002)Google Scholar
  5. 5.
    Stein, B., Meredith, M.: The Merging of the Senses. MIT Press, Cambridge (1993)Google Scholar
  6. 6.
    Checka, N., Wilson, K.: Person tracking using audio-video sensor fusion. Technical report, MIT Artificial Intelligence Laboratory (2002)Google Scholar
  7. 7.
    Zotkin, D., Duraiswami, R., Davis, L.: Joint audio-visual tracking using particle filters. EURASIP Journal of Applied Signal Processing 2002, 1154–1164 (2002)zbMATHCrossRefGoogle Scholar
  8. 8.
    Wilson, K., Checka, N., Demirdjian, D., Darrell, T.: Audio-video array source separation for perceptual user interfaces. In: Proceedings of Workshop on Perceptive User Interfaces (2001)Google Scholar
  9. 9.
    Darrell, T., Fisher, J., Wilson, K.: Geometric and statistical approaches to audiovisual segmentation for unthetered interaction. Technical report, CLASS Project (2002)Google Scholar
  10. 10.
    Hershey, J., Movellan, J.R.: Audio-vision: Using audio-visual synchrony to locate sounds. In: Advances in Neural Information Processing Systems 12, pp. 813–819. MIT Press, Cambridge (2000)Google Scholar
  11. 11.
    Mason, M., Duric, Z.: Using histograms to detect and track objects in color video. In: The 30th IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2001), Washington, D.C., USA, pp. 154–159 (2001)Google Scholar
  12. 12.
    Bregman, A.: Auditory Scene Analysis: The Perceptual Organization of Sound. MIT Press, London (1990)Google Scholar
  13. 13.
    Peltonen, V.: Computational auditory scene recognition. Master’s thesis, Tampere University of Tech., Finland (2001)Google Scholar
  14. 14.
    Cowling, M., Sitte, R.: Comparison of techniques for environmental sound recognition. Pattern Recognition Letters, 2895–2907 (2003)Google Scholar
  15. 15.
    Zhang, T., Kuo, C.: Audio content analysis for online audiovisual data segmentation and classification. IEEE Transactions on Speech and Audio Processing 9, 441–457 (2001)CrossRefGoogle Scholar
  16. 16.
    Roweis, S.: One microphone source separation. In: Advances in Neural Information Processing Systems, pp. 793–799 (2000)Google Scholar
  17. 17.
    Hild II, K., Erdogmus, D., Principe, J.: On-line minimum mutual information method for time-varying blind source separation. In: Intl. Workshop on Independent Component Analysis and Signal Separation (ICA 2001), pp. 126–131 (2001)Google Scholar
  18. 18.
    Marple, S.: Digital Spectral Analysis, 2nd edn. Prentice-Hall, Englewood Cliffs (1987)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Marco Cristani
    • 1
  • Manuele Bicego
    • 1
  • Vittorio Murino
    • 1
  1. 1.Dipartimento di InformaticaUniversity of VeronaVeronaItaly

Personalised recommendations