Speaker Tracking Using Multi-modal Fusion Framework

  • Anwar Saeed
  • Ayoub Al-Hamadi
  • Michael Heuer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7340)


This paper introduces a framework by which multi-modal sensory data can be efficiently and meaningfully combined in the application of speaker tracking. This framework fuses together four different observation types taken from multi-modal sensors. The advantages of this fusion are that weak sensory data from either modality can be reinforced, and the presence of noise can be reduced. We propose a method of combining these modalities by employing a particle filter. This method offers satisfied real-time performance.


Speaker tracking Skin detection Face detection Particle filter Time difference of arrival 


  1. 1.
    Shivappa, S., Trivedi, M., Rao, B.: Audiovisual information fusion in human computer interfaces and intelligent environments: A survey. Proceedings of the IEEE 98, 1692–1715 (2010)CrossRefGoogle Scholar
  2. 2.
    Vermaak, J., Gangnet, M., Blake, A., Perez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: ICCV, pp. 741–746 (2001)Google Scholar
  3. 3.
    Zhou, H., Taj, M., Cavallaro, A.: Target detection and tracking with heterogeneous sensors. IEEE Journal of Selected Topics in Signal Processing 2, 503–513 (2008)CrossRefGoogle Scholar
  4. 4.
    Saeed, A., Niese, R., Al-Hamadi, A., Michaelis, B.: Coping with hand-hand overlapping in bimanual movements. In: 2011 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 238–243 (2011)Google Scholar
  5. 5.
    Schettini, R., Gasparini, F.: Skin segmentation using multiple thresholding. In: Internet Imaging VII, IS and T/SPIE, pp. 60610F-1–60610F-8. SPIE (2006)Google Scholar
  6. 6.
    Rahman, N.A., Wei, K.C., See, J.: RGB-H-CbCr Skin Colour Model for Human Face Detection. In: Proceedings of The MMU International Symposium on Information & Communications Technologies, M2USIC 2006 (2006)Google Scholar
  7. 7.
    Saeed, A., Niese, R., Al-Hamadi, A., Panning, A.: Hand-face-touch measure: a cue for human behavior analysis. In: 2011 IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS), vol. 3, pp. 605–609 (2011)Google Scholar
  8. 8.
    Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000)Google Scholar
  9. 9.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features, pp. 511–518 (2001)Google Scholar
  10. 10.
    Knapp, C.H., Carter, G.C.: The generalized correlation method for estimation of time delay. IEEE Trans. Acoust., Speech, Signal Processing 24, 320–327 (1976)CrossRefGoogle Scholar
  11. 11.
    Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B.D.: Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In: American Society for Engineering Education ASEE Zone Conference Proceedings, pp. 1–7 (2008)Google Scholar
  12. 12.
    Blake, A., Isard, M.: The CONDENSATION algorithm - conditional density propagation and applications to visual tracking. In: NIPS, pp. 361–367 (1996)Google Scholar
  13. 13.
    Steer, M., Al-Hamadi, A., Michaelis, B.: Audio-visual data fusion using a particle filter in the application of face recognition. In: 2010 20th International Conference on Pattern Recognition (ICPR), pp. 4392–4395 (2010)Google Scholar
  14. 14.
    Doucet, A., De Freitas, N., Gordon, N. (eds.): Sequential Monte Carlo methods in practice (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Anwar Saeed
    • 1
  • Ayoub Al-Hamadi
    • 1
  • Michael Heuer
    • 1
  1. 1.Institute for Electronics, Signal Processing and Communications (IESK)Otto-von-Guericke-University MagdeburgMagdeburgGermany

Personalised recommendations