Multimodal identification and localization of users in a smart environment


Detecting the location and identity of users is a first step in creating context-aware applications for technologically-endowed environments. We propose a system that makes use of motion detection, person tracking, face identification, feature-based identification, audio-based localization, and audio-based identification modules, fusing information with particle filters to obtain robust localization and identification. The data streams are processed with the help of the generic client-server middleware SmartFlow, resulting in a flexible architecture that runs across different platforms.

This is a preview of subscription content, access via your institution.


  1. 1.

    European union. 6th framework integrated project CHIL. URL

  2. 2.

    NIST SmartFlow system. URL

  3. 3.

    Adami A, Burget L, Dupont S, Garudadri H, Grezl F, Hermansky H, Jain P, Kajarekar S, Morgan N, Sivadas S (2002) Qualcomm-ICSI-OGI features for ASR.In: Procceedings of ICSLP, pp 21–24

  4. 4.

    Ajmera J, McCowan I, Bourlard H (2002) Robust HMM-based speech/music segmentation. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 1

  5. 5.

    Anguera X (2005) Beamformit: the robust acoustic beamforming toolkit. URL

  6. 6.

    Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Proceedings of ICSLP

  7. 7.

    Barras C, Zhu X, Meignier S, Gauvain J (2004) Improving speaker diarization. In: RT-04F workshop

  8. 8.

    Bernardin K, Elbs A, Stiefelhagen R (2006) Multiple object tracking performance metrics and evaluation in a smart room environment. In: IEEE international workshop on vision algorithms, pp 53–68

  9. 9.

    Bimbot F, Bonastre JF, Fredouille C, Gravier G, Magrin-Chagnolleau I, Meignier S, Merlin T, Ortega-García J, Petrovska-Delacrétaz D, Reynolds D (2004) A tutorial of text-independent speaker verification. EURASIP J Appl Signal Process 4:430–451

    Article  Google Scholar 

  10. 10.

    Black J, Ellis T, Rosin P (2002) Multi-view image surveillance and tracking. In: IEEE workshop on motion and video computing

  11. 11.

    Carpenter J, Clifford P, Fearnhead P (1999) Improved particle filter for nonlinear problems. IEE Proc Radar Sonar Navig 146(1):2–7

    Article  Google Scholar 

  12. 12.

    Casas J, Stiefelhagen R (2005) Multi-camera/multi-microphone system design for continuous room monitoring. In: CHIL consortium deliverable D4.1

  13. 13.

    Checka N, Wilson K, Siracusa M, Darrell T (2004) Multiple person and speaker activity tracking with a particle filter. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’04), vol 5

  14. 14.

    Chen J, Huang N, Benesty J (2004) An adaptive blind SIMO identification approach to joint multichannel time delay estimation. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 4, pp iv-53–iv-56

  15. 15.

    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans ASSP 28:357–366

    Article  Google Scholar 

  16. 16.

    DiBiase J, Silverman H, Brandstein M (2001) Microphone arrays. Robust localization in reverberant rooms. Springer, Berlin

    Google Scholar 

  17. 17.

    Fleuret F, Berclaz J, Lengagne R, Fua P (2008) Multi-camera people tracking with a probabilistic occupancy map. IEEE Trans Pattern Anal Mach Intell 30(2):267–282

    Article  Google Scholar 

  18. 18.

    Fung G, Mangasarian O (2001) Proximal support vector machine classifiers. In: Proceedings of KDDM, pp 77–86

  19. 19.

    Gatica-Perez D, Lathoud G, Odobez JM, McCowan I (2007) Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans Audio Speech Lang Process 15(2):601–616

    Article  Google Scholar 

  20. 20.

    Gordon N, Salmond D, Smith A (1993) Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc Radar Signal Process 140(2):107–113

    Article  Google Scholar 

  21. 21.

    Haritaoğlu S, Harwood D, Davis L (2000) W4: real-time surveillance of people and their activities. IEEE Trans Pattern Anal Mach Intell 22(8):809–830

    Article  Google Scholar 

  22. 22.

    Isard M, Blake A (1998) Condensation—conditional density propagation for visual tracking. Int J Comput Vis 29(1):5–28

    Article  Google Scholar 

  23. 23.

    Kang J, Cohen I, Medioni G (2004) Tracking people in crowded scenes across multiple cameras. In: Asian conference on computer vision

  24. 24.

    Katsarakis N, Souretis G, Talantzis F, Pnevmatikakis A, Polymenakos L (2007) 3D audiovisual person tracking using Kalman filtering and information theory. In: Lecture notes in computer science, vol 4122. Springer, Berlin, p 45

    Google Scholar 

  25. 25.

    Khalaf RY, Intille SS (2001) Improving multiple people tracking using temporal consistency. MIT Dept. of Architecture, House_ n Project Technical Report

  26. 26.

    Khan Z, Balch T, Dellaert F (2003) Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems, vol 1, pp 254–259

  27. 27.

    Kirby M, Sirovich L (1990) Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell 12(1):103–108

    Article  Google Scholar 

  28. 28.

    Luque J, Anguera X, Temko A, Hernando J (2007) Speaker diarization for conference room: the UPC RT07s evaluation system. In: Proceedings of CLEAR. Lecture notes in computer science. Springer, Berlin

    Google Scholar 

  29. 29.

    Luque J, Morros R, Garde A, Anguita J, Farrus M, Macho D, Marqués F, Martínez C, Vilaplana V, Hernando J (2006) Audio, video and multimodal person identification in a smart room. In: Proceedings of CLEAR 2006. Lecture notes in computer science, vol 4122. Springer, Berlin

    Google Scholar 

  30. 30.

    Mittal A, Davis L (2003) M2tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. Int J Comput Vis 51(3):189–203

    Article  Google Scholar 

  31. 31.

    Moraru D, Ben M, Gravier G (2005) Experiments on speaker tracking and segmentation in radio broadcast news. In: Ninth European conference on speech communication and technology

  32. 32.

    Mostefa D et al (2006) CLEAR evaluation plan v1.1. In:

  33. 33.

    Nickel K, Gehrig T, Stiefelhagen R, McDonough J (2005) A joint particle filter for audio-visual speaker tracking. In: Proceedings of the 7th international conference on multimodal interfaces pp 61–68

  34. 34.

    Omologo M, Svaizer P (1997) Use of the crosspower-spectrum phase in acoustic event location. IEEE Trans Speech Audio Process 5(3):288–292

    Article  Google Scholar 

  35. 35.

    Potamitis I, Tremoulis G, Fakotakis N (2003) Multi-speaker DOA tracking using interactive multiple models and probabilistic data association. In: Proceedings of European conference on speech communication and technology

  36. 36.

    Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 17(2)

  37. 37.

    Rabinkin D (1995) A framework for speech source localization using sensor arrays. PhD thesis, Brown University

  38. 38.

    Reynolds D, Torres-Carrasquillo P (2005) Approaches and applications of audio diarization. In: IEEE international conference on acoustics, speech, and signal processing, vol 5

  39. 39.

    Salah A, Alpaydın E (2004) Incremental mixtures of factor analyzers. In: International conference on pattern recognition, vol 1, pp 276–279

  40. 40.

    Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge

    Google Scholar 

  41. 41.

    Stanford V, Garofolo J, Galibert O, Michel M, Laprun C (2003) The NIST smart space and meeting room projects: signals, acquisition, annotation, and metrics. Proc ICCASP 4:736–739

    Google Scholar 

  42. 42.

    Stauffer C, Grimson W (1999) Adaptive background mixture models for real-time tracking. In: Proceedings of the IEEE international conference on computer vision and pattern recognition

  43. 43.

    Stiefelhagen R, Bernardin K, Bowers R, Garofolo J, Mostefa D, Soundararajan P (2007) The CLEAR 2006 evaluation. In: Proceedings of CLEAR. Lecture notes in computer science. Springer, Berlin, pp 1–44

    Google Scholar 

  44. 44.

    Szeder G, Tichy W (2007) A communication middleware for smart room environments. In: Proceedings of the European conference on ambient intelligence. Lecture notes in computer science, vol 4794. Springer, Berlin, pp 195–210

    Chapter  Google Scholar 

  45. 45.

    Tangelder J, Schouten B (2006) Sparse face representations for face recognition in smart environments. In: International conference on pattern recognition

  46. 46.

    Temko A, Macho D, Nadeu C (2007) Enhanced SVM training for robust speech activity detection. In: Proceedings of ICCASP

  47. 47.

    Vilaplana V, Martínez C, Cruz J, Marques F (2006) Face recognition using groups of images in smart room scenarios. In: International conference on image processing (ICIP’06)

  48. 48.

    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Procedings of the IEEE conference on computer vision and pattern recognition, vol 1, pp 511–518

  49. 49.

    Wei Niu Long Jiao DH, Wang YF (2003) Real time multi person tracking in video surveillance. In: Pacific rim multimedia conference, Singapore

  50. 50.

    Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: real-time tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19(7):780–785

    Article  Google Scholar 

  51. 51.

    Zhao T, Nevatia R, Wu B (2008) Segmentation and tracking of multiple humans in crowded environments. IEEE Trans Pattern Anal Mach Intell. doi:10.1109/TPAMI.2007.70770

    Google Scholar 

  52. 52.

    Zhou S, Krueger V, Chellappa R (2003) Probabilistic recognition of human faces from video. Comput Vis Image Underst 91(1):214–245

    Article  Google Scholar 

  53. 53.

    Zotkin D, Duraiswami R, Davis L (2001) Multimodal 3D tracking and event detection via the particle filter. In: IEEE workshop on detection and recognition of events in video, pp 20–27

  54. 54.

    Zotkin D, Duraiswami R, Davis L (2002) Joint audio-visual tracking using particle filters. EURASIP J Appl Signal Process 2002(11):1154–1164

    MATH  Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Albert Ali Salah.

Additional information

This work is supported by Spanish projects SAPIRE (TEC2007-65470) and PROVEC (TEC2007-66858/TCM) and Dutch projects BRICKS/BSIK and BASIS IOP GenCom.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Salah, A.A., Morros, R., Luque, J. et al. Multimodal identification and localization of users in a smart environment. J Multimodal User Interfaces 2, 75–91 (2008).

Download citation


  • Multimodal tracking
  • Multimodal identification
  • Particle filters
  • Smart rooms