Journal on Multimodal User Interfaces

, Volume 2, Issue 2, pp 75–91 | Cite as

Multimodal identification and localization of users in a smart environment

  • Albert Ali SalahEmail author
  • Ramon Morros
  • Jordi Luque
  • Carlos Segura
  • Javier Hernando
  • Onkar Ambekar
  • Ben Schouten
  • Eric Pauwels
Original Paper


Detecting the location and identity of users is a first step in creating context-aware applications for technologically-endowed environments. We propose a system that makes use of motion detection, person tracking, face identification, feature-based identification, audio-based localization, and audio-based identification modules, fusing information with particle filters to obtain robust localization and identification. The data streams are processed with the help of the generic client-server middleware SmartFlow, resulting in a flexible architecture that runs across different platforms.


Multimodal tracking Multimodal identification Particle filters Smart rooms 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    European union. 6th framework integrated project CHIL. URL
  2. 2.
    NIST SmartFlow system. URL
  3. 3.
    Adami A, Burget L, Dupont S, Garudadri H, Grezl F, Hermansky H, Jain P, Kajarekar S, Morgan N, Sivadas S (2002) Qualcomm-ICSI-OGI features for ASR.In: Procceedings of ICSLP, pp 21–24 Google Scholar
  4. 4.
    Ajmera J, McCowan I, Bourlard H (2002) Robust HMM-based speech/music segmentation. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 1 Google Scholar
  5. 5.
    Anguera X (2005) Beamformit: the robust acoustic beamforming toolkit. URL
  6. 6.
    Anguera X, Wooters C, Hernando J (2006) Robust speaker diarization for meetings: ICSI RT06s evaluation system. In: Proceedings of ICSLP Google Scholar
  7. 7.
    Barras C, Zhu X, Meignier S, Gauvain J (2004) Improving speaker diarization. In: RT-04F workshop Google Scholar
  8. 8.
    Bernardin K, Elbs A, Stiefelhagen R (2006) Multiple object tracking performance metrics and evaluation in a smart room environment. In: IEEE international workshop on vision algorithms, pp 53–68 Google Scholar
  9. 9.
    Bimbot F, Bonastre JF, Fredouille C, Gravier G, Magrin-Chagnolleau I, Meignier S, Merlin T, Ortega-García J, Petrovska-Delacrétaz D, Reynolds D (2004) A tutorial of text-independent speaker verification. EURASIP J Appl Signal Process 4:430–451 CrossRefGoogle Scholar
  10. 10.
    Black J, Ellis T, Rosin P (2002) Multi-view image surveillance and tracking. In: IEEE workshop on motion and video computing Google Scholar
  11. 11.
    Carpenter J, Clifford P, Fearnhead P (1999) Improved particle filter for nonlinear problems. IEE Proc Radar Sonar Navig 146(1):2–7 CrossRefGoogle Scholar
  12. 12.
    Casas J, Stiefelhagen R (2005) Multi-camera/multi-microphone system design for continuous room monitoring. In: CHIL consortium deliverable D4.1 Google Scholar
  13. 13.
    Checka N, Wilson K, Siracusa M, Darrell T (2004) Multiple person and speaker activity tracking with a particle filter. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP’04), vol 5 Google Scholar
  14. 14.
    Chen J, Huang N, Benesty J (2004) An adaptive blind SIMO identification approach to joint multichannel time delay estimation. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 4, pp iv-53–iv-56 Google Scholar
  15. 15.
    Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans ASSP 28:357–366 CrossRefGoogle Scholar
  16. 16.
    DiBiase J, Silverman H, Brandstein M (2001) Microphone arrays. Robust localization in reverberant rooms. Springer, Berlin Google Scholar
  17. 17.
    Fleuret F, Berclaz J, Lengagne R, Fua P (2008) Multi-camera people tracking with a probabilistic occupancy map. IEEE Trans Pattern Anal Mach Intell 30(2):267–282 CrossRefGoogle Scholar
  18. 18.
    Fung G, Mangasarian O (2001) Proximal support vector machine classifiers. In: Proceedings of KDDM, pp 77–86 Google Scholar
  19. 19.
    Gatica-Perez D, Lathoud G, Odobez JM, McCowan I (2007) Audiovisual probabilistic tracking of multiple speakers in meetings. IEEE Trans Audio Speech Lang Process 15(2):601–616 CrossRefGoogle Scholar
  20. 20.
    Gordon N, Salmond D, Smith A (1993) Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proc Radar Signal Process 140(2):107–113 CrossRefGoogle Scholar
  21. 21.
    Haritaoğlu S, Harwood D, Davis L (2000) W4: real-time surveillance of people and their activities. IEEE Trans Pattern Anal Mach Intell 22(8):809–830 CrossRefGoogle Scholar
  22. 22.
    Isard M, Blake A (1998) Condensation—conditional density propagation for visual tracking. Int J Comput Vis 29(1):5–28 CrossRefGoogle Scholar
  23. 23.
    Kang J, Cohen I, Medioni G (2004) Tracking people in crowded scenes across multiple cameras. In: Asian conference on computer vision Google Scholar
  24. 24.
    Katsarakis N, Souretis G, Talantzis F, Pnevmatikakis A, Polymenakos L (2007) 3D audiovisual person tracking using Kalman filtering and information theory. In: Lecture notes in computer science, vol 4122. Springer, Berlin, p 45 Google Scholar
  25. 25.
    Khalaf RY, Intille SS (2001) Improving multiple people tracking using temporal consistency. MIT Dept. of Architecture, House_ n Project Technical Report Google Scholar
  26. 26.
    Khan Z, Balch T, Dellaert F (2003) Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems, vol 1, pp 254–259 Google Scholar
  27. 27.
    Kirby M, Sirovich L (1990) Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Trans Pattern Anal Mach Intell 12(1):103–108 CrossRefGoogle Scholar
  28. 28.
    Luque J, Anguera X, Temko A, Hernando J (2007) Speaker diarization for conference room: the UPC RT07s evaluation system. In: Proceedings of CLEAR. Lecture notes in computer science. Springer, Berlin Google Scholar
  29. 29.
    Luque J, Morros R, Garde A, Anguita J, Farrus M, Macho D, Marqués F, Martínez C, Vilaplana V, Hernando J (2006) Audio, video and multimodal person identification in a smart room. In: Proceedings of CLEAR 2006. Lecture notes in computer science, vol 4122. Springer, Berlin Google Scholar
  30. 30.
    Mittal A, Davis L (2003) M2tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. Int J Comput Vis 51(3):189–203 CrossRefGoogle Scholar
  31. 31.
    Moraru D, Ben M, Gravier G (2005) Experiments on speaker tracking and segmentation in radio broadcast news. In: Ninth European conference on speech communication and technology Google Scholar
  32. 32.
    Mostefa D et al (2006) CLEAR evaluation plan v1.1. In:
  33. 33.
    Nickel K, Gehrig T, Stiefelhagen R, McDonough J (2005) A joint particle filter for audio-visual speaker tracking. In: Proceedings of the 7th international conference on multimodal interfaces pp 61–68 Google Scholar
  34. 34.
    Omologo M, Svaizer P (1997) Use of the crosspower-spectrum phase in acoustic event location. IEEE Trans Speech Audio Process 5(3):288–292 CrossRefGoogle Scholar
  35. 35.
    Potamitis I, Tremoulis G, Fakotakis N (2003) Multi-speaker DOA tracking using interactive multiple models and probabilistic data association. In: Proceedings of European conference on speech communication and technology Google Scholar
  36. 36.
    Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 17(2) Google Scholar
  37. 37.
    Rabinkin D (1995) A framework for speech source localization using sensor arrays. PhD thesis, Brown University Google Scholar
  38. 38.
    Reynolds D, Torres-Carrasquillo P (2005) Approaches and applications of audio diarization. In: IEEE international conference on acoustics, speech, and signal processing, vol 5 Google Scholar
  39. 39.
    Salah A, Alpaydın E (2004) Incremental mixtures of factor analyzers. In: International conference on pattern recognition, vol 1, pp 276–279 Google Scholar
  40. 40.
    Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge Google Scholar
  41. 41.
    Stanford V, Garofolo J, Galibert O, Michel M, Laprun C (2003) The NIST smart space and meeting room projects: signals, acquisition, annotation, and metrics. Proc ICCASP 4:736–739 Google Scholar
  42. 42.
    Stauffer C, Grimson W (1999) Adaptive background mixture models for real-time tracking. In: Proceedings of the IEEE international conference on computer vision and pattern recognition Google Scholar
  43. 43.
    Stiefelhagen R, Bernardin K, Bowers R, Garofolo J, Mostefa D, Soundararajan P (2007) The CLEAR 2006 evaluation. In: Proceedings of CLEAR. Lecture notes in computer science. Springer, Berlin, pp 1–44 Google Scholar
  44. 44.
    Szeder G, Tichy W (2007) A communication middleware for smart room environments. In: Proceedings of the European conference on ambient intelligence. Lecture notes in computer science, vol 4794. Springer, Berlin, pp 195–210 CrossRefGoogle Scholar
  45. 45.
    Tangelder J, Schouten B (2006) Sparse face representations for face recognition in smart environments. In: International conference on pattern recognition Google Scholar
  46. 46.
    Temko A, Macho D, Nadeu C (2007) Enhanced SVM training for robust speech activity detection. In: Proceedings of ICCASP Google Scholar
  47. 47.
    Vilaplana V, Martínez C, Cruz J, Marques F (2006) Face recognition using groups of images in smart room scenarios. In: International conference on image processing (ICIP’06) Google Scholar
  48. 48.
    Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Procedings of the IEEE conference on computer vision and pattern recognition, vol 1, pp 511–518 Google Scholar
  49. 49.
    Wei Niu Long Jiao DH, Wang YF (2003) Real time multi person tracking in video surveillance. In: Pacific rim multimedia conference, Singapore Google Scholar
  50. 50.
    Wren C, Azarbayejani A, Darrell T, Pentland A (1997) Pfinder: real-time tracking of the human body. IEEE Trans Pattern Anal Mach Intell 19(7):780–785 CrossRefGoogle Scholar
  51. 51.
    Zhao T, Nevatia R, Wu B (2008) Segmentation and tracking of multiple humans in crowded environments. IEEE Trans Pattern Anal Mach Intell. doi: 10.1109/TPAMI.2007.70770 Google Scholar
  52. 52.
    Zhou S, Krueger V, Chellappa R (2003) Probabilistic recognition of human faces from video. Comput Vis Image Underst 91(1):214–245 CrossRefGoogle Scholar
  53. 53.
    Zotkin D, Duraiswami R, Davis L (2001) Multimodal 3D tracking and event detection via the particle filter. In: IEEE workshop on detection and recognition of events in video, pp 20–27 Google Scholar
  54. 54.
    Zotkin D, Duraiswami R, Davis L (2002) Joint audio-visual tracking using particle filters. EURASIP J Appl Signal Process 2002(11):1154–1164 zbMATHCrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2008

Authors and Affiliations

  • Albert Ali Salah
    • 1
    Email author
  • Ramon Morros
    • 2
  • Jordi Luque
    • 2
  • Carlos Segura
    • 2
  • Javier Hernando
    • 2
  • Onkar Ambekar
    • 1
  • Ben Schouten
    • 1
  • Eric Pauwels
    • 1
  1. 1.Signals and ImagesCentre for Mathematics and Computer ScienceAmsterdamThe Netherlands
  2. 2.Technical University of CataloniaBarcelonaSpain

Personalised recommendations