Multimedia Systems

, Volume 16, Issue 6, pp 345–379 | Cite as

Multimodal fusion for multimedia analysis: a survey

  • Pradeep K. AtreyEmail author
  • M. Anwar Hossain
  • Abdulmotaleb El Saddik
  • Mohan S. Kankanhalli
Regular Paper


This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.


Multimodal information fusion Multimedia analysis 



The authors would like to thank the editor and the anonymous reviewers for their valuable comments in improving the content of this paper. This work is partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.


  1. 1.
    PETS: Performance evaluation of tracking and surveillance (Last access date 31 August 2009).
  2. 2.
    TRECVID data availability (Last access date 02 September 2009).
  3. 3.
    Adams, W., Iyengar, G., Lin, C., Naphade, M., Neti, C., Nock, H., Smith, J.: Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J. Appl. Signal Process. 2003(2), 170–185 (2003)CrossRefGoogle Scholar
  4. 4.
    Aguilar, J.F., Garcia, J.O., Romero, D.G., Rodriguez, J.G.: A comparative evaluation of fusion strategies for multimodal biometric verification. In: International Conference on Video-Based Biometrie Person Authentication, pp. 830–837. Guildford (2003)Google Scholar
  5. 5.
    Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc. IEEE 94(11), 2025–2044 (2006)CrossRefGoogle Scholar
  6. 6.
    Andrieu, C., Doucet, A., Singh, S., Tadic, V.: Particle methods for change detection, system identification, and control. Proc. IEEE 92(3), 423–438 (2004)Google Scholar
  7. 7.
    Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: International Conference on Accoustic, Speech and Signal Processing, pp. II–153–156. Philadelphia (2005)Google Scholar
  8. 8.
    Atrey, P.K., Kankanhalli, M.S., Jain, R.: Information assimilation framework for event detection in multimedia surveillance systems. Springer/ACM Multimed. Syst. J. 12(3), 239–253 (2006)CrossRefGoogle Scholar
  9. 9.
    Atrey, P.K., Kankanhalli, M.S., Oommen, J.B.: Goal-oriented optimal subset selection of correlated multimedia streams. ACM Trans. Multimedia Comput. Commun. Appl. 3(1), 2 (2007)Google Scholar
  10. 10.
    Atrey, P.K., Kankanhalli, M.S., El Saddik, A.: Confidence building among correlated streams in multimedia surveillance systems. In: International Conference on Multimedia Modeling, pp. 155–164. Singapore (2007)Google Scholar
  11. 11.
    Ayache, S., Quénot, G., Gensel, J.: Classifier fusion for svm-based multimedia semantic indexing. In: The 29th European Conference on Information Retrieval Research, pp. 494–504. Rome (2007)Google Scholar
  12. 12.
    Babaguchi, N., Kawai, Y., Kitahashi, T.: Event based indexing of broadcasted sports video by intermodal collaboration. IEEE Trans. Multimed. 4, 68–75 (2002)CrossRefGoogle Scholar
  13. 13.
    Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personalized abstraction of broadcasted american football video by highlight selection. IEEE Trans. Multimed. 6(4), 575–586 (2004)CrossRefGoogle Scholar
  14. 14.
    Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruíz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: International Conference on Audio-and Video-Based Biometrie Person Authentication, pp. 625–638. Guildford (2003)Google Scholar
  15. 15.
    Beal, M.J., Jojic, N., Attias, H.: A graphical model for audio-visual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 828– 836 (2003)CrossRefGoogle Scholar
  16. 16.
    Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V., Pieczynski, W.: Multisensor image segmentation using Dempster–Shafer fusion in markov fields context. IEEE Trans. Geosci. Remote Sens. 39(8), 1789–1798 (2001)CrossRefGoogle Scholar
  17. 17.
    Bengio, S.: Multimodal authentication using asynchronous hmms. In: The 4th International Conference Audio and Video Based Biometric Person Authentication, pp. 770–777. Guildford (2003)Google Scholar
  18. 18.
    Bengio, S., Marcel, C., Marcel, S., Mariethoz, J. Confidence measures for multimodal identity verification. Inf. Fusion 3(4), 267–276 (2002)CrossRefGoogle Scholar
  19. 19.
    Bredin, H., Chollet, G.: Audio-visual speech synchrony measure for talking-face identity verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 233–236. Paris (2007)Google Scholar
  20. 20.
    Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Adv. Signal Process. 11 p. (2007). Article ID 70186Google Scholar
  21. 21.
    Brémond, F., Thonnat, M.: A context representation of surveillance systems. In: European Conference on Computer Vision. Orlando (1996)Google Scholar
  22. 22.
    Brooks, R.R., Iyengar, S.S.: Multi-sensor Fusion: Fundamentals and Applications with Software. Prentice Hall PTR, Upper Saddle River, NJ (1998)Google Scholar
  23. 23.
    Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)CrossRefGoogle Scholar
  24. 24.
    Caruana, R., Munson, A., Niculescu-Mizil, A.: Getting the most out of ensemble selection. In: ACM International Conference on on Data Mining, pp. 828–833. Maryland (2006)Google Scholar
  25. 25.
    Chaisorn, L., Chua, T.S., Lee, C.H., Zhao, Y., Xu, H., Feng, H., Tian, Q.: A multi-modal approach to story segmentation for news video. World Wide Web 6, 187–208 (2003)CrossRefGoogle Scholar
  26. 26.
    Chang, S.F., Manmatha, R., Chua, T.S.: Combining text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 1005–1008. IEEE Computer Society, Philadelphia (2005)Google Scholar
  27. 27.
    Chen, Q., Aickelin, U.: Anomaly detection using the dempster–shafer method. In: International Conference on Data Mining, pp. 232–240. Las Vegas (2006)Google Scholar
  28. 28.
    Chetty, G., Wagner, M.: Audio-visual multimodal fusion for biometric person authentication and liveness verification. In: NICTA-HCSNet Multimodal User Interaction Workshop, pp. 17–24. Sydney (2006)Google Scholar
  29. 29.
    Chieu, H.L., Lee, Y.K.: Query based event extraction along a timeline. In: International ACM Conference on Research and Development in Information Retrieval, pp. 425–432. Sheffield (2004)Google Scholar
  30. 30.
    Choudhury, T., Rehg, J.M., Pavlovic, V., Pentland, A.: Boosting and structure learning in dynamic bayesian networks for audio-visual speaker detection. In: The 16th International Conference on Pattern Recognition, vol. 3, pp. 789–794. Quebec (2002)Google Scholar
  31. 31.
    Chua, T.S., Chang, S.F., Chaisorn, L., Hsu, W.: Story boundary detection in large broadcast news video archives: techniques, experience and trends. In: ACM International Conference on Multimedia, pp. 656–659. New York, USA (2004)Google Scholar
  32. 32.
    Corradini, A., Mehta, M., Bernsen, N., Martin, J., Abrilian, S.: Multimodal input fusion in human–computer interaction. In: NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management. Karlsruhe University, Germany (2003)Google Scholar
  33. 33.
    Crisan, D., Doucet, A.: A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process. 50(3), 736–746 (2002)CrossRefMathSciNetGoogle Scholar
  34. 34.
    Cutler, R., Davis, L.: Look who’s talking: Speaker detection using video and audio correlation. In: IEEE International Conference on Multimedia and Expo, pp. 1589–1592. New York City (2000)Google Scholar
  35. 35.
    Darrell, T., Fisher III, J.W., Viola, P., Freeman, W.: Audio-visual segmentation and “the cocktail party effect”. In: International Conference on Multimodal Interfaces. Bejing (2000)Google Scholar
  36. 36.
    Datcu, D., Rothkrantz, L.J.M.: Facial expression recognition with relevance vector machines. In: IEEE International Conference on Multimedia and Expo, pp. 193–196. Amsterdam, The Netherlands (2005)Google Scholar
  37. 37.
    Debouk, R., Lafortune, S., Teneketzis, D.: On an optimal problem in sensor selection. J. Discret. Event Dyn. Syst. Theory Appl. 12, 417–445 (2002)zbMATHCrossRefMathSciNetGoogle Scholar
  38. 38.
    Ding, Y., Fan, G.: Segmental hidden markov models for view-based sport video analysis. In: International Workshop on Semantic Learning Applications in Multimedia. Minneapolis (2007)Google Scholar
  39. 39.
    Fisher-III, J., Darrell, T., Freeman, W., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Advances in Neural Information Processing Systems, pp. 772–778. Denver (2000)Google Scholar
  40. 40.
    Foresti, G.L., Snidaro, L.: A distributed sensor network for video surveillance of outdoor environments. In: IEEE International Conference on Image Processing. Rochester (2002)Google Scholar
  41. 41.
    Gandetto, M., Marchesotti, L., Sciutto, S., Negroni, D., Regazzoni, C.S.: From multi-sensor surveillance towards smart interactive spaces. In: IEEE International Conference on Multimedia and Expo, pp. I:641–644. Baltimore (2003)Google Scholar
  42. 42.
    Garcia Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., les Jardins, J., Lunter, J., Ni, Y., Petrovska Delacretaz, D.: BIOMET: A multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. In: International Conference on Audio-and Video-Based Biometrie Person Authentication, pp. 845–853. Guildford, UK (2003)Google Scholar
  43. 43.
    Gehrig, T., Nickel, K., Ekenel, H., Klee, U., McDonough, J.: Kalman filters for audio–video source localization. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 118– 121. Karlsruhe University, Germany (2005)Google Scholar
  44. 44.
    Guironnet, M., Pellerin, D., Rombaut, M.: Video classification based on low-level feature fusion model. In: The 13th European Signal Processing Conference. Antalya, Turkey (2005)Google Scholar
  45. 45.
    Hall, D.L., Llinas, J.: An introduction to multisensor fusion. In: Proceedings of the IEEE: Special Issues on Data Fusion, vol. 85, no. 1, pp. 6–23 (1997)Google Scholar
  46. 46.
    Hershey, J., Attias, H., Jojic, N., Krisjianson, T.: Audio visual graphical models for speech processing. In: IEEE International Conference on Speech, Acoustics, and Signal Processing, pp. 649–652. Montreal (2004)Google Scholar
  47. 47.
    Hershey, J., Movellan, J.: Audio-vision: using audio-visual synchrony to locate sounds. In: Advances in Neural Information Processing Systems, pp. 813–819. MIT Press, USA (2000)Google Scholar
  48. 48.
    Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)zbMATHCrossRefMathSciNetGoogle Scholar
  49. 49.
    Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3d pointing gestures. In: ACM International Conference on Multimodal Interfaces, pp. 175–182. State College, PA (2004)Google Scholar
  50. 50.
    Hossain, M.A., Atrey, P.K., El Saddik, A.: Smart mirror for ambient home environment. In: The 3rd IET International Conference on Intelligent Environments, pp. 589–596. Ulm (2007)Google Scholar
  51. 51.
    Hossain, M.A., Atrey, P.K., El Saddik, A.: Modeling and assessing quality of information in multi-sensor multimedia monitoring systems. ACM Trans. Multimed. Comput. Commun. Appl. 7(1) (2011)Google Scholar
  52. 52.
    Hsu, W., Kennedy, L., Huang, C.W., Chang, S.F., Lin, C.Y.: News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003. In: International Conference on Acoustics Speech and Signal Processing. Montreal, QC (2004)Google Scholar
  53. 53.
    Hsu, W.H.M., Chang, S.F.: Generative, discriminative, and ensemble learning on multi-modal perceputal fusion toward news stroy segmentation. In: IEEE International Conference on Multimedia and Expos, pp. 1091–1094. Taipei (2004)Google Scholar
  54. 54.
    Hu, H., Gan, J.Q.: Sensors and data fusion algorithms in mobile robotics. Technical report, CSM-422, Department of Computer Science, University of Essex, UK (2005)Google Scholar
  55. 55.
    Hua, X.S., Zhang, H.J.: An attention-based decision fusion scheme for multimedia information retrieval. In: The 5th Pacific-Rim Conference on Multimedia. Tokyo, Japan (2004)Google Scholar
  56. 56.
    Isler, V., Bajcsy, R.: The sensor selection problem for bounded uncertainty sensing models. In: International Symposium on Information Processing in Sensor Networks, pp. 151–158. Los Angeles (2005)Google Scholar
  57. 57.
    Iyengar, G., Nock, H.J., Neti, C.: Audio-visual synchrony for detection of monologue in video archives. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong (2003)Google Scholar
  58. 58.
    Iyengar, G., Nock, H.J., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. In: ACM International Conference on Multimedia, pp. 255–258. Berkeley (2003)Google Scholar
  59. 59.
    Jaffre, G., Pinquier, J.: Audio/video fusion: a preprocessing step for multimodal person identification. In: International Workshop on MultiModal User Authentification. Toulouse, France (2006)Google Scholar
  60. 60.
    Jaimes, A., Sebe, N.: Multimodal human computer interaction: a survey. In: IEEE International Workshop on Human Computer Interaction. Beijing (2005)Google Scholar
  61. 61.
    Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognit. 38(12), 2270–2285 (2005)CrossRefGoogle Scholar
  62. 62.
    Jasinschi, R.S., Dimitrova, N., McGee, T., Agnihotri, L., Zimmerman, J., Li, D., Louie, J.: A probabilistic layered framework for integrating multimedia content and context information. In: International Conference on Acoustics, Speech and Signal Processing, vol. II, pp. 2057–2060. Orlando (2002)Google Scholar
  63. 63.
    Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: International Conference on Image and Video Retrieval, vol. 3115, pp. 24–32. Dublin (2004)Google Scholar
  64. 64.
    Jiang, S., Kumar, R., Garcia, H.E.: Optimal sensor selection for discrete event systems with partial observation. IEEE Trans. Automat. Contr. 48, 369–381 (2003)CrossRefMathSciNetGoogle Scholar
  65. 65.
    Julier, S.J., Uhlmann, J.K.: New extension of the Kalman filter to nonlinear systems. In: Signal Processing, Sensor Fusion, and Target Recognition VI, vol. 3068 SPIE, pp. 182–193. San Diego (1997)Google Scholar
  66. 66.
    Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME J. Basic Eng. 82(series D), 35–45 (1960)Google Scholar
  67. 67.
    Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling in multimedia systems. IEEE Trans. Multimed. 8(5), 937–946 (2006)CrossRefGoogle Scholar
  68. 68.
    Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling on multiple data streams. IEEE Trans. Multimed. 8(5), 947–955 (2006)CrossRefGoogle Scholar
  69. 69.
    Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)CrossRefGoogle Scholar
  70. 70.
    Lam, K.Y., Cheng, R., Liang, B.Y., Chau, J.: Sensor node selection for execution of continuous probabilistic queries in wireless sensor networks. In: ACM International Workshop on Video Surveillance and Sensor Networks, pp. 63–71. NY, USA (2004)Google Scholar
  71. 71.
    León, T., Zuccarello, P., Ayala, G., de Ves, E., Domingo, J.: Applying logistic regression to relevance feedback in image retrieval systems. Pattern Recognit. 40(10), 2621–2632 (2007)zbMATHCrossRefGoogle Scholar
  72. 72.
    Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: ACM International Conference on Multimedia (2003)Google Scholar
  73. 73.
    Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 524–531. Washington (2005)Google Scholar
  74. 74.
    Li, M., Li, D., Dimitrove, N., Sethi, I.K.: Audio-visual talking face detection. In: International Conference on Multimedia and Expo, pp. 473–476. Baltimore, MD (2003)Google Scholar
  75. 75.
    Liu, X., Zhang, L., Li, M., Zhang, H., Wang, D.: Boosting image classification with lda-based feature combination for digital photograph management. Pattern Recognit. 38(6), 887–901 (2005)CrossRefGoogle Scholar
  76. 76.
    Liu, Y., Zhang, D., Lu, G., Tan, A.H.: Integrating semantic templates with decision tree for image semantic learning. In: The 13th International Multimedia Modeling Conference, pp. 185–195. Singapore (2007)Google Scholar
  77. 77.
    Loh, A., Guan, F., Ge, S.S.: Motion estimation using audio and video fusion. In: International Conference on Control, Automation, Robotics and Vision, vol. 3, pp. 1569–1574 (2004)Google Scholar
  78. 78.
    Lucey, S., Sridharan, S., Chandran, V.: Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier. In: International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 551–554. Hong Kong (2001)Google Scholar
  79. 79.
    Luo, R.C., Yih, C.C., Su, K.L.: Multisensor fusion and integration: Approaches, applications, and future research directions. IEEE Sens. J. 2(2), 107–119 (2002)CrossRefGoogle Scholar
  80. 80.
    Magalhães, J., Rüger, S.: Information-theoretic semantic multimedia indexing. In: International Conference on Image and Video Retrieval, pp. 619–626. Amsterdam, The Netherlands (2007)Google Scholar
  81. 81.
    Makkook, M.A.: A multimodal sensor fusion architecture for audio-visual speech recognition. MS Thesis, University of Waterloo, Canada (2007)Google Scholar
  82. 82.
    Matas, J., Hamouz, M., Jonsson, K., Kittler, J., Li, Y., Kotropoulos, C., Tefas, A., Pitas, I., Tan, T., Yan, H., Smeraldi, F., Capdevielle, N., Gerstner, W., Abdeljaoued, Y., Bigun, J., Ben-Yacoub, S., Mayoraz, E.: Comparison of face verification results on the XM2VTS database. p. 4858. Los Alamitos, CA, USA (2000)Google Scholar
  83. 83.
    McDonald, K., Smeaton, A.F.: A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International Conference on Image and Video Retrieval, pp. 61–70. Singapore (2005)Google Scholar
  84. 84.
    Mena, J.B., Malpica, J.: Color image segmentation using the dempster–shafer theory of evidence for the fusion of texture. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XXXIV, Part 3/W8, pp. 139–144. Munich, Germany (2003)Google Scholar
  85. 85.
    Meyer, G.F., Mulligan, J.B., Wuerger, S.M.: Continuous audio-visual digit recognition using N-best decision fusion. J. Inf. Fusion 5, 91–101 (2004)CrossRefGoogle Scholar
  86. 86.
    Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphye, K.: Dynamic bayesian networks for audio-visual speech recognition. EURASIP J. Appl. Signal Process. 11, 1–15 (2002)Google Scholar
  87. 87.
    Neti, C., Maison, B., Senior, A., Iyengar, G., Cuetos, P., Basu, S., Verma, A.: Joint processing of audio and visual information for multimedia indexing and human-computer interaction. In: International Conference RIAO. Paris, France (2000)Google Scholar
  88. 88.
    Ni, J., , Ma, X., Xu, L., Wang, J.: An image recognition method based on multiple bp neural networks fusion. In: IEEE International Conference on Information Acquisition (2004)Google Scholar
  89. 89.
    Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: The 7th International Conference on Multimodal Interfaces, pp. 61–68. Torento, Italy (2005)Google Scholar
  90. 90.
    Nock, H.J., Iyengar, G., Neti, C.: Assessing face and speech consistency for monologue detection in video. In: ACM International Conference on Multimedia. French Riviera, France (2002)Google Scholar
  91. 91.
    Nock, H.J., Iyengar, G., Neti, C.: Speaker localisation using audio-visual synchrony: an empirical study. In: International Conference on Image and Video Retrieval. Urbana, USA (2003)Google Scholar
  92. 92.
    Noulas, A.K., Krose, B.J.A.: Em detection of common origin of multi-modal cues. In: International Conference on Multimodal Interfaces, pp. 201–208. Banff (2006)Google Scholar
  93. 93.
    Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy, M., Espinosa, V., Satue, A., Hernaez, I., Igarza, J.J., Vivaracho, C., Escudero, D., Moro, Q.I.: Biometric on the internet MCYT baseline corpus: a bimodal biometric database. IEE Proc. Vis. Image Signal Process. 150(6), 395–401 (2003)CrossRefGoogle Scholar
  94. 94.
    Oshman, Y.: Optimal sensor selection strategy for discrete-time state estimators. IEEE Trans. Aerosp. Electron. Syst. 30, 307–314 (1994)CrossRefGoogle Scholar
  95. 95.
    Oviatt, S.: Ten myths of multimodal interaction. Commun. ACM 42(11), 74–81 (1999)CrossRefGoogle Scholar
  96. 96.
    Oviatt, S.: Taming speech recognition errors within a multimodal interface. Commun. ACM 43(9), 45–51 (2000)CrossRefGoogle Scholar
  97. 97.
    Oviatt, S.L.: Multimodal interfaces. In: Jacko, J., Sears, A. (eds.) The Human–Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications. Lawrence Erlbaum Assoc., NJ (2003)Google Scholar
  98. 98.
    Pahalawatta, P., Pappas, T.N., Katsaggelos, A.K.: Optimal sensor selection for video-based target tracking in a wireless sensor network. In: IEEE International Conference on Image Processing, pp. V:3073–3076. Singapore (2004)Google Scholar
  99. 99.
    Perez, D.G., Lathoud, G., McCowan, I., Odobez, J.M., Moore, D.: Audio-visual speaker tracking with importance particle filter. In: IEEE International Conference on Image Processing (2003)Google Scholar
  100. 100.
    Pfleger, N.: Context based multimodal fusion. In: ACM International Conference on Multimodal Interfaces, pp. 265–272. State College (2004)Google Scholar
  101. 101.
    Pfleger, N.: Fade - an integrated approach to multimodal fusion and discourse processing. In: Dotoral Spotlight at ICMI 2005. Trento, Italy (2005)Google Scholar
  102. 102.
    Pitsikalis, V., Katsamanis, A., Papandreou, G., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation. In: Ninth International Conference on Spoken Language Processing. Pittsburgh (2006)Google Scholar
  103. 103.
    Poh, N., Bengio, S.: How do correlation and variance of base-experts affect fusion in biometric authentication tasks? IEEE Trans. Signal Process. 53, 4384–4396 (2005)Google Scholar
  104. 104.
    Poh, N., Bengio, S.: Database, protocols and tools for evaluating score-level fusion algorithms in biometric authentication. Pattern Recognit. 39(2), 223–233 (2006) (Part Special Issue: Complexity Reduction)CrossRefGoogle Scholar
  105. 105.
    Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant features for audio-visual LVSCR. In: IEEE International Conference on Acoustic Speech and Signal Processing, pp. 165–168. Salt Lake City (2001)Google Scholar
  106. 106.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  107. 107.
    Potamitis, I., Chen, H., Tremoulis, G.: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Trans. Speech Audio Process. 12(5), 520–529 (2004)CrossRefGoogle Scholar
  108. 108.
    Radova, V., Psutka, J.: An approach to speaker identification using multiple classifiers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 1135–1138. Munich, Germany (1997)Google Scholar
  109. 109.
    Rashidi, A., Ghassemian, H.: Extended dempster–shafer theory for multi-system/sensor decision fusion. In: Commission IV Joint Workshop on Challenges in Geospatial Analysis, Integration and Visualization II, pp. 31–37. Germany (2003)Google Scholar
  110. 110.
    Reddy, B.S.: Evidential reasoning for multimodal fusion in human computer interaction (2007). MS Thesis, University of Waterloo, CanadaGoogle Scholar
  111. 111.
    Ribeiro, M.I.: Kalman and extended Kalman filters: concept, derivation and properties. Technical report., Institute for Systems and Robotics, Lisboa (2004)Google Scholar
  112. 112.
    Roweis, S., Ghahramani, Z.: A unifying review of linear gaussian models. Neural Comput. 11(2), 305–345 (1999)CrossRefGoogle Scholar
  113. 113.
    Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digit. Signal Process. 14(5), 449–480 (2004)CrossRefGoogle Scholar
  114. 114.
    Satoh, S., Nakamura, Y., Kanade, T.: Name-It: Naming and detecting faces in news video. IEEE Multimed. 6(1), 22–35 (1999)CrossRefGoogle Scholar
  115. 115.
    Siegel, M., Wu, H.: Confidence fusion. In: IEEE International Workshop on Robot Sensing, pp. 96–99 (2004)Google Scholar
  116. 116.
    Singh, R., Vatsa, M., Noore, A., Singh, S.K.: Dempster–shafer theory based finger print classifier fusion with update rule to minimize training time. IEICE Electron. Express 3(20), 429–435 (2006)CrossRefGoogle Scholar
  117. 117.
    Slaney, M., Covell, M.: Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Neural Information Processing Society, vol. 13 (2000)Google Scholar
  118. 118.
    Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications, pp. 151–174. Springer, Berlin (2009)Google Scholar
  119. 119.
    Snoek, C.G.M., Worring, M.: A review on multimodal video indexing. In: IEEE International Conference on Multimedia and Expo, pp. 21–24. Lusanne, Switzerland (2002)Google Scholar
  120. 120.
    Snoek, C.G.M., Worring, M.: Multimodal video indexing: A review of the state-of-the-art. Multimed. Tools Appl. 25(1), 5–35 (2005)CrossRefGoogle Scholar
  121. 121.
    Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: ACM International Conference on Multimedia, pp. 399–402. Singapore (2005)Google Scholar
  122. 122.
    Sridharan, H., Sundaram, H., Rikakis, T.: Computational models for experiences in the arts and multimedia. In: The ACM Workshop on Experiential Telepresence. Berkeley, CA (2003)Google Scholar
  123. 123.
    Stauffer, C.: Automated audio-visual activity analysis. Tech. rep., MIT-CSAIL-TR-2005-057, Massachusetts Institute of Technology, Cambridge, MA (2005)Google Scholar
  124. 124.
    Strobel, N., Spors, S., Rabenstein, R.: Joint audio–video object localization and tracking. IEEE Signal Process. Mag. 18(1), 22–31 (2001)CrossRefGoogle Scholar
  125. 125.
    Talantzis, F., Pnevmatikakis, A., Polymenakos, L.C.: Real time audio-visual person tracking. In: IEEE 8th Workshop on Multimedia Signal Processing, pp. 243–247. IEEE Computer Society, Victoria, BC (2006)Google Scholar
  126. 126.
    Tatbul, N., Buller, M., Hoyt, R., Mullen, S., Zdonik, S.: Confidence-based data management for personal area sensor networks. In: The Workshop on Data Management for Sensor Networks (2004)Google Scholar
  127. 127.
    Tavakoli, A., Zhang, J., Son, S.H.: Group-based event detection in undersea sensor networks. In: Second International Workshop on Networked Sensing Systems. San Diego, CA (2005)Google Scholar
  128. 128.
    Teissier, P., Guerin-Dugue, A., Schwartz, J.L.: Models for audiovisual fusion in a noisy-vowel recognition task. J. VLSI Signal Process. 20, 25–44 (1998)CrossRefGoogle Scholar
  129. 129.
    Teriyan, V.Y., Puuronen, S.: Multilevel context representation using semantic metanetwork. In: International and Interdisciplinary Conference on Modeling and Using Context, pp. 21–32. Rio de Janeiro, Brazil (1997)Google Scholar
  130. 130.
    Tesic, J., Natsev, A., Lexing, X., Smith, J.R.: Data modeling strategies for imbalanced learning in visual search. In: IEEE International Conference on Multimedia and Expo, pp. 1990–1993. Beijing (2007)Google Scholar
  131. 131.
    Town, C.: Multi-sensory and multi-modal fusion for sentient computing. Int. J. Comput. Vis. 71, 235–253 (2007)CrossRefGoogle Scholar
  132. 132.
    Vermaak, J., Gangnet, M., Blake, A., Perez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: The 8th IEEE International Conference on Computer Vision, vol. 1, pp. 741–746. Paris, France (2001)Google Scholar
  133. 133.
    Voorhees, E.M., Gupta, N.K., Johnson-Laird, B.: Learning collection fusion strategies. In: ACM International Conference on Research and Development in Information Retrieval, pp. 172–179. Seattle, WA (1995)Google Scholar
  134. 134.
    Wall, M.E., Rechtsteiner, A., Rocha, L.M.: Singular Value Decomposition and Principal Component Analysis, Chap. 5, pp. 91–109. Kluwel, Norwell, MA (2003)Google Scholar
  135. 135.
    Wang, J., Kankanhalli, M.S.: Experience-based sampling technique for multimedia analysis. In: ACM International Conference on Multimedia, pp. 319–322. Berkeley, CA (2003)Google Scholar
  136. 136.
    Wang, J., Kankanhalli, M.S., Yan, W.Q., Jain, R.: Experiential sampling for video surveillance. In: ACM Workshop on Video Surveillance. Berkeley (2003)Google Scholar
  137. 137.
    Wang, S., Dash, M., Chia, L.T., Xu, M.: Efficient sampling of training set in large and noisy multimedia data. ACM Trans. Multimed. Comput. Commun. Appl. 3(3), 14 (2007)CrossRefGoogle Scholar
  138. 138.
    Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis: using both audio and visual clues. In: IEEE Signal Processing Magazine, pp. 12–36 (2000)Google Scholar
  139. 139.
    Westerveld, T.: Image retrieval: content versus context. In: RIAO Content-Based Multimedia Information Access. Paris, France (2000)Google Scholar
  140. 140.
    Wu, H.: Sensor data fusion for context-aware computing using dempster–shafer theory. Ph.D. thesis, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2003)Google Scholar
  141. 141.
    Wu, K., Lin, C.K., Chang, E., Smith, J.R.: Multimodal information fusion for video concept detection. In: IEEE International Conference on Image Processing, pp. 2391–2394. Singapore (2004)Google Scholar
  142. 142.
    Wu, Y., Chang, E., Tsengh, B.L.: Multimodal metadata fusion using causal strength. In: ACM International Conference on Multimedia, pp. 872–881. Singapore (2005)Google Scholar
  143. 143.
    Wu, Y., Chang, E.Y., Chang, K.C.C., Smith, J.R.: Optimal multimodal fusion for multimedia data analysis. In: ACM International Conference on Multimedia, pp. 572–579. New York City, NY (2004)Google Scholar
  144. 144.
    Wu, Z., Cai, L., Meng, H.: Multi-level fusion of audio and visual features for speaker identification. In: International Conference on Advances in Biometrics, pp. 493–499 (2006)Google Scholar
  145. 145.
    Xie, L., Kennedy, L., Chang, S.F., Divakaran, A., Sun, H., Lin, C.Y.: Layered dynamic mixture model for pattern discovery in asynchronous multi-modal streams. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1053–1056. Philadelphia, USA (2005)Google Scholar
  146. 146.
    Xiong, N., Svensson, P.: Multi-sensor management for information fusion: issues and approaches. Inf. Fusion 3, 163–186(24) (2002)CrossRefGoogle Scholar
  147. 147.
    Xu, C., Wang, J., Lu, H., Zhang, Y.: A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multimed. 10(3), 421–436 (2008)CrossRefGoogle Scholar
  148. 148.
    Xu, C., Zhang, Y.F., Zhu, G., Rui, Y., Lu, H., Huang, Q.: Using webcast text for semantic event detection in broadcast sports video. IEEE Trans. Multimed. 10(7), 1342–1355 (2008)CrossRefGoogle Scholar
  149. 149.
    Xu, H., Chua, T.S.: Fusion of AV features and external information sources for event detection in team sports video. ACM Trans. Multimed. Comput. Commun. Appl. 2(1), 44–67 (2006)CrossRefGoogle Scholar
  150. 150.
    Yan, R.: Probabilistic models for combining diverse knowledge sources in multimedia retrieval. Ph.D. thesis. Carnegie Mellon University (2006)Google Scholar
  151. 151.
    Yan, R., Yang, J., Hauptmann, A.: Learning query-class dependent weights in automatic video retrieval. In: ACM International Conference on Multimedia, pp. 548–555. New York, USA (2004)Google Scholar
  152. 152.
    Yang, M.T., Wang, S.C., Lin, Y.Y.: A multimodal fusion system for people detection and tracking. International Journal of Imaging Systems and Technology 15, 131–142 (2005)CrossRefGoogle Scholar
  153. 153.
    Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A. Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)CrossRefGoogle Scholar
  154. 154.
    Zhou, Q., Aggarwal, J.: Object tracking in an outdoor environment using fusion of features and cameras. Image Vis. Comput. 24(11), 1244–1255 (2006)CrossRefGoogle Scholar
  155. 155.
    Zhou, Z.H.: Learning with unlabeled data and its application to image retrieval. In: The 9th Pacific Rim International Conference on Artificial Intelligence, pp. 5–10. Guilin (2006)Google Scholar
  156. 156.
    Zhu, Q., Yeh, M.C., Cheng, K.T.: Multimodal fusion using learned text concepts for image categorization. In: ACM International Conference on Multimedia, pp. 211–220. Santa Barbara (2006)Google Scholar
  157. 157.
    Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual tracking using particle filters. EURASIP J. Appl. Signal Process. (11), 1154–1164 (2002)Google Scholar
  158. 158.
    Zou, X., Bhanu, B.: Tracking humans using multimodal fusion. In: IEEE Conference on Computer Vision and Pattern Recognition, p. 4. Washington (2005)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Pradeep K. Atrey
    • 1
    Email author
  • M. Anwar Hossain
    • 2
  • Abdulmotaleb El Saddik
    • 2
  • Mohan S. Kankanhalli
    • 3
  1. 1.Department of Applied Computer ScienceUniversity of WinnipegWinnipegCanada
  2. 2.Multimedia Communications Research LaboratoryUniversity of OttawaOttawaCanada
  3. 3.School of ComputingNational University of SingaporeSingaporeSingapore

Personalised recommendations