Multimodal fusion for multimedia analysis: a survey

Atrey, Pradeep K.; Hossain, M. Anwar; El Saddik, Abdulmotaleb; Kankanhalli, Mohan S.

doi:10.1007/s00530-010-0182-0

Multimodal fusion for multimedia analysis: a survey

Regular Paper
Published: 04 April 2010

Volume 16, pages 345–379, (2010)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Pradeep K. Atrey¹,
M. Anwar Hossain²,
Abdulmotaleb El Saddik² &
…
Mohan S. Kankanhalli³

14k Accesses
712 Citations
9 Altmetric
Explore all metrics

Abstract

This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multivariate mutual information for audio video fusion

Article 06 April 2016

FF-SKPCCA: Kernel probabilistic canonical correlation analysis

Article 05 September 2016

Overview of Uni-modal and Multi-modal Representations for Classification Tasks

Notes

To maintain consistency, we will use these notations for modalities in rest of this paper.

References

PETS: Performance evaluation of tracking and surveillance (Last access date 31 August 2009). http://www.cvg.rdg.ac.uk/slides/pets.html
TRECVID data availability (Last access date 02 September 2009). http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html
Adams, W., Iyengar, G., Lin, C., Naphade, M., Neti, C., Nock, H., Smith, J.: Semantic indexing of multimedia content using visual, audio, and text cues. EURASIP J. Appl. Signal Process. 2003(2), 170–185 (2003)
Article Google Scholar
Aguilar, J.F., Garcia, J.O., Romero, D.G., Rodriguez, J.G.: A comparative evaluation of fusion strategies for multimodal biometric verification. In: International Conference on Video-Based Biometrie Person Authentication, pp. 830–837. Guildford (2003)
Aleksic, P.S., Katsaggelos, A.K.: Audio-visual biometrics. Proc. IEEE 94(11), 2025–2044 (2006)
Article Google Scholar
Andrieu, C., Doucet, A., Singh, S., Tadic, V.: Particle methods for change detection, system identification, and control. Proc. IEEE 92(3), 423–438 (2004)
Google Scholar
Argillander, J., Iyengar, G., Nock, H.: Semantic annotation of multimedia using maximum entropy models. In: International Conference on Accoustic, Speech and Signal Processing, pp. II–153–156. Philadelphia (2005)
Atrey, P.K., Kankanhalli, M.S., Jain, R.: Information assimilation framework for event detection in multimedia surveillance systems. Springer/ACM Multimed. Syst. J. 12(3), 239–253 (2006)
Article Google Scholar
Atrey, P.K., Kankanhalli, M.S., Oommen, J.B.: Goal-oriented optimal subset selection of correlated multimedia streams. ACM Trans. Multimedia Comput. Commun. Appl. 3(1), 2 (2007)
Google Scholar
Atrey, P.K., Kankanhalli, M.S., El Saddik, A.: Confidence building among correlated streams in multimedia surveillance systems. In: International Conference on Multimedia Modeling, pp. 155–164. Singapore (2007)
Ayache, S., Quénot, G., Gensel, J.: Classifier fusion for svm-based multimedia semantic indexing. In: The 29th European Conference on Information Retrieval Research, pp. 494–504. Rome (2007)
Babaguchi, N., Kawai, Y., Kitahashi, T.: Event based indexing of broadcasted sports video by intermodal collaboration. IEEE Trans. Multimed. 4, 68–75 (2002)
Article Google Scholar
Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personalized abstraction of broadcasted american football video by highlight selection. IEEE Trans. Multimed. 6(4), 575–586 (2004)
Article Google Scholar
Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., Matas, J., Messer, K., Popovici, V., Porée, F., Ruíz, B., Thiran, J.P.: The BANCA database and evaluation protocol. In: International Conference on Audio-and Video-Based Biometrie Person Authentication, pp. 625–638. Guildford (2003)
Beal, M.J., Jojic, N., Attias, H.: A graphical model for audio-visual object tracking. IEEE Trans. Pattern Anal. Mach. Intell. 25, 828– 836 (2003)
Article Google Scholar
Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V., Pieczynski, W.: Multisensor image segmentation using Dempster–Shafer fusion in markov fields context. IEEE Trans. Geosci. Remote Sens. 39(8), 1789–1798 (2001)
Article Google Scholar
Bengio, S.: Multimodal authentication using asynchronous hmms. In: The 4th International Conference Audio and Video Based Biometric Person Authentication, pp. 770–777. Guildford (2003)
Bengio, S., Marcel, C., Marcel, S., Mariethoz, J. Confidence measures for multimodal identity verification. Inf. Fusion 3(4), 267–276 (2002)
Article Google Scholar
Bredin, H., Chollet, G.: Audio-visual speech synchrony measure for talking-face identity verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 233–236. Paris (2007)
Bredin, H., Chollet, G.: Audiovisual speech synchrony measure: application to biometrics. EURASIP J. Adv. Signal Process. 11 p. (2007). Article ID 70186
Brémond, F., Thonnat, M.: A context representation of surveillance systems. In: European Conference on Computer Vision. Orlando (1996)
Brooks, R.R., Iyengar, S.S.: Multi-sensor Fusion: Fundamentals and Applications with Software. Prentice Hall PTR, Upper Saddle River, NJ (1998)
Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Article Google Scholar
Caruana, R., Munson, A., Niculescu-Mizil, A.: Getting the most out of ensemble selection. In: ACM International Conference on on Data Mining, pp. 828–833. Maryland (2006)
Chaisorn, L., Chua, T.S., Lee, C.H., Zhao, Y., Xu, H., Feng, H., Tian, Q.: A multi-modal approach to story segmentation for news video. World Wide Web 6, 187–208 (2003)
Article Google Scholar
Chang, S.F., Manmatha, R., Chua, T.S.: Combining text and audio-visual features in video indexing. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 5, pp. 1005–1008. IEEE Computer Society, Philadelphia (2005)
Chen, Q., Aickelin, U.: Anomaly detection using the dempster–shafer method. In: International Conference on Data Mining, pp. 232–240. Las Vegas (2006)
Chetty, G., Wagner, M.: Audio-visual multimodal fusion for biometric person authentication and liveness verification. In: NICTA-HCSNet Multimodal User Interaction Workshop, pp. 17–24. Sydney (2006)
Chieu, H.L., Lee, Y.K.: Query based event extraction along a timeline. In: International ACM Conference on Research and Development in Information Retrieval, pp. 425–432. Sheffield (2004)
Choudhury, T., Rehg, J.M., Pavlovic, V., Pentland, A.: Boosting and structure learning in dynamic bayesian networks for audio-visual speaker detection. In: The 16th International Conference on Pattern Recognition, vol. 3, pp. 789–794. Quebec (2002)
Chua, T.S., Chang, S.F., Chaisorn, L., Hsu, W.: Story boundary detection in large broadcast news video archives: techniques, experience and trends. In: ACM International Conference on Multimedia, pp. 656–659. New York, USA (2004)
Corradini, A., Mehta, M., Bernsen, N., Martin, J., Abrilian, S.: Multimodal input fusion in human–computer interaction. In: NATO-ASI Conference on Data Fusion for Situation Monitoring, Incident Detection, Alert and Response Management. Karlsruhe University, Germany (2003)
Crisan, D., Doucet, A.: A survey of convergence results on particle filtering methods for practitioners. IEEE Trans. Signal Process. 50(3), 736–746 (2002)
Article MathSciNet Google Scholar
Cutler, R., Davis, L.: Look who’s talking: Speaker detection using video and audio correlation. In: IEEE International Conference on Multimedia and Expo, pp. 1589–1592. New York City (2000)
Darrell, T., Fisher III, J.W., Viola, P., Freeman, W.: Audio-visual segmentation and “the cocktail party effect”. In: International Conference on Multimodal Interfaces. Bejing (2000)
Datcu, D., Rothkrantz, L.J.M.: Facial expression recognition with relevance vector machines. In: IEEE International Conference on Multimedia and Expo, pp. 193–196. Amsterdam, The Netherlands (2005)
Debouk, R., Lafortune, S., Teneketzis, D.: On an optimal problem in sensor selection. J. Discret. Event Dyn. Syst. Theory Appl. 12, 417–445 (2002)
Article MATH MathSciNet Google Scholar
Ding, Y., Fan, G.: Segmental hidden markov models for view-based sport video analysis. In: International Workshop on Semantic Learning Applications in Multimedia. Minneapolis (2007)
Fisher-III, J., Darrell, T., Freeman, W., Viola, P.: Learning joint statistical models for audio-visual fusion and segregation. In: Advances in Neural Information Processing Systems, pp. 772–778. Denver (2000)
Foresti, G.L., Snidaro, L.: A distributed sensor network for video surveillance of outdoor environments. In: IEEE International Conference on Image Processing. Rochester (2002)
Gandetto, M., Marchesotti, L., Sciutto, S., Negroni, D., Regazzoni, C.S.: From multi-sensor surveillance towards smart interactive spaces. In: IEEE International Conference on Multimedia and Expo, pp. I:641–644. Baltimore (2003)
Garcia Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., les Jardins, J., Lunter, J., Ni, Y., Petrovska Delacretaz, D.: BIOMET: A multimodal person authentication database including face, voice, fingerprint, hand and signature modalities. In: International Conference on Audio-and Video-Based Biometrie Person Authentication, pp. 845–853. Guildford, UK (2003)
Gehrig, T., Nickel, K., Ekenel, H., Klee, U., McDonough, J.: Kalman filters for audio–video source localization. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 118– 121. Karlsruhe University, Germany (2005)
Guironnet, M., Pellerin, D., Rombaut, M.: Video classification based on low-level feature fusion model. In: The 13th European Signal Processing Conference. Antalya, Turkey (2005)
Hall, D.L., Llinas, J.: An introduction to multisensor fusion. In: Proceedings of the IEEE: Special Issues on Data Fusion, vol. 85, no. 1, pp. 6–23 (1997)
Hershey, J., Attias, H., Jojic, N., Krisjianson, T.: Audio visual graphical models for speech processing. In: IEEE International Conference on Speech, Acoustics, and Signal Processing, pp. 649–652. Montreal (2004)
Hershey, J., Movellan, J.: Audio-vision: using audio-visual synchrony to locate sounds. In: Advances in Neural Information Processing Systems, pp. 813–819. MIT Press, USA (2000)
Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MATH MathSciNet Google Scholar
Holzapfel, H., Nickel, K., Stiefelhagen, R.: Implementation and evaluation of a constraint-based multimodal fusion system for speech and 3d pointing gestures. In: ACM International Conference on Multimodal Interfaces, pp. 175–182. State College, PA (2004)
Hossain, M.A., Atrey, P.K., El Saddik, A.: Smart mirror for ambient home environment. In: The 3rd IET International Conference on Intelligent Environments, pp. 589–596. Ulm (2007)
Hossain, M.A., Atrey, P.K., El Saddik, A.: Modeling and assessing quality of information in multi-sensor multimedia monitoring systems. ACM Trans. Multimed. Comput. Commun. Appl. 7(1) (2011)
Hsu, W., Kennedy, L., Huang, C.W., Chang, S.F., Lin, C.Y.: News video story segmentation using fusion of multi-level multi-modal features in TRECVID 2003. In: International Conference on Acoustics Speech and Signal Processing. Montreal, QC (2004)
Hsu, W.H.M., Chang, S.F.: Generative, discriminative, and ensemble learning on multi-modal perceputal fusion toward news stroy segmentation. In: IEEE International Conference on Multimedia and Expos, pp. 1091–1094. Taipei (2004)
Hu, H., Gan, J.Q.: Sensors and data fusion algorithms in mobile robotics. Technical report, CSM-422, Department of Computer Science, University of Essex, UK (2005)
Hua, X.S., Zhang, H.J.: An attention-based decision fusion scheme for multimedia information retrieval. In: The 5th Pacific-Rim Conference on Multimedia. Tokyo, Japan (2004)
Isler, V., Bajcsy, R.: The sensor selection problem for bounded uncertainty sensing models. In: International Symposium on Information Processing in Sensor Networks, pp. 151–158. Los Angeles (2005)
Iyengar, G., Nock, H.J., Neti, C.: Audio-visual synchrony for detection of monologue in video archives. In: IEEE International Conference on Acoustics, Speech, and Signal Processing. Hong Kong (2003)
Iyengar, G., Nock, H.J., Neti, C.: Discriminative model fusion for semantic concept detection and annotation in video. In: ACM International Conference on Multimedia, pp. 255–258. Berkeley (2003)
Jaffre, G., Pinquier, J.: Audio/video fusion: a preprocessing step for multimodal person identification. In: International Workshop on MultiModal User Authentification. Toulouse, France (2006)
Jaimes, A., Sebe, N.: Multimodal human computer interaction: a survey. In: IEEE International Workshop on Human Computer Interaction. Beijing (2005)
Jain, A., Nandakumar, K., Ross, A.: Score normalization in multimodal biometric systems. Pattern Recognit. 38(12), 2270–2285 (2005)
Article Google Scholar
Jasinschi, R.S., Dimitrova, N., McGee, T., Agnihotri, L., Zimmerman, J., Li, D., Louie, J.: A probabilistic layered framework for integrating multimedia content and context information. In: International Conference on Acoustics, Speech and Signal Processing, vol. II, pp. 2057–2060. Orlando (2002)
Jeon, J., Manmatha, R.: Using maximum entropy for automatic image annotation. In: International Conference on Image and Video Retrieval, vol. 3115, pp. 24–32. Dublin (2004)
Jiang, S., Kumar, R., Garcia, H.E.: Optimal sensor selection for discrete event systems with partial observation. IEEE Trans. Automat. Contr. 48, 369–381 (2003)
Article MathSciNet Google Scholar
Julier, S.J., Uhlmann, J.K.: New extension of the Kalman filter to nonlinear systems. In: Signal Processing, Sensor Fusion, and Target Recognition VI, vol. 3068 SPIE, pp. 182–193. San Diego (1997)
Kalman, R.E.: A new approach to linear filtering and prediction problems. Trans. ASME J. Basic Eng. 82(series D), 35–45 (1960)
Google Scholar
Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling in multimedia systems. IEEE Trans. Multimed. 8(5), 937–946 (2006)
Article Google Scholar
Kankanhalli, M.S., Wang, J., Jain, R.: Experiential sampling on multiple data streams. IEEE Trans. Multimed. 8(5), 947–955 (2006)
Article Google Scholar
Kittler, J., Hatef, M., Duin, R.P., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
Article Google Scholar
Lam, K.Y., Cheng, R., Liang, B.Y., Chau, J.: Sensor node selection for execution of continuous probabilistic queries in wireless sensor networks. In: ACM International Workshop on Video Surveillance and Sensor Networks, pp. 63–71. NY, USA (2004)
León, T., Zuccarello, P., Ayala, G., de Ves, E., Domingo, J.: Applying logistic regression to relevance feedback in image retrieval systems. Pattern Recognit. 40(10), 2621–2632 (2007)
Article MATH Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: ACM International Conference on Multimedia (2003)
Li, F.F., Perona, P.: A bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 524–531. Washington (2005)
Li, M., Li, D., Dimitrove, N., Sethi, I.K.: Audio-visual talking face detection. In: International Conference on Multimedia and Expo, pp. 473–476. Baltimore, MD (2003)
Liu, X., Zhang, L., Li, M., Zhang, H., Wang, D.: Boosting image classification with lda-based feature combination for digital photograph management. Pattern Recognit. 38(6), 887–901 (2005)
Article Google Scholar
Liu, Y., Zhang, D., Lu, G., Tan, A.H.: Integrating semantic templates with decision tree for image semantic learning. In: The 13th International Multimedia Modeling Conference, pp. 185–195. Singapore (2007)
Loh, A., Guan, F., Ge, S.S.: Motion estimation using audio and video fusion. In: International Conference on Control, Automation, Robotics and Vision, vol. 3, pp. 1569–1574 (2004)
Lucey, S., Sridharan, S., Chandran, V.: Improved speech recognition using adaptive audio-visual fusion via a stochastic secondary classifier. In: International Symposium on Intelligent Multimedia, Video and Speech Processing, pp. 551–554. Hong Kong (2001)
Luo, R.C., Yih, C.C., Su, K.L.: Multisensor fusion and integration: Approaches, applications, and future research directions. IEEE Sens. J. 2(2), 107–119 (2002)
Article Google Scholar
Magalhães, J., Rüger, S.: Information-theoretic semantic multimedia indexing. In: International Conference on Image and Video Retrieval, pp. 619–626. Amsterdam, The Netherlands (2007)
Makkook, M.A.: A multimodal sensor fusion architecture for audio-visual speech recognition. MS Thesis, University of Waterloo, Canada (2007)
Matas, J., Hamouz, M., Jonsson, K., Kittler, J., Li, Y., Kotropoulos, C., Tefas, A., Pitas, I., Tan, T., Yan, H., Smeraldi, F., Capdevielle, N., Gerstner, W., Abdeljaoued, Y., Bigun, J., Ben-Yacoub, S., Mayoraz, E.: Comparison of face verification results on the XM2VTS database. p. 4858. Los Alamitos, CA, USA (2000)
McDonald, K., Smeaton, A.F.: A comparison of score, rank and probability-based fusion methods for video shot retrieval. In: International Conference on Image and Video Retrieval, pp. 61–70. Singapore (2005)
Mena, J.B., Malpica, J.: Color image segmentation using the dempster–shafer theory of evidence for the fusion of texture. In: International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XXXIV, Part 3/W8, pp. 139–144. Munich, Germany (2003)
Meyer, G.F., Mulligan, J.B., Wuerger, S.M.: Continuous audio-visual digit recognition using N-best decision fusion. J. Inf. Fusion 5, 91–101 (2004)
Article Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphye, K.: Dynamic bayesian networks for audio-visual speech recognition. EURASIP J. Appl. Signal Process. 11, 1–15 (2002)
Google Scholar
Neti, C., Maison, B., Senior, A., Iyengar, G., Cuetos, P., Basu, S., Verma, A.: Joint processing of audio and visual information for multimedia indexing and human-computer interaction. In: International Conference RIAO. Paris, France (2000)
Ni, J., , Ma, X., Xu, L., Wang, J.: An image recognition method based on multiple bp neural networks fusion. In: IEEE International Conference on Information Acquisition (2004)
Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle filter for audio-visual speaker tracking. In: The 7th International Conference on Multimodal Interfaces, pp. 61–68. Torento, Italy (2005)
Nock, H.J., Iyengar, G., Neti, C.: Assessing face and speech consistency for monologue detection in video. In: ACM International Conference on Multimedia. French Riviera, France (2002)
Nock, H.J., Iyengar, G., Neti, C.: Speaker localisation using audio-visual synchrony: an empirical study. In: International Conference on Image and Video Retrieval. Urbana, USA (2003)
Noulas, A.K., Krose, B.J.A.: Em detection of common origin of multi-modal cues. In: International Conference on Multimodal Interfaces, pp. 201–208. Banff (2006)
Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy, M., Espinosa, V., Satue, A., Hernaez, I., Igarza, J.J., Vivaracho, C., Escudero, D., Moro, Q.I.: Biometric on the internet MCYT baseline corpus: a bimodal biometric database. IEE Proc. Vis. Image Signal Process. 150(6), 395–401 (2003)
Article Google Scholar
Oshman, Y.: Optimal sensor selection strategy for discrete-time state estimators. IEEE Trans. Aerosp. Electron. Syst. 30, 307–314 (1994)
Article Google Scholar
Oviatt, S.: Ten myths of multimodal interaction. Commun. ACM 42(11), 74–81 (1999)
Article Google Scholar
Oviatt, S.: Taming speech recognition errors within a multimodal interface. Commun. ACM 43(9), 45–51 (2000)
Article Google Scholar
Oviatt, S.L.: Multimodal interfaces. In: Jacko, J., Sears, A. (eds.) The Human–Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications. Lawrence Erlbaum Assoc., NJ (2003)
Pahalawatta, P., Pappas, T.N., Katsaggelos, A.K.: Optimal sensor selection for video-based target tracking in a wireless sensor network. In: IEEE International Conference on Image Processing, pp. V:3073–3076. Singapore (2004)
Perez, D.G., Lathoud, G., McCowan, I., Odobez, J.M., Moore, D.: Audio-visual speaker tracking with importance particle filter. In: IEEE International Conference on Image Processing (2003)
Pfleger, N.: Context based multimodal fusion. In: ACM International Conference on Multimodal Interfaces, pp. 265–272. State College (2004)
Pfleger, N.: Fade - an integrated approach to multimodal fusion and discourse processing. In: Dotoral Spotlight at ICMI 2005. Trento, Italy (2005)
Pitsikalis, V., Katsamanis, A., Papandreou, G., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation. In: Ninth International Conference on Spoken Language Processing. Pittsburgh (2006)
Poh, N., Bengio, S.: How do correlation and variance of base-experts affect fusion in biometric authentication tasks? IEEE Trans. Signal Process. 53, 4384–4396 (2005)
Google Scholar
Poh, N., Bengio, S.: Database, protocols and tools for evaluating score-level fusion algorithms in biometric authentication. Pattern Recognit. 39(2), 223–233 (2006) (Part Special Issue: Complexity Reduction)
Article Google Scholar
Potamianos, G., Luettin, J., Neti, C.: Hierarchical discriminant features for audio-visual LVSCR. In: IEEE International Conference on Acoustic Speech and Signal Processing, pp. 165–168. Salt Lake City (2001)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Potamitis, I., Chen, H., Tremoulis, G.: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Trans. Speech Audio Process. 12(5), 520–529 (2004)
Article Google Scholar
Radova, V., Psutka, J.: An approach to speaker identification using multiple classifiers. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 1135–1138. Munich, Germany (1997)
Rashidi, A., Ghassemian, H.: Extended dempster–shafer theory for multi-system/sensor decision fusion. In: Commission IV Joint Workshop on Challenges in Geospatial Analysis, Integration and Visualization II, pp. 31–37. Germany (2003)
Reddy, B.S.: Evidential reasoning for multimodal fusion in human computer interaction (2007). MS Thesis, University of Waterloo, Canada
Ribeiro, M.I.: Kalman and extended Kalman filters: concept, derivation and properties. Technical report., Institute for Systems and Robotics, Lisboa (2004)
Roweis, S., Ghahramani, Z.: A unifying review of linear gaussian models. Neural Comput. 11(2), 305–345 (1999)
Article Google Scholar
Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digit. Signal Process. 14(5), 449–480 (2004)
Article Google Scholar
Satoh, S., Nakamura, Y., Kanade, T.: Name-It: Naming and detecting faces in news video. IEEE Multimed. 6(1), 22–35 (1999)
Article Google Scholar
Siegel, M., Wu, H.: Confidence fusion. In: IEEE International Workshop on Robot Sensing, pp. 96–99 (2004)
Singh, R., Vatsa, M., Noore, A., Singh, S.K.: Dempster–shafer theory based finger print classifier fusion with update rule to minimize training time. IEICE Electron. Express 3(20), 429–435 (2006)
Article Google Scholar
Slaney, M., Covell, M.: Facesync: A linear operator for measuring synchronization of video facial images and audio tracks. In: Neural Information Processing Society, vol. 13 (2000)
Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Divakaran, A. (ed.) Multimedia Content Analysis, Theory and Applications, pp. 151–174. Springer, Berlin (2009)
Google Scholar
Snoek, C.G.M., Worring, M.: A review on multimodal video indexing. In: IEEE International Conference on Multimedia and Expo, pp. 21–24. Lusanne, Switzerland (2002)
Snoek, C.G.M., Worring, M.: Multimodal video indexing: A review of the state-of-the-art. Multimed. Tools Appl. 25(1), 5–35 (2005)
Article Google Scholar
Snoek, C.G.M., Worring, M., Smeulders, A.W.M.: Early versus late fusion in semantic video analysis. In: ACM International Conference on Multimedia, pp. 399–402. Singapore (2005)
Sridharan, H., Sundaram, H., Rikakis, T.: Computational models for experiences in the arts and multimedia. In: The ACM Workshop on Experiential Telepresence. Berkeley, CA (2003)
Stauffer, C.: Automated audio-visual activity analysis. Tech. rep., MIT-CSAIL-TR-2005-057, Massachusetts Institute of Technology, Cambridge, MA (2005)
Strobel, N., Spors, S., Rabenstein, R.: Joint audio–video object localization and tracking. IEEE Signal Process. Mag. 18(1), 22–31 (2001)
Article Google Scholar
Talantzis, F., Pnevmatikakis, A., Polymenakos, L.C.: Real time audio-visual person tracking. In: IEEE 8th Workshop on Multimedia Signal Processing, pp. 243–247. IEEE Computer Society, Victoria, BC (2006)
Tatbul, N., Buller, M., Hoyt, R., Mullen, S., Zdonik, S.: Confidence-based data management for personal area sensor networks. In: The Workshop on Data Management for Sensor Networks (2004)
Tavakoli, A., Zhang, J., Son, S.H.: Group-based event detection in undersea sensor networks. In: Second International Workshop on Networked Sensing Systems. San Diego, CA (2005)
Teissier, P., Guerin-Dugue, A., Schwartz, J.L.: Models for audiovisual fusion in a noisy-vowel recognition task. J. VLSI Signal Process. 20, 25–44 (1998)
Article Google Scholar
Teriyan, V.Y., Puuronen, S.: Multilevel context representation using semantic metanetwork. In: International and Interdisciplinary Conference on Modeling and Using Context, pp. 21–32. Rio de Janeiro, Brazil (1997)
Tesic, J., Natsev, A., Lexing, X., Smith, J.R.: Data modeling strategies for imbalanced learning in visual search. In: IEEE International Conference on Multimedia and Expo, pp. 1990–1993. Beijing (2007)
Town, C.: Multi-sensory and multi-modal fusion for sentient computing. Int. J. Comput. Vis. 71, 235–253 (2007)
Article Google Scholar
Vermaak, J., Gangnet, M., Blake, A., Perez, P.: Sequential monte carlo fusion of sound and vision for speaker tracking. In: The 8th IEEE International Conference on Computer Vision, vol. 1, pp. 741–746. Paris, France (2001)
Voorhees, E.M., Gupta, N.K., Johnson-Laird, B.: Learning collection fusion strategies. In: ACM International Conference on Research and Development in Information Retrieval, pp. 172–179. Seattle, WA (1995)
Wall, M.E., Rechtsteiner, A., Rocha, L.M.: Singular Value Decomposition and Principal Component Analysis, Chap. 5, pp. 91–109. Kluwel, Norwell, MA (2003)
Google Scholar
Wang, J., Kankanhalli, M.S.: Experience-based sampling technique for multimedia analysis. In: ACM International Conference on Multimedia, pp. 319–322. Berkeley, CA (2003)
Wang, J., Kankanhalli, M.S., Yan, W.Q., Jain, R.: Experiential sampling for video surveillance. In: ACM Workshop on Video Surveillance. Berkeley (2003)
Wang, S., Dash, M., Chia, L.T., Xu, M.: Efficient sampling of training set in large and noisy multimedia data. ACM Trans. Multimed. Comput. Commun. Appl. 3(3), 14 (2007)
Article Google Scholar
Wang, Y., Liu, Z., Huang, J.C.: Multimedia content analysis: using both audio and visual clues. In: IEEE Signal Processing Magazine, pp. 12–36 (2000)
Westerveld, T.: Image retrieval: content versus context. In: RIAO Content-Based Multimedia Information Access. Paris, France (2000)
Wu, H.: Sensor data fusion for context-aware computing using dempster–shafer theory. Ph.D. thesis, The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2003)
Wu, K., Lin, C.K., Chang, E., Smith, J.R.: Multimodal information fusion for video concept detection. In: IEEE International Conference on Image Processing, pp. 2391–2394. Singapore (2004)
Wu, Y., Chang, E., Tsengh, B.L.: Multimodal metadata fusion using causal strength. In: ACM International Conference on Multimedia, pp. 872–881. Singapore (2005)
Wu, Y., Chang, E.Y., Chang, K.C.C., Smith, J.R.: Optimal multimodal fusion for multimedia data analysis. In: ACM International Conference on Multimedia, pp. 572–579. New York City, NY (2004)
Wu, Z., Cai, L., Meng, H.: Multi-level fusion of audio and visual features for speaker identification. In: International Conference on Advances in Biometrics, pp. 493–499 (2006)
Xie, L., Kennedy, L., Chang, S.F., Divakaran, A., Sun, H., Lin, C.Y.: Layered dynamic mixture model for pattern discovery in asynchronous multi-modal streams. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1053–1056. Philadelphia, USA (2005)
Xiong, N., Svensson, P.: Multi-sensor management for information fusion: issues and approaches. Inf. Fusion 3, 163–186(24) (2002)
Article Google Scholar
Xu, C., Wang, J., Lu, H., Zhang, Y.: A novel framework for semantic annotation and personalized retrieval of sports video. IEEE Trans. Multimed. 10(3), 421–436 (2008)
Article Google Scholar
Xu, C., Zhang, Y.F., Zhu, G., Rui, Y., Lu, H., Huang, Q.: Using webcast text for semantic event detection in broadcast sports video. IEEE Trans. Multimed. 10(7), 1342–1355 (2008)
Article Google Scholar
Xu, H., Chua, T.S.: Fusion of AV features and external information sources for event detection in team sports video. ACM Trans. Multimed. Comput. Commun. Appl. 2(1), 44–67 (2006)
Article Google Scholar
Yan, R.: Probabilistic models for combining diverse knowledge sources in multimedia retrieval. Ph.D. thesis. Carnegie Mellon University (2006)
Yan, R., Yang, J., Hauptmann, A.: Learning query-class dependent weights in automatic video retrieval. In: ACM International Conference on Multimedia, pp. 548–555. New York, USA (2004)
Yang, M.T., Wang, S.C., Lin, Y.Y.: A multimodal fusion system for people detection and tracking. International Journal of Imaging Systems and Technology 15, 131–142 (2005)
Article Google Scholar
Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A. Face recognition: a literature survey. ACM Comput. Surv. 35(4), 399–458 (2003)
Article Google Scholar
Zhou, Q., Aggarwal, J.: Object tracking in an outdoor environment using fusion of features and cameras. Image Vis. Comput. 24(11), 1244–1255 (2006)
Article Google Scholar
Zhou, Z.H.: Learning with unlabeled data and its application to image retrieval. In: The 9th Pacific Rim International Conference on Artificial Intelligence, pp. 5–10. Guilin (2006)
Zhu, Q., Yeh, M.C., Cheng, K.T.: Multimodal fusion using learned text concepts for image categorization. In: ACM International Conference on Multimedia, pp. 211–220. Santa Barbara (2006)
Zotkin, D.N., Duraiswami, R., Davis, L.S.: Joint audio-visual tracking using particle filters. EURASIP J. Appl. Signal Process. (11), 1154–1164 (2002)
Zou, X., Bhanu, B.: Tracking humans using multimodal fusion. In: IEEE Conference on Computer Vision and Pattern Recognition, p. 4. Washington (2005)

Download references

Acknowledgments

The authors would like to thank the editor and the anonymous reviewers for their valuable comments in improving the content of this paper. This work is partially supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Author information

Authors and Affiliations

Department of Applied Computer Science, University of Winnipeg, Winnipeg, Canada
Pradeep K. Atrey
Multimedia Communications Research Laboratory, University of Ottawa, Ottawa, Canada
M. Anwar Hossain & Abdulmotaleb El Saddik
School of Computing, National University of Singapore, Singapore, Singapore
Mohan S. Kankanhalli

Authors

Pradeep K. Atrey
View author publications
You can also search for this author in PubMed Google Scholar
M. Anwar Hossain
View author publications
You can also search for this author in PubMed Google Scholar
Abdulmotaleb El Saddik
View author publications
You can also search for this author in PubMed Google Scholar
Mohan S. Kankanhalli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pradeep K. Atrey.

Additional information

Communicated by Wu-chi Feng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Atrey, P.K., Hossain, M.A., El Saddik, A. et al. Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16, 345–379 (2010). https://doi.org/10.1007/s00530-010-0182-0

Download citation

Received: 08 January 2009
Accepted: 09 March 2010
Published: 04 April 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s00530-010-0182-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal fusion for multimedia analysis: a survey

Abstract

Access this article

Similar content being viewed by others

Multivariate mutual information for audio video fusion

FF-SKPCCA: Kernel probabilistic canonical correlation analysis

Overview of Uni-modal and Multi-modal Representations for Classification Tasks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal fusion for multimedia analysis: a survey

Abstract

Access this article

Similar content being viewed by others

Multivariate mutual information for audio video fusion

FF-SKPCCA: Kernel probabilistic canonical correlation analysis

Overview of Uni-modal and Multi-modal Representations for Classification Tasks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation