Multimodal Fusion in Surveillance Applications

  • Virginia Fernandez ArguedasEmail author
  • Qianni Zhang
  • Ebroul Izquierdo
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)


The recent outbreak of vandalism, accidents and criminal activities has increased general public’s awareness about safety and security, demanding improved security measures. Smart surveillance video systems have become an ubiquitous platform which monitors private and public environments, ensuring citizens well-being. Their universal deployment integrates diverse media and acquisition systems, generating daily an enormous amount of multimodal data. Nowadays, numerous surveillance applications exploit multiple types of data and features benefitting from their uncorrelated contributions. Hence, the analysis, standardisation and fusion of complex content, specially visual, have become a fundamental problem to enhance surveillance systems by increasing their accuracy, robustness and reliability. During this chapter, an exhaustive survey of the existing multimodal fusion techniques and their applications in surveillance is provided. Addressing some of the revealed challenges from the state of the art, this chapter focuses on the development of a multimodal fusion technique for automatic surveillance object classification. The proposed fusion technique exploits the benefits of a Bayesian inference scheme to enhance surveillance systems’ performance. The chapter ends with an evaluation of the proposed Bayesian-based multimodal object classifier against two state-of-the-art object classifiers to demonstrate the benefits of multimodal fusion in surveillance applications.


Surveillance Video Object Classification Fusion Technique Semantic Concept Surveillance Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was partially supported by the European Commission under contract FP7-261743 VideoSense.


  1. 1.
    Aghajan H, Cavallaro A (2009) Multi-camera networks: principles and applications. Academic Press, LondonGoogle Scholar
  2. 2.
    Argillander J, Iyengar G, Nock H (2005) Semantic annotation of multimedia using maximum entropy models. In: IEEE international conference on acoustics, speech, and signal processing, vol 2. pp 153–156Google Scholar
  3. 3.
    Arsic D, Schuller B, Rigoll G (2007) Suspicious behavior detection in public transport by fusion of low-level video descriptors. In: IEEE international conference on multimedia and expo, pp 2018–2021Google Scholar
  4. 4.
    Atrey P, Kankanhalli M, El Saddik A (2006) Confidence building among correlated streams in multimedia surveillance systems. Adv Multimedia Model 4352:155–164CrossRefGoogle Scholar
  5. 5.
    Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Syst 16(6):345–379CrossRefGoogle Scholar
  6. 6.
    Atrey PK, Kankanhalli MS, Jain R (2006) Information assimilation framework for event detection in multimedia surveillance systems. Multimedia syst 12(3):239–253CrossRefGoogle Scholar
  7. 7.
    Bahlmann C, Pellkofer M, Giebel J, Baratoff G (2008) Multi-modal speed limit assistants: Combining camera and gps maps. In: IEEE intelligent vehicles symposium, pp 132–137Google Scholar
  8. 8.
    Bahlmann C, Zhu Y, Ramesh V, Pellkofer M, Koehler T (2005) A system for traffic sign detection, tracking, and recognition using color, shape, and motion information. In: IEEE intelligent vehicles symposium, pp 255–260Google Scholar
  9. 9.
    Brooks RR, Iyengar SS (1998) Multi-sensor fusion: fundamentals and applications with software. Prentice-Hall Inc, Upper Saddle RiverGoogle Scholar
  10. 10.
    Buxton H, Gong S (1995) Visual surveillance in a dynamic and uncertain world. Artif Intell 78(1–2):431–459CrossRefGoogle Scholar
  11. 11.
    Cheng HY, Weng CC, Chen YY (2012) Vehicle detection in aerial surveillance using dynamic bayesian networks. IEEE Trans Image Process 21(4):2152–2159Google Scholar
  12. 12.
    Csurka G, Clinchant S. An empirical study of fusion operators for multimodal image retrieval. In: 10th international workshop on content-based multimedia indexing, IEEE, pp 1–6Google Scholar
  13. 13.
    Dore A, Pinasco M, Regazzoni C (2009) Multi-modal data fusion techniques and applications. In: Multi-camera networks, pp 213–237Google Scholar
  14. 14.
    Drajic D, Cvejic N (2007) Adaptive fusion of multimodal surveillance image sequences in visual sensor networks. IEEE Trans Consum Electron 53(4):1456–1462CrossRefGoogle Scholar
  15. 15.
    Du Y, Chen F, Xu W (2007) Human interaction representation and recognition through motion decomposition. IEEE Sig Process Lett 14(12):952–955CrossRefGoogle Scholar
  16. 16.
    Fernandez Arguedas V, Izquierdo E (2011) Object classification based on behaviour patterns. In: International conference on imaging for crime detection and preventionGoogle Scholar
  17. 17.
    Fernandez Arguedas V, Zhang Q, Chandramouli K, Izquierdo E (2011) Multi-feature fusion for surveillance video indexing. In: International workshop on image analysis for multimedia interactive services, IEEEGoogle Scholar
  18. 18.
    Fernandez Arguedas V, Zhang Q, Chandramouli K, Izquierdo E (2012) Semantic hyper/multi media adaptation, chapter Vision based semantic analysis of surveillance videos. Springer, Berlin, pp 83–126Google Scholar
  19. 19.
    Fernandez Arguedas V, Zhang Q, Izquierdo E (2012) Bayesian multimodal fusion in forensic applications. In: Computer vision-ECCV 2012, workshops and demonstrations. Springer, pp 466–475Google Scholar
  20. 20.
    Gupta H, Yu L, Hakeem A, Eun Choe T, Haering N, Locasto M (2011) Multimodal complex event detection framework for wide area surveillance. In: IEEE computer society conference on computer vision and pattern recognition workshops, pp 47–54Google Scholar
  21. 21.
    Huang T, Koller D, Malik J, Ogasawara G, Rao B, Russell S, Weber J (1995) Automatic symbolic traffic scene analysis using belief networks. In: Proceedings of the national conference on artificial intelligence. Wiley, pp 966–966Google Scholar
  22. 22.
    Jiang SF, Zhang CM, Zhang S (2011) Two-stage structural damage detection using fuzzy neural networks and data fusion techniques. Expert Syst Appl 38(1):511–519CrossRefGoogle Scholar
  23. 23.
    Johansson G (1973) Visual perception of biological motion and a model for its analysis. Attention Percept Psychophys 14(2):201–211CrossRefMathSciNetGoogle Scholar
  24. 24.
    Junejo IN (2010) Using dynamic bayesian network for scene modeling and anomaly detection. Sig Image Video Process 4(1):1–10CrossRefzbMATHGoogle Scholar
  25. 25.
    Klausner A, Tengg A, Rinner B (2007) Vehicle classification on multi-sensor smart cameras using feature-and decision-fusion. In: ACM/IEEE international conference on distributed smart cameras, pp 67–74Google Scholar
  26. 26.
    Liu X, Chua CS (2006) Multi-agent activity recognition using observation decomposedhidden markov models. Image Vis Comput 24(2):166–175CrossRefzbMATHGoogle Scholar
  27. 27.
    Luo RC, Yih CC, Su KL (2002) Multisensor fusion and integration: approaches, applications, and future research directions. IEEE Sens J 2(2):107–119CrossRefGoogle Scholar
  28. 28.
    Ma J, Liu W, Miller P (2012) An evidential improvement for gender profiling. In: Denoeux T, Masson M-H (eds) Belief functions: theory and applications, volume 164 of advances in intelligent and soft computing. Springer, Berlin/Heidelberg, pp 29–36Google Scholar
  29. 29.
    Magalhães J, Rüger S (2007) Information-theoretic semantic multimedia indexing. In: ACM international conference on image and video retrieval, pp 619–626Google Scholar
  30. 30.
    Messai N, Thomas P, Lefebvre D, Moudni AE (2005) Neural networks for local monitoring of traffic magnetic sensors. Control Eng Pract 13(1):67–80CrossRefGoogle Scholar
  31. 31.
    Meuter M, Nunn C, Görmer SM, Müller-Schneiders S, Kummert A (2011) A decision fusion and reasoning module for a traffic sign recognition system. IEEE Trans Intell Transp Sys 99:1–9Google Scholar
  32. 32.
    Mironica I, Ionescu B, Knees P, Lambert P (2013) An in-depth evaluation of multimodal video genre categorization. In: 11th International workshop on content-based multimedia indexing, pp 11–16Google Scholar
  33. 33.
    Moore DJ, Essa IA, Hayes MH III (1999) Exploiting human actions and object context for recognition tasks. In: IEEE international conference on computer vision, vol 1. pp 80–86Google Scholar
  34. 34.
    Nayak J, Gonzalez-Argueta L, Song B, Roy-Chowdhury A, Tuncel E (2008) Multi-target tracking through opportunistic camera control in a resource constrained multimodal sensor network. In: Second ACM/IEEE international conference on distributed smart cameras, ICDSC 2008, pp 1–10Google Scholar
  35. 35.
    Nirmala DE, Paul BS, Vaidehi V (2011) A novel multimodal image fusion method using shift invariant discrete wavelet transform and support vector machines. In: International conference on recent trends in information technology, pp 932–937Google Scholar
  36. 36.
    Oliver NM, Rosario B, Pentland AP (2000) A bayesian computer vision system for modeling human interactions. IEEE Trans Pattern Anal Mach Intell 22(8):831–843CrossRefGoogle Scholar
  37. 37.
    Ozkurt C, Camci F (2010) Automatic traffic density estimation and vehicle classification for traffic surveillance systems using neural networks. Math Comput Appl 14(3):187Google Scholar
  38. 38.
    Prati A, Vezzani R, Benini L, Farella E, Zappi P (2005) An integrated multi-modal sensor network for video surveillance. In: ACM international workshop on video surveillance and sensor networks, pp 95–102Google Scholar
  39. 39.
    Rashidi A, Ghassemian H (2003) Extended dempster-shafer theory for multi-system/sensor decision fusion. In: Joint workshop on challenges in geospatial analysis, integration and visualization II, pp 31–37Google Scholar
  40. 40.
    Remagnino P, Tan T, Baker K (1998) Agent orientated annotation in model based visual surveillance. In: International conference on computer vision, pp 857–862Google Scholar
  41. 41.
    Snidaro L, Visentini I, Foresti G (2011) Data fusion in modern surveillance. In: Innovations in defence support systems-3, pp 1–21Google Scholar
  42. 42.
    Snoek CGM, Worring M, Smeulders AWM (2005) Early versus late fusion in semantic video analysis. In: ACM international conference on multimediaGoogle Scholar
  43. 43.
    Suk HI, Jain AK, Lee SW (2011) A network of dynamic probabilistic models for human interaction analysis. IEEE Trans Circ Syst Video Technol 21(7):932–945Google Scholar
  44. 44.
    Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: a survey. IEEE Trans Circ Sys Video Technol 18(11):1473–1488Google Scholar
  45. 45.
    Vanajakshi L, Rilett LR (2004) A comparison of the performance of artificial neural networks and support vector machines for the prediction of traffic speed. In: IEEE intelligent vehicles symposium, pp 194–199Google Scholar
  46. 46.
    Vlahogianni EI, Karlaftis MG, Golias JC (2005) Optimized and meta-optimized neural networks for short-term traffic flow prediction: a genetic approach. Transp Res Part C Emerg Technol 13(3):211–234CrossRefGoogle Scholar
  47. 47.
    Wu Z, Cai L, Meng H (2006) Multi-level fusion of audio and visual features for speaker identification. In: International conference on advances in biometrics, pp 493–499Google Scholar
  48. 48.
    Xiao JM, Wang XH (2004) Study on traffic flow prediction using rbf neural network. In: International conference on machine learning and cybernetics, vol 5. pp 2672–2675Google Scholar
  49. 49.
    Zhang Q, Izquierdo E (2007) Combining low-level features for semantic inference in image retrieval. In: EURASIP journal on advances in signal processing, p 12Google Scholar
  50. 50.
    Zou X, Bhanu B (2005) Tracking humans using multi-modal fusion. In: IEEE computer society conference on computer vision and pattern recognition, pp 4–4Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Virginia Fernandez Arguedas
    • 1
    • 2
    Email author
  • Qianni Zhang
    • 1
  • Ebroul Izquierdo
    • 1
  1. 1.Multimedia and Vision Research GroupSchool of Electronic Engineering and Computer Science, Queen Mary, University of LondonLondonUK
  2. 2.European Commission—Joint Research Centre (JRC)IspraItaly

Personalised recommendations