Audio and Video Feature Fusion for Activity Recognition in Unconstrained Videos

  • José Lopes
  • Sameer Singh
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4224)


Combining audio and image processing for understanding video content has several benefits when compared to using each modality on their own. For the task of context and activity recognition in video sequences, it is important to explore both data streams to gather relevant information. In this paper we describe a video context and activity recognition model. Our work extracts a range of audio and visual features, followed by feature reduction and information fusion. We show that combining audio with video based decision making improves the quality of context and activity recognition in videos by 4% over audio data and 18% over image data.


Activity Recognition Feature Fusion Audio Feature Decision Fusion Decision Level Fusion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Boersma, P.: Accurate Short-Term Analysis of the Fundamental Frequency and the Harmonics- to-Noise Ratio of a Sampled Sound. In: Institute of Phonetic Sciences, University of Amsterdam, Proceedings, vol. 17 (1993)Google Scholar
  2. 2.
    Halif, R., Flusser, J.: Numerically Stable Direct Least Squares Fitting of Ellipses. Department of Software Engineering, Charles University, Czech Republic (2000)Google Scholar
  3. 3.
    Hu, Y.H., Hwant, J.-N.: Handbook of Neural Network Signal Processing. CRC Press, Boca RatonGoogle Scholar
  4. 4.
    Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)CrossRefGoogle Scholar
  5. 5.
    Kobes, R., Kunstatter, G.: Physics 1501 – Modern Technology Physics Department, University of WinnipegGoogle Scholar
  6. 6.
    Laws, K.I.: Textured image segmentation, Ph.D. thesis, University of Southern California (1980)Google Scholar
  7. 7.
    Liu, Z., Wang, Y.: Audio Feature Extraction and Analysis for Scene Segmentation and Classification. Journal of VLSI Signal Processing, 61–79 (1998)Google Scholar
  8. 8.
    Liu, Z., Huang, J., Wang, Y.: Classification of TV Programs Based on Audio Information Using Hidden Markov Model. In: IEEE Workshop on Multimedia Signal Processing (1998)Google Scholar
  9. 9.
    Lopes, J., Lin, C., Singh, S.: Multi-stage Classification for Audio based Activity Recognition. In: Submited to International Conference on Intelligent Data Engineering and Automated Learning (2006)Google Scholar
  10. 10.
    Lucas, B.D., Kanade, T.: An Iterative Image Registration Technique with an Application to Stereo Vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981)Google Scholar
  11. 11.
    Martin, J.C., Veldman, R., Beroule, D.: Developing multimodal interfaces: a theoretical framework and guided propagation networks. In: Bunt, H., Beun, R.J., Borghuis, T. (eds.) Multimodal Human-Computer Communication (1998)Google Scholar
  12. 12.
    Mindru, F., Moons, T., Van Gool, L.: Recognizing color patterns irrespective of viewpoint and illumination. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR 1999), pp. 368–373 (1999)Google Scholar
  13. 13.
    Naphade, M.R., Huang, T.: Extracting semantics from audiovisual content: the final frontier in multimedia retrieval. IEEE Transactions on Neural Networks 13, 793–810 (2002)CrossRefGoogle Scholar
  14. 14.
    Pudil, P., Navovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15, 1119–1125 (1994)CrossRefGoogle Scholar
  15. 15.
    Sharma, R., Pavlovic, V.I., Huang, T.S.: Toward multimodal human-computer interface. Proceedings of the IEEE 86(5), 853–869 (1998)CrossRefGoogle Scholar
  16. 16.
    Sonka, M., Hlavac, V., Boyle, R.: Image Processing, Analysis and Machine Vision. Brooks/Cole (1999)Google Scholar
  17. 17.
    Watkinson, J.: The Engineer’s Guide to Motion Compensation, Petersfield, Snell & Wilcox (1994)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • José Lopes
    • 1
  • Sameer Singh
    • 1
  1. 1.Research School of InformaticsLoughborough UniversityLoughboroughUK

Personalised recommendations