Skip to main content
Log in

Classifying web videos using a global video descriptor

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Computing descriptors for videos is a crucial task in computer vision. In this paper, we propose a global video descriptor for classification of videos. Our method, bypasses the detection of interest points, the extraction of local video descriptors and the quantization of descriptors into a code book; it represents each video sequence as a single feature vector. Our global descriptor is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence, it integrates the information about the motion and scene structure. We tested our approach on three datasets, KTH (Schuldt et al., Proceedings of the 17th international conference on, pattern recognition (ICPR’04), vol. 3, pp. 32–36, 2004), UCF50 (http://vision.eecs.ucf.edu/datasetsActions.html) and HMDB51 (Kuehne et al., HMDB: a large video database for human motion recognition, 2011), and obtained promising results which demonstrate the robustness and the discriminative power of our global video descriptor for classifying videos of various actions. In addition, the combination of our global descriptor and a local descriptor resulted in the highest classification accuracies on UCF50 and HMDB51 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th international conference on, pattern recognition (ICPR’04), vol. 3, pp. 32–36 (2004)

  2. http://vision.eecs.ucf.edu/datasetsActions.html

  3. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)

  4. Poppe, R.: A survey on vision-based human action recognition. Image Vision Comput. 28, 976–990 (2010)

    Article  Google Scholar 

  5. Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)

    Article  Google Scholar 

  6. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Computer Society conference on computer vision and, pattern recognition (CVPR ’08) (2008)

  7. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE conference on computer vision and, pattern recognition (CVPR ’09), pp. 1996–2003 (2009)

  8. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 257–267 (2001)

    Article  Google Scholar 

  9. Yilmaz, A., Shah, M.: A differential geometric approach to representing the human actions. Comput. Vis. Image Underst. 109, 335–351 (2008)

    Article  Google Scholar 

  10. Black, M.: Explaining optical flow events with parameterized spatio-temporal models. In: IEEE Computer Society conference on computer vision and, pattern recognition (CVPR ’99), vol. 1, pp. 326–332 (1999)

  11. Polana, R., Nelson, R.C.: Detection and recognition of periodic, non-rigid motion. Int. J. Comput. Vision 23, 261–282 (1997)

    Article  Google Scholar 

  12. Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: IEEE international conference on computer vision (ICCV ’11), pp. 1419–1426 (2011)

  13. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE conference on computer vision and, pattern recognition (CVPR ’11), pp. 3169–3176 (2011)

  14. Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)

    Article  Google Scholar 

  15. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of fourth Alvey vision conference, pp. 147–151 (1988)

  16. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS, pp. 65–72 (2005)

  17. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. BMVC, In (2008)

    Google Scholar 

  18. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM multimedia, pp. 357–360 (2007)

  19. Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the 11th European conference on computer vision (ECCV ’10), pp. 494–507 (2010)

  20. Oliva, A., Torralba, A.B., Guerin-Dugue, A., Herault, J.: Global semantic classification of scenes using power spectrum templates. Challenge of image retrieval, pp. 1–12 (1999)

  21. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 145–175 (2001)

    Article  MATH  Google Scholar 

  22. Heeger, D.J.: Notes on motion estimation. http://white.stanford.edu/~heeger (1998)

  23. Maaten, L.V.D., Postma, E.O., Herik, H.J.V.D.: Dimensionality reduction: a comparative review (2008)

  24. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, p. 127 (2009)

  25. Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Google Scholar 

  26. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE 11th international conference on computer vision (ICCV’07), pp. 1–8 (2007)

  27. Gilbert, A., Illingworth, J., Bowden, R.: Action recognition using mined hierarchical compound features. IEEE Trans. Pattern Anal. Mach. Intell. 33, 883–897 (2011)

    Article  Google Scholar 

Download references

Acknowledgments

The research presented in this paper is supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center, contract number D11PC20071. The US government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC or the US government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Berkan Solmaz.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material (AVI 9508KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Solmaz, B., Assari, S.M. & Shah, M. Classifying web videos using a global video descriptor. Machine Vision and Applications 24, 1473–1485 (2013). https://doi.org/10.1007/s00138-012-0449-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00138-012-0449-x

Keywords

Navigation