Machine Vision and Applications

, Volume 24, Issue 7, pp 1473–1485 | Cite as

Classifying web videos using a global video descriptor

  • Berkan SolmazEmail author
  • Shayan Modiri Assari
  • Mubarak Shah
Original Paper


Computing descriptors for videos is a crucial task in computer vision. In this paper, we propose a global video descriptor for classification of videos. Our method, bypasses the detection of interest points, the extraction of local video descriptors and the quantization of descriptors into a code book; it represents each video sequence as a single feature vector. Our global descriptor is computed by applying a bank of 3-D spatio-temporal filters on the frequency spectrum of a video sequence; hence, it integrates the information about the motion and scene structure. We tested our approach on three datasets, KTH (Schuldt et al., Proceedings of the 17th international conference on, pattern recognition (ICPR’04), vol. 3, pp. 32–36, 2004), UCF50 ( and HMDB51 (Kuehne et al., HMDB: a large video database for human motion recognition, 2011), and obtained promising results which demonstrate the robustness and the discriminative power of our global video descriptor for classifying videos of various actions. In addition, the combination of our global descriptor and a local descriptor resulted in the highest classification accuracies on UCF50 and HMDB51 datasets.


Video descriptors Action recognition Frequency spectrum Spatio-temporal analysis 



The research presented in this paper is supported by the Intelligence Advanced Research Projects Activity (IARPA) via the Department of Interior National Business Center, contract number D11PC20071. The US government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC or the US government.

Supplementary material

Supplementary material (AVI 9508KB)


  1. 1.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of the 17th international conference on, pattern recognition (ICPR’04), vol. 3, pp. 32–36 (2004)Google Scholar
  2. 2.
  3. 3.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  4. 4.
    Poppe, R.: A survey on vision-based human action recognition. Image Vision Comput. 28, 976–990 (2010)CrossRefGoogle Scholar
  5. 5.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)CrossRefGoogle Scholar
  6. 6.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Computer Society conference on computer vision and, pattern recognition (CVPR ’08) (2008)Google Scholar
  7. 7.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE conference on computer vision and, pattern recognition (CVPR ’09), pp. 1996–2003 (2009)Google Scholar
  8. 8.
    Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23, 257–267 (2001)CrossRefGoogle Scholar
  9. 9.
    Yilmaz, A., Shah, M.: A differential geometric approach to representing the human actions. Comput. Vis. Image Underst. 109, 335–351 (2008)CrossRefGoogle Scholar
  10. 10.
    Black, M.: Explaining optical flow events with parameterized spatio-temporal models. In: IEEE Computer Society conference on computer vision and, pattern recognition (CVPR ’99), vol. 1, pp. 326–332 (1999)Google Scholar
  11. 11.
    Polana, R., Nelson, R.C.: Detection and recognition of periodic, non-rigid motion. Int. J. Comput. Vision 23, 261–282 (1997)CrossRefGoogle Scholar
  12. 12.
    Wu, S., Oreifej, O., Shah, M.: Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In: IEEE international conference on computer vision (ICCV ’11), pp. 1419–1426 (2011)Google Scholar
  13. 13.
    Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: IEEE conference on computer vision and, pattern recognition (CVPR ’11), pp. 3169–3176 (2011)Google Scholar
  14. 14.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64, 107–123 (2005)CrossRefGoogle Scholar
  15. 15.
    Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of fourth Alvey vision conference, pp. 147–151 (1988)Google Scholar
  16. 16.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: VS-PETS, pp. 65–72 (2005)Google Scholar
  17. 17.
    Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. BMVC, In (2008)Google Scholar
  18. 18.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional SIFT descriptor and its application to action recognition. In: ACM multimedia, pp. 357–360 (2007)Google Scholar
  19. 19.
    Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the 11th European conference on computer vision (ECCV ’10), pp. 494–507 (2010)Google Scholar
  20. 20.
    Oliva, A., Torralba, A.B., Guerin-Dugue, A., Herault, J.: Global semantic classification of scenes using power spectrum templates. Challenge of image retrieval, pp. 1–12 (1999)Google Scholar
  21. 21.
    Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vision 42, 145–175 (2001)CrossRefzbMATHGoogle Scholar
  22. 22.
    Heeger, D.J.: Notes on motion estimation. (1998)
  23. 23.
    Maaten, L.V.D., Postma, E.O., Herik, H.J.V.D.: Dimensionality reduction: a comparative review (2008)Google Scholar
  24. 24.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, p. 127 (2009)Google Scholar
  25. 25.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)Google Scholar
  26. 26.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: IEEE 11th international conference on computer vision (ICCV’07), pp. 1–8 (2007)Google Scholar
  27. 27.
    Gilbert, A., Illingworth, J., Bowden, R.: Action recognition using mined hierarchical compound features. IEEE Trans. Pattern Anal. Mach. Intell. 33, 883–897 (2011)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Berkan Solmaz
    • 1
    Email author
  • Shayan Modiri Assari
    • 1
  • Mubarak Shah
    • 1
  1. 1.University of Central FloridaOrlandoUSA

Personalised recommendations