Human Action Classification Using N-Grams Visual Vocabulary

  • Ruber Hernández-García
  • Edel García-Reyes
  • Julián Ramos-Cózar
  • Nicolás Guil
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8827)


Human action classification is an important task in computer vision. The Bag-of-Words model is a representation method very used in action classification techniques. In this work we propose an approach based on mid-level features representation for human action description. First, an optimal vocabulary is created without a preliminary number of visual words, which is a known problem of the K-means method. We introduce a graph-based video representation using the interest points relationships, in order to take into account the spatial and temporal layout. Finally, a second visual vocabulary based on n-grams is used for classification. This combines the representational power of graphs with the efficiency of the bag-of-words representation. The representation method was tested on the KTH dataset using STIP and MoSIFT descriptors and multi-class SVM with a chi-square kernel. The experimental results show that our approach using STIP descriptor outperforms the best results of state-of-art, meanwhile using MoSIFT descriptor are comparable to them.


Human Action Classification Bag-of-Words Visual Words Frequent Subgraphs KTH dataset 


  1. 1.
    Acosta-Mendoza, N., Gago-Alonso, A., Medina-Pagola, J.E.: Frequent approximate subgraphs as features for graph-based image classification. Knowledge-Based Systems 27, 381–392 (2012)CrossRefGoogle Scholar
  2. 2.
    Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzàlez, J.: Selective spatio-temporal interest points. Computer Vision and Image Understanding 116(3), 396–410 (2012)CrossRefGoogle Scholar
  3. 3.
    Chen, M.Y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos. Research Showcase 929, Carnegie Mellon University. School of Computer Science. Computer Science Department (2009)Google Scholar
  4. 4.
    Cózar, J.R., Hernández, R., Heredia, Y., González-Linares, J.M., Guil, N.: Reducing Vocabulary Size in Human Action Classification. In: Frontiers in Artificial Intelligence and Applications, vol. 243, pp. 1712–1719. IOS Press (2012)Google Scholar
  5. 5.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley Interscience (2001)Google Scholar
  6. 6.
    Fiedler, M., Borgelt, C.: Support computation for mining frequent subgraphs in a single graph. In: Proceedings of MLG-2007: 5th International Workshop on Mining and Learning with Graphs, pp. 1–6 (2007)Google Scholar
  7. 7.
    Gao, Z., Chen, M.-Y., Hauptmann, A.G., Cai, A.: Comparing Evaluation Protocols on the KTH Dataset. In: Salah, A.A., Gevers, T., Sebe, N., Vinciarelli, A. (eds.) HBU 2010. LNCS, vol. 6219, pp. 88–100. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  8. 8.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV 2003), vol. 1, pp. 432–439 (2003)Google Scholar
  9. 9.
    Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR0 2008, pp. 1–8 (2008)Google Scholar
  10. 10.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1–8 (2009)Google Scholar
  11. 11.
    Morales-González, A., García-Reyes, E.: Assessing the role of spatial relations for the object recognition task. In: Bloch, I., Cesar Jr., R.M. (eds.) CIARP 2010. LNCS, vol. 6419, pp. 549–556. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning and of human and action categories and using and spatial-temporal words. International Journal on Computer Vision (79), 299–318 (2008)Google Scholar
  13. 13.
    Özdemir, B., Aksoy, S.: Image classification using subgraph histogram representation. In: ICPR 2010, pp. 1112–1115 (August 2010)Google Scholar
  14. 14.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36 (August 2004)Google Scholar
  15. 15.
    Thi, T.H., Cheng, L., Zhang, J., Wang, L., Satoh, S.: Structured learning of local features for human action classification and localization. Image and Vision Computing 30(1), 1–14 (2012)CrossRefGoogle Scholar
  16. 16.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Wiliem, A., Madasu, V.K., Boles, W.W., Yarlagadda, P.K.: Detecting uncommon and trajectories. In: Digital Image and Computing: Techniques and Applications, DICTA (December 2008)Google Scholar
  18. 18.
    Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: IEEE International Conference on Data Mining, ICDM 2002, pp. 721–724 (2002)Google Scholar
  19. 19.
    Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and visual phrases for image applications. In: Proceedings of the 17th ACM International Conference on Multimedia, pp. 75–84 (October 2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ruber Hernández-García
    • 1
  • Edel García-Reyes
    • 2
  • Julián Ramos-Cózar
    • 3
  • Nicolás Guil
    • 3
  1. 1.Digital Signals DepartamentUniversity of Informatics SciencesCuba
  2. 2.Pattern Recognition DepartmentAdvanced Technologies Application CenterCuba
  3. 3.Dept. of Computer ArchitectureUniversity of MálagaSpain

Personalised recommendations