Classification of Human Actions Using 3-D Convolutional Neural Networks: A Hierarchical Approach

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 841)


In this paper, we present a hierarchical approach for human action classification using 3-D Convolutional neural networks (3-D CNN). In general, human actions refer to positioning and movement of hands and legs and hence can be classified based on those performed by hands or by legs or, in some cases, both. This acts as the intuition for our work on hierarchical classification. In this work, we consider the actions as tasks performed by hand or leg movements. Therefore, instead of using a single 3-D CNN for classification of given actions, we use multiple networks to perform the classification hierarchically, that is, we first perform binary classification to separate the hand and leg actions and then use two separate networks for hand and leg actions to perform classification among target action categories. For example, in case of KTH dataset, we train three networks to classify six different actions, comprising of three actions each for hands and legs. The novelty of our approach lies in performing the separation of hand and leg actions first, thus making the subsequent classifiers to accept the features corresponding to either hands or legs only. This leads to better classification accuracy. Also, the use of 3-D CNN enables automatic extraction of features in spatial as well as temporal domain, avoiding the need for hand crafted features. This makes it one of the better approaches when it comes to video classification. We use the KTH, Weizmann and UCF-sports datasets to evaluate our method and comparison with the state of the art methods shows that our approach outperforms most of them.


  1. 1.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  2. 2.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: 2003 Proceedings of the Ninth IEEE International Conference on Computer Vision, vol. 1, pp. 432–439, October 2003Google Scholar
  3. 3.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, Series MM 2007, pp. 357–360. ACM, New York (2007).
  4. 4.
    Ravanbakhsh, M., Mousavi, H., Rastegari, M., Murino, V., Davis, L.S.: Action recognition with image based CNN features, CoRR, vol. abs/1512.03980 (2015).
  5. 5.
    Baumann, F.: Action recognition with HOG-OF features. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 243–248. Springer, Heidelberg (2013). Scholar
  6. 6.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). Scholar
  7. 7.
    Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, Series COLT 1992, pp. 144–152. ACM, New York (1992).
  8. 8.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, Series NIPS 2012, pp. 1097–1105. Curran Associates Inc., USA (2012).
  9. 9.
    Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill Inc., New York (1997)zbMATHGoogle Scholar
  10. 10.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: 2004 Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36, August 2004Google Scholar
  11. 11.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proceedings of the 14th International Conference on Computer Communications and Networks, Series ICCCN 2005, Washington, DC, USA, pp. 65–72. IEEE Computer Society (2005).
  12. 12.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008). Scholar
  13. 13.
    Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8, October 2007Google Scholar
  14. 14.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008Google Scholar
  15. 15.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR 2011, pp. 3169–3176, June 2011Google Scholar
  16. 16.
    Hao, Z., Lu, L., Zhang, Q., Wu, J., Izquierdo, E., Yang, J., Zhao, J.: Action recognition based on subdivision-fusion model. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 50.1–50.12. BMVA Press, September 2015.
  17. 17.
    Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)CrossRefGoogle Scholar
  18. 18.
    Brahnam, S., Nanni, L.: High performance set of features for human action classification (2009)Google Scholar
  19. 19.
    Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008Google Scholar
  20. 20.
    Soomro, K., Zamir, A.R.: Action recognition in realistic sports videos. In: Moeslund, T.B., Thomas, G., Hilton, A. (eds.) Computer Vision in Sports. ACVPR, pp. 181–208. Springer, Cham (2014). Scholar
  21. 21.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Cavallaro, A., Prince, S., Alexander, D. (eds.) British Machine Vision Conference, BMVC 2009, London, United Kingdom, pp. 124.1–124.11. BMVA Press, September 2009.
  22. 22.
    Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2046–2053, June 2010Google Scholar
  23. 23.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization, CoRR, vol. abs/1506.01929 (2015).

Copyright information

© Springer Nature Singapore Pte Ltd. 2018

Authors and Affiliations

  1. 1.Dhirubhai Ambani Institute of Information and Communication TechnologyGandhinagarIndia

Personalised recommendations