Skip to main content

Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization


Human activity recognition (HAR) has quite a wide range of applications. Due to its widespread use, new studies have been developed to improve the HAR performance. In this study, HAR is carried out using the commonly preferred KTH and Weizmann dataset, as well as a dataset which we created. Speeded up robust features (SURF) are used to extract features from these datasets. These features are reinforced with bag of visual words (BoVW). Different from the studies in the literature that use similar methods, SURF descriptors are extracted from binary images as well as grayscale images. Moreover, four different machine learning (ML) methods such as k-nearest neighbors, decision tree, support vector machine and naive Bayes are used for classification of BoVW features. Hyperparameter optimization is used to set the hyperparameters of these ML methods. As a result, ML methods are compared with each other through a comparison with the activity recognition performances of binary and grayscale image features. The results show that if the contrast of the environment decreases when a human enters the frame, the SURF of the binary image are more effective than the SURF of the gray image for HAR.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. Dobhal T, Shitole V, Thomas G, Navada G (2015) Human activity recognition using binary motion image and deep learning. Procedia Comput Sci 58:178–185

    Article  Google Scholar 

  2. Kim E, Helal S, Cook D (2010) Human activity recognition and pattern discovery. IEEE Pervasive Comput/IEEE Comput Soc IEEE Commun Soc 9(1):48

    Article  Google Scholar 

  3. De Kleijn R, Kachergis G, Hommel B (2014) Everyday robotic action: lessons from human action control. Front Neurorobot 8:13

    Article  Google Scholar 

  4. Dhamsania CJ, Ratanpara TV (2016) A survey on human action recognition from videos. In: 2016 Online international conference on green engineering and technologies (IC-GET). IEEE, pp 1–5

  5. Koohzadi M, Charkari NM (2017) Survey on deep learning methods in human action recognition. IET Comput Vis 11(8):623–632

    Article  Google Scholar 

  6. Ngoc LQ, Viet VH, Son TT, Hoang PM (2016) A robust approach for action recognition based on spatio-temporal features in RGB-D sequences. Int J Adv Comput Sci Appl 7(5):166–177

    Google Scholar 

  7. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  8. Mandal R, Roy PP, Pal U, Blumenstein M (2018) Bag-of-visual-words for signature-based multi-script document retrieval. Neural Comput Appl.

    Article  Google Scholar 

  9. Tang F, Lim SH, Chang NL, Tao H (2009) A novel feature descriptor invariant to complex brightness changes. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 2631–2638

  10. Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: European conference on computer vision. Springer, pp 404–417

  11. Panchal P, Panchal S, Shah S (2013) A comparison of SIFT and SURF. Int J Innov Res Comput and Commun Eng 1(2):323–327

    Google Scholar 

  12. Karami E, Prasad S, Shehata M (2017) Image matching using SIFT, SURF, BRIEF and ORB: performance comparison for distorted images. arXiv preprint arXiv:1710.02726

  13. Yang J, Jiang Y-G, Hauptmann AG, Ngo C-W (2007) Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the international workshop on multimedia information retrieval. ACM, pp 197–206

  14. Faraki M, Palhang M, Sanderson C (2014) Log-Euclidean bag of words for human action recognition. IET Comput Vis 9(3):331–339

    Article  Google Scholar 

  15. Dawn DD, Shaikh SH (2016) A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32(3):289–306

    Article  Google Scholar 

  16. Xu S, Fang T, Li D, Wang S (2010) Object classification of aerial images with bag-of-visual words. IEEE Geosci Remote Sens Lett 7(2):366–370

    Article  Google Scholar 

  17. Kim J, Kim B-S, Savarese S (2012) Comparing image classification methods: k-nearest-neighbor and support-vector-machines. Ann Arbor 1001:48109–48122

    Google Scholar 

  18. Farid DM, Zhang L, Rahman CM, Hossain MA, Strachan R (2014) Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks. Expert Syst Appl 41(4):1937–1946

    Article  Google Scholar 

  19. Ben-Hur A, Weston J (2010) A user’s guide to support vector machines. In: Data mining techniques for the life sciences. Springer, pp 223–239

  20. Abellán J, Castellano JG (2017) Improving the Naive Bayes classifier via a quick variable selection method using maximum of entropy. Entropy 19(6):247

    Article  Google Scholar 

  21. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(Feb):281–305

    MathSciNet  MATH  Google Scholar 

  22. Yao Y, Cao J, Ma Z (2018) A cost-effective deadline-constrained scheduling strategy for a hyperparameter optimization workflow for machine learning algorithms. In: International conference on service-oriented computing. Springer, pp 870–878

  23. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: Proceedings of the 17th international conference on pattern recognition, 2004 ICPR 2004, vol. 3. IEEE, pp 32–36

  24. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space–time shapes. In: Proceedings of international conference computer Vision. IEEE, pp 1395–1402

  25. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31(8):651–666

    Article  Google Scholar 

  26. Plötz T, Guan Y (2018) Deep learning for human activity recognition in mobile computing. Computer 51(5):50–59

    Article  Google Scholar 

  27. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding. Springer, pp 29–39

  28. Rahman S, Cho S-Y, Leung M (2012) Recognising human actions by analysing negative spaces. IET Comput Vis 6(3):197–213

    MathSciNet  Article  Google Scholar 

  29. Zhang Z, Hu Y, Chan S, Chia L-T (2008) Motion context: a new representation for human action recognition. In: European conference on computer vision. Springer, pp 817–829

  30. Singh M, Basu A, Mandal MK (2008) Human activity recognition based on silhouette directionality. IEEE Trans Circuits Syst Video Technol 18(9):1280–1292

    Article  Google Scholar 

  31. Bian W, Tao D, Rui Y (2012) Cross-domain human action recognition. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):298–307

    Article  Google Scholar 

  32. Cao X-Q, Liu Z-Q (2015) Type-2 fuzzy topic models for human action recognition. IEEE Trans Fuzzy Syst 23(5):1581–1593

    Article  Google Scholar 

  33. Uddin MZ, Kim T-S, Kim J-T (2013) A spatiotemporal robust approach for human activity recognition. Int J Adv Robot Syst 10(11):391

    Article  Google Scholar 

  34. Ding W, Liu K, Cheng F, Shi H, Zhang B (2015) Skeleton-based human action recognition with profile hidden Markov models. In: CCF Chinese conference on computer vision. Springer, pp 12–21

  35. Gao H, Chen W, Dou L (2015) Image classification based on support vector machine and the fusion of complementary features. arXiv preprint arXiv:1511.01706

  36. Halima NB, Hosam O (2016) Bag of words based surveillance system using support vector machines. Int J Secur Appl 10(4):331–346

    Google Scholar 

  37. Liu A-A, Su Y, Gao Z, Hao T, Yang Z-X, Zhang Z (2013) Partwise bag-of-words-based multi-task learning for human action recognition. Electron Lett 49(13):803–805

    Article  Google Scholar 

  38. Liu A-A, Xu N, Su Y-T, Lin H, Hao T, Yang Z-X (2015) Single/multi-view human action recognition via regularized multi-task learning. Neurocomputing 151:544–553

    Article  Google Scholar 

  39. Liu Y, Fung K-C, Ding W, Guo H, Qu T, Xiao C (2018) Novel smart waste sorting system based on image processing algorithms: SURF-BoW and multi-class SVM. Comput Inf Sci 11(3):35

    Google Scholar 

  40. Zhu Y, Nayak NM, Roy-Chowdhury AK (2013) Context-aware activity recognition and anomaly detection in video. J Sel Top Signal Process 7(1):91–101

    Article  Google Scholar 

  41. Vo V, Ly N (2012) Robust human action recognition using improved BOW and hybrid features. In: 2012 IEEE International symposium on signal processing and information technology (ISSPIT). IEEE, pp 000224–000229

  42. Gilbert A, Illingworth J, Bowden R (2009) Fast realistic multi-action recognition using mined dense spatio-temporal features. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 925–931

  43. Grushin A, Monner DD, Reggia JA, Mishra A (2013) Robust human action recognition via long short-term memory. In: The 2013 international joint conference on, neural networks (IJCNN). IEEE, pp 1–8

  44. Jhuang H, Serre T, Wolf L, Poggio T (2007) A biologically inspired system for action recognition. In: IEEE 11th international conference on computer vision, 2007 ICCV 2007. IEEE, pp 1–8

  45. Kläser A (2010) Learning human actions in video. Ph.D. Thesis, Université de Grenoble

  46. Lin Z, Jiang Z, Davis LS (2009) Recognizing actions by shape-motion prototype trees. In: 2009 IEEE 12th international conference on computer vision. IEEE, pp 444–451

  47. Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the wild”. In: IEEE conference on computer vision and pattern recognition, 2009. CVPR 2009. IEEE, pp 1996–2003

  48. Liu J, Shah M (2008) Learning human actions via information maximization. In: IEEE conference on computer vision and pattern recognition, 2008. CVPR 2008. IEEE, pp 1–8

  49. Rodriguez M (2010) Spatio-temporal maximum average correlation height templates in action recognition and video summarization. Electronic Theses and Dissertations, 4323

  50. Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: IEEE conference on computer vision and pattern recognition CVPR 2008. IEEE, pp 1–8

  51. Sun X, Chen M, Hauptmann A (2009) Action recognition via local descriptors and holistic features. In: IEEE computer society conference on computer vision and pattern recognition workshops, 2009 CVPR workshops 2009. IEEE, pp 58–65

  52. Veeriah V, Zhuang N, Qi G-J (2015) Differential recurrent neural networks for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp 4041–4049

  53. Wu X, Liang W, Jia Y (2009) Incremental discriminative-analysis of canonical correlations for action recognition. In: 2009 IEEE 12th international conference on computer vision, 2009. IEEE, pp 2035–2041

  54. Suto J, Oniga S, Lung C, Orha I (2018) Comparison of offline and real-time human activity recognition results using machine learning techniques. Neural Comput Appl.

    Article  Google Scholar 

  55. Niebles JC, Wang H, Fei-Fei L (2008) Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis 79(3):299–318

    Article  Google Scholar 

  56. Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing: volume 1. Association for Computational Linguistics, pp 248–256

  57. Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space–time shapes. In: Tenth IEEE international conference on computer vision (ICCV’05). IEEE, pp 1395–1402

  58. Scovanner P, Ali S, Shah M (2007) A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM international conference on multimedia. ACM, pp 357–360

  59. Bregonzio M, Xiang T, Gong S (2012) Fusing appearance and distribution information of interest points for action recognition. Pattern Recognit 45(3):1220–1234

    Article  Google Scholar 

  60. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance. IEEE, pp 65–72

  61. Klaser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: BMVC 2008 19th British machine vision conference. British Machine Vision Association, pp 275: 1–10

  62. Liu H, Ju Z, Ji X, Chan CS, Khoury M (2017) Study of human action recognition based on improved spatio-temporal features. In: Human Motion sensing and recognition: a fuzzy qualitative approach. Springer, Berlin, pp 233–250

  63. Moussa MM, Hamayed E, Fayek MB, El Nemr HA (2015) An enhanced method for human action recognition. J Adv Res 6(2):163–169

    Article  Google Scholar 

  64. Singh YK, Singh ND (2017) Binary face image recognition using logistic regression and neural network. In: 2017 International conference on energy, communication, data analytics and soft computing (ICECDS). IEEE, pp 3883–3888

  65. Pandey RK, Vignesh K, Ramakrishnan A (2018) Binary Document image super resolution for improved readability and OCR performance. arXiv preprint arXiv:1812.02475

  66. Perner P, Perner H, Müller B (2002) Mining knowledge for HEp-2 cell image classification. Artif Intel Med 26(1–2):161–173

    MATH  Article  Google Scholar 

  67. Santofimia MJ, Martinez-del-Rincon J, Nebel J-C (2014) Episodic reasoning for vision-based human action recognition. Sci World J 2014:270171

    Article  Google Scholar 

  68. Laptev I, Lindeberg T (2006) Local descriptors for spatio-temporal recognition. In: Spatial coherence for visual motion analysis. Springer, pp 91–103

  69. Galar M, Fernández A, Barrenechea E, Bustince H, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44(8):1761–1776

    Article  Google Scholar 

  70. Haralick RM (1979) Statistical and structural approaches to texture. Proc IEEE 67(5):786–804

    Article  Google Scholar 

Download references


The authors are thankful to RAC-LAB ( for providing the trial version of their commercial software for this study.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Muhammet Fatih Aslan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aslan, M.F., Durdu, A. & Sabanci, K. Human action recognition with bag of visual words using different machine learning methods and hyperparameter optimization. Neural Comput & Applic 32, 8585–8597 (2020).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Human activity recognition
  • Image processing
  • Speeded up robust features
  • Bag of visual words
  • Machine learning
  • k-Nearest neighbors
  • Decision tree
  • Support vector machine
  • Naive Bayes
  • Hyperparameter optimization