A Study of Action Recognition Problems: Dataset and Architectures Perspectives

  • Bassel S. Chawky
  • A. S. Elons
  • A. Ali
  • Howida A. Shedeed
Part of the Studies in Computational Intelligence book series (SCI, volume 730)


Action recognition field has recently grown dramatically due to its importance in many applications like smart surveillance, human–computer interaction, assisting aged citizens or web-video search and retrieval. Many research trials have tackled action recognition as an open problem. Different datasets are built to evaluate architectures variations. In this survey, different action recognition datasets are explored to highlight their ability to evaluate different models. In addition, for each dataset, a usage is proposed based on the content and format of data it includes, the number of classes and challenges it covers. On other hand, another exploration for different architectures is drawn showing the contribution of each of them to handle different action recognition problem challenges and the scientific explanation behind their results. An overall of 21 datasets is covered with 13 architectures that are shallow and deep models.


Action/activity recognition Machine learning Computer vision Action recognition Architectures Shallow models Deep learning models 


  1. 1.
    Shao, L., Jones, S., Li, X.: Efficient search and localization of human actions in video databases. IEEE Trans. Circuits Syst. Video Technol. 24(3), 504–512 (2014)CrossRefGoogle Scholar
  2. 2.
    Wang, F., Xu, D., Lu, W., Xu, H.: Automatic annotation and retrieval for videos. In: Pacific-Rim Symposium on Image and Video Technology, pp. 1030–1040. Springer, Heidelberg (2006)Google Scholar
  3. 3.
    Hung, M.H., Pan, J.S.: A real-time action detection system for surveillance videos using template matching. J. Inf. Hiding Multimedia Signal Process. 6(6), 1088–1099 (2015)Google Scholar
  4. 4.
    Campo, E., Chan, M.: Detecting abnormal behaviour by real-time monitoring of patients. In: Proceedings of the AAAI-02 Workshop Automation as Caregiver, pp. 8–12 (2002)Google Scholar
  5. 5.
    Mumtaz, M., Habib, H. A.: Evaluation of Activity Recognition Algorithms for Employee Performance Monitoring. Int. J. Comput. Sci. Issues (IJCSI), 9(5), 203–210 (2012)Google Scholar
  6. 6.
    Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. 1, 25–36 (2013)Google Scholar
  7. 7.
    Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recogn. 47(10), 3343–3361 (2014)CrossRefGoogle Scholar
  8. 8.
    Rodriguez, M.: Spatio-temporal maximum average correlation height templates in action recognition and video summarization (2010)Google Scholar
  9. 9.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2929–2936. IEEE (2009)Google Scholar
  10. 10.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)Google Scholar
  11. 11.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, vol. 3, pp. 32–36. IEEE (2004)Google Scholar
  12. 12.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 1996–2003. IEEE (2009)Google Scholar
  13. 13.
    Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)CrossRefGoogle Scholar
  14. 14.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402
  15. 15.
    Li, L.J., Fei-Fei, L.: What, where and who? classifying events by scene and object recognition. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)Google Scholar
  16. 16.
    Jhuang, H., et al.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision (2013)Google Scholar
  17. 17.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1194–1201. IEEE (2012)Google Scholar
  18. 18.
  19. 19.
    Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Escalante, H.: Multi-modal gesture recognition challenge 2013: dataset and results. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 445–452. ACM (2013)Google Scholar
  20. 20.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale Video Classification with Convolutional Neural Networks (2014)Google Scholar
  21. 21.
    Badler, N. I., O’Rourke, J., Platt, S., Morris, M. A.: Human movement understanding: a variety of perspectives. In: AAAI, pp. 53–55 (1980)Google Scholar
  22. 22.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (pp. 65–72). IEEE (2005)Google Scholar
  23. 23.
    Klaser, A., Marszałek, M., Schmid, C. A spatio-temporal descriptor based on 3d-gradients. In: BMVC 2008–19th British Machine Vision Conference, pp. 275–1. British Machine Vision Association (2008)Google Scholar
  24. 24.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: European Conference on Computer Vision, pp. 650–663. Springer, Heidelberg (2008)Google Scholar
  25. 25.
    Wang, H., Ullah, M. M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)Google Scholar
  26. 26.
    Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. Comput. Vis. Image Underst. (2016).Google Scholar
  27. 27.
    Dodge, S. F., Karam, L.J.: Is Bottom-Up Attention Useful for Scene Recognition? (2013). arXiv:1307.5702
  28. 28.
    Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: European Conference on Computer Vision, pp. 581–595. Springer International Publishing (2014)Google Scholar
  29. 29.
    Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition (2016)Google Scholar
  30. 30.
    Wang, L., Qiao, Y., Tang, X. Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4305–4314 (2015)Google Scholar
  31. 31.
    Bottou, L., Vapnik, V.: Local learning algorithms. Neural Comput. 4(6), 888–900 (1992)CrossRefGoogle Scholar
  32. 32.
    Strasburger, H., Rentschler, I., Jüttner, M.: Peripheral vision and pattern recognition: a review. J. Vis. 11(5), 13–13 (2011)Google Scholar
  33. 33.
    Ni, B., Paramathayalan, V.R., Moulin, P.: Multiple granularity analysis for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 756–763 (2014)Google Scholar
  34. 34.
    Freedman, R.G., Jung, H.T., Zilberstein, S.: Plan and activity recognition from a topic modeling perspective. In: ICAPS (2014)Google Scholar
  35. 35.
    Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., Poggio, T.: A quantitative theory of immediate visual recognition. Prog. Brain Res. 165, 33–56 (2007)CrossRefGoogle Scholar
  36. 36.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  37. 37.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  38. 38.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)Google Scholar
  39. 39.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4694–4702 (2015)Google Scholar
  40. 40.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  41. 41.
    Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: AISTATS, vol. 1, p. 3 (2009)Google Scholar
  42. 42.
    Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features. In: European Conference on Computer Vision, pp. 140–153. Springer, Heidelberg (2010)Google Scholar
  43. 43.
    Le, Q. V.: Building high-level features using large scale unsupervised learning. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing pp. 8595–8598 (2013)Google Scholar
  44. 44.
    Sun, L., Jia, K., Chan, T.H., Fang, Y., Wang, G., Yan, S.: DL-SFA: deeply-learned slow feature analysis for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2632 (2014)Google Scholar
  45. 45.
    Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. pp. 248–255. IEEE (2009)Google Scholar
  46. 46.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Berg, A.C.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)Google Scholar
  47. 47.
    Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments, vol. 1, no. 2, p. 3, Technical Report 07-49, University of Massachusetts, Amherst (2007)Google Scholar
  48. 48.
    Zhang, W., Sun, J., Tang, X.: Cat head detection-how to effectively exploit shape and texture features. In: European Conference on Computer Vision, pp. 802–816. Springer, Heidelberg (2008)Google Scholar
  49. 49.
    Keller, C. G., Enzweiler, M., Gavrila, D. M.: A new benchmark for stereo-based pedestrian detection. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 691–696. IEEE (2011)Google Scholar
  50. 50.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 190–200. Association for Computational Linguistics (2011)Google Scholar
  51. 51.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  52. 52.
    Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation (2014)Google Scholar
  53. 53.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  54. 54.
    Torabi, A., Pal, C., Larochelle, H., Courville, A.: Using descriptive video services to create a large data source for video annotation research (2015). arXiv:1503.01070

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Bassel S. Chawky
    • 1
  • A. S. Elons
    • 1
  • A. Ali
    • 1
  • Howida A. Shedeed
    • 1
  1. 1.Faculty of Computer and Information Sciences, Scientific Computing DepartmentAin Shams UniversityCairoEgypt

Personalised recommendations