Skip to main content
Log in

A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector

  • Survey
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Over the past two decades, human action recognition from video has been an important area of research in computer vision. Its applications include surveillance systems, human–computer interactions and various real-world applications where one of the actor is a human being. A number of review works have been done by several researchers in the context of human action recognition. However, it is found that there is a gap in literature when it comes to methodologies of STIP-based detector for human action recognition. This paper presents a comprehensive review on STIP-based methods for human action recognition. STIP-based detectors are robust in detecting interest points from video in spatio-temporal domain. This paper also summarizes related public datasets useful for comparing performances of various techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Baumberg, A., Hogg, D.: Generating spatiotemporal models from examples. In: 6th British Machine Vision Conference, vol. 14, pp. 525–532 (1996)

  2. Laptev, I., Lindeberg, T.: Space–time interest points. In: Proceedings ICCV’03, pp. 432–439. France (2003)

  3. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D R transform on spatio-temporal interest points for action recognition. In: CVPR’13 (2013)

  4. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. In: IEEE Proceedings of Nonrigid and Articulated Motion Workshop, vol. 73, pp. 428–440. San Juan (1997)

  5. Wu, J., Hu, D., Chen, F.: Action recognition by hidden temporal models. Vis. Comput. 30(12), 1395–1404 (2013)

  6. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. (CSUR). 43(3), 16:1–16:43 (2011)

  7. Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. Circuits Syst. Video Technol. 21(9), 1228–1241 (2011)

    Article  Google Scholar 

  8. Moravec, H.: Obstacle avoidance and navigation in the real world by a seeing robot rover. In:tech. report CMU-RI-TR-80-03, Robotics Institute, Carnegie Mellon University & doctoral dissertation, Stanford University (1980)

  9. Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of Fourth Alvey Vision Conference, pp. 147–151 (1988)

  10. Fstner, M.A., Glch, E.: A fast operator for detection and precise location of distinct Points, corners and centers of circular features. In: ISPRS Intercommission Workshop (1987)

  11. Li, Y., Kuai, Y.: Action recognition based on spatio-temporal interest points. In: Proceedings on 5th International Conference on BioMedical Engineering and Informatics (BMEI) (2012)

  12. Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. Comput. Vis. ECCV 5303, 650–663 (2008)

    Google Scholar 

  13. Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J., Xavier Roca, F.: A selective spatio-temporal interest point detector for human action recognition in complex scenes. In: IEEE International Conference on Computer Vision (ICCV), pp. 1776–1783 (2011)

  14. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64, 107–123 (2005)

    Article  Google Scholar 

  15. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)

  16. Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial–temporal words. Int. J. Comput. Vis. 79, 299–318 (2007)

  17. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings on IEEE Conference, Computer Vision and Pattern Recognition (CVPR), pp. 2046–2053. San Francisco (2010)

  18. Zhang, H., Parker, L.E.: 4-Dimensional local spatio-temporal features for human activity recognition. In: Proceedings of IEEE International Conference on Intelligent Robots and Systems, pp. 2044–2049. San Francisco (2011)

  19. Laptev, I., Caputo, B., Schuldt, C., Lindeberg, T.: Local velocity-adapted motion events for spatio-temporal recognition. Comput. Vis. Image Underst. 108, 207–229 (2007)

    Article  Google Scholar 

  20. Yu, T.-H., Kim, T.-K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. In: Proceedings of the British Machine Vision Conference, pp. 52.1–52.12. BMVA Press (2010)

  21. Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29, 983–1009 (2012)

    Article  Google Scholar 

  22. Iosifidis, A., Tefas, A., Pitas, I.: Multi-view human action recognition: a survey In: Ninth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 522–525. Beijing (2013)

  23. Akila, K., Chitrakala, S.: A comparative analysis of various representations of human action recognition in a video. Int. J. Innov. Res. Comput. Commun. Eng. 2(1), 2829–2837 (2014)

  24. Claudette, C., Shah, M.: Motion-based recognition: a survey. Image Vis. Comput. 13(2), 129–155 (1995)

    Article  Google Scholar 

  25. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Special Issue on Modeling People: Vision-based understanding of a person’s shape, appearance, movement and behavior, vol. 104, pp. 90–126 (2006)

  26. Gavrila, D.M.: The visual analysis of human movement: a survey. Comput. Vis. Image Underst. 73, 82–98 (1999)

    Article  MATH  Google Scholar 

  27. Krger, V., Kragic, D., Geib, C.: The meaning of action: a review on action recognition and mapping. Adv. Robot. 21(13), 1473–1501 (2007)

    Google Scholar 

  28. Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)

    Article  Google Scholar 

  29. Weinlanda, D., Ronfardb, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)

    Article  Google Scholar 

  30. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1473–1488 (2008)

    Article  Google Scholar 

  31. Ke, S.-R., Thuc, H.L.U., Lee, Y.-J., Hwang, J.-N., Yoo, J.-H., Choi, K.-H.: A review on video-based human activity recognition. Act. Detect. Nov. Sens. Technol. 2, 88–131 (2013)

    Google Scholar 

  32. Guan, D., Ma, T., Yuan, W., Lee, Y.-K., Jehad Sarkar, A.M.: Review of sensor-based activity recognition systems. IETE Tech. Rev. (Medknow Publications and Media Pvt. Ltd.) 28, 418 (2011)

    Article  Google Scholar 

  33. Xu, X., Tang, J., Zhang, X., Liu, X., Zhang, H., Qiu, Y.: Exploring techniques for vision based human activity recognition: methods, systems, and evaluation. Sensors 13, 1635–1650 (2013)

    Article  Google Scholar 

  34. Chaquet, J.M., Carmona, E.J., Fernndez-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 633–659 (2013)

    Article  Google Scholar 

  35. Hassner, T.: A critical review of action recognition benchmarks. In: IEEE Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 245–250. Portland (2013)

  36. Laptev, I., Lindeberg, T.: Interest point detection and scale selection in space-time. In: Proceedings of 4th International Conference, pp. 372–387, UK (2003)

  37. Laptev, I., Lindeberg, T.: Velocity adaptation of spatio-temporal receptive fields for direct recognition of activities: an experimental study. In: ECCV’02 workshop on Statistical Methods in Video Processing, pp. 61–66. Copenhagen (2003)

  38. Laptev, I., Lindeberg, T.: Local descriptors for spatio-temporal recognition. In: First International Workshop, SCVMA, pp. 91–103. Prague (2004)

  39. Laptev, I., Lindeberg, T.: Velocity adaptation of space-time interest points. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR, pp. 52–56 (2004)

  40. Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J.: Selective spatio-temporal interest points. In: Special issue on Semantic Understanding of Human Behaviors in Image Sequences, vol. 116(3), pp. 396–410 (2012)

  41. Wong, S.-F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: Proceedings on 11th IEEE International Conference of Computer Vision, ICCV. Rio de Janeiro, pp. 1–8 (2007)

  42. Yan, X., Luo, Y.: Recognizing human actions using a new descriptor based on spatialtemporal interest points and weighted-output classifier. J. Neurocomput. 87, 51–61 (2012)

    Article  Google Scholar 

  43. Wang, H., Ullah, M.M., Klser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings British Machine Vision Conference, pp. 1–18 (2009)

  44. Liangliang, C., Tian, Y.L., Liu, Z., Yao, B., Zhang, Z., Huang, T.S.: Action detection using multiple spatial–temporal interest point features. In: International Conference on Multimedia and Expo, pp. 340–345. IEEE (2010)

  45. Matikainen, P., Hebert, M., Sukthankar, R.: Representing pairwise spatial and temporal relations for action recognition. In: Proceedings on 11th European conference of the Computer vision: Part I, pp. 508–521. ECCV (2010)

  46. Yu, G., Yuan, J., Liu, Z.: Predicting human activities using spatio-temporal structure of interest points. In: Proceedings of the 20th ACM international conference on Multimedia, pp. 1049–1052. New York (2012)

  47. Scovanner, P., Ali, S., Shah, M.: A 3-Dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the 15th international conference on Multimedia, pp. 357–360 (2007)

  48. Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)

  49. Weinland, D., Ozuysal, M., Fua, P.: Making action recognition robust to occlusions and viewpoint changes. In: 11th European Conference on Computer Vision, pp. 635–648 (2010)

  50. Wang, T., Wang, S., Ding, X.: Detecting human action as the spatio-temporal tube of maximum mutual information. IEEE Trans. Circuits Syst. Video Technol. 24(2), 277–290 (2013)

  51. Singh, V.K., Nevatia, R.: Simultaneous tracking and action recognition for single actor human actions. Vis. Comput. 27(12), 1115–1123 (2011)

    Article  Google Scholar 

  52. Jiang, X., Zhong, F., Peng, Q., Qin, X.: Online robust action recognition based on a hierarchical model. Vis. Comput. 30(9), 1021–1033 (2014)

  53. Ramanathan, M., Yau, W.-Y., Teoh, E.K.: Human action recognition with video data: research and evaluation challenges. IEEE Trans. Hum. Mach. Syst. 44(5), 650–663 (2014)

    Article  Google Scholar 

  54. Qiuxia, W., Wang, Z., Deng, F., Chi, Z., Feng, D.D.: Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans. Syst. Man Cybern. Syst. 43(4), 875–885 (2013)

    Article  Google Scholar 

  55. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)

    Article  Google Scholar 

  56. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In:17th International Conference on ICPR 3, pp. 32–36 (2004)

  57. Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (2008)

  58. Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)

  59. Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)

  60. Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)

  61. Kliper-Gross, O., Hassner, T., Wolf, L.: The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34(3) (2012)

  62. Munaro, M., Ballin, G., Michieletto, S., Menegatti, E.: 3D flow estimation for human action recognition from colored point clouds. In: Biologically Inspired Cognitive Architectures (BICA) vol. 5, pp. 42–51 (2013)

  63. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)

  64. Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction. In: Conference for Visual Media Production, pp. 159–168 (2009)

  65. Singh, S., Velastin, S.A., Ragheb, H.: MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods. In: 2nd Workshop on Activity monitoring by multi-camera surveillance systems(AMMCSS), pp. 48–55 (2010)

  66. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2011)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Debapratim Das Dawn.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das Dawn, D., Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32, 289–306 (2016). https://doi.org/10.1007/s00371-015-1066-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-015-1066-2

Keywords

Navigation