Abstract
Over the past two decades, human action recognition from video has been an important area of research in computer vision. Its applications include surveillance systems, human–computer interactions and various real-world applications where one of the actor is a human being. A number of review works have been done by several researchers in the context of human action recognition. However, it is found that there is a gap in literature when it comes to methodologies of STIP-based detector for human action recognition. This paper presents a comprehensive review on STIP-based methods for human action recognition. STIP-based detectors are robust in detecting interest points from video in spatio-temporal domain. This paper also summarizes related public datasets useful for comparing performances of various techniques.
Similar content being viewed by others
References
Baumberg, A., Hogg, D.: Generating spatiotemporal models from examples. In: 6th British Machine Vision Conference, vol. 14, pp. 525–532 (1996)
Laptev, I., Lindeberg, T.: Space–time interest points. In: Proceedings ICCV’03, pp. 432–439. France (2003)
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D R transform on spatio-temporal interest points for action recognition. In: CVPR’13 (2013)
Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. In: IEEE Proceedings of Nonrigid and Articulated Motion Workshop, vol. 73, pp. 428–440. San Juan (1997)
Wu, J., Hu, D., Chen, F.: Action recognition by hidden temporal models. Vis. Comput. 30(12), 1395–1404 (2013)
Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. (CSUR). 43(3), 16:1–16:43 (2011)
Mukherjee, S., Biswas, S.K., Mukherjee, D.P.: Recognizing human action at a distance in video by key poses. IEEE Trans. Circuits Syst. Video Technol. 21(9), 1228–1241 (2011)
Moravec, H.: Obstacle avoidance and navigation in the real world by a seeing robot rover. In:tech. report CMU-RI-TR-80-03, Robotics Institute, Carnegie Mellon University & doctoral dissertation, Stanford University (1980)
Harris, C., Stephens, M.: A combined corner and edge detector. In: Proceedings of Fourth Alvey Vision Conference, pp. 147–151 (1988)
Fstner, M.A., Glch, E.: A fast operator for detection and precise location of distinct Points, corners and centers of circular features. In: ISPRS Intercommission Workshop (1987)
Li, Y., Kuai, Y.: Action recognition based on spatio-temporal interest points. In: Proceedings on 5th International Conference on BioMedical Engineering and Informatics (BMEI) (2012)
Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. Comput. Vis. ECCV 5303, 650–663 (2008)
Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J., Xavier Roca, F.: A selective spatio-temporal interest point detector for human action recognition in complex scenes. In: IEEE International Conference on Computer Vision (ICCV), pp. 1776–1783 (2011)
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64, 107–123 (2005)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial–temporal words. Int. J. Comput. Vis. 79, 299–318 (2007)
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings on IEEE Conference, Computer Vision and Pattern Recognition (CVPR), pp. 2046–2053. San Francisco (2010)
Zhang, H., Parker, L.E.: 4-Dimensional local spatio-temporal features for human activity recognition. In: Proceedings of IEEE International Conference on Intelligent Robots and Systems, pp. 2044–2049. San Francisco (2011)
Laptev, I., Caputo, B., Schuldt, C., Lindeberg, T.: Local velocity-adapted motion events for spatio-temporal recognition. Comput. Vis. Image Underst. 108, 207–229 (2007)
Yu, T.-H., Kim, T.-K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. In: Proceedings of the British Machine Vision Conference, pp. 52.1–52.12. BMVA Press (2010)
Vishwakarma, S., Agrawal, A.: A survey on activity recognition and behavior understanding in video surveillance. Vis. Comput. 29, 983–1009 (2012)
Iosifidis, A., Tefas, A., Pitas, I.: Multi-view human action recognition: a survey In: Ninth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 522–525. Beijing (2013)
Akila, K., Chitrakala, S.: A comparative analysis of various representations of human action recognition in a video. Int. J. Innov. Res. Comput. Commun. Eng. 2(1), 2829–2837 (2014)
Claudette, C., Shah, M.: Motion-based recognition: a survey. Image Vis. Comput. 13(2), 129–155 (1995)
Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Special Issue on Modeling People: Vision-based understanding of a person’s shape, appearance, movement and behavior, vol. 104, pp. 90–126 (2006)
Gavrila, D.M.: The visual analysis of human movement: a survey. Comput. Vis. Image Underst. 73, 82–98 (1999)
Krger, V., Kragic, D., Geib, C.: The meaning of action: a review on action recognition and mapping. Adv. Robot. 21(13), 1473–1501 (2007)
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28, 976–990 (2010)
Weinlanda, D., Ronfardb, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)
Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: a survey. IEEE Trans. Circuits Syst. Video Technol. 18(11), 1473–1488 (2008)
Ke, S.-R., Thuc, H.L.U., Lee, Y.-J., Hwang, J.-N., Yoo, J.-H., Choi, K.-H.: A review on video-based human activity recognition. Act. Detect. Nov. Sens. Technol. 2, 88–131 (2013)
Guan, D., Ma, T., Yuan, W., Lee, Y.-K., Jehad Sarkar, A.M.: Review of sensor-based activity recognition systems. IETE Tech. Rev. (Medknow Publications and Media Pvt. Ltd.) 28, 418 (2011)
Xu, X., Tang, J., Zhang, X., Liu, X., Zhang, H., Qiu, Y.: Exploring techniques for vision based human activity recognition: methods, systems, and evaluation. Sensors 13, 1635–1650 (2013)
Chaquet, J.M., Carmona, E.J., Fernndez-Caballero, A.: A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117, 633–659 (2013)
Hassner, T.: A critical review of action recognition benchmarks. In: IEEE Conference in Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 245–250. Portland (2013)
Laptev, I., Lindeberg, T.: Interest point detection and scale selection in space-time. In: Proceedings of 4th International Conference, pp. 372–387, UK (2003)
Laptev, I., Lindeberg, T.: Velocity adaptation of spatio-temporal receptive fields for direct recognition of activities: an experimental study. In: ECCV’02 workshop on Statistical Methods in Video Processing, pp. 61–66. Copenhagen (2003)
Laptev, I., Lindeberg, T.: Local descriptors for spatio-temporal recognition. In: First International Workshop, SCVMA, pp. 91–103. Prague (2004)
Laptev, I., Lindeberg, T.: Velocity adaptation of space-time interest points. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR, pp. 52–56 (2004)
Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzalez, J.: Selective spatio-temporal interest points. In: Special issue on Semantic Understanding of Human Behaviors in Image Sequences, vol. 116(3), pp. 396–410 (2012)
Wong, S.-F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: Proceedings on 11th IEEE International Conference of Computer Vision, ICCV. Rio de Janeiro, pp. 1–8 (2007)
Yan, X., Luo, Y.: Recognizing human actions using a new descriptor based on spatialtemporal interest points and weighted-output classifier. J. Neurocomput. 87, 51–61 (2012)
Wang, H., Ullah, M.M., Klser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings British Machine Vision Conference, pp. 1–18 (2009)
Liangliang, C., Tian, Y.L., Liu, Z., Yao, B., Zhang, Z., Huang, T.S.: Action detection using multiple spatial–temporal interest point features. In: International Conference on Multimedia and Expo, pp. 340–345. IEEE (2010)
Matikainen, P., Hebert, M., Sukthankar, R.: Representing pairwise spatial and temporal relations for action recognition. In: Proceedings on 11th European conference of the Computer vision: Part I, pp. 508–521. ECCV (2010)
Yu, G., Yuan, J., Liu, Z.: Predicting human activities using spatio-temporal structure of interest points. In: Proceedings of the 20th ACM international conference on Multimedia, pp. 1049–1052. New York (2012)
Scovanner, P., Ali, S., Shah, M.: A 3-Dimensional SIFT descriptor and its application to action recognition. In: Proceedings of the 15th international conference on Multimedia, pp. 357–360 (2007)
Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)
Weinland, D., Ozuysal, M., Fua, P.: Making action recognition robust to occlusions and viewpoint changes. In: 11th European Conference on Computer Vision, pp. 635–648 (2010)
Wang, T., Wang, S., Ding, X.: Detecting human action as the spatio-temporal tube of maximum mutual information. IEEE Trans. Circuits Syst. Video Technol. 24(2), 277–290 (2013)
Singh, V.K., Nevatia, R.: Simultaneous tracking and action recognition for single actor human actions. Vis. Comput. 27(12), 1115–1123 (2011)
Jiang, X., Zhong, F., Peng, Q., Qin, X.: Online robust action recognition based on a hierarchical model. Vis. Comput. 30(9), 1021–1033 (2014)
Ramanathan, M., Yau, W.-Y., Teoh, E.K.: Human action recognition with video data: research and evaluation challenges. IEEE Trans. Hum. Mach. Syst. 44(5), 650–663 (2014)
Qiuxia, W., Wang, Z., Deng, F., Chi, Z., Feng, D.D.: Realistic human action recognition with multimodal feature selection and fusion. IEEE Trans. Syst. Man Cybern. Syst. 43(4), 875–885 (2013)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space–time shapes. Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In:17th International Conference on ICPR 3, pp. 32–36 (2004)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition (2008)
Laptev, I., Marszaek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936 (2009)
Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2009)
Kliper-Gross, O., Hassner, T., Wolf, L.: The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 34(3) (2012)
Munaro, M., Ballin, G., Michieletto, S., Menegatti, E.: 3D flow estimation for human action recognition from colored point clouds. In: Biologically Inspired Cognitive Architectures (BICA) vol. 5, pp. 42–51 (2013)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2–3), 249–257 (2006)
Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction. In: Conference for Visual Media Production, pp. 159–168 (2009)
Singh, S., Velastin, S.A., Ragheb, H.: MuHAVi: A multicamera human action video dataset for the evaluation of action recognition methods. In: 2nd Workshop on Activity monitoring by multi-camera surveillance systems(AMMCSS), pp. 48–55 (2010)
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: Proceedings of IEEE Computer Vision and Pattern Recognition (CVPR) (2011)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Das Dawn, D., Shaikh, S.H. A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis Comput 32, 289–306 (2016). https://doi.org/10.1007/s00371-015-1066-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-015-1066-2