Skip to main content

Advertisement

Log in

Human Action Recognition and Prediction: A Survey

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Derived from rapid advances in computer vision and machine learning, video analysis tasks have been moving from inferring the present state to predicting the future state. Vision-based action recognition and prediction from videos are such tasks, where action recognition is to infer human actions (present state) based upon complete action executions, and action prediction to predict human actions (future state) based upon incomplete action executions. These two tasks have become particularly prevalent topics recently because of their explosively emerging real-world applications, such as visual surveillance, autonomous driving vehicle, entertainment, and video retrieval, etc. Many attempts have been devoted in the last a few decades in order to build a robust and effective framework for action recognition and prediction. In this paper, we survey the complete state-of-the-art techniques in action recognition and prediction. Existing models, popular algorithms, technical difficulties, popular action databases, evaluation protocols, and promising future directions are also provided with systematic discussions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. In this paper, action prediction refers to the task of predicting action category, and motion prediction refers to the task of predicting motion trajectory. Video prediction is not discussed in this paper as it focuses on motion in videos rather than motion of human.

References

  • Abbeel, P., & Ng, A. (2004). Apprenticeship learning via inverse reinforcement learning. In: ICML.

  • Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., & Vijayanarasimhan, S. (2016). Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675.

  • Alahi, A., & Fei-Fei, V.R.L. (2014). Socially-aware large-scale crowd forecasting. In: CVPR.

  • Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., & Savarese, S. (2016). Social lstm: Human trajectory prediction in crowded spaces. In: CVPR.

  • Ballan, L., Castaldo, F., Alahi, A., Palmieri, F., & Savarese, S. (2016). Knowledge transfer for scene-specific motion prediction. In: ECCV.

  • Bao, W., Yu, Q., & Kong, Y. (2021). Evidential deep learning for open set action recognition. In: ICCV.

  • Bendale, A., & Boult, T.E. (2016). Towards open set deep networks. In: CVPR.

  • Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Bhattacharyya, A., Reino, D.O., Fritz, M., & Schiele, B. (2021). Euro-pvi: Pedestrian vehicle interactions in dense urban centers. In: CVPR.

  • Bishay, M., Zoumpourlis, G., & Patras, I. (2019). Tarn: Temporal attentive relation network for few-shot and zero-shot action recognition. In: BMVC.

  • Blake, R., & Shiffrar, M. (2007). Perception of human motion. Annual Review of Psychology, 58, 47–73.

    Article  Google Scholar 

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In: Proc. ICCV.

  • Bobick, A., & Davis, J. (2001). The recognition of human movement using temporal templates. IEEE Trans Pattern Analysis and Machine Intelligence, 23(3), 257–267.

    Article  Google Scholar 

  • Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In: European Conference on Computer Vision, pp. 628–643. Springer.

  • Bregonzio, M., Gong, S., & Xiang, T. (2009). Recognizing action as clouds of space-time interest points. In: CVPR.

  • Buchler, U., Brattoli, B., & Ommer, B. (2018). Improving spatiotemporal self-supervision by deep reinforcement learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–786.

  • Cao, K., Ji, J., Cao, Z., Chang, C.Y., & Niebles, J.C. (2020). Few-shot video classification via temporal alignment. In: CVPR.

  • Cao, Y., Barrett, D., Barbu, A., Narayanaswamy, S., Yu, H., Michaux, A., Lin, Y., Dickinson, S., Siskind, J., & Wang, S. (2013). Recognizing human activities from partially observed videos. In: CVPR.

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR.

  • Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., & Sukthankar, R. (2018). Rethinking the Faster R-CNN architecture for temporal action localization. In: CVPR.

  • Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., Pu, S., & Tian, Y. (2020). Learning open set network with discriminative reciprocal points. In: ECCV.

  • Chen, S., Sun, P., Xie, E., Ge, C., Wu, J., Ma, L., Shen, J., & Luo, P. (2021). Watch only once: An end-to-end video action detection framework. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 8178–8187.

  • Choi, W., & Savarese, S. (2012). A unified framework for multi-target tracking and collective activity recognition. In: ECCV, pp. 215–230. Springer.

  • Choi, W., Shahid, K., & Savarese, S. (2009). What are they doing? : Collective activity classification using spatio-temporal relationship among people. In: computer vision workshops (ICCV Workshops), 2009 IEEE 12th international conference on, pp. 1282 –1289.

  • Choi, W., Shahid, K., & Savarese, S. (2011). Learning context for collective activity recognition. In: CVPR.

  • Chung, J., hsin Wuu, C., ru Yang, H., Tai, Y.W., & Tang, C.K. (2021). Haa500: Human-centric atomic action dataset with curated videos. In: ICCV.

  • Chunhui, L., Yueyu, H., Yanghao, L., Sijie, S., & Jiaying, L. (2017). Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475.

  • Ciptadi, A., Goodwin, M. S., & Rehg, J. M. (2014). Movement pattern histogram for action recognition and retrieval. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision - ECCV 2014 (pp. 695–710). Springer International Publishing.

  • Clarke, T., Bradshaw, M., Field, D., Hampson, S., & Rose, D. (2005). The perception of emotion from body movement in point-light displays of interpersonal dialogue. Perception, 24, 1171–80.

    Article  Google Scholar 

  • Cutting, J., & Kozlowski, L. (1977). Recognition of friends by their work: Gait perception without familarity cues. Bulletin of the Psychonomic Society, 9, 353–56.

    Article  Google Scholar 

  • Dai, X., Singh, B., Zhang, G., Davis, L., & Chen, Y. (2017). Temporal context network for activity localization in videos. 2017 IEEE International conference on computer vision (ICCV) pp. 5727–5736.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In: CVPR.

  • Damen, D., Doughty, H., Farinella, G.M., Fidler, S., Furnari, A., Kazakos, E., Moltisanti, D., Munro, J., Perrett, T., Price, W., & Wray, M. (2018). Scaling egocentric vision: The epic-kitchens dataset. In: European Conference on Computer Vision.

  • Darwin, C. (1872). The Expression of the Emotions in Man and Animals. London: John Murray.

  • Dawar, N., & Kehtarnavaz, N. (2018). Action detection and recognition in continuous action streams by deep learning-based sensing fusion. IEEE Sensors Journal, 18(23), 9660–9668.

    Article  Google Scholar 

  • Decety, J., & Grezes, J. (1999). Neural mechanisms subserving the perception of human actions. Neural Mechanisms of Perception and Action, 3(5), 172–178.

    Google Scholar 

  • Dendorfer, P., Elflein, S., & Leal-Taixé, L. (2021). Mg-gan: A multi-generator model preventing out-of-distribution samples in pedestrian trajectory prediction. In: ICCV.

  • Diba, A., Sharma, V., & Gool, L.V. (2017). Deep temporal linear encoding networks. In: CVPR.

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In: ICCV VS-PETS.

  • Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In: CVPR.

  • Dragan, A., Ratliff, N., & Srinivasa, S. (2011). Manipulation planning with goal sets using constrained trajectory optimization. In: ICRA.

  • Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In: 2009 IEEE 12th International conference on computer vision, pp. 1491–1498. IEEE.

  • Duong, T.V., Bui, H.H., Phung, D.Q., & Venkatesh, S. (2005). Activity recognition and abnormality detection with the switching hidden semi-markov model. In: CVPR.

  • Duta, I.C., Ionescu, B., Aizawa, K., & Sebe, N. (2017). spatio-temporal vector of locally max pooled features for action recognition in videos. In: CVPR.

  • Dwivedi, S.K., Gupta, V., Mitra, R., Ahmed, S., & Jain, A. (2019). Protogan: Towards few shot learning for action recognition. In: ICCVW.

  • Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. ICCV, 2, 726–733.

    Google Scholar 

  • Escorcia, V., Caba Heilbron, F., Niebles, J.C., & Ghanem, B. (2016). DAPs: Deep action proposals for action understanding. In: ECCV.

  • Fabian Caba Heilbron Victor Escorcia, B.G., & Niebles, J.C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 961–970.

  • Fanti, C., Zelnik-Manor, L., & Perona, P. (2005). Hybrid models for human motion recognition. In: CVPR.

  • Feichtenhofer, C., Pinz, A., & Wildes, R.P. (2016). Spatiotemporal residual networks for video action recognition. In: NIPS.

  • Feichtenhofer, C., Pinz, A., & Wildes, R.P. (2017). Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7445–7454. IEEE.

  • Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In: CVPR.

  • Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: CVPR.

  • Fernando, B., Bilen, H., Gavves, E., & Gould, S. (2017). Self-supervised video representation learning with odd-one-out networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645.

  • Fernando, B., & Herath, S. (2021). Anticipating human actions by correlating past with the future with jaccard similarity measures. In: CVPR.

  • Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: deep inverse optimal control via policy optimization. In: arXiv preprint arXiv:1603.00448.

  • Fouhey, D.F., & Zitnick, C.L. (2014). Predicting object dynamics in scenes. In: CVPR.

  • Furnari, A., & Farinella, G.M. (2020). Rolling-unrolling lstms for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI).

  • Gan, C., Gong, B., Liu, K., Su, H., & Guibas, L.J. (2018). Geometry guided convolutional neural networks for self-supervised video representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5589–5597.

  • Gao, J., Yang, Z., Chen, K., Sun, C., & Nevatia, R. (2017). TURN TAP: Temporal unit regression network for temporal action proposals. In: ICCV.

  • Geng, C., Huang, S.j., & Chen, S. (2020). Recent advances in open set recognition: A survey. IEEE transactions on pattern analysis and machine intelligence.

  • Ghadiyaram, D., Tran, D., & Mahajan, D. (2019). Large-scale weakly-supervised pre-training for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12046–12055.

  • Girase, H., Gang, H., Malla, S., Li, J., Kanehara, A., Mangalam, K., & Choi, C. (2021). Loki: Long term and key intentions for trajectory prediction. In: ICCV.

  • Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In: CVPR.

  • Giuliari, F., Hasan, I., Cristani, M., & Galasso, F. (2021). Transformer networks for trajectory forecasting. In: 2020 25th international conference on pattern recognition (ICPR), pp. 10335–10342. IEEE.

  • Goodale, M. A., & Milner, A. D. (1992). Separate visual pathways for perception and action. Trends in Neurosciences, 15(1), 20–25.

    Article  Google Scholar 

  • Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.

    Article  Google Scholar 

  • Goyal, R., Kahou, S.E., Michalski, V., Materzynska, J., Westphal, S., Kim, H., Haenel, V., Fruend, I., Yianilos, P., Mueller-Freitag, M., et al. (2017). The” something something” video database for learning and evaluating visual common sense. In: Proc. ICCV.

  • Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2018). AVA: A video dataset of spatio-temporally localized atomic visual actions. In: CVPR.

  • Gu, C., Sun, C., Vijayanarasimhan, S., Pantofaru, C., Ross, D.A., Toderici, G., Li, Y., Ricco, S., Sukthankar, R., Schmid, C., et al. (2017). Ava: A video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421.

  • Guo, M., Chou, E., Huang, D.A., Song, S., Yeung, S., & Fei-Fei, L. (2018). Neural graph matching networks for fewshot 3d action recognition. In: ECCV.

  • Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., & Alahi, A. (2018). Social gan: Socially acceptable trajectories with generative adversarial networks. In: CVPR.

  • Hadfield, S., & Bowden, R. (2013). Hollywood 3d: Recognizing actions in 3d natural scenes. In: CVPR. Portland, Oregon.

  • Harris, C., & Stephens., M. (1988). A combined corner and edge detector. In: Alvey vision conference.

  • Hasan, M., & Roy-Chowdhury, A.K. (2014). Continuous learning of human activity models using deep nets. In: ECCV.

  • Heilbron, F.C., Escorcia, V., Ghanem, B., & Niebles, J.C. (2015). ActivityNet: A large-scale video benchmark for human activity understanding. In: CVPR.

  • Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing.

  • Hoai, M., & la Torre, F.D. (2012). Max-margin early event detectors. In: CVPR.

  • Horn, B., & Schunck, B. (1981). Determining optical flow. Artificial Intelligence, 17, 185–203.

    Article  Google Scholar 

  • Hu, J.F., Zheng, W.S., Lai, J., & Zhang, J. (2015). Jointly learning heterogeneous features for rgb-d activity recognition. In: CVPR.

  • Hu, W., Xie, D., Fu, Z., Zeng, W., & Maybank, S. (2007). Semantic-based surveillance video retrieval. Image Processing, IEEE Transactions on, 16(4), 1168–1181.

    Article  MathSciNet  Google Scholar 

  • Huang, D.A., Fei-Fei, L., & Niebles, J.C. (2016). Connectionist temporal modeling for weakly supervised action labeling. In: European conference on computer Vision, pp. 137–153. Springer.

  • Huang, D.A., & Kitani, K.M. (2008). Action-reaction: Forecasting the dynamics of human interaction. In: ECCV.

  • Ikizler, N., & Forsyth, D. (2007). Searching video for complex activities with finite state models. In: CVPR.

  • Jain, M., van Gemert, J., Jegou, H., Bouthemy, P., & Snoek, C.G. (2014). Action localization with tubelets from motion. In: CVPR.

  • Jain, M., Jégou, H., & Bouthemy, P. (2013). Better exploiting motion for better action recognition. In: CVPR.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2010). 3d convolutional neural networks for human action recognition. In: ICML.

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence: IEEE Trans.

  • Jia, C., Kong, Y., Ding, Z., & Fu, Y. (2014). Latent tensor transfer learning for rgb-d action recognition. In: ACM Multimedia.

  • Jia, K., & Yeung, D.Y. (2008). Human action recognition using local spatio-temporal discriminant embedding. In: CVPR.

  • Jiang, Y.G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/.

  • Jiang, Y. G., Wu, Z., Wang, J., Xue, X., & Chang, S. F. (2018). Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(2), 352–364. https://doi.org/10.1109/TPAMI.2017.2670560

    Article  Google Scholar 

  • Jingen Liu, J.L., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. In: CVPR.

  • Gao, Jiyang., Yang, Zhenheng., & N, R. (2017). Red: Reinforced encoder-decoder networks for action anticipation. In: BMVC.

  • Kar, A., Rai, N., Sikka, K., & Sharma, G. (2017). Adascan: Adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: CVPR.

  • Karaman, S., Seidenari, L., & Bimbo, A.D. (2014). Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop.

  • Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In: CVPR.

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  • Ke, Q., Bennamoun, M., An, S., Sohel, F., & Boussaid, F. (2017). A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3288–3297.

  • Ke, Q., Fritz, M., & Schiele, B. (2019). Time-conditioned action anticipation in one shot. In: CVPR.

  • Ke, Q., Fritz, M., & Schiele, B. (2021). Future moment assessment for action query. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision.

  • Keestra, M. (2015). Understanding human action. integraiting meanings, mechanisms, causes, and contexts. TRANSDISCIPLINARITY IN PHILOSOPHY AND SCIENCE: APPROACHES, PROBLEMS, PROSPECTS pp. 201–235.

  • Khurram Soomro, A.R.Z., & Shah, M. (2012). Ucf101: A dataset of 101 human action classes from videos in the wild. CRCV-TR-12-01.

  • Kim, K., Lee, D., & Essa, I. (2011). Gaussian process regression flow for analysis of motion trajectories. In: ICCV.

  • Kitani, K.M., Ziebart, B.D., Bagnell, J.A., & Hebert, M. (2012). Activity forecasting. In: ECCV.

  • Klaser, A., Marszalek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In: BMVC.

  • Kliper-Gross, O., Hassner, T., & Wolf, L. (2012). The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(3).

  • Kong, Y., & Fu, Y. (2014). Modeling supporting regions for close human interaction recognition. In: ECCV workshop.

  • Kong, Y., & Fu, Y. (2015). Bilinear heterogeneous information machine for rgb-d action recognition. In: CVPR.

  • Kong, Y., & Fu, Y. (2016). Max-margin action prediction machine. TPAMI, 38(9), 1844–1858.

    Article  Google Scholar 

  • Kong, Y., & Fu, Y. (2017). Max-margin heterogeneous information machine for rgb-d action recognition. International Journal of Computer Vision (IJCV), 123(3), 350–371.

    Article  MathSciNet  MATH  Google Scholar 

  • Kong, Y., Gao, S., Sun, B., & Fu, Y. (2018). Action prediction from videos via memorizing hard-to-predict samples. In: AAAI.

  • Kong, Y., Jia, Y., & Fu, Y. (2012). Learning human interaction by interactive phrases. In: Proceedings of European conference on computer vision.

  • Kong, Y., Jia, Y., & Fu, Y. (2014). Interactive phrases: Semantic descriptions for human interaction recognition. In: PAMI.

  • Kong, Y., Kit, D., & Fu, Y. (2014). A discriminative model with multiple temporal scales for action prediction. In: ECCV.

  • Kong, Y., Tao, Z., & Fu, Y. (2017). Deep sequential context networks for action prediction. In: CVPR.

  • Kong, Y., Tao, Z., & Fu, Y. (2018). Adversarial action prediction networks. IEEE TPAMI.

  • Kooij, J.F.P., Schneider, N., Flohr, F., & Gavrila, D.M. (2014). Context-based pedestrian path prediction. In: European Conference on Computer Vision, pp. 618–633. Springer.

  • Koppula, H.S., Gupta, R., & Saxena, A. (2013). Learning human activities and object affordances from rgb-d videos. International Journal of Robotics Research.

  • Koppula, H.S., & Saxena, A. (2013). Anticipating human activities for reactive robotic response. In: IROS.

  • Koppula, H.S., & Saxena, A. (2013). Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation. In: ICML.

  • Koppula, H. S., & Saxena, A. (2016). Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1), 14–29.

    Article  Google Scholar 

  • Kosaraju, V., Sadeghian, A., Martín-Martín, R., Reid, I., Rezatofighi, S.H., & Savarese, S. (2019). Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. arXiv preprint arXiv:1907.03395.

  • Kretzschmar, H., Kuderer, M., & Burgard, W. (2014). Learning to predict trajecteories of cooperatively navigation agents. In: International conference on robotics and automation.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In: ICCV.

  • Kurakin, A., Zhang, Z., & Liu, Z. (2012). A real-time system for dynamic hand gesture recognition with a depth sensor. In: EUSIPCO.

  • Lai, S., Zhang, W. S., Hu, J. F., & Zhang, J. (2018). Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 27(5), 2272–2285.

    Article  MathSciNet  MATH  Google Scholar 

  • Lan, T., Chen, T.C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In: European conference on computer vision, pp. 689–704. Springer.

  • Lan, T., Sigal, L., & Mori, G. (2012). Social roles in hierarchical models for human activity. In: CVPR.

  • Lan, T., Wang, Y., Yang, W., Robinovitch, S. N., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. TPAMI, 34(8), 1549–1562.

    Article  Google Scholar 

  • Laptev, I. (2005). On space-time interest points. IJCV, 64(2), 107–123.

    Article  MathSciNet  Google Scholar 

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In: ICCV, pp. 432–439.

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In: CVPR.

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies.

  • Laptev, I., & Perez, P. (2007). Retrieving actions in movies. In: ICCV.

  • Le, Q.V., Zou, W.Y., Yeung, S.Y., & Ng, A.Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR.

  • Lee, H.Y., Huang, J.B., Singh, M., & Yang, M.H. (2017). Unsupervised representation learning by sorting sequences. In: Proceedings of the IEEE international conference on computer vision, pp. 667–676.

  • Lee, N., Choi, W., Vernaza, P., Choy, C.B., Torr, P.H., & Chandraker, M. (2017). Desire: Distant future prediction in dynamic scenes with interacting agents. In: CVPR.

  • Lee, N., & Kitani, K.M. (2016). Predicting wide receiver trajectories in american football. In: WACV2016.

  • Li, J., Ma, H., & Tomizuka, M. (2019). Conditional generative neural system for probabilistic trajectory prediction. In: 2019 IEEE/RSJ International conference on intelligent robots and systems (IROS), pp. 6150–6156. IEEE.

  • Li, K., & Fu, Y. (2014). Prediction of human activity by discovering temporal sequence patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1644–1657.

    Article  Google Scholar 

  • Li, K., Hu, J., & Fu, Y. (2012). Modeling complex temporal composition of actionlets for activity prediction. In: ECCV.

  • Li, W., Zhang, Z., & Liu, Z. (2010). Action recognition based on a bag of 3d points. In: CVPR workshop.

  • Li, Y., Chen, L., He, R., Wang, Z., Wu, G., & Wang, L. (2021). Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In: ICCV.

  • Li, Z., & Yao, L. (2021). Three birds with one stone: Multi-task temporal action detection via recycling temporal annotations. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 4751–4760.

  • Liang, J., Jiang, L., Niebles, J.C., Hauptmann, A.G., & Fei-Fei, L. (2019). Peeking into the future: Predicting future person activities and locations in videos. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5725–5734.

  • Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). Bmn: Boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 3889–3898.

  • Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). Bsn: Boundary sensitive network for temporal action proposal generation. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19.

  • Lin, Y.Y., Hua, J.H., Tang, N.C., Chen, M.H., & Liao, H.Y.M. (2014). Depth and skeleton associated action recognition without online accessible rgb-d cameras. In: CVPR.

  • Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In: CVPR.

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos “in the wild”. In: Proceedings of IEEE conference on computer vision and pattern recognition.

  • Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2020). Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2684–2701.

    Article  Google Scholar 

  • Liu, J., Shahroudy, A., Xu, D., & Wang, G. (2016). Spatio-temporal lstm with trust gates for 3d human action recognition. In: European conference on computer vision, pp. 816–833. Springer.

  • Liu, L., & Shao, L. (2013). Learning discriminative representations from rgb-d video data. In: IJCAI.

  • Liu, X., Pintea, S.L., Nejadasl, F.K., Booij, O., & van Gemert, J.C. (2021). No frame left behind: Full video action recognition. In: CVPR.

  • Liu, Y., Ma, L., Zhang, Y., Liu, W., & Chang, S.F. (2019). Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3604–3613.

  • Liu, Y., Yan, Q., & Alahi, A. (2020). Social nce: Contrastive learning of socially-aware motion representations. arXiv preprint arXiv:2012.11717.

  • Lu, C., Jia, J., & Tang, C.K. (2014). Range-sample depth feature for action recognition. In: CVPR.

  • Lucas, B.D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In: Proceedings of imaging understanding workshop.

  • Luo, G., Yang, S., Tian, G., Yuan, C., Hu, W., & Maybank, S. J. (2014). Learning human actions by combining global dynamics and local appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12), 2466–2482.

    Article  Google Scholar 

  • Luo, J., Wang, W., & Qi, H. (2013). Group sparsity and geometry constrained dictionary learning for action recognition from depth maps. In: ICCV.

  • Luo, Z., Hsieh, J.T., Jiang, L., Carlos Niebles, J., & Fei-Fei, L. (2018). Graph distillation for action detection with privileged modalities. In: ECCV.

  • Ma, S., Sigal, L., & Sclaroff, S. (2016). Learning activity progression in lstms for activity detection and early detection. In: CVPR.

  • Mainprice, J., Hayne, R., & Berenson, D. (2016). Goal set inverse optimal control and iterative re-planning for predicting human reaching motions in shared workspace. In: arXiv preprint arXiv:1606.02111.

  • Mangalam, K., An, Y., Girase, H., & Malik, J. (2020). From goals, waypoints & paths to long term human trajectory forecasting. arXiv preprint arXiv:2012.01526.

  • Mangalam, K., Girase, H., Agarwal, S., Lee, K.H., Adeli, E., Malik, J., & Gaidon, A. (2020). It is not the journey but the destination: Endpoint conditioned trajectory prediction. In: European conference on computer vision, pp. 759–776. Springer.

  • Marchetti, F., Becattini, F., Seidenari, L., & Bimbo, A.D. (2020). Mantra: Memory augmented networks for multiple trajectory prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7143–7152.

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In: IEEE conference on computer vision & pattern recognition.

  • Mass, J., Johansson, G., Jason, G., & Runeson, S. (1971). Motion perception I and II [film]. Houghton Mifflin.

  • Mehrasa, N., Jyothi, A.A., Durand, T., He, J., Sigal, L., & Mori, G. (2019). A variational auto-encoder model for stochastic point processes. In: CVPR.

  • Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In: ICCV.

  • Gao, Mingfei., Zhou, Yingbo., X, R., S, R., X, C. (2021). Woad: Weakly supervised online action detection in untrimmed videos. In: CVPR.

  • Mishra, A., Verma, V., Reddy, M.K.K., Subramaniam, A., Rai, P., & Mittal, A. (2018). A generative approach to zero-shot and few-shot action recognition.

  • Misra, I., Zitnick, C.L., & Hebert, M. (2016). Shuffle and learn: unsupervised learning using temporal order verification. In: European conference on computer vision, pp. 527–544. Springer.

  • Mohamed, A., Qian, K., Elhoseiny, M., & Claudel, C. (2020). Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14424–14432.

  • Monfort, M., Zhou, B., Bargal, S. A., Yan, T., Andonian, A., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C., et al. (2019). Moments in time dataset: One million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 502–508.

    Article  Google Scholar 

  • Morency, L.P., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In: CVPR.

  • Morrisand, B., & Trivedi, M. (2011). Trajectory learning for activity understanding: Unsupervised, multilevel, and long-term adaptive approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(11), 2287–2301.

    Article  Google Scholar 

  • Narayan, S., Cholakkal, H., Khan, F.S., & Shao, L. (2019). 3C-Net: Category count and center loss for weakly-supervised action localization. In: ICCV.

  • Narayanan, S., Moslemi, R., Pittaluga, F., Liu, B., & Chandraker, M. (2021). Divide-and-conquer for lane-aware diverse trajectory prediction. In: CVPR.

  • Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In: CVPR.

  • Ni, B., Wang, G., & Moulin, P. (2011). RGBD-HuDaAct: A color-depth video database for human daily activity recognition. In: ICCV Workshop on CDC3CV.

  • Niebles, J.C., Chen, C.W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV.

  • Niebles, J.C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In: CVPR.

  • Niebles, J. C., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision, 79(3), 299–318.

    Article  Google Scholar 

  • Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., & Bajcsy, R. (2013). Berkeley mhad: A comprehensive multimodal human action database. In: Proceedings of the IEEE workshop on applications on computer vision.

  • Oliver, N. M., Rosario, B., & Pentland, A. P. (2000). A bayesian computer vision system for modeling human interactions. PAMI, 22(8), 831–843.

    Article  Google Scholar 

  • Oreifej, O., & Liu, Z. (2013). Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences. In: CVPR.

  • Oza, P., & Patel, V.M. (2019). C2AE: Class conditioned auto-encoder for open-set recognition. In: CVPR.

  • Patron-Perez, A., Marszalek, M., Reid, I., & Zissermann, A. (2012). Structured learning of human interaction in tv shows. PAMI, 34(12), 2441–2453.

    Article  Google Scholar 

  • Patron-Perez, A., Marszalek, M., Zisserman, A., & Reid, I. (2010). High five: Recognising human interactions in tv shows. In: Proceedings of British conference on machine vision.

  • Pei, M., Jia, Y., & Zhu, S.C. (2011). Parsing video events with goal inference and intent prediction. In: ICCV, pp. 487–494. IEEE.

  • Perera, P., Morariu, V.I., Jain, R., Manjunatha, V., Wigington, C., Ordonez, V., & Patel, V.M. (2020). Generative-discriminative feature representations for open-set recognition. In: CVPR.

  • Perrett, T., Masullo, A., Burghardt, T., Mirmehdi, M., & Damen, D. (2021). Temporal-relational crosstransformers for few-shot action recognition. In: CVPR.

  • Perronnin, F., & Dance, C. (2006). Fisher kernels on visual vocabularies for image categorization. In: CVPR.

  • Plotz, T., Hammerla, N.Y., & Olivier, P. (2011). Feature learning for activity recognition in ubiquitous computing. In: IJCAI.

  • Poppe, R. (2010). A survey on vision-based human action recognition. Image and Vision Computing, 28, 976–990.

    Article  Google Scholar 

  • Purushwalkam, S., & Gupta, A. (2016). Pose from action: Unsupervised learning of pose features based on motion. arXiv preprint arXiv:1609.05420.

  • Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual network. In: ICCV.

  • Qiu, Z., Yao, T., Ngo, C.W., Tian, X., & Mei, T. (2019). Learning spatio-temporal representation with local and global diffusion. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 12056–12065.

  • Rajko, S., Qian, G., Ingalls, T., & James, J. (2007). Real-time gesture recognition with minimal training requirements and on-line learning. In: CVPR.

  • Ramanathan, V., Yao, B., & Fei-Fei, L. (2013). Social role discovery in human events. In: CVPR.

  • Ramezani, M., & Yaghmaee, F. (2016). A review on human action analysis in videos for retrieval applications. Artificial Intelligence Review, 46(4), 485–514.

    Article  Google Scholar 

  • Raptis, M., & Sigal, L. (2013). Poselet key-framing: A model for human activity recognition. In: CVPR.

  • Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In: ECCV.

  • Rasouli, A., Rohani, M., & Luo, J. (2021). Bifold and semantic reasoning for pedestrian behavior prediction. In: CVPR.

  • Reddy, K.K., & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications Journal.

  • Ricoeur, P. (1992). Oneself as another (K. Blamey, Trans.). Chicago: University of Chicago Press.

  • Rizzolatti, G., & Craighero, L. (2004). The mirror-neuron system. Annual Review of Neuroscience, 27, 169–192.

    Article  Google Scholar 

  • Rizzolatti, G., & Sinigaglia, C. (2010). The functional role of the parieto-frontal mirror circuit: Interpretations and misinterpretations. Nat. Rev. Neurosci., 11, 264–274.

    Article  Google Scholar 

  • Rodriguez, M.D., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In: CVPR.

  • Rohit, G., & Kristen, G. (2021). Anticipative video transformer. In: ICCV.

  • Roitberg, A., Ma, C., Haurilet, M., & Stiefelhagen, R. (2020). Open set driver activity recognition. In: IVS.

  • Ryoo, M., & Aggarwal, J. (2006). Recognition of composite human activities through context-free grammar based representation. CVPR, 2, 1709–1718.

    Google Scholar 

  • Ryoo, M., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In: ICCV, pp. 1593–1600.

  • Ryoo, M., & Aggarwal, J. (2011). Stochastic representation and recognition of high-level group activities. IJCV, 93, 183–200.

    Article  MathSciNet  MATH  Google Scholar 

  • Ryoo, M., Fuchs, T.J., Xia, L., Aggarwal, J.K., & Matthies, L. (2015). Robot-centric activity prediction from first-person videos: What will they do to me? In: Proceedings of the tenth annual ACM/IEEE international conference on human-robot interaction, pp. 295–302. ACM.

  • Ryoo, M.S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In: ICCV.

  • Ryoo, M.S., & Aggarwal, J.K. (2010). UT-Interaction Dataset, ICPR contest on Semantic Description of Human Activities (SDHA). http://cvrc.ece.utexas.edu/SDHA2010/Human_Interaction.html.

  • S Singh, S.V., & Ragheb, H. (2010). Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods. In: 2nd Workshop on Activity monitoring by multi-camera surveillance systems (AMMCSS), pp. 48–55.

  • Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., & Savarese, S. (2019). Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1349–1358.

  • Satkin, S., & Hebert, M. (2010). Modeling the temporal extent of actions. In: ECCV.

  • Scheirer, W. J., Jain, L. P., & Boult, T. E. (2014). Probability models for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(11), 2317–2324.

    Article  Google Scholar 

  • Scheirer, W. J., de Rezende Rocha, A., Sapkota, A., & Boult, T. E. (2012). Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757–1772.

    Article  Google Scholar 

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In: IEEE ICPR.

  • Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional sift descriptor and its application to action recognition. In: Proc. ACM Multimedia.

  • Shahroudy, A., Liu, J., Ng, T.T., & Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: IEEE conference on computer vision and pattern recognition.

  • Shahroudy, A., Liu, J., Ng, T.T., & Wang, G. (2016). Ntu rgb+d: A large scale dataset for 3d human activity analysis. In: CVPR.

  • Shi, Q., Cheng, L., Wang, L., & Smola, A. (2011). Human action segmentation and recognition using discriminative semi-markov models. IJCV, 93, 22–32.

    Article  MATH  Google Scholar 

  • Shotton, J., Girshick, R., Fitzgibbon, A., Sharp, T., Cook, M., Finocchio, M., Moore, R., Kohli, P., Criminisi, A., Kipman, A., & Blake, A. (2013). Efficient human pose estimation from single depth images. PAMI.

  • Shou, Z., Chan, J., Zareian, A., Miyazawa, K., & Chang, S.F. (2017). CDC: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR.

  • Shou, Z., Wang, D., & Chang, S.F. (2016). Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR.

  • Shu, Y., Shi, Y., Wang, Y., Zou, Y., Yuan, Q., & Tian, Y. (2018). ODN: Opening the deep network for open-set action recognition. In: ICME.

  • Si, C., Chen, W., Wang, W., Wang, L., & Tan, T. (2019). An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1227–1236.

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In: NIPS.

  • Singh, S., Velastin, S.A., & Ragheb, H. (2010). Muhavi: A multicamera human action video dataset for the evaluation of action recognition methods. In: Advanced Video and Signal Based Surveillance (AVSS), 2010 Seventh IEEE international conference on, pp. 48–55. IEEE.

  • Sminchisescu, C., Kanaujia, A., Li, Z., & Metaxas, D. (2005). Conditional models for contextual human motion recognition. In: International conference on computer vision.

  • Song, H., Wu, X., Zhu, B., Wu, Y., Chen, M., & Jia, Y. (2019). Temporal action localization in untrimmed videos using action pattern trees. IEEE Transactions on Multimedia (TMM), 21(3), 717–730.

    Article  Google Scholar 

  • Song, L., Zhang, S., Yu, G., & Sun, H. (2019). TACNet: Transition-aware context network for spatio-temporal action detection. In: CVPR.

  • Song, S., Lan, C., Xing, J., Zeng, W., & Liu, J. (2018). Spatio-temporal attention-based LSTM networks for 3d action recognition and detection. IEEE Transactions on Image Processing (TIP), 27(7), 3459–3471.

    Article  MathSciNet  MATH  Google Scholar 

  • Su, H., Zhu, J., Dong, Y., & Zhang, B. (2017). Forecast the plausible paths in crowd scenes. In: IJCAI.

  • Sumi, S. (2000). Perception of point-light walker produced by eight lights attached to the back of the walker. Swiss Journal of Psychology, 59, 126–32.

    Article  Google Scholar 

  • Sun, D., Roth, S., & Black, M.J. (2010). Secrets of optical flow estimation and their principles. In: CVPR.

  • Sun, J., Wu, X., Yan, S., Cheong, L., Chua, T., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In: CVPR.

  • Sun, L., Jia, K., Chan, T.H., Fang, Y., Wang, G., & Yan, S. (2014). Dl-sfa: Deeply-learned slow feature analysis for action recognition. In: CVPR.

  • Sung, J., Ponce, C., Selman, B., & Saxena, A. (2011). Human activity detection from rgbd images. In: AAAI workshop on pattern, activity and intent recognition.

  • Sung, J., Ponce, C., Selman, B., & Saxena, A. (2012). Unstructured human activity detection from rgbd images. In: ICRA.

  • Surís, D., Liu, R., & Vondrick, C. (2021). Learning the predictability of the future. In: CVPR.

  • Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In: CVPR.

  • Tang, K., Ramanathan, V., Fei-Fei, L., & Koller, D. (2012). Shifting weights: Adapting object detectors from image to video. In: Advances in Neural Information Processing Systems.

  • Tang, Y., Ding, D., Rao, Y., Zheng, Y., Zhang, D., Zhao, L., Lu, J., & Zhou, J. (2019). COIN: A large-scale dataset for comprehensive instructional video analysis. In: CVPR.

  • Taylor, G.W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In: ECCV.

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In: ICCV.

  • Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In: ECCV.

  • Troje, N. (2002). Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2, 371–87.

    Article  Google Scholar 

  • Troje, N., Westhoff, C., & Lavrov, M. (2005). Person identification from biological motion: Effects of structural and kinematic cues. Perception Psychophys, 67, 667–75.

    Article  Google Scholar 

  • Turek, M., Hoogs, A., & Collins, R. (2010). Unsupervised learning of functional categories in video scenes. In: ECCV.

  • Unreal engine. https://www.unrealengine.com/.

  • UnrealCV. https://unrealcv.org.

  • Vahdat, A., Gao, B., Ranjbar, M., & Mori, G. (2011). A discriminative key pose sequence model for recognizing human interactions. In: ICCV Workshops, pp. 1729 –1736.

  • Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating visual representations from unlabeled video. In: CVPR.

  • Walker, J., Gupta, A., & Hebert, M. (2014). Patch to the future: Unsupervised visual prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3302–3309.

  • Wang, C., Wang, Y., Xu, M., & Crandall, D.J. (2021). Stepwise goal-driven networks for trajectory prediction. arXiv preprint arXiv:2103.14107.

  • Wang, H., Kläser, A., Schmid, C., & Liu, C.L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(60–79).

  • Wang, H., Kläser, A., Schmid, C., & Liu, C.L. (2011). Action Recognition by Dense Trajectories. In: IEEE conference on computer vision & pattern recognition, pp. 3169–3176. Colorado Springs, United States. http://hal.inria.fr/inria-00583818/en.

  • Wang, H., Oneata, D., Verbeek, J., & Schmid, C. (2015). A robust and efficient video representation for action recognition. IJCV.

  • Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision. Sydney, Australia. http://hal.inria.fr/hal-00873267.

  • Wang, H., Ullah, M.M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In: BMVC.

  • Wang, J., Liu, Z., Chorowski, J., Chen, Z., & Wu, Y. (2012). Robust 3d action recognition with random occupancy patterns. In: ECCV.

  • Wang, J., Liu, Z., Wu, Y., & Yuan, J. (2012). Mining actionlet ensemble for action recognition with depth cameras. In: CVPR.

  • Wang, K., Wang, X., Lin, L., Wang, M., & Zuo, W. (2014). 3d human activity recognition with reconfigurable convolutional neural networks. In: ACM Multimedia.

  • Wang, L., Qiao, Y., & Tang, X. (2014). Action recognition and detection by combining motion and appearance features. In: ECCV THUMOS Workshop.

  • Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR.

  • Wang, L., & Suter, D. (2007). Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In: CVPR.

  • Wang, L., Tong, Z., Ji, B., & Wu, G. (2021). Tdn: Temporal difference networks for efficient action recognition. In: CVPR, pp. 1895–1904.

  • Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). UntrimmedNets for weakly supervised action recognition and detection. In: CVPR.

  • Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Gool, L.V. (2016). Temoral segment networks: Toward good practices for deep action recognition. In: ECCV.

  • Wang, S.B., Quattoni, A., Morency, L.P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. In: CVPR.

  • Wang, X., & Gupta, A. (2015). Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE international conference on computer vision, pp. 2794–2802.

  • Wang, X., He, K., & Gupta, A. (2017). Transitive invariance for self-supervised visual representation learning. In: Proceedings of the IEEE international conference on computer vision, pp. 1329–1338.

  • Wang, Y., & Mori, G. (2008). Learning a discriminative hidden part model for human action recognition. In: NIPS.

  • Wang, Y., & Mori, G. (2010). Hidden part models for human action recognition: Probabilistic vs. max-margin. PAMI.

  • Wang, Z., Wang, J., Xiao, J., Lin, K.H., & Huang, T.S. (2012). Substructural and boundary modeling for continuous action recognition. In: CVPR.

  • Weinland, D., Ronfard, R., & Boyer, E. (2006). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2–3), 249–257.

    Article  Google Scholar 

  • Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest poing detector. In: ECCV.

  • Wolf, C., Lombardi, E., Mille, J., Celiktutan, O., Jiu, M., Dogan, E., Eren, G., Baccouche, M., Dellandréa, E., Bichot, C. E., et al. (2014). Evaluation of video activity localizations integrating quality and quantity measurements. Computer Vision and Image Understanding, 127, 14–30.

    Article  Google Scholar 

  • Wong, S.F., Kim, T.K., & Cipolla, R. (2007). Learning motion categories using both semantic and structural information. In: CVPR.

  • Wu, B., Yuan, C., & Hu, W. (2014). Human action recognition based on context-dependent graph kernels. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2609–2616.

  • Wu, J., Yildirim, I., Lim, J.J., Freeman, W.T., & Tenenbaum, J.B. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In: Advances in Neural Information Processing Systems, pp. 127–135.

  • Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Action recognition using context and appearance distribution features. In: CVPR.

  • Wu, Z., Wang, X., Jiang, Y.G., Ye, H., & Xue, X. (2015). Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In: ACM Multimedia.

  • Wulfmeier, M., Wang, D., & Posner, I. (2016). Watch this: Scalable cost function learning for path planning in urban environment. In: arXiv preprint arXiv:1607:02329.

  • Xia, L., & Aggarwal, J. (2013). Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: CVPR.

  • Xia, L., Chen, C., & Aggarwal, J. (2012). View invariant human action recognition using histograms of 3d joints. In: Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE computer society conference on, pp. 20–27. IEEE.

  • Xia, L., Chen, C.C., & Aggarwal, J.K. (2012). View invariant human action recognition using histograms of 3d joints. In: CVPRW.

  • Xu, D., Xiao, J., Zhao, Z., Shao, J., Xie, D., & Zhuang, Y. (2019). Self-supervised spatiotemporal learning via video clip order prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 10334–10343.

  • Xu, H., Das, A., & Saenko, K. (2017). R-c3d: Region convolutional 3d network for temporal activity detection. In: Proceedings of the IEEE international conference on computer vision, pp. 5783–5792.

  • Xu, H., Das, A., & Saenko, K. (2019). Two-stream region convolutional 3d network for temporal activity detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10), 2319–2332.

    Article  Google Scholar 

  • Xu, M., Gao, M., Chen, Y.T., Davis, L.S., & Crandall, D.J. (2019). Temporal recurrent networks for online action detection. In: ICCV.

  • Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI conference on artificial intelligence.

  • Yang, H., He, X., & Porikli, F. (2018). One-shot action localization by learning sequence matching network. In: CVPR.

  • Yang, S., Yuan, C., Wu, B., Hu, W., & Wang, F. (2015). Multi-feature max-margin hierarchical bayesian model for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1610–1618.

  • Yang, W., Zhang, T., Yu, X., Qi, T., Zhang, Y., & Wu, F. (2021). Uncertainty guided collaborative training for weakly supervised temporal action detection. In: CVPR.

  • Yang, X., & Tian, Y. (2014). Super normal vector for activity recognition using depth sequences. In: CVPR.

  • Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., & Kautz, J. (2019). STEP: Spatio-temporal progressive learning for video action detection. In: CVPR.

  • Yang, Y., Hou, C., Lang, Y., Guan, D., Huang, D., & Xu, J. (2019). Open-set human activity recognition based on micro-doppler signatures. Pattern Recognition, 85, 60–69.

    Article  Google Scholar 

  • Yang, Y., & Shah, M. (2012). Complex events detection using data-driven concepts. In: ECCV.

  • Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5d graph matching. In: ECCV.

  • Yao, B., & Fei-Fei, L. (2012). Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. TPAMI, 34(9), 1691–1703.

    Article  Google Scholar 

  • Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In: CVPR.

  • Yeung, S., Russakovsky, O., Mori, G., & Fei-Fei, L. (2016). End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2678–2687.

  • Yilmaz, A., & Shah, M. (2005). Actions sketch: A novel action representation. In: CVPR.

  • Yu, G., Liu, Z., & Yuan, J. (2014). Discriminative orderlet mining for real-time recognition of human-object interaction. In: ACCV.

  • Yu, T., Ren, Z., Li, Y., Yan, E., Xu, N., & Yuan, J. (2019). Temporal structure mining for weakly supervised action detection. In: ICCV.

  • Yu, T.H., Kim, T.K., & Cipolla, R. (2010). Real-time action recognition by spatiotemporal semantic and structural forests. In: BMVC.

  • Yuan, C., Hu, W., Tian, G., Yang, S., & Wang, H. (2013). Multi-task sparse learning with beta process prior for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 423–429.

  • Yuan, C., Li, X., Hu, W., Ling, H., & Maybank, S.J. (2013). 3d r transform on spatio-temporal interest points for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 724–730.

  • Yuan, C., Li, X., Hu, W., Ling, H., & Maybank, S. J. (2014). Modeling geometric-temporal context with directional pyramid co-occurrence for action recognition. IEEE Transactions on Image Processing, 23(2), 658–672.

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan, C., Wu, B., Li, X., Hu, W., Maybank, S. J., & Wang, F. (2016). Fusing r features and local features with context-aware kernels for action recognition. International Journal of Computer Vision, 118(2), 151–171.

    Article  MathSciNet  Google Scholar 

  • Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In: IEEE conference on computer vision and pattern recognition.

  • Yuan, J., Liu, Z., & Wu, Y. (2010). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Yuan, Y., Weng, X., Ou, Y., & Kitani, K. (2021). Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. arXiv preprint arXiv:2103.14023.

  • Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. In: ICCV.

  • Zhai, X., Peng, Y., & Xiao, J. (2013). Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems, 19(5), 395–406.

    Article  Google Scholar 

  • Zhang, H., & Patel, V. M. (2016). Sparse representation-based open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1690–1696.

    Article  Google Scholar 

  • Zhang, H., Zhang, L., Qi, X., Li, H., Torr, P.H.S., & Koniusz, P. (2020). Few-shot action recognition with permutation-invariant attention. In: ECCV.

  • Zhao, H., Torralba, A., Torresani, L., & Yan, Z. (2019). HACS: Human action clips and segments dataset for recognition and temporal localization. In: ICCV.

  • Zhao, H., & Wildes, R.P. (2021). Where are you heading? dynamic trajectory prediction with expert goal examples. In: ICCV.

  • Zhao, H., Yan, Z., Wang, H., Torresani, L., & Torralba, A. (2017). Slac: A sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374.

  • Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., & Lin, D. (2017). Temporal action detection with structured segment networks. In: ICCV.

  • Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In: Proceedings of the European conference on computer vision (ECCV), pp. 803–818.

  • Zhou, B., Wang, X., & Tang, X. (2011). Random field topic model for semantic region analysis in crowded scenes from tracklets. In: CVPR.

  • Zhu, L., & Yang, Y. (2018). Compound memory networks for few-shot video classification. In: ECCV.

  • Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., & Xie, X. (2016). Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Thirtieth AAAI conference on artificial intelligence.

  • Ziebart, B., Maas, A., Bagnell, J., & Dey, A. (2008). Maximum entropy inverse reinforcement learning. In: AAAI.

  • Ziebart, B., Ratliff, N., Gallagher, G., Mertz, C., Peterson, K., Bagnell, J., Hebert, M., Dey, A., & Srinivasa, S. (2009). Planning-based prediction for pedestrians. In: IROS.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yu Kong.

Additional information

Communicated by Boxin Shi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kong, Y., Fu, Y. Human Action Recognition and Prediction: A Survey. Int J Comput Vis 130, 1366–1401 (2022). https://doi.org/10.1007/s11263-022-01594-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01594-9

Keywords

Navigation