International Journal of Computer Vision

, Volume 119, Issue 3, pp 307–328 | Cite as

First-Person Activity Recognition: Feature, Temporal Structure, and Prediction

  • M. S. Ryoo
  • Larry Matthies


This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects at the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multi-channel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. Furthermore, we present a novel algorithm for early recognition (i.e., prediction) of activities from first-person videos, which allows us to infer ongoing activities at their early stage. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos and perform early recognition reliably.


Activity Recognition Video Segment Onset Activity Human Activity Recognition Training Video 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0016.

Supplementary material

Supplementary material 1 (avi 10088 KB)

11263_2015_847_MOESM2_ESM.pdf (220 kb)
Supplementary material 2 (pdf 220 KB)


  1. Aggarwal, J . K., & Ryoo, M . S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43, 16:1–16:43.CrossRefGoogle Scholar
  2. Choi, J., Jeon, W., & Lee, S. (2008). Spatio-temporal pyramid matching for sports videos. In ACM MIR.Google Scholar
  3. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE Workshop on VS-PETS.Google Scholar
  4. Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In ICCV.Google Scholar
  5. Fathi, A., Hodgins, J., & Rehg, J. (2012). Social interactions: A first-person perspective. In CVPR.Google Scholar
  6. Hoai, M., & la Torre, F. D. (2012). Max-margin early event detectors. In CVPR.Google Scholar
  7. Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In ICPR.Google Scholar
  8. Kitani, K. M., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast unsupervised ego-action learning for first-person sports videos. In CVPR.Google Scholar
  9. Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In ECCV.Google Scholar
  10. Koppula, H. S., & Saxena, A. (2013). Anticipating human activities using object affordances for reactive robotic response. In RSS.Google Scholar
  11. Lan, T., Sigal, L., & Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In CVPR.Google Scholar
  12. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2), 107–123.MathSciNetCrossRefGoogle Scholar
  13. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.Google Scholar
  14. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  15. Lee, S., Bambach, S., Crandall, D.J., Franchak, J.M., & Yu, C. (2014). This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In CVPRW.Google Scholar
  16. Lee, Y., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.Google Scholar
  17. Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.Google Scholar
  18. Niebles, J., Chen, C., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.Google Scholar
  19. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.Google Scholar
  20. Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV.Google Scholar
  21. Ryoo, M. S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.Google Scholar
  22. Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93(2), 183–200.MathSciNetCrossRefzbMATHGoogle Scholar
  23. Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.Google Scholar
  24. Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In CVPR.Google Scholar
  25. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.Google Scholar
  26. Shawe-Taylor, N., & Kandola, A. (2002). On kernel target alignment. In NIPS.Google Scholar
  27. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.Google Scholar
  28. Si, Z., Pei, M., Yao, B., & Zhu, S. (2011). Unsupervised learning of event and-or grammar and semantics from video. In ICCV.Google Scholar
  29. Spriggs, E. H., Torre, F. D. L., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In IEEE Workshop on Egocentric Vision, in conjunction with CVPR.Google Scholar
  30. Wu, T., Lin, C., & Weng, R. (2004). Probability estimates for multi-class classification by pairwise coupling. JMLR, 5, 975–1005.MathSciNetzbMATHGoogle Scholar
  31. Xia, L., Chen, C.-C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.Google Scholar
  32. Xie, D., Todorovic, S., & Zhu, S.-C. (2013). Inferring “dark matter” and “dark energy” from videos. In ICCV.Google Scholar
  33. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73, 213–238.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Jet Propulsion LaboratoryCalifornia Institute of TechnologyPasadenaUSA

Personalised recommendations