First-Person Activity Recognition: Feature, Temporal Structure, and Prediction

  • 1438 Accesses

  • 14 Citations


This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects at the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multi-channel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. Furthermore, we present a novel algorithm for early recognition (i.e., prediction) of activities from first-person videos, which allows us to infer ongoing activities at their early stage. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos and perform early recognition reliably.

This is a preview of subscription content, log in to check access.

Access options

Buy single article

Instant unlimited access to the full article PDF.

US$ 39.95

Price includes VAT for USA

Subscribe to journal

Immediate online access to all issues from 2019. Subscription will auto renew annually.

US$ 199

This is the net price. Taxes to be calculated in checkout.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17


  1. Aggarwal, J . K., & Ryoo, M . S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43, 16:1–16:43.

  2. Choi, J., Jeon, W., & Lee, S. (2008). Spatio-temporal pyramid matching for sports videos. In ACM MIR.

  3. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE Workshop on VS-PETS.

  4. Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In ICCV.

  5. Fathi, A., Hodgins, J., & Rehg, J. (2012). Social interactions: A first-person perspective. In CVPR.

  6. Hoai, M., & la Torre, F. D. (2012). Max-margin early event detectors. In CVPR.

  7. Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In ICPR.

  8. Kitani, K. M., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast unsupervised ego-action learning for first-person sports videos. In CVPR.

  9. Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In ECCV.

  10. Koppula, H. S., & Saxena, A. (2013). Anticipating human activities using object affordances for reactive robotic response. In RSS.

  11. Lan, T., Sigal, L., & Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In CVPR.

  12. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2), 107–123.

  13. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  14. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  15. Lee, S., Bambach, S., Crandall, D.J., Franchak, J.M., & Yu, C. (2014). This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In CVPRW.

  16. Lee, Y., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.

  17. Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.

  18. Niebles, J., Chen, C., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.

  19. Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.

  20. Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV.

  21. Ryoo, M. S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.

  22. Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93(2), 183–200.

  23. Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.

  24. Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In CVPR.

  25. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

  26. Shawe-Taylor, N., & Kandola, A. (2002). On kernel target alignment. In NIPS.

  27. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.

  28. Si, Z., Pei, M., Yao, B., & Zhu, S. (2011). Unsupervised learning of event and-or grammar and semantics from video. In ICCV.

  29. Spriggs, E. H., Torre, F. D. L., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In IEEE Workshop on Egocentric Vision, in conjunction with CVPR.

  30. Wu, T., Lin, C., & Weng, R. (2004). Probability estimates for multi-class classification by pairwise coupling. JMLR, 5, 975–1005.

  31. Xia, L., Chen, C.-C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.

  32. Xie, D., Todorovic, S., & Zhu, S.-C. (2013). Inferring “dark matter” and “dark energy” from videos. In ICCV.

  33. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73, 213–238.

Download references


The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0016.

Author information

Correspondence to M. S. Ryoo.

Additional information

Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (avi 10088 KB)

Supplementary material 1 (avi 10088 KB)

Supplementary material 2 (pdf 220 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ryoo, M.S., Matthies, L. First-Person Activity Recognition: Feature, Temporal Structure, and Prediction. Int J Comput Vis 119, 307–328 (2016).

Download citation


  • Activity Recognition
  • Video Segment
  • Onset Activity
  • Human Activity Recognition
  • Training Video