Skip to main content

Advertisement

Log in

First-Person Activity Recognition: Feature, Temporal Structure, and Prediction

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects at the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multi-channel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. Furthermore, we present a novel algorithm for early recognition (i.e., prediction) of activities from first-person videos, which allows us to infer ongoing activities at their early stage. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos and perform early recognition reliably.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

References

  • Aggarwal, J . K., & Ryoo, M . S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43, 16:1–16:43.

    Article  Google Scholar 

  • Choi, J., Jeon, W., & Lee, S. (2008). Spatio-temporal pyramid matching for sports videos. In ACM MIR.

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE Workshop on VS-PETS.

  • Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In ICCV.

  • Fathi, A., Hodgins, J., & Rehg, J. (2012). Social interactions: A first-person perspective. In CVPR.

  • Hoai, M., & la Torre, F. D. (2012). Max-margin early event detectors. In CVPR.

  • Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In ICPR.

  • Kitani, K. M., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast unsupervised ego-action learning for first-person sports videos. In CVPR.

  • Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In ECCV.

  • Koppula, H. S., & Saxena, A. (2013). Anticipating human activities using object affordances for reactive robotic response. In RSS.

  • Lan, T., Sigal, L., & Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In CVPR.

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2), 107–123.

    Article  MathSciNet  Google Scholar 

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • Lee, S., Bambach, S., Crandall, D.J., Franchak, J.M., & Yu, C. (2014). This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In CVPRW.

  • Lee, Y., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.

  • Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.

  • Niebles, J., Chen, C., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.

  • Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.

  • Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV.

  • Ryoo, M. S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.

  • Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93(2), 183–200.

    Article  MathSciNet  MATH  Google Scholar 

  • Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.

  • Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In CVPR.

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.

  • Shawe-Taylor, N., & Kandola, A. (2002). On kernel target alignment. In NIPS.

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.

  • Si, Z., Pei, M., Yao, B., & Zhu, S. (2011). Unsupervised learning of event and-or grammar and semantics from video. In ICCV.

  • Spriggs, E. H., Torre, F. D. L., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In IEEE Workshop on Egocentric Vision, in conjunction with CVPR.

  • Wu, T., Lin, C., & Weng, R. (2004). Probability estimates for multi-class classification by pairwise coupling. JMLR, 5, 975–1005.

    MathSciNet  MATH  Google Scholar 

  • Xia, L., Chen, C.-C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.

  • Xie, D., Todorovic, S., & Zhu, S.-C. (2013). Inferring “dark matter” and “dark energy” from videos. In ICCV.

  • Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73, 213–238.

    Article  Google Scholar 

Download references

Acknowledgments

The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0016.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. S. Ryoo.

Additional information

Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (avi 10088 KB)

Supplementary material 2 (pdf 220 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ryoo, M.S., Matthies, L. First-Person Activity Recognition: Feature, Temporal Structure, and Prediction. Int J Comput Vis 119, 307–328 (2016). https://doi.org/10.1007/s11263-015-0847-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0847-4

Keywords

Navigation