Abstract
This paper discusses the problem of recognizing interaction-level human activities from a first-person viewpoint. The goal is to enable an observer (e.g., a robot or a wearable camera) to understand ‘what activity others are performing to it’ from continuous video inputs. These include friendly interactions such as ‘a person hugging the observer’ as well as hostile interactions like ‘punching the observer’ or ‘throwing objects at the observer’, whose videos involve a large amount of camera ego-motion caused by physical interactions. The paper investigates multi-channel kernels to integrate global and local motion information, and presents a new activity learning/recognition methodology that explicitly considers temporal structures displayed in first-person activity videos. Furthermore, we present a novel algorithm for early recognition (i.e., prediction) of activities from first-person videos, which allows us to infer ongoing activities at their early stage. In our experiments, we not only show classification results with segmented videos, but also confirm that our new approach is able to detect activities from continuous videos and perform early recognition reliably.
Similar content being viewed by others
References
Aggarwal, J . K., & Ryoo, M . S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43, 16:1–16:43.
Choi, J., Jeon, W., & Lee, S. (2008). Spatio-temporal pyramid matching for sports videos. In ACM MIR.
Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE Workshop on VS-PETS.
Fathi, A., Farhadi, A., & Rehg, J. M. (2011). Understanding egocentric activities. In ICCV.
Fathi, A., Hodgins, J., & Rehg, J. (2012). Social interactions: A first-person perspective. In CVPR.
Hoai, M., & la Torre, F. D. (2012). Max-margin early event detectors. In CVPR.
Iwashita, Y., Takamine, A., Kurazume, R., & Ryoo, M. S. (2014). First-person animal activity recognition from egocentric videos. In ICPR.
Kitani, K. M., Okabe, T., Sato, Y., & Sugimoto, A. (2011). Fast unsupervised ego-action learning for first-person sports videos. In CVPR.
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., & Hebert, M. (2012). Activity forecasting. In ECCV.
Koppula, H. S., & Saxena, A. (2013). Anticipating human activities using object affordances for reactive robotic response. In RSS.
Lan, T., Sigal, L., & Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In CVPR.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2), 107–123.
Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
Lee, S., Bambach, S., Crandall, D.J., Franchak, J.M., & Yu, C. (2014). This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In CVPRW.
Lee, Y., Ghosh, J., & Grauman, K. (2012). Discovering important people and objects for egocentric video summarization. In CVPR.
Li, Y., Fathi, A., & Rehg, J. M. (2013). Learning to predict gaze in egocentric video. In ICCV.
Niebles, J., Chen, C., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.
Pirsiavash, H., & Ramanan, D. (2012). Detecting activities of daily living in first-person camera views. In CVPR.
Ryoo, M. S. (2011). Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV.
Ryoo, M. S., & Aggarwal, J. (2009). Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV.
Ryoo, M. S., & Aggarwal, J. K. (2011). Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision, 93(2), 183–200.
Ryoo, M. S., & Matthies, L. (2013). First-person activity recognition: What are they doing to me? In CVPR.
Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In ICPR.
Shawe-Taylor, N., & Kandola, A. (2002). On kernel target alignment. In NIPS.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from a single depth image. In CVPR.
Si, Z., Pei, M., Yao, B., & Zhu, S. (2011). Unsupervised learning of event and-or grammar and semantics from video. In ICCV.
Spriggs, E. H., Torre, F. D. L., & Hebert, M. (2009). Temporal segmentation and activity classification from first-person sensing. In IEEE Workshop on Egocentric Vision, in conjunction with CVPR.
Wu, T., Lin, C., & Weng, R. (2004). Probability estimates for multi-class classification by pairwise coupling. JMLR, 5, 975–1005.
Xia, L., Chen, C.-C., & Aggarwal, J. K. (2012). View invariant human action recognition using histograms of 3D joints. In CVPRW.
Xie, D., Todorovic, S., & Zhu, S.-C. (2013). Inferring “dark matter” and “dark energy” from videos. In ICCV.
Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73, 213–238.
Acknowledgments
The research described in this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. This research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-10-2-0016.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Ivan Laptev, Josef Sivic, and Deva Ramanan.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (avi 10088 KB)
Rights and permissions
About this article
Cite this article
Ryoo, M.S., Matthies, L. First-Person Activity Recognition: Feature, Temporal Structure, and Prediction. Int J Comput Vis 119, 307–328 (2016). https://doi.org/10.1007/s11263-015-0847-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0847-4