Abstract
In this paper, we study the problem of recognizing human actions in the presence of a single egocentric camera and multiple static cameras. Some actions are better presented in static cameras, where the whole body of an actor and the context of actions are visible. Some other actions are better recognized in egocentric cameras, where subtle movements of hands and complex object interactions are visible. In this paper, we introduce a model that can benefit from the best of both worlds by learning to predict the importance of each camera in recognizing actions in each frame. By joint discriminative learning of latent camera importance variables and action classifiers, our model achieves successful results in the challenging CMU-MMAC dataset. Our experimental results show significant gain in learning to use the cameras according to their predicted importance. The learned latent variables provide a level of understanding of a scene that enables automatic cinematography by smoothly switching between cameras in order to maximize the amount of relevant information in each frame.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. (2010)
Ren, X., Philipose, M.: Egocentric recognition of handled objects: benchmark and analysis. In: CVPR (2009)
Aggarwal, J., Ryoo, M.: Human activity analysis: a review. ACM Comput. Surv. 43, 1–43 (2011)
Taralova, E., De la Torre, F., Hebert, M.: Source constrained clustering. In: ICCV (2011)
Fathi, A., Farhadi, A., Rehg, J.M.: Understanding egocentric activities. In: ICCV (2011)
Fathi, A., Li, Y., Rehg, J.M.: Learning to recognize daily actions using gaze. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 314–327. Springer, Heidelberg (2012)
Kanade, T., Hebert, M.: First-person vision. In: Proceedings of the IEEE (2012)
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: CVPR (2012)
Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: a first-person perspective. In: CVPR (2012)
Kitani, K.M.: Ego-action analysis for first-person sports videos. IEEE Pervasive Comput. 11, 92–95 (2012)
Ogaki, K., Kitani, K.M., Sugano, Y., Sato, Y.: Coupling eye-motion and ego-motion features for first-person activity recognition. In: CVPR Workshops (2012)
Sundaram, S., Mayol-Cuevas, W.: What are we doing here? egocentric activity recognition on the move for contextual mapping. In: ICRA (2012)
Sundaram, S., Mayol-Cuevas, W.: High level activity recognition using low resolution wearable vision. In: CVPR (2009)
Li, Y.L., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV (2013)
Fathi, A., Rehg, J.M.: Modeling actions through state changes. In: CVPR (2013)
Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: CVPR (2013)
Farhadi, A., Tabrizi, M.K., Endres, I., Forsyth, D.A.: A latent model of discriminative aspect. In: ICCV (2009)
Wu, X., Jia, Y.: View-invariant action recognition using latent kernelized structural SVM. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 411–424. Springer, Heidelberg (2012)
Gu, C., Ren, X.: Discriminative mixture-of-templates for viewpoint classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 408–421. Springer, Heidelberg (2010)
Song, Y., Morency, L.P., Davis, R.: Multi-view latent variable discriminative models for action recognition. In: CVPR (2012)
Wu, C., Khalili, A.H., Aghajan, H.: Multiview activity recognition in smart homes with spatio-temporal features. In: ACM/IEEE International Conference on Distributed Smart Cameras (2010)
Weinland, D., Özuysal, M., Fua, P.: Making action recognition robust to occlusions and viewpoint changes. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part III. LNCS, vol. 6313, pp. 635–648. Springer, Heidelberg (2010)
Farhadi, A., Tabrizi, M.K.: Learning to recognize activities from the wrong view point. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 154–166. Springer, Heidelberg (2008)
Liu, J., Shah, M., Kuipers, B., Savarese, S.: Cross-view action recognition via view knowledge transfer. In: CVPR (2011)
Huang, C.-H., Yeh, Y.-R., Wang, Y.-C.F.: Recognizing actions across cameras by exploring the correlated subspace. In: Fusiello, A., Murino, V., Cucchiara, R. (eds.) ECCV 2012 Ws/Demos, Part I. LNCS, vol. 7583, pp. 342–351. Springer, Heidelberg (2012)
Junejo, I., Dexter, E., Laptev, I., Perez, P.: View-independent action recognition from temporal self-similarities. PAMI 33, 172–185 (2011)
Gkalelis, N., Kim, H., Hilton, A., Nikolaidis, N., Pitas, I.: The i3DPost multi-view and 3D human action/interaction database. In: CVMP (2009)
Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. CVIU 104, 249–257 (2006)
De la Torre, F., Hodgins, J., Montano, J., Valcarcel, S., Macey, J.: Guide to the carnegie mellon university multimodal activity (CMU-MMAC) database. Technical report, CMU, RI (2009)
Spriggs, E.H., De la Torre, F., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: IEEE Workshop on Egocentric Vision (2009)
Fisher, R., Reddy, P.: Supervised multi-modal action classification. Technical report, Carnegie Mellon University (2011)
McCall, C., Reddy, K.K., Shah, M.: Macro-class selection for hierarchical k-nn classification of inertial sensor data. In: PECCS (2012)
Zhao, L., Wang, X., Sukthankar, G., Sukthankar, R.: Motif discovery and feature selection for CRF-based activity recognition. In: ICPR (2010)
Elson, D.K., Riedl, M.O.: A lightweight intelligent virtual cinematography system for machinima production. In: AIIDE (2007)
He, L.W., Cohen, M.F., Salesin, D.H.: The virtual cinematographer: a paradigm for automatic real-time camera control and directing. In: Computer Graphics and Interactive Techniques (1996)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI 32, 1627–1645 (2010)
Scheirer, W.J., Rocha, A., Michaels, R., Boult, T.E.: Meta-recognition: the theory and practice of recognition score analysis. PAMIs 33, 1689–1695 (2011)
Morency, L., Quattoni, A., Darrell, T.: Latent-dynamic discriminative models for continuous gesture recognition. In: CVPR (2007)
van der Maaten, L.J.P., Welling, M., Saul, L.K.: Hidden-unit conditional random fields. In: IJCAI (2011)
Oliva, A., Torralba, A.: Building the gist of a scene: the role of global image features in recognition. In: Martinez-Conde, S., Macknik, S., Martinez, L., Alonso, J.-M., Tse, P. (eds.) Progress in Brain Research. Elsevier, Amsterdam (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Soran, B., Farhadi, A., Shapiro, L. (2015). Action Recognition in the Presence of One Egocentric and Multiple Static Cameras. In: Cremers, D., Reid, I., Saito, H., Yang, MH. (eds) Computer Vision -- ACCV 2014. ACCV 2014. Lecture Notes in Computer Science(), vol 9007. Springer, Cham. https://doi.org/10.1007/978-3-319-16814-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-16814-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16813-5
Online ISBN: 978-3-319-16814-2
eBook Packages: Computer ScienceComputer Science (R0)