Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition
This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high-resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end, we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors – called α process; 2) bottom-up inference based on detecting activity parts – called β process; and 3) top-down inference based on detecting activity context – called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus.
KeywordsCovariance Gall Tated Time 165s
Unable to display preview. Download preview PDF.
- 2.Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR (2009)Google Scholar
- 3.Si, Z., Pei, M., Yao, B., Zhu, S.C.: Unsupervised learning of event AND-OR grammar and semantics from video. In: ICCV (2011)Google Scholar
- 5.UCLA Courtyard Dataset (2012), http://vcla.stat.ucla.edu/Projects/Multiscale_Activity_Recognition/
- 6.Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: CVPR (2011)Google Scholar
- 7.Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: ICCV (2009)Google Scholar
- 8.Amer, M., Todorovic, S.: Sum-product networks for modeling activities with stochastic structure. In: CVPR (2012)Google Scholar
- 9.Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. In: NIPS (2010)Google Scholar
- 11.Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: CVPR (2011)Google Scholar
- 13.Yao, A., Gall, J., Van Gool, L.J.: A hough transform-based voting framework for action recognition. In: CVPR (2010)Google Scholar
- 16.Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
- 17.Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: ICCV (2009)Google Scholar
- 18.Amer, M., Todorovic, S.: A Chains model for localizing group activities in videos. In: ICCV (2011)Google Scholar