Cost-Sensitive Top-Down/Bottom-Up Inference for Multiscale Activity Recognition

  • Mohamed R. Amer
  • Dan Xie
  • Mingtian Zhao
  • Sinisa Todorovic
  • Song-Chun Zhu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7575)


This paper addresses a new problem, that of multiscale activity recognition. Our goal is to detect and localize a wide range of activities, including individual actions and group activities, which may simultaneously co-occur in high-resolution video. The video resolution allows for digital zoom-in (or zoom-out) for examining fine details (or coarser scales), as needed for recognition. The key challenge is how to avoid running a multitude of detectors at all spatiotemporal scales, and yet arrive at a holistically consistent video interpretation. To this end, we use a three-layered AND-OR graph to jointly model group activities, individual actions, and participating objects. The AND-OR graph allows a principled formulation of efficient, cost-sensitive inference via an explore-exploit strategy. Our inference optimally schedules the following computational processes: 1) direct application of activity detectors – called α process; 2) bottom-up inference based on detecting activity parts – called β process; and 3) top-down inference based on detecting activity context – called γ process. The scheduling iteratively maximizes the log-posteriors of the resulting parse graphs. For evaluation, we have compiled and benchmarked a new dataset of high-resolution videos of group and individual activities co-occurring in a courtyard of the UCLA campus.


Group Activity Activity Recognition Child Node Coarse Scale Descriptor Vector 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Zhu, S.C., Mumford, D.: A stochastic grammar of images. Found. Trends. Comput. Graph. Vis. 2, 259–362 (2006)zbMATHCrossRefGoogle Scholar
  2. 2.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR (2009)Google Scholar
  3. 3.
    Si, Z., Pei, M., Yao, B., Zhu, S.C.: Unsupervised learning of event AND-OR grammar and semantics from video. In: ICCV (2011)Google Scholar
  4. 4.
    Wu, T., Zhu, S.C.: A numerical study of the bottom-up and top-down inference processes in and-or graphs. IJCV 93, 226–252 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  5. 5.
  6. 6.
    Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: CVPR (2011)Google Scholar
  7. 7.
    Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: ICCV (2009)Google Scholar
  8. 8.
    Amer, M., Todorovic, S.: Sum-product networks for modeling activities with stochastic structure. In: CVPR (2012)Google Scholar
  9. 9.
    Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. In: NIPS (2010)Google Scholar
  10. 10.
    Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. IJCV 93, 183–200 (2011)MathSciNetzbMATHCrossRefGoogle Scholar
  11. 11.
    Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: CVPR (2011)Google Scholar
  12. 12.
    Sivic, J., Zisserman, A.: Efficient visual search for objects in videos. Proceedings of the IEEE 96, 548–566 (2008)CrossRefGoogle Scholar
  13. 13.
    Yao, A., Gall, J., Van Gool, L.J.: A hough transform-based voting framework for action recognition. In: CVPR (2010)Google Scholar
  14. 14.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 1627–1645 (2010)CrossRefGoogle Scholar
  15. 15.
    Matikainen, P., Hebert, M., Sukthankar, R.: Representing Pairwise Spatial and Temporal Relations for Action Recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 508–521. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  17. 17.
    Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: ICCV (2009)Google Scholar
  18. 18.
    Amer, M., Todorovic, S.: A Chains model for localizing group activities in videos. In: ICCV (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Mohamed R. Amer
    • 1
  • Dan Xie
    • 2
  • Mingtian Zhao
    • 2
  • Sinisa Todorovic
    • 1
  • Song-Chun Zhu
    • 2
  1. 1.Oregon State UniversityCorvallisUSA
  2. 2.University of CaliforniaLos AngelesUSA

Personalised recommendations