International Journal of Computer Vision

, Volume 121, Issue 1, pp 5–25 | Cite as

Spatially Coherent Interpretations of Videos Using Pattern Theory

  • Fillipe D. M. de Souza
  • Sudeep Sarkar
  • Anuj Srivastava
  • Jingyong Su


Activity interpretation in videos results not only in recognition or labeling of dominant activities, but also in semantic descriptions of scenes. Towards this broader goal, we present a combinatorial approach that assumes availability of algorithms for detecting and labeling objects and basic actions in videos, albeit with some errors. Given these uncertain labels and detected objects, we link them into interpretable structures using the domain knowledge, under the framework of Grenander’s general pattern theory. Here a semantic description is built using basic units, termed generators, that represent either objects or actions. These generators have multiple out-bonds, each associated with different types of domain semantics, spatial constraints, and image evidence. The generators combine, according to a set of pre-defined combination rules that capture domain semantics, to form larger configurations that represent video interpretations. This framework derives its representational power from flexibility in size and structure of configurations. We impose a probability distribution on the configuration space, with inferences generated using a Markov chain Monte Carlo-based simulated annealing process. The primary advantage of the approach is that it handles known challenges—appearance variabilities, errors in object labels, object clutter, simultaneous events, etc—without the need for exponentially-large (labeled) training data. Experimental results demonstrate its ability to successfully provide interpretations under clutter and the simultaneity of events. They show: (1) a performance increase of more than 30 % over other state-of-the-art approaches using more than 5000 video units from the Breakfast Actions dataset, and (2) an overall recall and precision improvement of more than 50 and 100 %, respectively, on the YouCook data set.


Activity detection Pattern theory Graphical methods Compositional approach 



This research was supported in part by NSF Grants 1217515 and 1217676.


  1. Albanese, M., Chellappa, R., Cuntoor, N., Moscato, V., Picariello, A., Subrahmanian, V., et al. (2010). Pads: A probabilistic activity detection framework for video data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12), 2246–2261.CrossRefGoogle Scholar
  2. Albanese, M., Chellappa, R., Moscato, V., Picariello, A., Subrahmanian, V., Turaga, P., et al. (2008). A constrained probabilistic petri net framework for human activity detection in video. IEEE Transactions on Multimedia, 10(6), 982–996.CrossRefGoogle Scholar
  3. Amer, M.R., Todorovic, S., Fern, A., Zhu, S.C. (2013). Monte carlo tree search for scheduling activity recognition. In IEEE International Conference on Computer Vision (ICCV) (pp. 1353–1360).Google Scholar
  4. Bhattacharya, S., Kalayeh, M.M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  5. Brendel, W., Fern, A., Todorovic, S. (2011). Probabilistic event logic for interval-based event recognition. In: CVPR.Google Scholar
  6. Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.CrossRefGoogle Scholar
  7. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P. (2011). Smote: Synthetic minority over-sampling technique. arXiv preprint arXiv:1106.1813.
  8. Das, P., Xu, C., Doell, R.F., Corso, J.J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2634–2641).Google Scholar
  9. de Souza, F.D.M., Sarkar, S., Srivastava, A., Su, J. (2014). Pattern theory-based interpretation of activities. In: IEEE International Conference on Pattern Recognition (ICPR).Google Scholar
  10. Dubba, K.S.R. (2012). Learning relational event models from videos. Ph.D. thesis, University of Leeds.Google Scholar
  11. Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: A deep event network for multimedia event detection and evidence recounting. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).Google Scholar
  12. Ghanem, N., DeMenthon, D., Doermann, D., Davis, L. (2004). Representation and recognition of events in surveillance video using petri nets. In: IEEE Conference on Computer Vision and Pattern Recognition Workshop. 2004. CVPRW’04 (pp. 112–112).Google Scholar
  13. Grenander, U. (1993). General pattern theory: A mathematical study of regular structures. Oxford: Clarendon Press.zbMATHGoogle Scholar
  14. Grenander, U., & Miller, M. I. (2007). Pattern theory: From representation to inference (Vol. 1). Oxford: Oxford University Press.zbMATHGoogle Scholar
  15. Hilde, K., Arslan, A., Serre, T. (2014). The language of actions: Recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  16. Ivanov, Y. A., & Bobick, A. F. (2000). Recognition of visual activities and interactions by stochastic parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 22(8), 852–872.CrossRefGoogle Scholar
  17. Jiang, Y.-G., Bhattacharya, S., Chang, S.-F., & Shah, M. (2013). High-level event recognition in unconstrained videos. International Journal of Multimedia Information Retrieval, 2(2), 73–101.Google Scholar
  18. Joo, S.W., Chellappa, R. (2006). Recognition of multi-object events using attribute grammars. In: IEEE International Conference on Image Processing (pp. 2897–2900).Google Scholar
  19. Ke, Y., Sukthankar, R., Hebert, M. (2007). Event detection in crowded videos. In: ICCV.Google Scholar
  20. Lan, T., Sigal, L., Mori, G. (2012). Social roles in hierarchical models for human activity recognition. In: CVPR.Google Scholar
  21. Lan, T., Wang, Y., Yang, W., Robinovitch, S., & Mori, G. (2012). Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(8), 1549–1562.Google Scholar
  22. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B. (2008). Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1–8).Google Scholar
  23. Morariu, V.I., Davis, L.S. (2011). Multi-agent event recognition in structured scenarios. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3289–3296).Google Scholar
  24. Narayanaswamy, S., Barbu, A., Siskind, J. (2014). Seeing what youŕe told: Sentence-guided activity recognition in video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  25. Pei, M., Jia, Y., Zhu, S.C. (2011). Parsing video events with goal inference and intent prediction. In: IEEE International Conference on Computer Vision (ICCV) (pp. 487–494).Google Scholar
  26. Romdhane, R., Boulay, B., Bremond, F., Thonnat, M. (2011). Probabilistic recognition of complex event. In: Computer Vision Systems (CVS) (pp. 122–131). Springer.Google Scholar
  27. Ryoo, M.S., Aggarwal, J.K. (2007). Robust human-computer interaction system guiding a user by providing feedback. In: IJCAI (pp. 2850–2855).Google Scholar
  28. Sadanand, S., Corso, J.J. (2012). Action bank: A high-level representation of activity in video. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  29. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.C. (2015). Joint inference of groups, events and human roles in aerial videos. In: CVPR.Google Scholar
  30. Si, Z., Pei, M., Yao, B., Zhu, S.C. (2011). Unsupervised learning of event and-or grammar and semantics from video. In: IEEE International Conference on Computer Vision (ICCV) (pp. 41–48).Google Scholar
  31. Souza, F., Sarkar, S., Srivastava, A., Su, J. (2015). Temporally coherent interpretations for long videos using pattern theory. In: CVPR.Google Scholar
  32. Vahdat, A., Cannons, K., Mori, G., Kim, I., Oh, S. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In: ICCV.Google Scholar
  33. Wang, X., Ji, Q. (2015). Video event recognition with deep hierarchical context model. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  34. Wei, P., Zhao, Y., Zheng, N., Zhu, S.C. (2013). Modeling 4d human-object interactions for event and object recognition. In: ICCV.Google Scholar
  35. Xu, Z., Yang, Y., Hauptmann, A.G. (2015). A discriminative cnn video representation for event detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Fillipe D. M. de Souza
    • 1
  • Sudeep Sarkar
    • 1
  • Anuj Srivastava
    • 2
  • Jingyong Su
    • 3
  1. 1.Department of Computer Science & EngineeringUniversity of South FloridaTampaUSA
  2. 2.Department of StatisticsFlorida State UniversityTallahasseeUSA
  3. 3.Department of Mathematics & StatisticsTexas Tech UniversityLubbockUSA

Personalised recommendations