Advertisement

Learning Latent Constituents for Recognition of Group Activities in Video

  • Borislav Antic
  • Björn Ommer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8689)

Abstract

The collective activity of a group of persons is more than a mere sum of individual person actions, since interactions and the context of the overall group behavior have crucial influence. Consequently, the current standard paradigm for group activity recognition is to model the spatiotemporal pattern of individual person bounding boxes and their interactions. Despite this trend towards increasingly global representations, activities are often defined by semi-local characteristics and their interrelation between different persons. For capturing the large visual variability with small semi-local parts, a large number of them are required, thus rendering manual annotation infeasible. To automatically learn activity constituents that are meaningful for the collective activity, we sample local parts and group related ones not merely based on visual similarity but based on the function they fulfill on a set of validation images. Then max-margin multiple instance learning is employed to jointly i) remove clutter from these groups and focus on only the relevant samples, ii) learn the activity constituents, and iii) train the multi-class activity classifier. Experiments on standard activity benchmark sets show the advantage of this joint procedure and demonstrate the benefit of functionally grouped latent activity constituents for group activity recognition.

Keywords

Group Activity Recognition Latent Parts Multiple-Instance Learning Functional Grouping Video Retrieval 

Supplementary material

978-3-319-10590-1_3_MOESM1_ESM.avi (2.3 mb)
Electronic Supplementary Material (AVI 2,386 KB)
978-3-319-10590-1_3_MOESM2_ESM.avi (4 mb)
Electronic Supplementary Material (AVI 4,139 KB)
978-3-319-10590-1_3_MOESM3_ESM.avi (2.8 mb)
Electronic Supplementary Material (AVI 2,917 KB)
978-3-319-10590-1_3_MOESM4_ESM.avi (5.9 mb)
Electronic Supplementary Material (AVI 6,056 KB)
978-3-319-10590-1_3_MOESM5_ESM.avi (1.6 mb)
Electronic Supplementary Material (AVI 1,608 KB)
978-3-319-10590-1_3_MOESM6_ESM.avi (9.4 mb)
Electronic Supplementary Material (AVI 9,601 KB)

References

  1. 1.
    Amer, M.R., Todorovic, S.: A chains model for localizing participants of group activities in videos. In: ICCV, pp. 786–793 (2011)Google Scholar
  2. 2.
    Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.-C.: Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 187–200. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  3. 3.
    Antic, B., Ommer, B.: Video parsing for abnormality detection. In: ICCV, pp. 2415–2422 (2011)Google Scholar
  4. 4.
    Antić, B., Ommer, B.: Robust multiple-instance learning with superbags. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part II. LNCS, vol. 7725, pp. 242–255. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  5. 5.
    Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part VI. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Chapelle, O., Keerthi, S.S.: Multi-class feature selection with support vector machines. In: Proc. of the American Statistical Assoc. (2008)Google Scholar
  7. 7.
    Choi, W., Savarese, S.: A unified framework for multi-target tracking and collective activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part IV. LNCS, vol. 7575, pp. 215–230. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Choi, W., Savarese, S.: Understanding collective activities of people from videos. Pattern Analysis and Machine Intelligence (99), 1 (2013)Google Scholar
  9. 9.
    Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: Proc. of 9th International Workshop on Visual Surveillance (VSWS 2009) in Conjuction with ICCV (2009)Google Scholar
  10. 10.
    Choi, W., Shahid, K., Savarese, S.: Learning context for collective activity recognition. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (2011)Google Scholar
  11. 11.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Schmid, C., Soatto, S., Tomasi, C. (eds.)International Conference on Computer Vision & Pattern Recognition, vol. 2, pp. 886–893 (2005)Google Scholar
  12. 12.
    Dollár, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72. IEEE (2005)Google Scholar
  13. 13.
    Eigenstetter, A., Takami, M., Ommer, B.: Randomized Max-Margin Compositions for Visual Recognition. In: CVPR - International Conference on Computer Vision and Pattern Recognition, Columbus, USA (2014)Google Scholar
  14. 14.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  15. 15.
    Fergus, R., Perona, P., Zisserman, A.: Object class recognition by unsupervised scale-invariant learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 264–271 (2003)Google Scholar
  16. 16.
    Gehler, P.V., Chapelle, O.: Deterministic annealing for multiple-instance learning. In: International Conference on Artificial Intelligence and Statistics, pp. 123–130 (2007)Google Scholar
  17. 17.
    Khamis, S., Morariu, V.I., Davis, L.S.: Combining per-frame and per-track cues for multi-person action recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part I. LNCS, vol. 7572, pp. 116–129. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  18. 18.
    Khamis, S., Morariu, V.I., Davis, L.S.: A flow model for joint action recognition and identity maintenance. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  19. 19.
    Lan, T., Wang, Y., Mori, G., Robinovitch, S.: Retrieving actions in group contexts. In: International Workshop on Sign Gesture Activity (2010)Google Scholar
  20. 20.
    Lan, T., Wang, Y., Yang, W., Mori, G.: Beyond actions: Discriminative models for contextual group activities. In: Advances in Neural Information Processing Systems, NIPS (2010)Google Scholar
  21. 21.
    Lan, T., Wang, Y., Yang, W., Robinovitch, S., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Transactions on Pattern Analysis and Machine Intelligence (2012)Google Scholar
  22. 22.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Conference on Computer Vision & Pattern Recognition (2008)Google Scholar
  23. 23.
    Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. In: IEEE International Conference on Computer Vision and Pattern Recognition, CVPR (2011)Google Scholar
  24. 24.
    Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: Conference on Computer Vision & Pattern Recognition (2009)Google Scholar
  25. 25.
    Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Application (VISSAPP 2009), pp. 331–340. INSTICC Press (2009)Google Scholar
  26. 26.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vision 79(3), 299–318 (2008)CrossRefGoogle Scholar
  27. 27.
    Ommer, B., Mader, T., Buhmann, J.M.: Seeing the objects behind the dots: Recognition in videos from a moving camera. International Journal of Computer Vision 83(1), 57–71 (2009)CrossRefGoogle Scholar
  28. 28.
    Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. International Journal of Computer Vision 93(2), 183–200 (2011)CrossRefzbMATHMathSciNetGoogle Scholar
  29. 29.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: ICPR (3), pp. 32–36 (2004)Google Scholar
  30. 30.
    Singh, S., Gupta, A., Efros, A.A.: Unsupervised discovery of mid-level discriminative patches. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 73–86. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  31. 31.
    Turaga, P.K., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activities: A survey. IEEE Trans. Circuits Syst. Video Techn. 18(11), 1473–1488 (2008)CrossRefGoogle Scholar
  32. 32.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action Recognition by Dense Trajectories. In: IEEE Conference on Computer Vision & Pattern Recognition, Colorado Springs, United States, pp. 3169–3176 (2011)Google Scholar
  33. 33.
    Xiang, T., Gong, S.: Beyond tracking: Modelling activity and understanding behaviour. International Journal of Computer Vision 67(1), 21–51 (2006)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Borislav Antic
    • 1
  • Björn Ommer
    • 1
  1. 1.HCI & IWRUniversity of HeidelbergGermany

Personalised recommendations