International Journal of Computer Vision

, Volume 122, Issue 2, pp 334–370 | Cite as

Complex Activity Recognition Via Attribute Dynamics

Article
  • 484 Downloads

Abstract

The problem of modeling the dynamic structure of human activities is considered. Video is mapped to a semantic feature space, which encodes activity attribute probabilities over time. The binary dynamic system (BDS) model is proposed to jointly learn the distribution and dynamics of activities in this space. This is a non-linear dynamic system that combines binary observation variables and a hidden Gauss–Markov state process, extending both binary principal component analysis and the classical linear dynamic systems. A BDS learning algorithm, inspired by the popular dynamic texture, and a dissimilarity measure between BDSs, which generalizes the Binet–Cauchy kernel, are introduced. To enable the recognition of highly non-stationary activities, the BDS is embedded in a bag of words. An algorithm is introduced for learning a BDS codebook, enabling the use of the BDS as a visual word for attribute dynamics (WAD). Short-term video segments are then quantized with a WAD codebook, allowing the representation of video as a bag-of-words for attribute dynamics. Video sequences are finally encoded as vectors of locally aggregated descriptors, which summarize the first moments of video snippets on the BDS manifold. Experiments show that this representation achieves state-of-the-art performance on the tasks of complex activity recognition and event identification.

Keywords

Complex activity Attribute Dynamical model Variational inference Fisher score 

References

  1. Afsari, B., Chaudhry, R., Ravichandran, A., & Vidal, R. (2012). Group action induced distances for averaging and clustering linear dynamical systems with applications to the analysis of dynamic scenes. In CVPR.Google Scholar
  2. Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43(16), 1–16.CrossRefGoogle Scholar
  3. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. American Mathematical Society.Google Scholar
  4. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10(2), 251–276.CrossRefGoogle Scholar
  5. Attias, H. (1999). A variational bayesian framework for graphical models. In NIPS.Google Scholar
  6. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2010). Action classification in soccer videos with long short-term memory recurrent neural networks. In ICANN.Google Scholar
  7. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In 2nd international workshop on human behavior understanding.Google Scholar
  8. Bhattacharya, S. (2013). Recognition of complex events in open-source web-scale videos: A bottom up approach. ACM International Conference on Multimedia.Google Scholar
  9. Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR.Google Scholar
  10. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. ICML.Google Scholar
  11. Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267.CrossRefGoogle Scholar
  12. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
  13. Bregler, C. (1997). Learning and recognizing human dynamics in video sequences. In CVPR.Google Scholar
  14. Campbell, L., Bobick, A. (1995). Recognition of human body motion using phase space constraints. In ICCV.Google Scholar
  15. Chan, A., Vasconcelos, N. (2005). Probabilistic kernels for the classification of auto-regressive visual processes. In CVPR.Google Scholar
  16. Chan, A., & Vasconcelos, N. (2008). Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE Transaction on Pattern Analysis and Machine Intelligence, 30(5), 909–926.CrossRefGoogle Scholar
  17. Chan, A. B., & Vasconcelos, N. (2007). Classifying video with kernel dynamic textures. In CVPR.Google Scholar
  18. Chang, C., & Lin, C. (2011). Libsvm: A library for support vector machines. ACM Transaction on Intelligent Systems and Technology, 2(3), 27.Google Scholar
  19. Chaudhry, R., Ravichandran, A., Hager, G., & Vidal, R. (2009). Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In CVPR.Google Scholar
  20. Chomat, O., & Crowley, J. L. (1999). Probabilistic recognition of activity using local appearance. In CVPR.Google Scholar
  21. Cinbis, R. G., Verbeek, J., & Schmid, C. (2012). Image categorization using fisher kernels of non-iid image models. In CVPR.Google Scholar
  22. Collins, M., Dasgupta, S., & Schapire, R. E. (2002). A generalization of principal component analysis to the exponential family. In NIPS.Google Scholar
  23. Deng, L., & Yu, D. (2014). Deep learning: Methods and applications. Foundations and Trends in Signal Processing, 7(3–4), 197–387.MathSciNetCrossRefMATHGoogle Scholar
  24. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.Google Scholar
  25. Doretto, G., Chiuso, A., Wu, Y. N., & Soatto, S. (2003). Dynamic textures. International Journal of Computer Vision, 51(2), 91–109.CrossRefMATHGoogle Scholar
  26. Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9, 1871–1874.MATHGoogle Scholar
  27. Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR.Google Scholar
  28. Gaidon, A., Harchaoui, Z., & Schmid, C. (2011). Actom sequence models for efficient action detection. In CVPR.Google Scholar
  29. Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.CrossRefGoogle Scholar
  30. Ghahramani, Z., & Beal, M. J. (2000). Propagation algorithms for variational bayesian learning. In NIPS.Google Scholar
  31. Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transaction on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.CrossRefGoogle Scholar
  32. Graves, A., & Schmidhuber, J. (2009). Offline handwriting recognition with multidimensional recurrent neural networks. In NIPS.Google Scholar
  33. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In ICASSP.Google Scholar
  34. Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels. IEEE Transaction on Pattern Analysis and Machine Intelligence, 27(4), 482–492.CrossRefGoogle Scholar
  35. Hajimirsadeghi, H., Yan, W., Vahdat, A., & Mori, G. (2015). Visual recognition by counting instances: A multi-instance cardinality potential kernel. In CVPR.Google Scholar
  36. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  37. Ikizler, N., & Forsyth, D. A. (2008). Searching for complex human activities with no visual examples. International Journal of Computer Vision, 80(3), 337–357.CrossRefGoogle Scholar
  38. Jaakkola, T., & Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In NIPS.Google Scholar
  39. Jain, A., Gupta, A., Rodriguez, M., & Davis, L. S. (2013a). Representing videos using mid-level discriminative patches. In CVPR.Google Scholar
  40. Jain, M., Jegou, H., Bouthemy, P. (2013b). Better exploiting motion for better action recognition. In CVPR.Google Scholar
  41. Jain, M., van Gemert, J. C., Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR.Google Scholar
  42. Jegou, H., Perronnin, F., Douze, M., Sanchez, J., Perez, P., & Schmid, C. (2012). Aggregating local image descriptors into compact codes. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(9), 1704–1718.CrossRefGoogle Scholar
  43. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, M. J. (2013). Towards understanding action recognition. In ICCV.Google Scholar
  44. Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(1), 221–231.CrossRefGoogle Scholar
  45. Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In ECCV.Google Scholar
  46. Jones, S., & Shao, L. (2014). A multigraph representation for improved unsupervised/semi-supervised learning of human actions. In CVPR.Google Scholar
  47. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.CrossRefMATHGoogle Scholar
  48. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
  49. Kellokumpu, V., Zhao, G., & Pietikäinen, M. (2008). Human activity recognition using a dynamic texture based method. BMVC.Google Scholar
  50. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR.Google Scholar
  51. Krapac, J., Verbeek, J., & Jurie, F. (2011). Modeling spatial layout with fisher vectors for image categorization. In ICCV.Google Scholar
  52. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In ICCV.Google Scholar
  53. Kullback, S. (1997). Information theory and statistics. Mineola: Courier Dover Publications.MATHGoogle Scholar
  54. Lai, K., Liu, D., Chen, M., & Chang, S. (2014). Recognizing complex events in videos by learning key static-dynamic evidences. In ECCV.Google Scholar
  55. Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In CVPR.Google Scholar
  56. Lan, Z., Li, X., Hauptmann, A. G. (2014). Temporal extension of scale pyramid and spatial pyramid matching for action recognition. http://arxiv.org/abs/1408.7071.
  57. Lan, Z., Lin, M., Li, X., Hauptmann, A. G., Raj, B. (2015). Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. In CVPR.Google Scholar
  58. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.MathSciNetCrossRefGoogle Scholar
  59. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In CVPR.Google Scholar
  60. Laxton, B., Lim, J., & Kriegman, D. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In CVPR.Google Scholar
  61. Li, B., Ayazoglu, M., Mao, T., Camps, O., & Sznaier, M. (2011). Activity recognition using dynamic subspace angles. In CVPR.Google Scholar
  62. Li, W., & Vasconcelos, N. (2012). Recognizing activities by attribute dynamics. In NIPS.Google Scholar
  63. Li, W., Yu, Q., Divakaran, A., & Vasconcelos, N. (2013a). Dynamic pooling for complex event recognition. In ICCV.Google Scholar
  64. Li, W., Yu, Q., Sawhney, H., & Vasconcelos, N. (2013b). Recognizing activities via bag of words for attribute dynamics. In CVPR.Google Scholar
  65. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR.Google Scholar
  66. Matikainen, P., Hebert, M., & Sukthankar, R. (2010). Representing pairwise spatial and temporal relations for action recognition. Eur Conf Comput Vis, pp. 508–521. doi:10.1007/978-3-642-15549-9_37.
  67. Moore, D. J., Essa, I. A., III M. H. H. (1999). Exploiting human actions and object context for recognition tasks. ICCV.Google Scholar
  68. Moreno, P. J., Ho, P. P., & Vasconcelos, N. (2004). A kullback-leibler divergence based kernel for svm classification in multimedia applications. In NIPS.Google Scholar
  69. Ng, J. Y. H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
  70. Ni, B., Moulin, P., Yang, X., & Yan, S. (2015). Motion part regularization: Improving action recognition via trajectory selection. In CVPR.Google Scholar
  71. Niebles, J. C., Chen, C. W., Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV.Google Scholar
  72. Niyogi, S., & Adelson, E. (1994). Analyzing and recognizing walking figures in xyt. In CVPR.Google Scholar
  73. Over, P., Awad, G., Fiscus, J., Antonishek, B., Michel, M., Smeaton, A. F., Kraaij, W., & Quenot, G. (2011). Trecvid 2011—an overview of the goals, tasks, data, evaluation mechanisms, and metrics. Proceedings of TRECVID 2011.Google Scholar
  74. Palatucci, M., Pomerleau, D., Hinton, G., & Mitchell, T. (2009). Zero-shot learning with semantic output codes. In NIPS.Google Scholar
  75. Peng, X., Wang, L., Wang, X., Qiao, Y. (2014a). Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive study and good practice. http://arxiv.org/abs/1405.4506.
  76. Peng, X., Zou, C., Qiao, Y., & Peng, Q. (2014b). Action recognition with stacked fisher vectors. In ECCV.Google Scholar
  77. Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improving the Fisher Kernel for large-scale image classification. In ECCV.Google Scholar
  78. Pinhanez, C., & Bobick, A. (1998). Human action detection using pnf propagation of temporal constraints. In CVPR.Google Scholar
  79. Quattoni, A., Collins, M., & Darrell, T. (2007). Learning visual representations using images with captions. In CVPR.Google Scholar
  80. Rasiwasi, N., Moreno, P. J., & Vasconcelos, N. (2007). Bridging the gap: Query by semantic example. IEEE Transactions on Multimedia, 9(5), 923–938.CrossRefGoogle Scholar
  81. Rasiwasia, N., & Vasconcelos, N. (2008). Scene classification with low-dimensional semantic spaces and weak supervision. In CVPR.Google Scholar
  82. Rasiwasia, N., & Vasconcelos, N. (2009). Holistic context modeling using semantic co-occurrences. In CVPR.Google Scholar
  83. Rasiwasia, N., & Vasconcelos, N. (2012). Holistic context models for visual recognition. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(5), 902–917.CrossRefGoogle Scholar
  84. Ravichandran, A., Chaudhry, R., & Vidal, R. (2012). Categorizing dynamic textures using a bag of dynamical systems. IEEE Transaction on Pattern Analysis and Machine Intelligence, 35(2), 342–353.CrossRefGoogle Scholar
  85. Rodriguez, M., Ahmed, J., & Shah, M. (2008). Action mach: A spatio-temporal maximum average correlation height filter for action recognition. In CVPR.Google Scholar
  86. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345.CrossRefGoogle Scholar
  87. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. http://arxiv.org/abs/1409.0575.
  88. Saul, L. K., & Jordan, M. I. (2000). Attractor dynamics in feedforward neural networks. Neural Computation, 12, 1313–1335.CrossRefGoogle Scholar
  89. Schein, A. L., Saul, L. K., Ungar, L. H. (2003). A generalized linear model for principal component analysis of binary data. In AISTATS.Google Scholar
  90. Schölkopf, B., Smola, A., & Müller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10(5), 1299–1319.CrossRefGoogle Scholar
  91. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.Google Scholar
  92. Shao, L., Zhen, X., Tao, D., & Li, X. (2014). Spatio-temporal Laplacian pyramid coding for action recognition. IEEE Transaction on Cybernetics, 44(6), 817–827.CrossRefGoogle Scholar
  93. Shao, L., Liu, L., & Yu, M. (2015). Kernelized multiview projection for robust action recognition. International Journal of Computer Vision.Google Scholar
  94. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the em algorithm. Journal of Time Series Analysis, 3(4), 253–264.CrossRefMATHGoogle Scholar
  95. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.Google Scholar
  96. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep Fisher networks for large-scale image classification. In NIPS.Google Scholar
  97. Snoek, C. G. M., Worring, M., & Smeulders, A. W. M. (2005). Early versus late fusion in semantic video analysis. ACM International Conference on Multimedia.Google Scholar
  98. Sun, C., & Nevatia, R. (2013). Active: Activity concept transitions in video event classification. In ICCV.Google Scholar
  99. Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.Google Scholar
  100. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.Google Scholar
  101. Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., & Sawhney, H. (2012). Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR.Google Scholar
  102. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.Google Scholar
  103. Todorovic, S. (2012). Human activities as stochastic kronecker graphs. In ECCV.Google Scholar
  104. Vahdat, A., Cannons, K., Mori, G., Oh, S., & Kim, I. (2013). Compositional models for video event detection: A multiple kernel learning latent variable approach. In ICCV.Google Scholar
  105. Vasconcelos, N., Ho, P., & Moreno, P. (2004). The kullback-leibler kernel as a framework for discriminant and localized representations for visual recognition. In ECCV.Google Scholar
  106. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transaction on Pattern Analysis and Machine Intelligence, 34(3), 480–492.CrossRefGoogle Scholar
  107. Vishwanathan, S. V. N., Smola, A. J., & Vidal, R. (2006). Binet-cauchy kernels on dynamical systems and its application to the analysis of dynamic scenes. International Journal of Computer Vision, 73(1), 95–119.CrossRefGoogle Scholar
  108. Vrigkas, M., Nikou, C., & Kakadiaris, I. (2015). A review of human activity recognition methods. Frontiers in Robotics and AI, 2, 1–28.CrossRefGoogle Scholar
  109. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.Google Scholar
  110. Wang, H., Ullah, M., Kläser, A., Laptev, I., Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In BMVC.Google Scholar
  111. Wang, H., Klaser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  112. Wang, L., Qiao, Y., Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.Google Scholar
  113. Wang, X., McCallum, A. (2006). Topics over time: A non-markov continuous-time model of topical trends. In ACM SIGKDD.Google Scholar
  114. Winn, J., & Bishop, C. M. (2005). Variational message passing. Journal of Machine Learning Research, 6, 661–694.MathSciNetMATHGoogle Scholar
  115. Xu, Z., Tsang, I., Yang, Y., Ma, Z., Hauptmann, A. (2014), Event detection using multi-level relevance labels and multiple features. In CVPR.Google Scholar
  116. Yacoob, Y., Black, M. J. (1998). Parameterized modeling and recognition of activities. In ICCV.Google Scholar
  117. Ye, G., Liu, D., Jhuo, I. H., Chang, S. F. (2012). Robust late fusion with rank minimization. In CVPR.Google Scholar
  118. Yu, M., Liu, L., Shao, L. (2015). Structure-preserving binary representations for RGB-D action recognition. In IEEE Transaction on Pattern Analysis and Machine Intelligence.Google Scholar
  119. Yu, Q., Liu, J., Cheng, H., Divakaran, A., Sawhney, H. (2012). Multimedia event recounting with concept based representation. ACM International Conference on Multimedia.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.ECE DepartmentUniversity of California, San DiegoLa JollaUSA

Personalised recommendations