Advertisement

International Journal of Computer Vision

, Volume 103, Issue 1, pp 60–79 | Cite as

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

  • Heng Wang
  • Alexander Kläser
  • Cordelia Schmid
  • Cheng-Lin Liu
Article

Abstract

This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

Keywords

Action recognition Dense trajectories Motion boundary histograms 

Notes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) Grant 60825301, the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA06030300), as well as the joint Microsoft/INRIA project and the European integrated project AXES.

References

  1. Anjum, N., & Cavallaro, A. (2008). Multifeature object trajectory clustering for video analysis. IEEE Transactions on Multimedia, 18, 1555–1564.Google Scholar
  2. Bay, H., Tuytelaars, T., & Gool, L. V. (2006). SURF: Speeded up robust features. In European conference on computer vision.Google Scholar
  3. Bhattacharya, S., Sukthankar, R., Jin, R., & Shah, M. (2011). A probabilistic representation for efficient large scale visual recognition tasks. In IEEE conference on computer vision and pattern recognition.Google Scholar
  4. Bregonzio, M., Gong, S., & Xiang, T. (2009). Recognising action as clouds of space-time interest points. In IEEE conference on computer vision and pattern recognition.Google Scholar
  5. Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In European conference on computer vision.Google Scholar
  6. Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In IEEE international conference on computer vision.Google Scholar
  7. Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In European conference on computer vision.Google Scholar
  8. Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 500–513.CrossRefGoogle Scholar
  9. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition.Google Scholar
  10. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision.Google Scholar
  11. Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE workshop visual surveillance and performance evaluation of tracking and surveillance.Google Scholar
  12. Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian conference on image analysis.Google Scholar
  13. Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition.Google Scholar
  14. Gaidon, A., Harchaoui, & Schmid, C. (2012) Recognizing activities with cluster-trees of tracklets. In British Machine Vision Conference.Google Scholar
  15. Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 883–897.CrossRefGoogle Scholar
  16. Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1533–1543.CrossRefGoogle Scholar
  17. Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European conference on computer vision.Google Scholar
  18. Johnson, N., & Hogg, D. (1996). Learning the distribution of object trajectories for event recognition. Image and Vision Computing, 14, 609–615.CrossRefGoogle Scholar
  19. Junejo, I. N., Dexter, E., Laptev, I., & Pérez, P. (2011). View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 172–185.CrossRefGoogle Scholar
  20. Jung, C. R., Hennemann, L., & Musse, S. R. (2008). Event detection using trajectory clustering and 4-D histograms. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1565–1575.CrossRefGoogle Scholar
  21. Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In British machine vision conference.Google Scholar
  22. Kläser, A., Marszałek, M., Laptev, I., & Schmid, C. (2010). Will person detection help bag-of-features action recognition? Tech. Rep. RR-7373, INRIA.Google Scholar
  23. Kliper-Gross, O., Gurovich, Y., Hassner, T. & Wolf, L. (2012). Motion Interchange Patterns for Action Recognition in Unconstrained Videos. In European Conference on Computer Vision.Google Scholar
  24. Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
  25. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, IEEE (pp. 2556–2563).Google Scholar
  26. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  27. Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition.Google Scholar
  28. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision and pattern recognition.Google Scholar
  29. Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE conference on computer vision and pattern recognition.Google Scholar
  30. Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos in the wild. In IEEE conference on computer vision and pattern recognition.Google Scholar
  31. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  32. Lu, W. C., Wang, Y. C. F., & Chen, C. S. (2010). Learning dense optical-flow trajectory patterns for video object extraction. In IEEE advanced video and signal based surveillance conference.Google Scholar
  33. Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence.Google Scholar
  34. Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE conference on computer vision and pattern recognition.Google Scholar
  35. Matikainen, P., Hebert, M. & Sukthankar, R (2009). Trajectons: Action recognition through the motion analysis of tracked features. In ICCV workshops on video-oriented object and event classification.Google Scholar
  36. Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In IEEE international conference on computer vision.Google Scholar
  37. Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision.Google Scholar
  38. Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision.Google Scholar
  39. Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.CrossRefGoogle Scholar
  40. Piriou, G., Bouthemy, P., & Yao, J. F. (2006). Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15, 3418–3431.CrossRefGoogle Scholar
  41. Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In European conference on computer vision.Google Scholar
  42. Reddy, K.K. & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 1–11.Google Scholar
  43. Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
  44. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In IEEE conference on computer vision and pattern recognition.Google Scholar
  45. Sand, P., & Teller, S. (2008). Particle video: Long-range motion estimation using point trajectories. International Journal of Computer Vision, 80, 72–91.CrossRefGoogle Scholar
  46. Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.Google Scholar
  47. Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In ACM conference on multimedia.Google Scholar
  48. Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE conference on computer vision and pattern recognition.Google Scholar
  49. Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T. S., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
  50. Sun, J., Mu, Y., Yan, S., & Cheong, L. F. (2010). Activity recognition using dense long-duration trajectories. In IEEE international conference on multimedia and expo.Google Scholar
  51. Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In European conference on computer vision. Google Scholar
  52. Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European conference on computer vision.Google Scholar
  53. Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European conference on computer vision.Google Scholar
  54. Uemura, H., Ishikawa, S., & Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. In British machine vision conference.Google Scholar
  55. Ullah, M. M., Parizi, S. N., & Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British machine vision conference.Google Scholar
  56. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE conference on computer vision and pattern recognition.Google Scholar
  57. Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In British machine vision conference.Google Scholar
  58. Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. E. L. (2008). Trajectory analysis and semantic region modeling using a nonparametric Bayesian model. In IEEE international conference on computer vision.Google Scholar
  59. Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In IEEE international conference on computer vision.Google Scholar
  60. Weinland, D., Ronfard, R., & Boyer, E. (2006). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2), 249–257.Google Scholar
  61. Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision.Google Scholar
  62. Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In IEEE international conference on computer vision.Google Scholar
  63. Wu, S., Oreifej, O., & Shah, M. (2011). Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In IEEE international conference on computer vision.Google Scholar
  64. Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE conference on computer vision and pattern recognition.Google Scholar
  65. Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In IEEE international conference on computer vision.Google Scholar
  66. Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1728–1743.CrossRefGoogle Scholar
  67. Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • Heng Wang
    • 1
  • Alexander Kläser
    • 2
  • Cordelia Schmid
    • 2
  • Cheng-Lin Liu
    • 1
  1. 1.National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.LEAR TeamINRIA Grenoble Rhône-AlpesMontbonnotFrance

Personalised recommendations