Skip to main content
Log in

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://lear.inrialpes.fr/software.

  2. http://opencv.willowgarage.com/wiki/.

  3. The code of SIFT detector and descriptor is from http://blogs.oregonstate.edu/hess/code/sift/.

  4. Note that splitting MBH into MBHx and MBHy results in a slightly better performance.

  5. http://www.nada.kth.se/cvap/actions/.

  6. http://www.cs.ucf.edu/~liujg/YouTube_Action_dataset.html.

  7. Note that here we use the same dataset as Liu et al. (2009), whereas in Wang et al. (2011) we used a different version. This explains the difference in performance on the YouTube dataset.

  8. http://lear.inrialpes.fr/data.

  9. http://server.cs.ucf.edu/~vision/data.html.

  10. http://4drepository.inrialpes.fr/public/viewgroup/6.

  11. http://vision.stanford.edu/Datasets/OlympicSports/.

  12. http://vision.cs.uiuc.edu/projects/activity/.

  13. http://server.cs.ucf.edu/~vision/data/UCF50.rar.

  14. http://serre-lab.clps.brown.edu/resources/HMDB/.

  15. Note that we only consider the performance of the trajectory itself. Other information, such as gradient or optical flow, is not included.

  16. http://lmb.informatik.uni-freiburg.de/resources/binaries/pami2010Linux64.zip.

  17. http://www.irisa.fr/vista/Equipe/People/Laptev/download.html.

References

  • Anjum, N., & Cavallaro, A. (2008). Multifeature object trajectory clustering for video analysis. IEEE Transactions on Multimedia, 18, 1555–1564.

    Google Scholar 

  • Bay, H., Tuytelaars, T., & Gool, L. V. (2006). SURF: Speeded up robust features. In European conference on computer vision.

  • Bhattacharya, S., Sukthankar, R., Jin, R., & Shah, M. (2011). A probabilistic representation for efficient large scale visual recognition tasks. In IEEE conference on computer vision and pattern recognition.

  • Bregonzio, M., Gong, S., & Xiang, T. (2009). Recognising action as clouds of space-time interest points. In IEEE conference on computer vision and pattern recognition.

  • Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In European conference on computer vision.

  • Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In IEEE international conference on computer vision.

  • Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In European conference on computer vision.

  • Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 500–513.

    Article  Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition.

  • Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision.

  • Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE workshop visual surveillance and performance evaluation of tracking and surveillance.

  • Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian conference on image analysis.

  • Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition.

  • Gaidon, A., Harchaoui, & Schmid, C. (2012) Recognizing activities with cluster-trees of tracklets. In British Machine Vision Conference.

  • Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 883–897.

    Article  Google Scholar 

  • Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1533–1543.

    Article  Google Scholar 

  • Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European conference on computer vision.

  • Johnson, N., & Hogg, D. (1996). Learning the distribution of object trajectories for event recognition. Image and Vision Computing, 14, 609–615.

    Article  Google Scholar 

  • Junejo, I. N., Dexter, E., Laptev, I., & Pérez, P. (2011). View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 172–185.

    Article  Google Scholar 

  • Jung, C. R., Hennemann, L., & Musse, S. R. (2008). Event detection using trajectory clustering and 4-D histograms. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1565–1575.

    Article  Google Scholar 

  • Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In British machine vision conference.

  • Kläser, A., Marszałek, M., Laptev, I., & Schmid, C. (2010). Will person detection help bag-of-features action recognition? Tech. Rep. RR-7373, INRIA.

  • Kliper-Gross, O., Gurovich, Y., Hassner, T. & Wolf, L. (2012). Motion Interchange Patterns for Action Recognition in Unconstrained Videos. In European Conference on Computer Vision.

  • Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In IEEE conference on computer vision and pattern recognition.

  • Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, IEEE (pp. 2556–2563).

  • Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  • Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision and pattern recognition.

  • Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE conference on computer vision and pattern recognition.

  • Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos in the wild. In IEEE conference on computer vision and pattern recognition.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Lu, W. C., Wang, Y. C. F., & Chen, C. S. (2010). Learning dense optical-flow trajectory patterns for video object extraction. In IEEE advanced video and signal based surveillance conference.

  • Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence.

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE conference on computer vision and pattern recognition.

  • Matikainen, P., Hebert, M. & Sukthankar, R (2009). Trajectons: Action recognition through the motion analysis of tracked features. In ICCV workshops on video-oriented object and event classification.

  • Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In IEEE international conference on computer vision.

  • Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision.

  • Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision.

  • Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.

    Article  Google Scholar 

  • Piriou, G., Bouthemy, P., & Yao, J. F. (2006). Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15, 3418–3431.

    Article  Google Scholar 

  • Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In European conference on computer vision.

  • Reddy, K.K. & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 1–11.

  • Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In IEEE conference on computer vision and pattern recognition.

  • Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In IEEE conference on computer vision and pattern recognition.

  • Sand, P., & Teller, S. (2008). Particle video: Long-range motion estimation using point trajectories. International Journal of Computer Vision, 80, 72–91.

    Article  Google Scholar 

  • Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.

  • Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In ACM conference on multimedia.

  • Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE conference on computer vision and pattern recognition.

  • Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T. S., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In IEEE conference on computer vision and pattern recognition.

  • Sun, J., Mu, Y., Yan, S., & Cheong, L. F. (2010). Activity recognition using dense long-duration trajectories. In IEEE international conference on multimedia and expo.

  • Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In European conference on computer vision.

  • Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European conference on computer vision.

  • Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European conference on computer vision.

  • Uemura, H., Ishikawa, S., & Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. In British machine vision conference.

  • Ullah, M. M., Parizi, S. N., & Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British machine vision conference.

  • Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE conference on computer vision and pattern recognition.

  • Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In British machine vision conference.

  • Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. E. L. (2008). Trajectory analysis and semantic region modeling using a nonparametric Bayesian model. In IEEE international conference on computer vision.

  • Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In IEEE international conference on computer vision.

  • Weinland, D., Ronfard, R., & Boyer, E. (2006). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2), 249–257.

    Google Scholar 

  • Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision.

  • Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In IEEE international conference on computer vision.

  • Wu, S., Oreifej, O., & Shah, M. (2011). Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In IEEE international conference on computer vision.

  • Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE conference on computer vision and pattern recognition.

  • Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In IEEE international conference on computer vision.

  • Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1728–1743.

    Article  Google Scholar 

  • Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) Grant 60825301, the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA06030300), as well as the joint Microsoft/INRIA project and the European integrated project AXES.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Kläser, A., Schmid, C. et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int J Comput Vis 103, 60–79 (2013). https://doi.org/10.1007/s11263-012-0594-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0594-8

Keywords

Navigation