Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Wang, Heng; Kläser, Alexander; Schmid, Cordelia; Liu, Cheng-Lin

doi:10.1007/s11263-012-0594-8

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Published: 06 March 2013

Volume 103, pages 60–79, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Heng Wang¹,
Alexander Kläser²,
Cordelia Schmid² &
…
Cheng-Lin Liu¹

15k Accesses
1258 Citations
13 Altmetric
Explore all metrics

Abstract

This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Motion Boundary Trajectory for Human Action Recognition

A Robust and Efficient Video Representation for Action Recognition

Article 17 July 2015

Action recognition using edge trajectories and motion acceleration descriptor

Article 30 January 2016

Notes

http://lear.inrialpes.fr/software.
http://opencv.willowgarage.com/wiki/.
The code of SIFT detector and descriptor is from http://blogs.oregonstate.edu/hess/code/sift/.
Note that splitting MBH into MBHx and MBHy results in a slightly better performance.
http://www.nada.kth.se/cvap/actions/.
http://www.cs.ucf.edu/~liujg/YouTube_Action_dataset.html.
Note that here we use the same dataset as Liu et al. (2009), whereas in Wang et al. (2011) we used a different version. This explains the difference in performance on the YouTube dataset.
http://lear.inrialpes.fr/data.
http://server.cs.ucf.edu/~vision/data.html.
http://4drepository.inrialpes.fr/public/viewgroup/6.
http://vision.stanford.edu/Datasets/OlympicSports/.
http://vision.cs.uiuc.edu/projects/activity/.
http://server.cs.ucf.edu/~vision/data/UCF50.rar.
http://serre-lab.clps.brown.edu/resources/HMDB/.
Note that we only consider the performance of the trajectory itself. Other information, such as gradient or optical flow, is not included.
http://lmb.informatik.uni-freiburg.de/resources/binaries/pami2010Linux64.zip.
http://www.irisa.fr/vista/Equipe/People/Laptev/download.html.

References

Anjum, N., & Cavallaro, A. (2008). Multifeature object trajectory clustering for video analysis. IEEE Transactions on Multimedia, 18, 1555–1564.
Google Scholar
Bay, H., Tuytelaars, T., & Gool, L. V. (2006). SURF: Speeded up robust features. In European conference on computer vision.
Bhattacharya, S., Sukthankar, R., Jin, R., & Shah, M. (2011). A probabilistic representation for efficient large scale visual recognition tasks. In IEEE conference on computer vision and pattern recognition.
Bregonzio, M., Gong, S., & Xiang, T. (2009). Recognising action as clouds of space-time interest points. In IEEE conference on computer vision and pattern recognition.
Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In European conference on computer vision.
Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In IEEE international conference on computer vision.
Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In European conference on computer vision.
Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 500–513.
Article Google Scholar
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition.
Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision.
Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE workshop visual surveillance and performance evaluation of tracking and surveillance.
Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian conference on image analysis.
Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition.
Gaidon, A., Harchaoui, & Schmid, C. (2012) Recognizing activities with cluster-trees of tracklets. In British Machine Vision Conference.
Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 883–897.
Article Google Scholar
Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1533–1543.
Article Google Scholar
Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European conference on computer vision.
Johnson, N., & Hogg, D. (1996). Learning the distribution of object trajectories for event recognition. Image and Vision Computing, 14, 609–615.
Article Google Scholar
Junejo, I. N., Dexter, E., Laptev, I., & Pérez, P. (2011). View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 172–185.
Article Google Scholar
Jung, C. R., Hennemann, L., & Musse, S. R. (2008). Event detection using trajectory clustering and 4-D histograms. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1565–1575.
Article Google Scholar
Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In British machine vision conference.
Kläser, A., Marszałek, M., Laptev, I., & Schmid, C. (2010). Will person detection help bag-of-features action recognition? Tech. Rep. RR-7373, INRIA.
Kliper-Gross, O., Gurovich, Y., Hassner, T. & Wolf, L. (2012). Motion Interchange Patterns for Action Recognition in Unconstrained Videos. In European Conference on Computer Vision.
Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In IEEE conference on computer vision and pattern recognition.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, IEEE (pp. 2556–2563).
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition.
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision and pattern recognition.
Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE conference on computer vision and pattern recognition.
Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos in the wild. In IEEE conference on computer vision and pattern recognition.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Lu, W. C., Wang, Y. C. F., & Chen, C. S. (2010). Learning dense optical-flow trajectory patterns for video object extraction. In IEEE advanced video and signal based surveillance conference.
Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence.
Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE conference on computer vision and pattern recognition.
Matikainen, P., Hebert, M. & Sukthankar, R (2009). Trajectons: Action recognition through the motion analysis of tracked features. In ICCV workshops on video-oriented object and event classification.
Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In IEEE international conference on computer vision.
Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision.
Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision.
Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.
Article Google Scholar
Piriou, G., Bouthemy, P., & Yao, J. F. (2006). Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15, 3418–3431.
Article Google Scholar
Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In European conference on computer vision.
Reddy, K.K. & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 1–11.
Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In IEEE conference on computer vision and pattern recognition.
Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In IEEE conference on computer vision and pattern recognition.
Sand, P., & Teller, S. (2008). Particle video: Long-range motion estimation using point trajectories. International Journal of Computer Vision, 80, 72–91.
Article Google Scholar
Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.
Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In ACM conference on multimedia.
Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE conference on computer vision and pattern recognition.
Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T. S., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In IEEE conference on computer vision and pattern recognition.
Sun, J., Mu, Y., Yan, S., & Cheong, L. F. (2010). Activity recognition using dense long-duration trajectories. In IEEE international conference on multimedia and expo.
Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In European conference on computer vision.
Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European conference on computer vision.
Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European conference on computer vision.
Uemura, H., Ishikawa, S., & Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. In British machine vision conference.
Ullah, M. M., Parizi, S. N., & Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British machine vision conference.
Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE conference on computer vision and pattern recognition.
Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In British machine vision conference.
Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. E. L. (2008). Trajectory analysis and semantic region modeling using a nonparametric Bayesian model. In IEEE international conference on computer vision.
Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In IEEE international conference on computer vision.
Weinland, D., Ronfard, R., & Boyer, E. (2006). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2), 249–257.
Google Scholar
Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision.
Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In IEEE international conference on computer vision.
Wu, S., Oreifej, O., & Shah, M. (2011). Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In IEEE international conference on computer vision.
Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE conference on computer vision and pattern recognition.
Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In IEEE international conference on computer vision.
Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1728–1743.
Article Google Scholar
Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238.
Article Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (NSFC) Grant 60825301, the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA06030300), as well as the joint Microsoft/INRIA project and the European integrated project AXES.

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China
Heng Wang & Cheng-Lin Liu
LEAR Team, INRIA Grenoble Rhône-Alpes, 655 Avenue de l’Europe, 38330, Montbonnot, France
Alexander Kläser & Cordelia Schmid

Authors

Heng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Kläser
View author publications
You can also search for this author in PubMed Google Scholar
Cordelia Schmid
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heng Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Kläser, A., Schmid, C. et al. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Int J Comput Vis 103, 60–79 (2013). https://doi.org/10.1007/s11263-012-0594-8

Download citation

Received: 27 December 2011
Accepted: 19 October 2012
Published: 06 March 2013
Issue Date: May 2013
DOI: https://doi.org/10.1007/s11263-012-0594-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Abstract

Access this article

Similar content being viewed by others

Motion Boundary Trajectory for Human Action Recognition

A Robust and Efficient Video Representation for Action Recognition

Action recognition using edge trajectories and motion acceleration descriptor

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dense Trajectories and Motion Boundary Descriptors for Action Recognition

Abstract

Access this article

Similar content being viewed by others

Motion Boundary Trajectory for Human Action Recognition

A Robust and Efficient Video Representation for Action Recognition

Action recognition using edge trajectories and motion acceleration descriptor

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation