Dense Trajectories and Motion Boundary Descriptors for Action Recognition
Abstract
This paper introduces a video representation based on dense trajectories and motion boundary descriptors. Trajectories capture the local motion information of the video. A dense representation guarantees a good coverage of foreground motion as well as of the surrounding context. A state-of-the-art optical flow algorithm enables a robust and efficient extraction of dense trajectories. As descriptors we extract features aligned with the trajectories to characterize shape (point coordinates), appearance (histograms of oriented gradients) and motion (histograms of optical flow). Additionally, we introduce a descriptor based on motion boundary histograms (MBH) which rely on differential optical flow. The MBH descriptor shows to consistently outperform other state-of-the-art descriptors, in particular on real-world videos that contain a significant amount of camera motion. We evaluate our video representation in the context of action classification on nine datasets, namely KTH, YouTube, Hollywood2, UCF sports, IXMAS, UIUC, Olympic Sports, UCF50 and HMDB51. On all datasets our approach outperforms current state-of-the-art results.
Keywords
Action recognition Dense trajectories Motion boundary histogramsNotes
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC) Grant 60825301, the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the Chinese Academy of Sciences (Grant XDA06030300), as well as the joint Microsoft/INRIA project and the European integrated project AXES.
References
- Anjum, N., & Cavallaro, A. (2008). Multifeature object trajectory clustering for video analysis. IEEE Transactions on Multimedia, 18, 1555–1564.Google Scholar
- Bay, H., Tuytelaars, T., & Gool, L. V. (2006). SURF: Speeded up robust features. In European conference on computer vision.Google Scholar
- Bhattacharya, S., Sukthankar, R., Jin, R., & Shah, M. (2011). A probabilistic representation for efficient large scale visual recognition tasks. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Bregonzio, M., Gong, S., & Xiang, T. (2009). Recognising action as clouds of space-time interest points. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In European conference on computer vision.Google Scholar
- Brendel, W., & Todorovic, S. (2011). Learning spatiotemporal graphs of human activities. In IEEE international conference on computer vision.Google Scholar
- Brox, T., & Malik, J. (2010). Object segmentation by long term analysis of point trajectories. In European conference on computer vision.Google Scholar
- Brox, T., & Malik, J. (2011). Large displacement optical flow: Descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3), 500–513.CrossRefGoogle Scholar
- Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision.Google Scholar
- Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE workshop visual surveillance and performance evaluation of tracking and surveillance.Google Scholar
- Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian conference on image analysis.Google Scholar
- Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Gaidon, A., Harchaoui, & Schmid, C. (2012) Recognizing activities with cluster-trees of tracklets. In British Machine Vision Conference.Google Scholar
- Gilbert, A., Illingworth, J., & Bowden, R. (2011). Action recognition using mined hierarchical compound features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5), 883–897.CrossRefGoogle Scholar
- Hervieu, A., Bouthemy, P., & Cadre, J. P. L. (2008). A statistical video content recognition method using invariant features on object trajectories. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1533–1543.CrossRefGoogle Scholar
- Ikizler-Cinbis, N., & Sclaroff, S. (2010). Object, scene and actions: Combining multiple features for human action recognition. In European conference on computer vision.Google Scholar
- Johnson, N., & Hogg, D. (1996). Learning the distribution of object trajectories for event recognition. Image and Vision Computing, 14, 609–615.CrossRefGoogle Scholar
- Junejo, I. N., Dexter, E., Laptev, I., & Pérez, P. (2011). View-independent action recognition from temporal self-similarities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 172–185.CrossRefGoogle Scholar
- Jung, C. R., Hennemann, L., & Musse, S. R. (2008). Event detection using trajectory clustering and 4-D histograms. IEEE Transactions on Circuits and Systems for Video Technology, 18, 1565–1575.CrossRefGoogle Scholar
- Kläser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3D-gradients. In British machine vision conference.Google Scholar
- Kläser, A., Marszałek, M., Laptev, I., & Schmid, C. (2010). Will person detection help bag-of-features action recognition? Tech. Rep. RR-7373, INRIA.Google Scholar
- Kliper-Gross, O., Gurovich, Y., Hassner, T. & Wolf, L. (2012). Motion Interchange Patterns for Action Recognition in Unconstrained Videos. In European Conference on Computer Vision.Google Scholar
- Kovashka, A., & Grauman, K. (2010). Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T. (2011) HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, IEEE (pp. 2556–2563).Google Scholar
- Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
- Laptev, I., Marszałek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Liu, J., Luo, J., & Shah, M. (2009). Recognizing realistic actions from videos in the wild. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
- Lu, W. C., Wang, Y. C. F., & Chen, C. S. (2010). Learning dense optical-flow trajectory patterns for video object extraction. In IEEE advanced video and signal based surveillance conference.Google Scholar
- Lucas, B. D., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In International joint conference on artificial intelligence.Google Scholar
- Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Matikainen, P., Hebert, M. & Sukthankar, R (2009). Trajectons: Action recognition through the motion analysis of tracked features. In ICCV workshops on video-oriented object and event classification.Google Scholar
- Messing, R., Pal, C., & Kautz, H. (2009). Activity recognition using the velocity histories of tracked keypoints. In IEEE international conference on computer vision.Google Scholar
- Niebles, J. C., Chen, C. W., & Fei-Fei, L. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In European conference on computer vision.Google Scholar
- Nowak, E., Jurie, F., & Triggs, B. (2006). Sampling strategies for bag-of-features image classification. In European conference on computer vision.Google Scholar
- Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987.CrossRefGoogle Scholar
- Piriou, G., Bouthemy, P., & Yao, J. F. (2006). Recognition of dynamic video contents with global probabilistic models of visual motion. IEEE Transactions on Image Processing, 15, 3418–3431.CrossRefGoogle Scholar
- Raptis, M., & Soatto, S. (2010). Tracklet descriptors for action modeling and video analysis. In European conference on computer vision.Google Scholar
- Reddy, K.K. & Shah, M. (2012). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 1–11.Google Scholar
- Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Sand, P., & Teller, S. (2008). Particle video: Long-range motion estimation using point trajectories. International Journal of Computer Vision, 80, 72–91.CrossRefGoogle Scholar
- Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In International conference on pattern recognition.Google Scholar
- Scovanner, P., Ali, S., & Shah, M. (2007). A 3-dimensional SIFT descriptor and its application to action recognition. In ACM conference on multimedia.Google Scholar
- Shi, J., & Tomasi, C. (1994). Good features to track. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Sun, J., Wu, X., Yan, S., Cheong, L. F., Chua, T. S., & Li, J. (2009). Hierarchical spatio-temporal context modeling for action recognition. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Sun, J., Mu, Y., Yan, S., & Cheong, L. F. (2010). Activity recognition using dense long-duration trajectories. In IEEE international conference on multimedia and expo.Google Scholar
- Sundaram, N., Brox, T., & Keutzer, K. (2010). Dense point trajectories by GPU-accelerated large displacement optical flow. In European conference on computer vision. Google Scholar
- Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010). Convolutional learning of spatio-temporal features. In European conference on computer vision.Google Scholar
- Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European conference on computer vision.Google Scholar
- Uemura, H., Ishikawa, S., & Mikolajczyk, K. (2008). Feature tracking and motion compensation for action recognition. In British machine vision conference.Google Scholar
- Ullah, M. M., Parizi, S. N., & Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British machine vision conference.Google Scholar
- Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2011). Action recognition by dense trajectories. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Wang, H., Ullah, M. M., Kläser, A., Laptev, I., & Schmid, C. (2009). Evaluation of local spatio-temporal features for action recognition. In British machine vision conference.Google Scholar
- Wang, X., Ma, K. T., Ng, G. W., & Grimson, W. E. L. (2008). Trajectory analysis and semantic region modeling using a nonparametric Bayesian model. In IEEE international conference on computer vision.Google Scholar
- Weinland, D., Boyer, E., & Ronfard, R. (2007). Action recognition from arbitrary views using 3D exemplars. In IEEE international conference on computer vision.Google Scholar
- Weinland, D., Ronfard, R., & Boyer, E. (2006). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2), 249–257.Google Scholar
- Willems, G., Tuytelaars, T., & Gool, L. (2008). An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision.Google Scholar
- Wong, S. F., & Cipolla, R. (2007). Extracting spatiotemporal interest points using global information. In IEEE international conference on computer vision.Google Scholar
- Wu, S., Oreifej, O., & Shah, M. (2011). Action recognition in videos acquired by a moving camera using motion decomposition of lagrangian particle trajectories. In IEEE international conference on computer vision.Google Scholar
- Wu, X., Xu, D., Duan, L., & Luo, J. (2011). Action recognition using context and appearance distribution features. In IEEE conference on computer vision and pattern recognition.Google Scholar
- Yeffet, L., & Wolf, L. (2009). Local trinary patterns for human action recognition. In IEEE international conference on computer vision.Google Scholar
- Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, 1728–1743.CrossRefGoogle Scholar
- Zhang, J., Marszałek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238.CrossRefGoogle Scholar