International Journal of Computer Vision

, Volume 119, Issue 3, pp 239–253 | Cite as

EXMOVES: Mid-level Features for Efficient Action Recognition and Video Analysis

Article

Abstract

In this paper we present EXMOVES—learned exemplar-based features for efficient recognition and analysis of actions in videos. The entries in our descriptor are produced by evaluating a set of movement classifiers over spatial-temporal volumes of the input video sequences. Each movement classifier is a simple exemplar-SVM trained on low-level features, i.e., an SVM learned using a single annotated positive space-time volume and a large number of unannotated videos. Our representation offers several advantages. First, since our mid-level features are learned from individual video exemplars, they require minimal amount of supervision. Second, we show that simple linear classification models trained on our global video descriptor yield action recognition accuracy approaching the state-of-the-art but at orders of magnitude lower cost, since at test-time no sliding window is necessary and linear models are efficient to train and test. This enables scalable action recognition, i.e., efficient classification of a large number of actions even in massive video databases. Third, we show the generality of our approach by training our mid-level descriptors from different low-level features and testing them on two distinct video analysis tasks: human activity recognition as well as action similarity labeling. Experiments on large-scale benchmarks demonstrate the accuracy and efficiency of our proposed method on both these tasks.

Keywords

Action recognition Action similarity labeling  Video representation Mid-level features 

References

  1. Blank, M., Gorelick, L., Shechtman, E., Irani, M. & Basri, R. (2005). Actions as space-time shapes. In International Conference on Computer Vision, (pp. 1395–1402)Google Scholar
  2. Chapelle, O. & Keerthi, S. (2008). Multi-class feature selection with support vector machines. In Proceedings of the American Statistical Association Google Scholar
  3. Dalal, N. & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  4. Dalal, N., Triggs, B. & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European Conference on Computer Vision Google Scholar
  5. Deng, J., Berg, A., & Fei-Fei, L. (2011). Hierarchical semantic indexing for large scale image retrieval. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  6. Derpanis, K., Sizintsev, M., Cannons, K. & Wildes, P. (2010). Efficient action spotting based on a spacetime oriented structure representation. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  7. Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, (pp. 65–72)Google Scholar
  8. Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In International Conference on Computer Vision, (pp. 726–733)Google Scholar
  9. Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.A. (2009). Describing objects by their attributes. In CVPR, (pp. 1778–1785)Google Scholar
  10. Fathi, A., & Mori, G. (2008). Action recognition by learning mid-level motion features. In CVPR Google Scholar
  11. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  12. Ferrari, V., & Zisserman, A. (2007). Learning visual attributes. In Proceedings Neural Information Processing Systems, (pp. 433–440)Google Scholar
  13. Hu, Y., Cao, L., Lv, F., Yan, S., Gong, Y., & Huang, T. (2009). Action detection in complex scenes with spatial and temporal ambiguities. In International Conference on Computer Vision Google Scholar
  14. Jain, A., Gupta, A., Rodriguez, M., & Davis, L. (2013). Representing videos using mid-level discriminative patches. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2571–2578)Google Scholar
  15. Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In International Conference on Computer Vision Google Scholar
  16. Ke, Y., Sukthankar, R., & Hebert, M. (2010). Volumetric features for video event detection. International Journal of Computer Vision Google Scholar
  17. Klaser, A., Marszalek, M., & Schmid, C. (2008) A spatio-temporal descriptor based on 3D-gradients. In British Machine Vision Conference Google Scholar
  18. Kliper-Gross, O., Hassner, T., & Wolf, L. (2012). The action similarity labeling challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 615–621.CrossRefGoogle Scholar
  19. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. In International Conference on Computer Vision Google Scholar
  20. Lampert, C., Blaschko, M., & Hofmann, T. (2009). Efficient subwindow search: A branch and bound framework for object localization. IEEE Transactions on Pattern Analysis and Machine Intelligence Google Scholar
  21. Lampert, C.H., Nickisch, H., & Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. In Proceedings on IEEE Conference on Computer Vision and Pattern Recognition, (pp. 951–958)Google Scholar
  22. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  23. Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In International Conference on Computer Vision Google Scholar
  24. Laptev, I., Marszalek, M., C.S., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  25. Laptev, I., & Prez, P. (2007). Retrieving actions in movies. In International Conference on Computer Vision Google Scholar
  26. Le, Q., Zou, W., Yeung, S., & Ng, A. (2011). Learning hierarchical spatio-temporal features for action recognition with independent subspace analysis. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  27. Li, L., Su, H., Xing, E., & Fei-Fei, L. (2010). Object Bank: A high-level image representation for scene classification & semantic feature sparsification. In Neural Information Processing Systems Google Scholar
  28. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR, (pp. 3337–3344)Google Scholar
  29. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Google Scholar
  30. Malisiewicz, T., Gupta, A., & Efros, A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In International Conference on Computer Vision Google Scholar
  31. Marszalek, M., Laptev, I., & Schmid, C. (2009). Action in context. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  32. Niebles, J., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  33. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in Large Margin Classifiers, MIT Press.Google Scholar
  34. Ryoo, M., & Aggarwal, J. (2010). UT-interaction dataset, ICPR contest on semantic description of human activitiesGoogle Scholar
  35. Sadanand, S., & Corso, J. (2012). Action bank: A high-level representation of activity in video. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  36. Scovanner, P., Ali, S., & Shah, M. (2007) A 3-Dimensional SIFT descriptor and its application to action recognition. In ACM Conference on Multimedia, (pp. 357–360)Google Scholar
  37. Soomro, K., Roshan Zamir, A., & Shah, M. (2013). UCF101: A dataset of 101 human action classes from videos in the wild. Tech. Rep. CRCV-TR-12-01, University of Central FloridaGoogle Scholar
  38. Torresani, L., Szummer, M., & Fitzgibbon, A. (2010). Efficient object category recognition using classemes. In European Conference on Computer Vision Google Scholar
  39. Tran, D., & Sorokin, A. (2008). Human activity recognition with metric learning. In European Conference on Computer Vision Google Scholar
  40. Veeraraghavan, A., Chellappa, R., & Roy-Chowdhury, A. (2006). The function space of an activity. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  41. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  42. Wang, G., Hoiem, D., & Forsyth, D. (2009). Learning image similarity from flickr using stochastic intersection kernel machines. In International Conference on Computer Vision Google Scholar
  43. Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision Google Scholar
  44. Wang, L., & Suter, D. (2007). Recognizing human activities from silhouettes: Motion subspace and factorial discriminative graphical model. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  45. Wang, Y., Tran, D., & Liao, Z. (2011). Learning hierarchical poselets for human parsing. In IEEE Conference on Computer Vision and Pattern Recognition Google Scholar
  46. Weinberger, K., Blitzer, J., & Saul, L. (2006). Distance metric learning for large margin nearest neighbor classification. In Neural Information Processing Systems Google Scholar
  47. Yu, G., Yuan, J., & Liu, Z. (2011). Unsupervised random forest indexing for fast action search. In CVPR, (pp. 865–872)Google Scholar
  48. Yu, G., Yuan, J., & Liu, Z. (2012). Propagative hough voting for human activity recognition. In European Conference on Computer Vision Google Scholar
  49. Yuan, J., Liu, Z., & Wu, Y. (2009). Discriminative subvolume search for efficient action detection. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2442–2449)Google Scholar
  50. Yuan, J., Liu, Z., & Wu, Y. (2011). Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Computer Science DepartmentDartmouth CollegeHanoverUSA

Personalised recommendations