International Journal of Computer Vision

, Volume 119, Issue 3, pp 254–271 | Cite as

MoFAP: A Multi-level Representation for Action Recognition

  • Limin Wang
  • Yu Qiao
  • Xiaoou Tang


This paper proposes a multi-level video representation by stacking the activations of motion features, atoms, and phrases (MoFAP). Motion features refer to those low-level local descriptors, while motion atoms and phrases can be viewed as mid-level “temporal parts”. Motion atom is defined as an atomic part of action, and captures the motion information of video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms defined with an AND/OR structure. It further enhances the discriminative capacity of motion atoms by incorporating temporal structure in a longer temporal scale. Specifically, we first design a discriminative clustering method to automatically discover a set of representative motion atoms. Then, we mine effective motion phrases with high discriminative and representative capacity in a bottom-up manner. Based on these basic units of motion features, atoms, and phrases, we construct a MoFAP network by stacking them layer by layer. This MoFAP network enables us to extract the effective representation of video data from different levels and scales. The separate representations from motion features, motion atoms, and motion phrases are concatenated as a whole one, called Activation of MoFAP. The effectiveness of this representation is demonstrated on four challenging datasets: Olympic Sports, UCF50, HMDB51, and UCF101. Experimental results show that our representation achieves the state-of-the-art performance on these datasets.


Action recognition Motion Feature Motion Atom  Motion Phrase 


  1. Aggarwal, J. K., & Ryoo, M. S. (2011). Human activity analysis: A review. ACM Computing Surveys, 43(3), 16.CrossRefGoogle Scholar
  2. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB (pp. 487–499).Google Scholar
  3. Amer, M. R., Xie, D., Zhao, M., Todorovic, S., & Zhu, S. C. (2012). Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV (pp. 187–200).Google Scholar
  4. Berg, T. L., Berg, A. C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In ECCV (pp. 663–676).Google Scholar
  5. Bishop, C. (2006). Pattern recognition and machine learning (Vol. 4). Berlin: Springer.zbMATHGoogle Scholar
  6. Bourdev, L. D., & Malik, J. (2009). Poselets: Body part detectors trained using 3d human pose annotations. In ICCV (pp. 1365–1372).Google Scholar
  7. Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR (pp. 596–603).Google Scholar
  8. Chang, C. C., & Lin, C. J. (2011). Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.CrossRefGoogle Scholar
  9. Chen, Y., Zhu, L., Lin, C., Yuille, A. L., & Zhang, H. (2007). Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS.Google Scholar
  10. Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV, Prague (Vol. 1, pp. 1–2).Google Scholar
  11. Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS pp. 494–502.Google Scholar
  12. Felzenszwalb, P. F., Girshick, R. B., McAllester, D. A., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  13. Forsyth, D. A., Arikan, O., Ikemoto, L., O’Brien, J. F., & Ramanan, D. (2005). Computational studies of human motion: Part 1, tracking and motion synthesis. Foundations and Trends in Computer Graphics and Vision 1(2/3).Google Scholar
  14. Frey, B. J., & Dueck, D. (2007). Clustering by passing messages between data points. Science, 315, 972–976.MathSciNetCrossRefzbMATHGoogle Scholar
  15. Gaidon, A., Harchaoui, Z., & Schmid, C. (2013). Temporal localization of actions with actoms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2782–2795.CrossRefGoogle Scholar
  16. Gaidon, A., Harchaoui, Z., & Schmid, C. (2014). Activity representation with motion hierarchies. International Journal of Computer Vision, 107(3), 219–238.MathSciNetCrossRefGoogle Scholar
  17. Gorelick, L., Blank, M., Shechtman, E., Irani, M., & Basri, R. (2007). Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12), 2247–2253.CrossRefGoogle Scholar
  18. Jain, A., Gupta, A., Rodriguez, M., & Davis, L. S. (2013a). Representing videos using mid-level discriminative patches. In CVPR (pp. 2571–2578).Google Scholar
  19. Jain, M., Jegou, H., & Bouthemy, P. (2013b). Better exploiting motion for better action recognition. In CVPR (pp. 2555–2562).Google Scholar
  20. Jiang, Y., Dai, Q., Xue, X., Liu, W., & Ngo, C. (2012). Trajectory-based modeling of human actions with motion reference points. In ECCV (pp. 425–438).Google Scholar
  21. Jiang, Y. G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., & Sukthankar, R. (2013). THUMOS challenge: Action recognition with a large number of classes.
  22. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In CVPR (pp. 1725–1732).Google Scholar
  23. Kliper-Gross, O., Gurovich, Y., Hassner, T., & Wolf, L. (2012). Motion interchange patterns for action recognition in unconstrained videos. In ECCV (pp. 256–269).Google Scholar
  24. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). HMDB: A large video database for human motion recognition. In ICCV (pp. 2556–2563).Google Scholar
  25. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  26. Laxton, B., Lim, J., & Kriegman, D. J. (2007). Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In CVPR (pp. 1–8).Google Scholar
  27. Liu, J., Kuipers, B., & Savarese, S. (2011). Recognizing human actions by attributes. In CVPR (pp. 3337–3344).Google Scholar
  28. Niebles, J. C., Chen, C. W., & Li, F. F. (2010). Modeling temporal structure of decomposable motion segments for activity classification. In ECCV (pp. 392–405).Google Scholar
  29. Oliver, N., Rosario, B., & Pentland, A. (2000). A bayesian computer vision system for modeling human interactions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 831–843.CrossRefGoogle Scholar
  30. Parikh, D., & Grauman, K. (2011). Relative attributes. In ICCV (pp. 503–510).Google Scholar
  31. Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR (pp. 612–619).Google Scholar
  32. Raptis, M., Kokkinos, I., & Soatto, S. (2012). Discovering discriminative action parts from mid-level video representations. In CVPR (pp. 1242–1249).Google Scholar
  33. Reddy, K. K., & Shah, M. (2013). Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5), 971–981.CrossRefGoogle Scholar
  34. Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M., & Schiele, B. (2012). Script data for attribute-based recognition of composite activities. In ECCV.Google Scholar
  35. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR (pp. 1234–1241).Google Scholar
  36. Sánchez, J., Perronnin, F., Mensink, T., & Verbeek, J. J. (2013). Image classification with the fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.MathSciNetCrossRefzbMATHGoogle Scholar
  37. Sapienza, M., Cuzzolin, F., & Torr, P. H. S. (2012). Learning discriminative space-time actions from weakly labelled videos. In BMVC (pp. 1–12).Google Scholar
  38. Schüldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local svm approach. In ICPR.Google Scholar
  39. Si, Z., & Zhu, S. C. (2013). Learning AND-OR templates for object recognition and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(9), 2189–2205.CrossRefGoogle Scholar
  40. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS (pp. 568–576).Google Scholar
  41. Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV (pp. 73–86).Google Scholar
  42. Sivic, J., & Zisserman, A. (2003). Video google: A text retrieval approach to object matching in videos. In ICCV (pp. 1470–1477).Google Scholar
  43. Soomro, K., Zamir, A. R., & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402.Google Scholar
  44. Tang, K. D., Li, F. F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR (pp. 1250–1257).Google Scholar
  45. Turaga, P. K., Chellappa, R., Subrahmanian, V. S., & Udrea, O. (2008). Machine recognition of human activities: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 18(11), 1473–1488.CrossRefGoogle Scholar
  46. Wang, H., & Schmid, C. (2013a). Action recognition with improved trajectories. In ICCV (pp. 3551–3558).Google Scholar
  47. Wang, H., & Schmid, C. (2013b). Lear-inria submission for the thumos workshop. In: ICCV Workshop on Action Recognition with a Large Number of Classes.Google Scholar
  48. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013a). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  49. Wang, L., Qiao, Y., & Tang, X. (2013b). Mining motion atoms and phrases for complex action recognition. In ICCV (pp. 2680–2687).Google Scholar
  50. Wang, L., Qiao, Y., & Tang, X. (2013c). Motionlets: Mid-level 3D parts for human motion recognition. In CVPR (pp. 2674–2681).Google Scholar
  51. Wang, L., Qiao, Y., & Tang, X. (2014a). Latent hierarchical model of temporal structure for complex activity classification. IEEE Transactions on Image Processing, 23(2), 810–822.MathSciNetCrossRefGoogle Scholar
  52. Wang, L., Qiao, Y., & Tang, X. (2014b). Video action detection with relational dynamic-poselets. In ECCV (pp. 565–580).Google Scholar
  53. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trjectory-pooled deep-convolutional descriptors. In CVPR (pp. 4305–4314).Google Scholar
  54. Wang, S. B., Quattoni, A., Morency, L. P., Demirdjian, D., & Darrell, T. (2006). Hidden conditional random fields for gesture recognition. In CVPR (pp. 1521–1527).Google Scholar
  55. Wang, X., Wang, L., & Qiao, Y. (2012). A comparative study of encoding, pooling and normalization methods for action recognition. In ACCV (pp. 572–585).Google Scholar
  56. Wu, J., Zhang, Y., & Lin, W. (2014). Towards good practices for action video encoding. In CVPR (pp. 2577–2584).Google Scholar
  57. Yao, B., & Li, F. F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR.Google Scholar
  58. Zhang, W., Zhu, M., & Derpanis, K. G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV (pp. 2248–2255).Google Scholar
  59. Zhao, Y., & Zhu S. C. (2011). Image parsing with stochastic scene grammar. In NIPS (pp. 73–81).Google Scholar
  60. Zhu ,J., Wang, B., Yang, X., Zhang, W., & Tu, Z. (2013). Action recognition with actons. In ICCV (pp. 3559–3566).Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Information EngineeringThe Chinese University of Hong KongShatinHong Kong
  2. 2.Shenzhen Institutes of Advanced TechnologyChinese Academy of SciencesShenzhenChina

Personalised recommendations