Action Recognition with Stacked Fisher Vectors

  • Xiaojiang Peng
  • Changqing Zou
  • Yu Qiao
  • Qiang Peng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8693)


Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based action representations, SFV need not to mine discriminative action parts but can preserve mid-level information through Fisher vector encoding in higher layer. We evaluate the proposed methods on three challenging datasets, namely Youtube, J-HMDB, and HMDB51. Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.


Action recognition Fisher vectors stacked Fisher vectors max-margin dimensionality reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing Surveys 43(3), 16 (2011)CrossRefGoogle Scholar
  2. 2.
    Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 1 (2006)Google Scholar
  3. 3.
    Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)Google Scholar
  4. 4.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC (2011)Google Scholar
  5. 5.
    Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS pp. 487–493 (1999)Google Scholar
  7. 7.
    Jain, A., Gupta, A., Rodriguez, M., Davis, L.S.: Representing videos using mid-level discriminative patches. In: CVPR, pp. 2571–2578 (2013)Google Scholar
  8. 8.
    Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)Google Scholar
  9. 9.
    Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, pp. 3304–3311 (2010)Google Scholar
  10. 10.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J., et al.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  11. 11.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. TPAMI, 221–231 (2013)Google Scholar
  12. 12.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  13. 13.
    Klaser, A., Marszałek, M., Schmid, C.: et al.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)Google Scholar
  14. 14.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1, p. 4 (2012)Google Scholar
  15. 15.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)Google Scholar
  16. 16.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)Google Scholar
  17. 17.
    Laptev, I.: On space-time interest points. IJCV 64(2), 107–123 (2005)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Le, Q.V., et al.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)Google Scholar
  19. 19.
    Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR, pp. 3337–3344 (2011)Google Scholar
  20. 20.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR. pp. 1996–2003 (2009)Google Scholar
  21. 21.
    Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV, pp. 2486–2493 (2011)Google Scholar
  22. 22.
    Peng, X., Qiao, Y., Peng, Q., Qi, X.: Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition. In: BMVC, pp. 1–11 (2013)Google Scholar
  23. 23.
    Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)Google Scholar
  24. 24.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Ren, X., Ramanan, D.: Histograms of sparse codes for object detection. In: CVPR, pp. 3246–3253 (2013)Google Scholar
  26. 26.
    Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR, pp. 1234–1241 (2012)Google Scholar
  27. 27.
    Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space–time action parts from weakly labelled videos. IJCV, 1–18 (2014)Google Scholar
  28. 28.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for large-scale image classification. In: NIPS, pp. 163–171 (2013)Google Scholar
  29. 29.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)Google Scholar
  30. 30.
    Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)Google Scholar
  31. 31.
    Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV, 1–20 (2013)Google Scholar
  32. 32.
    Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  33. 33.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Cordelia, Schmid, o.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)Google Scholar
  34. 34.
    Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: ICCV, pp. 2680–2687 (2013)Google Scholar
  35. 35.
    Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3d parts for human motion recognition. In: CVPR, pp. 2674–2681 (2013)Google Scholar
  36. 36.
    Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 572–585. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  37. 37.
    Zhu, J., Wang, B., Yang, X., Zhang, W., Tu, Z.: Action recognition with actons. In: ICCV (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Xiaojiang Peng
    • 1
    • 3
    • 2
  • Changqing Zou
    • 3
    • 2
  • Yu Qiao
    • 2
    • 4
  • Qiang Peng
    • 1
  1. 1.Southwest Jiaotong UniversityChengduChina
  2. 2.Shenzhen Key Lab of CVPRShenzhen Institutes of Advanced Technology, CASChina
  3. 3.Department of Computer ScienceHengyang Normal UniversityHengyangChina
  4. 4.The Chinese University of Hong KongChina

Personalised recommendations