Skip to main content

Action Recognition with Stacked Fisher Vectors

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNIP,volume 8693)

Abstract

Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based action representations, SFV need not to mine discriminative action parts but can preserve mid-level information through Fisher vector encoding in higher layer. We evaluate the proposed methods on three challenging datasets, namely Youtube, J-HMDB, and HMDB51. Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.

Keywords

  • Action recognition
  • Fisher vectors
  • stacked Fisher vectors
  • max-margin dimensionality reduction

References

  1. Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: A review. ACM Computing Surveys 43(3), 16 (2011)

    CrossRef  Google Scholar 

  2. Bishop, C.M., Nasrabadi, N.M.: Pattern recognition and machine learning, vol. 1 (2006)

    Google Scholar 

  3. Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011)

    Google Scholar 

  4. Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC (2011)

    Google Scholar 

  5. Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  6. Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS pp. 487–493 (1999)

    Google Scholar 

  7. Jain, A., Gupta, A., Rodriguez, M., Davis, L.S.: Representing videos using mid-level discriminative patches. In: CVPR, pp. 2571–2578 (2013)

    Google Scholar 

  8. Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562 (2013)

    Google Scholar 

  9. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: CVPR, pp. 3304–3311 (2010)

    Google Scholar 

  10. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J., et al.: Towards understanding action recognition. In: ICCV (2013)

    Google Scholar 

  11. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. TPAMI, 221–231 (2013)

    Google Scholar 

  12. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)

    Google Scholar 

  13. Klaser, A., Marszałek, M., Schmid, C.: et al.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC (2008)

    Google Scholar 

  14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, vol. 1, p. 4 (2012)

    Google Scholar 

  15. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)

    Google Scholar 

  16. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)

    Google Scholar 

  17. Laptev, I.: On space-time interest points. IJCV 64(2), 107–123 (2005)

    CrossRef  MathSciNet  Google Scholar 

  18. Le, Q.V., et al.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)

    Google Scholar 

  19. Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR, pp. 3337–3344 (2011)

    Google Scholar 

  20. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR. pp. 1996–2003 (2009)

    Google Scholar 

  21. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV, pp. 2486–2493 (2011)

    Google Scholar 

  22. Peng, X., Qiao, Y., Peng, Q., Qi, X.: Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition. In: BMVC, pp. 1–11 (2013)

    Google Scholar 

  23. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)

    Google Scholar 

  24. Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  25. Ren, X., Ramanan, D.: Histograms of sparse codes for object detection. In: CVPR, pp. 3246–3253 (2013)

    Google Scholar 

  26. Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR, pp. 1234–1241 (2012)

    Google Scholar 

  27. Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space–time action parts from weakly labelled videos. IJCV, 1–18 (2014)

    Google Scholar 

  28. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep fisher networks for large-scale image classification. In: NIPS, pp. 163–171 (2013)

    Google Scholar 

  29. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV, pp. 1470–1477 (2003)

    Google Scholar 

  30. Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)

    Google Scholar 

  31. Wang, H., Kläser, A., Schmid, C., Liu, C.-L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV, 1–20 (2013)

    Google Scholar 

  32. Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)

    Google Scholar 

  33. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Cordelia, Schmid, o.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)

    Google Scholar 

  34. Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: ICCV, pp. 2680–2687 (2013)

    Google Scholar 

  35. Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3d parts for human motion recognition. In: CVPR, pp. 2674–2681 (2013)

    Google Scholar 

  36. Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012, Part III. LNCS, vol. 7726, pp. 572–585. Springer, Heidelberg (2013)

    CrossRef  Google Scholar 

  37. Zhu, J., Wang, B., Yang, X., Zhang, W., Tu, Z.: Action recognition with actons. In: ICCV (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Peng, X., Zou, C., Qiao, Y., Peng, Q. (2014). Action Recognition with Stacked Fisher Vectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10602-1_38

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10601-4

  • Online ISBN: 978-3-319-10602-1

  • eBook Packages: Computer ScienceComputer Science (R0)