Infinite Gaussian Fisher Vector to Support Video-Based Human Action Recognition

  • Jorge L. Fernández-RamírezEmail author
  • Andrés M. Álvarez-Meza
  • Álvaro A. Orozco-Gutiérrez
  • Julian David Echeverry-Correa
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11845)


Human Action Recognition (HAR) is a computer vision task that attempts to monitor, understand, and characterize humans in videos. Here, we introduce an extension to the conventional Fisher Vector encoding technique to support this task. The methodology, based on the Infinite Gaussian Mixture Model (IGMM) seeks to reveal a set of discriminant local spatio-temporal features for enabling the precise codification of visual information. Specifically, it is much simpler to handle the infinite limit from the IGMM, than working with traditional Gaussian Mixture Models (GMMs) with unknown sizes, that will require extensive cross-validation. Under this premise, we developed a fully automatic encoding methodology that avoids heuristically specifying the number of components in the mixture model. This parameter is known to greatly affect the recognition performance, and its inference with conventional methods implies a high computational burden. Moreover, the Markov Chain Monte Carlo implementation of the hierarchical IGMM effectively avoids local minima, which tend to plague mixtures trained by optimization-based methods. Attained results on the UCF50 and HMDB51 databases demonstrate that our proposal outperforms state of the art encoding approaches concerning the trade-off between recognition performance and computational complexity, as it drastically reduces both number of operations and memory requirements.


Human Action Recognition Infinite Gaussian Mixture Model Fisher Vector Video processing 



Under grants provided by the project: “Prototipo de un sistema de recuperación de información por contenido orientado a la localización y clasificación de grupos de microcalcificaciones en mamografías - PROTOCAM”, CV E6-19-1, from the VIIE-UTP. Also, J. Fernández is partially funded by the Colciencias program: Jóvenes investigadores e innovadores-Convocatoria 812 de 2018, and by the project “Sitema de clasificación de videos basado en técnicas de representación utilizando métodos núcleo e inferencia bayesiana”, CV E6-19-2, from the VIIE-UTP.


  1. 1.
    Bloom, V., Argyriou, V., Makris, D.: Linear latent low dimensional space for online early action recognition and prediction. Pattern Recogn. 72, 532–547 (2017)CrossRefGoogle Scholar
  2. 2.
    Borges, P.V.K., Conci, N., Cavallaro, A.: Video-based human behavior understanding: a survey. IEEE Trans. Circuits Syst. Video Technol. 23(11), 1993–2008 (2013)CrossRefGoogle Scholar
  3. 3.
    Carmona, J., Climent, J.: Human action recognition by means of subtensor projections and dense trajectories. Pattern Recogn. 81, 443–455 (2018)CrossRefGoogle Scholar
  4. 4.
    Chen, T., Morris, J., Martin, E.: Probability density estimation via an infinite Gaussian mixture model: application to statistical process monitoring. J. R. Stat. Soc. Ser. C Appl. Stat. 55(5), 699–715 (2006)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Duta, I.C., Ionescu, B., Aizawa, K., Sebe, N.: Spatio-temporal VLAD encoding for human action recognition in videos. In: Amsaleg, L., Guðmundsson, G., Gurrin, C., Jónsson, B., Satoh, S. (eds.) MMM 2017. LNCS, vol. 10132, pp. 365–378. Springer, Cham (2017). Scholar
  6. 6.
    Fan, W., Bouguila, N., Liu, X.: A nonparametric Bayesian learning model using accelerated variational inference and feature selection. Pattern Anal. Appl. 22(1), 63–74 (2019)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Fernández-Ramírez, J., Álvarez-Meza, A., Orozco-Gutiérrez, Á.: Video-based human action recognition using kernel relevance analysis. In: Bebis, G., et al. (eds.) ISVC 2018. LNCS, vol. 11241, pp. 116–125. Springer, Cham (2018). Scholar
  8. 8.
    Field, M., Stirling, D., Pan, Z., Ros, M., Naghdy, F.: Recognizing human motions through mixture modeling of inertial data. Pattern Recogn. 48(8), 2394–2406 (2015)CrossRefGoogle Scholar
  9. 9.
    Gilks, W.R., Wild, P.: Adaptive rejection sampling for Gibbs sampling. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(2), 337–348 (1992)zbMATHGoogle Scholar
  10. 10.
    Higham, N.: Computing a nearest symmetric positive semidefinite matrix. Linear Algebra Appl. 103(C), 103–118 (1988)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision (ICCV) (2011)Google Scholar
  12. 12.
    Li, Q., Cheng, H., Zhou, Y., Huo, G.: Human action recognition using improved salient dense trajectories. Comput. Intell. Neurosci. 2016, 1–11 (2016) Google Scholar
  13. 13.
    Ma, C.Y., Chen, M.H., Kira, Z., AlRegib, G.: TS-LSTM and temporal-inception: exploiting spatiotemporal dynamics for activity recognition. Sig. Process. Image Commun. 71, 76–87 (2019)CrossRefGoogle Scholar
  14. 14.
    Priya, T., Prasad, S., Wu, H.: Superpixels for spatially reinforced Bayesian classification of hyperspectral images. IEEE Geosci. Remote Sens. Lett. 12(5), 1071–1075 (2015)CrossRefGoogle Scholar
  15. 15.
    Qian, Y., Sengupta, B.: Pillar networks: combining parametric with non-parametric methods for action recognition. Robot. Autonomous Syst. 118, 47–54 (2019)CrossRefGoogle Scholar
  16. 16.
    Rasmussen, C.: The infinite Gaussian mixture model, pp. 554–559 (2000)Google Scholar
  17. 17.
    Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013)CrossRefGoogle Scholar
  18. 18.
    Sicre, R., Nicolas, H.: Improved Gaussian mixture model for the task of object tracking. In: Real, P., Diaz-Pernil, D., Molina-Abril, H., Berciano, A., Kropatsch, W. (eds.) CAIP 2011. LNCS, vol. 6855, pp. 389–396. Springer, Heidelberg (2011). Scholar
  19. 19.
    Uijlings, J., Duta, I.C., Sangineto, E., Sebe, N.: Video classification with densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off. Int. J. Multimedia Inf. Retrieval 4(1), 33–44 (2015)CrossRefGoogle Scholar
  20. 20.
    Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. Int. J. Comput. Vis. 119(3), 219–238 (2016)MathSciNetCrossRefGoogle Scholar
  21. 21.
    Wang, S., Hou, Y., Li, Z., Dong, J., Tang, C.: Combining convnets with hand-crafted features for action recognition based on an HMM-SVM classifier. Multimedia Tools Appl. 77(15), 18983–18998 (2018)CrossRefGoogle Scholar
  22. 22.
    Weng, J., Weng, C., Yuan, J., Liu, Z.: Discriminative spatio-temporal pattern discovery for 3D action recognition. IEEE Trans. Circuits Syst. Video Technol. 29(4), 1077–1089 (2019)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jorge L. Fernández-Ramírez
    • 1
    Email author
  • Andrés M. Álvarez-Meza
    • 2
  • Álvaro A. Orozco-Gutiérrez
    • 1
  • Julian David Echeverry-Correa
    • 1
  1. 1.Automatics Research GroupUniversidad Tecnológica de PereiraPereiraColombia
  2. 2.Signal Processing and Recognition GroupUniversidad Nacional de Colombia - Sede ManizalesManizalesColombia

Personalised recommendations