Joint Recognition and Segmentation of Actions via Probabilistic Integration of Spatio-Temporal Fisher Vectors

  • Johanna Carvajal
  • Chris McCool
  • Brian Lovell
  • Conrad Sanderson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9794)

Abstract

We propose a hierarchical approach to multi-action recognition that performs joint classification and segmentation. A given video (containing several consecutive actions) is processed via a sequence of overlapping temporal windows. Each frame in a temporal window is represented through selective low-level spatio-temporal features which efficiently capture relevant local dynamics. Features from each window are represented as a Fisher vector, which captures first and second order statistics. Instead of directly classifying each Fisher vector, it is converted into a vector of class probabilities. The final classification decision for each frame is then obtained by integrating the class probabilities at the frame level, which exploits the overlapping of the temporal windows. Experiments were performed on two datasets: s-KTH (a stitched version of the KTH dataset to simulate multi-actions), and the challenging CMU-MMAC dataset. On s-KTH, the proposed approach achieves an accuracy of 85.0 %, significantly outperforming two recent approaches based on GMMs and HMMs which obtained 78.3 % and 71.2 %, respectively. On CMU-MMAC, the proposed approach achieves an accuracy of 40.9 %, outperforming the GMM and HMM approaches which obtained 33.7 % and 38.4 %, respectively. Furthermore, the proposed system is on average 40 times faster than the GMM based approach.

References

  1. 1.
    Buchsbaum, D., Canini, K.R., Griffiths, T.: Segmenting and recognizing human action using low-level video features. In: Annual Conference of the Cognitive Science Society (2011)Google Scholar
  2. 2.
    Hoai, M., Lan, Z.Z., De la Torre, F.: Joint segmentation and classification of human actions in video. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3265–3272 (2011)Google Scholar
  3. 3.
    Shi, Q., Wang, L., Cheng, L., Smola, A.: Discriminative human action segmentation and recognition using semi-Markov model. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–8 (2008)Google Scholar
  4. 4.
    Cheng, Y., Fan, Q., Pankanti, S., Choudhary, A.: Temporal sequence modeling for video event detection. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2235–2242 (2014)Google Scholar
  5. 5.
    Borzeshi, E., Perez Concha, O., Xu, R., Piccardi, M.: Joint action segmentation and classification by an extended hidden Markov model. IEEE Sig. Process. Lett. 20, 1207–1210 (2013)CrossRefGoogle Scholar
  6. 6.
    Carvajal, J., Sanderson, C., McCool, C., Lovell, B.C.: Multi-action recognition via stochastic modelling of optical flow and gradients. In: Workshop on Machine Learning for Sensory Data Analysis (MLSDA), pp. 19–24. ACM (2014). http://dx.doi.org/10.1145/2689746.2689748
  7. 7.
    Jaakkola, T., Haussler, D.: Exploiting generative models in discriminative classifiers. Adv. Neural Inf. Process. Syst. 11, 487–493 (1998)Google Scholar
  8. 8.
    Lasserre, J., Bishop, C.M.: Generative or discriminative? Getting the best of both worlds. In: Bernardo, J., Bayarri, M., Berger, J., Dawid, A., Heckerman, D., Smith, A., West, M. (eds.) Bayesian Statistics, vol. 8, pp. 3–24. Oxford University Press, Oxford (2007)Google Scholar
  9. 9.
    Csurka, G., Perronnin, F.: Fisher vectors: beyond bag-of-visual-words image representations. In: Richard, P., Braz, J. (eds.) VISIGRAPP 2010. CCIS, vol. 229, pp. 28–42. Springer, Heidelberg (2011)Google Scholar
  10. 10.
    Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the Fisher vector: theory and practice. Int. J. Comput. Vis. 105, 222–245 (2013)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. Br. Mach. Vis. Conf. (BMVC) 124(1–124), 11 (2009)Google Scholar
  12. 12.
    Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with Fisher vectors on a compact feature set. In: International Conference on Computer Vision (ICCV), pp. 1817–1824 (2013)Google Scholar
  13. 13.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: International Conference on Computer Vision (ICCV) (2013)Google Scholar
  14. 14.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64, 107–123 (2005)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Cao, L., Tian, Y., Liu, Z., Yao, B., Zhang, Z., Huang, T.: Action detection using multiple spatial-temporal interest point features. In: International Conference on Multimedia and Expo (ICME), pp. 340–345 (2010)Google Scholar
  16. 16.
    Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 256–269. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  17. 17.
    Ali, S., Shah, M.: Human action recognition in videos using kinematic features and multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. 32, 288–303 (2010)CrossRefGoogle Scholar
  18. 18.
    Guo, K., Ishwar, P., Konrad, J.: Action recognition from video using feature covariance matrices. IEEE Trans. Image Process. 22, 2479–2494 (2013)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)MATHGoogle Scholar
  20. 20.
    Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. J. Mach. Learn. Res. 2, 265–292 (2001)MATHGoogle Scholar
  21. 21.
    Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classifiers 10, 61–74 (1999)Google Scholar
  22. 22.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. Int. Conf. Pattern Recogn. (ICPR) 3, 32–36 (2004)Google Scholar
  23. 23.
    De la Torre, F., Hodgins, J.K., Montano, J., Valcarcel, S.: Detailed human data acquisition of kitchen activities: the CMU-multimodal activity database (CMU-MMAC). In: CHI Workshop on Developing Shared Home Behavior Datasets to Advance HCI and Ubiquitous Computing Research (2009)Google Scholar
  24. 24.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011)Google Scholar
  25. 25.
    Spriggs, E.H., Torre, F.D.L., Hebert, M.: Temporal segmentation and activity classification from first-person sensing. In: IEEE Workshop on Egocentric Vision, CVPR (2009)Google Scholar
  26. 26.
    Simonyan, K., Vedaldi, A., Zisserman, A.: Deep Fisher networks for large-scale image classification. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 163–171 (2013)Google Scholar
  27. 27.
    Parkhi, O.M., Simonyan, K., Vedaldi, A., Zisserman, A.: A compact and discriminative face track descriptor. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Johanna Carvajal
    • 1
    • 3
  • Chris McCool
    • 2
  • Brian Lovell
    • 1
  • Conrad Sanderson
    • 1
    • 3
    • 4
  1. 1.University of QueenslandBrisbaneAustralia
  2. 2.Queensland University of TechnologyBrisbaneAustralia
  3. 3.NICTABrisbaneAustralia
  4. 4.Data61, CSIROBrisbaneAustralia

Personalised recommendations