Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation

  • Negar Rostamzadeh
  • Gloria Zen
  • Ionuţ Mironică
  • Jasper Uijlings
  • Nicu Sebe
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8156)


In this work we propose an efficient method for activity recognition in a daily living scenario. At feature level, we propose a method to extract and combine low- and high-level information and we show that the performance of body pose estimation (and consequently of activity recognition) can be significantly improved. Particularly, we propose an approach extending the pictorial deformable models for the body pose estimation from the state-of-the-art. We show that including low level cues (e.g. optical flow and foreground) together with an off-the-shelf body part detector allows reaching better performance without the need to re-train the detectors. Finally, we apply the Fisher Kernel representation that takes the temporal variation into account and we show that we outperform state-of-the-art methods on a public dataset with daily living activities.


Body Part Gaussian Mixture Model Activity Recognition Foreground Pixel Human Action Recognition 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Bilen, H., Namboodiri, V.P., Van Gool, L.: Action recognition: A region based approach. In: WACV (2011)Google Scholar
  2. 2.
    Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: AVSS (2012)Google Scholar
  3. 3.
    Bilinski, P., Corvee, E., Bak, S., Bremond, F.: Relative dense tracklets for human action recognition. In: FG (2013)Google Scholar
  4. 4.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  5. 5.
    Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: CVPR (2008)Google Scholar
  6. 6.
    Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV (2005)Google Scholar
  7. 7.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: CVPR (2008)Google Scholar
  8. 8.
    Gaur, U., Zhu, Y., Song, B., Roy-Chowdhury, A.: A string of feature graphs model for recognition of complex activities in natural videos. In: ICCV (2011)Google Scholar
  9. 9.
    Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: NIPS (1999)Google Scholar
  10. 10.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC (2010)Google Scholar
  11. 11.
    Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: CVPR (2010)Google Scholar
  12. 12.
    Kuehne, H., Gehrig, D., Schultz, T., Stiefelhagen, R.: On-line action recognition from sparce feature flow. In: VISAPP (2012)Google Scholar
  13. 13.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  14. 14.
    Liu, K.H., Weng, M.F., Tseng, C.Y., Chuang, Y.Y., Chen, M.S.: Association and temporal rule mining for post-filtering of semantic concept detection in video. IEEE Trans. MM (2008)Google Scholar
  15. 15.
    Malgireddy, M.R., Nwogu, I., Govindaraju, V.: A generative framework to investigate the underlying patterns in human activities. In: ICCV Workshops (2011)Google Scholar
  16. 16.
    Matikainen, P., Hebert, M., Sukthankar, R.: Representing pairwise spatial and temporal relations for action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 508–521. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  17. 17.
    Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: CVPR (2009)Google Scholar
  18. 18.
    Mironica, I., Ionescu, B., Uijlings, J., Sebe, N.: Fisher kernel based relevance feedback for multimodal video retrieval. In: ICMR (2013)Google Scholar
  19. 19.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  20. 20.
    Qi, G.J., Hua, X.S., Rui, Y., Tang, J., Mei, T., Wang, M., Zhang, H.J.: Correlative multilabel video annotation with temporal kernels. ACM TOMCCAP (2008)Google Scholar
  21. 21.
    Ramanan, D., Sminchisescu, C.: Training deformable models for localization. In: CVPR (2006)Google Scholar
  22. 22.
    Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)Google Scholar
  23. 23.
    Sadanand, S., Corso, J.J.: Action bank: A high-level representation of activity in video. In: CVPR (2012)Google Scholar
  24. 24.
    Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 536–548. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  25. 25.
    Savarese, S., DelPozo, A., Niebles, J.C., Fei-Fei, L.: Spatial-temporal correlations for unsupervised action classification. In: IEEE Workshop on Motion and Video Computing (2008)Google Scholar
  26. 26.
    Snoek, C.G., Worring, M.: Concept-based video retrieval. FTIR 4(2), 215–322 (2009)Google Scholar
  27. 27.
    Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking. In: CVPR (1999)Google Scholar
  28. 28.
    Tomasi, C., Kanade, T.: Detection and tracking of point features. CMU-CS (1991)Google Scholar
  29. 29.
    Tran, D., Forsyth, D.: Improved human parsing with a full relational model. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 227–240. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  30. 30.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC (2009)Google Scholar
  31. 31.
    Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal contexts. In: CVPR (2011)Google Scholar
  32. 32.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  33. 33.
    Yang, Y., Ramanan, D.: Articulated human detection with flexible mixtures-of-parts. PAMI (2013)Google Scholar
  34. 34.
    Zen, G., Rostamzadeh, N., Staiano, J., Ricci, E., Sebe, N.: Enhanced semantic descriptors for functional scene categorization. In: ICPR (2012)Google Scholar
  35. 35.
    Zhang, Y., Liu, X., Chang, M.-C., Ge, W., Chen, T.: Spatio-temporal phrases for activity recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 707–721. Springer, Heidelberg (2012)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Negar Rostamzadeh
    • 1
  • Gloria Zen
    • 1
  • Ionuţ Mironică
    • 2
  • Jasper Uijlings
    • 1
  • Nicu Sebe
    • 1
  1. 1.DISIUniversity of TrentoTrentoItaly
  2. 2.LAPIUniversity Politehnica of BucharestBucharestRomania

Personalised recommendations