Semantics of Human Behavior in Image Sequences

  • Nataliya ShapovalovaEmail author
  • Carles Fernández
  • F. Xavier Roca
  • Jordi Gonzàlez


Human behavior is contextualized and understanding the scene of an action is crucial for giving proper semantics to behavior. In this chapter we present a novel approach for scene understanding. The emphasis of this work is on the particular case of Human Event Understanding. We introduce a new taxonomy to organize the different semantic levels of the Human Event Understanding framework proposed. Such a framework particularly contributes to the scene understanding domain by (i) extracting behavioral patterns from the integrative analysis of spatial, temporal, and contextual evidence and (ii) integrative analysis of bottom-up and top-down approaches in Human Event Understanding. We will explore how the information about interactions between humans and their environment influences the performance of activity recognition, and how this can be extrapolated to the temporal domain in order to extract higher inferences from human events observed in sequences of images.


Activity Recognition Scale Invariant Feature Transform Human Event Spatial Interaction Model Scene Understanding 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We gratefully acknowledge Marco Pedersoli in providing the detection module. This work was initially supported by the EU Project FP6 HERMES IST-027110 and VIDI-Video IST-045547. Also, the authors acknowledge the support of the Spanish Research Programs Consolider-Ingenio 2010: MIPRCV (CSD200700018); Avanza I+D ViCoMo (TSI-020400-2009-133); CENIT-IMAGENIO 2010 SEGUR@; along with the Spanish projects TIN2009-14501-C02-01 and TIN2009-14501-C02-02.


  1. 1.
    Al-Hames, M., Rigoll, G.: A multi-modal mixed-state dynamic Bayesian network for robust meeting event recognition from disturbed data. In: IEEE International Conference on Multimedia and Expo (ICME 2005), pp. 45–48 (2005) CrossRefGoogle Scholar
  2. 2.
    Albanese, M., Chellappa, R., Moscato, V., Picariello, A., Subrahmanian, V.S., Turaga, P., Udrea, O.: A constrained probabilistic Petri Net framework for human activity detection in video. IEEE Trans. Multimed. 10(6), 982–996 (2008) CrossRefGoogle Scholar
  3. 3.
    Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: International Conference on Computer Vision (2005) Google Scholar
  4. 4.
    Bobick, A.F.: Movement, activity and action: the role of knowledge in the perception of motion. Philos. Trans. R. Soc. Lond. B, Biol. Sci. 352(1358), 1257 (1997) CrossRefGoogle Scholar
  5. 5.
    Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(3), 257–267 (2002) CrossRefGoogle Scholar
  6. 6.
    Bosch, A., Munoz, X., Marti, R.: Which is the best way to organize/classify images by content? Image Vis. Comput. 25(6), 778–791 (2007) CrossRefGoogle Scholar
  7. 7.
    Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: ACM International Conference on Image and Video Retrieval (2007) Google Scholar
  8. 8.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893, San Diego (2005) Google Scholar
  9. 9.
    Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: Proceedings of the British Machine Vision Conference, Aberystwyth, UK (2010) Google Scholar
  10. 10.
    Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for multi-class object layout. In: International Conference on Computer Vision (2009) Google Scholar
  11. 11.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2010 (VOC2010) Results (2010) Google Scholar
  12. 12.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32 (2010) Google Scholar
  13. 13.
    Fernández, C., Baiget, P., Roca, F.X., Gonzàlez, J.: Interpretation of complex situations in a cognitive surveillance framework. Signal Process. Image Commun. 23(7), 554–569 (2008) CrossRefGoogle Scholar
  14. 14.
    Fernández, C., Baiget, P., Roca, F.X., Gonzàlez, J.: Determining the best suited semantic events for cognitive surveillance. Expert Syst. Appl. 38(4), 4068–4079 (2011) CrossRefGoogle Scholar
  15. 15.
    Fusier, F., Valentin, V., Brémond, F., Thonnat, M., Borg, M., Thirde, D., Ferryman, J.: Video understanding for complex activity recognition. Mach. Vis. Appl. 18(3), 167–188 (2007) zbMATHCrossRefGoogle Scholar
  16. 16.
    Gonzàlez, J.: Human sequence evaluation: the key-frame approach. PhD thesis, UAB, Spain (2004).
  17. 17.
    Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009) CrossRefGoogle Scholar
  18. 18.
    Hongeng, S., Nevatia, R.: Multi-agent event recognition. In: International Conference on Computer Vision, pp. 84–93 (2001) Google Scholar
  19. 19.
    Ikizler, N., Duygulu, P.I.: Histogram of oriented rectangles: A new pose descriptor for human action recognition. Image Vis. Comput. 27(10), 1515–1526 (2009) CrossRefGoogle Scholar
  20. 20.
    Ikizler, N., Forsyth, D.A.: Searching video for complex activities with finite state models. In: CVPR (2007) Google Scholar
  21. 21.
    Ikizler-Cinbis, N., Cinbis, R.G., Sclaroff, S.: Learning actions from the web. In: International Conference on Computer Vision (2009) Google Scholar
  22. 22.
    Kitani, K.M., Sato, Y., Sugimoto, A.: Recovering the basic structure of human activities from noisy video-based symbol strings. Int. J. Pattern Recognit. Artif. Intell. 22(8), 1621–1646 (2008) CrossRefGoogle Scholar
  23. 23.
    Kjellström, H., Romero, J., Martínez, D., Kragić, D.: Simultaneous visual recognition of manipulation actions and manipulated objects. In: European Conference on Computer Vision, pp. 336–349 (2008) Google Scholar
  24. 24.
    Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, Alaska (2008) Google Scholar
  25. 25.
    Laxton, B., Lim, J., Kriegman, D.: Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007 (CVPR’07), pp. 1–8 (2007) CrossRefGoogle Scholar
  26. 26.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: IEEE Conference on Computer Vision and Pattern Recognition, New York, USA, pp. 2169–2178 (2006) Google Scholar
  27. 27.
    Li, L.-J., Fei-Fei, L.: What, where and who? classifying event by scene and object recognition. In: International Conference on Computer Vision (2007) Google Scholar
  28. 28.
    Lindeberg, T.: Feature detection with automatic scale selection. Int. J. Comput. Vis. 30(2), 77–116 (1998) Google Scholar
  29. 29.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE Conference on Computer Vision and Pattern Recognition, Florida, USA (2009) Google Scholar
  30. 30.
    Lowe, D.G.: Object recognition from local scale-invariant features. In: International Conference on Computer Vision, Kerkyra, Greece, p. 1150 (1999) CrossRefGoogle Scholar
  31. 31.
    Mahajan, D., Kwatra, N., Jain, S., Kalra, P., Banerjee, S.: A framework for activity recognition and detection of unusual activities. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing. Citeseer, University Park (2004) Google Scholar
  32. 32.
    Marszałek, M., Laptev, I., Schmid, C.: Actions in context. In: IEEE Conference on Computer Vision and Pattern Recognition, Florida, USA (2009) Google Scholar
  33. 33.
    Masoud, O., Papanikolopoulos, N.: A method for human action recognition. Image Vis. Comput. 21(8), 729–743 (2003) CrossRefGoogle Scholar
  34. 34.
    Mikolajczyk, K., Schmid, C.: Scale and affine invariant interest point detectors. Int. J. Comput. Vis. 60(1), 63–86 (2004) CrossRefGoogle Scholar
  35. 35.
    Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104(2), 90–126 (2006) CrossRefGoogle Scholar
  36. 36.
    Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. In: Proceedings of the National Conference on Artificial Intelligence, pp. 770–776 (2002) Google Scholar
  37. 37.
    Nagel, H.H.: From image sequences towards conceptual descriptions. Image Vis. Comput. 6(2), 59–74 (1988) CrossRefGoogle Scholar
  38. 38.
    Nagel, H.H., Gerber, R.: Representation of occurrences for road vehicle traffic. AI Mag. 172(4–5), 351–391 (2008) Google Scholar
  39. 39.
    Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008) CrossRefGoogle Scholar
  40. 40.
    Noceti, N., Santoro, M., Odone, F., Disi, V.D.: String-based spectral clustering for understanding human behaviours. In: Workshop on Tracking Humans for the Evaluation of their Motion in Image Sequences, pp. 19–27 (2008) Google Scholar
  41. 41.
    Oliver, N., Rosario, B., Pentland, A.: A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 22, 831 (2000) CrossRefGoogle Scholar
  42. 42.
    Pedersoli, M., Gonzàlez, J., Bagdanov, A.D., Roca, F.X.: Recursive coarse-to-fine localization for fast object detection. In: European Conference on Computer Vision (2010) Google Scholar
  43. 43.
    Polana, R., Nelson, R.C.: Detection and recognition of periodic, nonrigid motion. Int. J. Comput. Vis. 23(3), 261–282 (1997) CrossRefGoogle Scholar
  44. 44.
    Roth, D., Koller-Meier, E., Van Gool, L.: Multi-object tracking evaluated on sparse events. Multimed. Tools Appl. 1–19 (September 2009), online Google Scholar
  45. 45.
    Rowe, D., Rius, I., Gonzàlez, J., Villanueva, J.J.: Improving tracking by handling occlusions. In: 3rd ICAPR. LNCS, vol. 2, pp. 384–393. Springer, Berlin (2005) Google Scholar
  46. 46.
    Saxena, S., Brémond, F., Thonnat, M., Ma, R.: Crowd behavior recognition for video surveillance. In: Advanced Concepts for Intelligent Vision Systems, pp. 970–981 (2008) CrossRefGoogle Scholar
  47. 47.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local SVM approach. In: International Conference on Pattern Recognition, Cambridge, UK (2004) Google Scholar
  48. 48.
    Smeaton, A.F., Over, P., Kraaij, W.: Evaluation campaigns and TRECVID. In: MIR ’06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval (2006) Google Scholar
  49. 49.
    Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 1349–1380 (2000) Google Scholar
  50. 50.
    Smith, P., da Vitoria, N., Shah, M.: Temporal boost for event recognition. In: 10th IEEE International Conference on Computer Vision, October 2005 Google Scholar
  51. 51.
    Vu, V.T., Brémond, F., Thonnat, M.: Automatic video interpretation: A recognition algorithm for temporal scenarios based on pre-compiled scenario models. Comput. Vis. Syst. 523–533 (2003) Google Scholar
  52. 52.
    Wang, Y., Jiang, H., Drew, M.S., Li, Z.N., Mori, G.: Unsupervised discovery of action classes. In: IEEE Conference on Computer Vision and Pattern Recognition, New York, USA, pp. 1654–1661 (2006) Google Scholar
  53. 53.
    Xiang, T., Gong, S.: Beyond tracking: modelling activity and understanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006) CrossRefGoogle Scholar
  54. 54.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, Florida, USA (2009) Google Scholar
  55. 55.
    Zheng, H., Wang, H., Black, N.: Human activity detection in smart home environment with self-adaptive neural networks. In: Proceedings of the IEEE International Conference on Networking, Sensing and Control (ICNSC), pp. 1505–1510, April 2008 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Nataliya Shapovalova
    • 1
    Email author
  • Carles Fernández
    • 1
  • F. Xavier Roca
    • 1
  • Jordi Gonzàlez
    • 1
  1. 1.Departament de Ciències de la Computació and Computer Vision CenterUniversitat Autònoma de BarcelonaBellaterraSpain

Personalised recommendations