Automated Textual Descriptions for a Wide Range of Video Events with 48 Human Actions

  • Patrick Hanckmann
  • Klamer Schutte
  • Gertjan J. Burghouts
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7583)


Presented is a hybrid method to generate textual descriptions of video based on actions. The method includes an action classifier and a description generator. The aim for the action classifier is to detect and classify the actions in the video, such that they can be used as verbs for the description generator. The aim of the description generator is (1) to find the actors (objects or persons) in the video and connect these correctly to the verbs, such that these represent the subject, and direct and indirect objects, and (2) to generate a sentence based on the verb, subject, and direct and indirect objects. The novelty of our method is that we exploit the discriminative power of a bag-of-features action detector with the generative power of a rule-based action descriptor. Shown is that this approach outperforms a homogeneous setup with the rule-based action detector and action descriptor.


  1. 1.
    DARPA: Hosting corpora suitable for research in visual activity recognition, in particular, the video corpora collected as part of DARPA’s Mind’s Eye program (2011),
  2. 2.
    Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: A local svm approach. In: Proc. of ICPR, pp. 32–36 (2004)Google Scholar
  3. 3.
    Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29, 2247–2253 (2007)CrossRefGoogle Scholar
  4. 4.
    Ali, S., Shah, M.: Floor Fields for Tracking in High Density Crowd Scenes. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part II. LNCS, vol. 5303, pp. 1–14. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  5. 5.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR (2009)Google Scholar
  6. 6.
    Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: CVPR (2009)Google Scholar
  7. 7.
    Gagnon, L.: Automatic detection of visual elements in films and description with a synthetic voice- application to video description. In: Proceedings of the 9th International Conference on Low Vision (2008)Google Scholar
  8. 8.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, pp. 2012–2019 (2009)Google Scholar
  9. 9.
    Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision 50, 171–184 (2002)zbMATHCrossRefGoogle Scholar
  10. 10.
    Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: ICCV Workshops, pp. 664–671. IEEE (2011)Google Scholar
  11. 11.
    Burghouts, G., Bouma, H., de Hollander, R., van den Broek, S., Schutte, K.: Recognition of 48 human behaviors from video. in Int. Symp. Optronics in Defense and Security, OPTRO (2012)Google Scholar
  12. 12.
    Ditzel, M., Kester, L., van den Broek, S.: System design for distributed adaptive observation systems. In: IEEE Int. Conf. Information Fusion (2011)Google Scholar
  13. 13.
    Bouma, H., Hanckmann, P., Marck, J.-W., Penning, L., den Hollander, R., ten Hove, J.-M., van den Broek, S., Schutte, K., Burghouts, G.: Automatic human action recognition in a scene from visual inputs. In: Proc. SPIE, vol. 8388 (2012)Google Scholar
  14. 14.
    Burghouts, G., den Hollander, R., Schutte, K., Marck, J., Landsmeer, S., Breejen, E.d.: Increasing the security at vital infrastructures: automated detection of deviant behaviors. In: Proc. SPIE, vol. 8019 (2011)Google Scholar
  15. 15.
    Withagen, P., Schutte, K., Groen, F.: Probabilistic classification between foreground objects and background. In: Proc. IEEE Int. Conf. Pattern Recognition, pp. 31–34 (2004)Google Scholar
  16. 16.
    Laptev, I.: Improving object detection with boosted histograms. Image and Vision Computing, 535–544 (2009)Google Scholar
  17. 17.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Analysis and Machine Intelligence 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  18. 18.
    Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Computer Vision and Pattern Recognition (2008)Google Scholar
  19. 19.
    van den Broek, S., Hanckmann, P., Ditzel, M.: Situation and threat assessment for urban scenarios in a distributed system. In: Proc. Int. Conf. Information Fusion (2011)Google Scholar
  20. 20.
    Steinberg, A.N., Bowman, C.L.: Rethinking the JDL data fusion levels. In: NSSDF Conference Proceedings (2004)Google Scholar
  21. 21.
    Burghouts, G., Schutte, K.: Correlations between 48 human actions improve their detection. In: ICPR 2012 (2012)Google Scholar
  22. 22.
    Breiman, L.: Random forests. Machine Learning 45, 1 (2001)Google Scholar
  23. 23.
    Guillaumin, M., Mensink, T., Verbeek, J., Schmid, C.: Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV (2009)Google Scholar
  24. 24.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)Google Scholar
  25. 25.
    The Stanford Natural Language Processing Group: The Stanford parser: A statistical parser (2003),

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Patrick Hanckmann
    • 1
  • Klamer Schutte
    • 1
  • Gertjan J. Burghouts
    • 1
  1. 1.TNOThe HagueThe Netherlands

Personalised recommendations