International Journal of Computer Vision

, Volume 50, Issue 2, pp 171–184 | Cite as

Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions

  • Atsuhiro Kojima
  • Takeshi Tamura
  • Kunio Fukunaga


We propose a method for describing human activities from video images based on concept hierarchies of actions. Major difficulty in transforming video images into textual descriptions is how to bridge a semantic gap between them, which is also known as inverse Hollywood problem. In general, the concepts of events or actions of human can be classified by semantic primitives. By associating these concepts with the semantic features extracted from video images, appropriate syntactic components such as verbs, objects, etc. are determined and then translated into natural language sentences. We also demonstrate the performance of the proposed method by several experiments.

natural language generation concept hierarchy semantic primitive position/posture estimation of human case frame 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Asanuma, K., Onishi, M., Kojima, A., and Fukunaga, K. 1999. Extracting regions of human face and hands considering information of color and region tracking. Trans. IEEJ(C), 119-C(11):1351–1358 (in Japanese).Google Scholar
  2. Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12):833–846.Google Scholar
  3. Babaguchi, N., Dan, S., and Kitahashi, T. 1996. Generation of sketch map image and its instructions to support the understanding of geographical information. In Proc. of ICPR'96, pp. 274–278.Google Scholar
  4. Fellbaum, C. (Ed.) 1998. WordNet: An Electronic Lexical Database. MIT Press: Cambridge.Google Scholar
  5. Fillmore, C.J. 1968. The case for case. In Universals in Linguistic Theory, E. Bach and R. Harms (Eds.). Rinehart and Wiston: New York, pp. 1–88.Google Scholar
  6. Herzog, G. and Rohr, K. 1995. Integrating vision and language: Towards automatic description of human movements. In Proc. 19th Annual German Conf. on Artificial Intelligence, pp. 257–268.Google Scholar
  7. Hornby, A.S. 1975. Guide to Patterns and Usage in English. Oxford Univ. Press: London.Google Scholar
  8. Intille, S. and Bobick, A. 1998. Representation and Visual Recognition of Complex, Multi-Agent Actions Using Belief Networks, Technical Report 454, M.I.T Media Lab. Perceptual Computing Section.Google Scholar
  9. Kitahashi, T., Ohya, M., Kakusho, K., and Babaguchi, N. 1997. Media information processing in documents—Generation of manuals of mechanical parts assembling—. In Proc. of 4th Int. Conf. on Document Analysis and Recognition, Ulm, Germany, pp. 792–796.Google Scholar
  10. Kojima, A., Izumi, M., Tamura, T., and Fukunaga, K. 2000. Generating natural language description of human behavior from video images. In Proc. of ICPR 2000, vol. 4, pp. 728–731.Google Scholar
  11. Kollnig, H., Nagel, H.-H., and Otte, M. 1994. Association of motion verbs with vehicle movements extracted from dense optical flow fields. In Proc. of 3rd European Conf. on Computer Vision'94, pp. 338–347.Google Scholar
  12. Nagel, H.-H. 1994. A vision of ‘vision and language’ comprises action: An example from road traffic. Artificial Intelligence Review, 8:189–214.Google Scholar
  13. Nishida, F. and Takamatsu, S. 1982. Japanese-English translation through internal expressions. In Proc. of COLING-82, pp. 271–276.Google Scholar
  14. Nishida, F., Takamatsu, S., Tani, T., and Doi, T. 1988. Feed-back of correcting information in postediting to a ma-chine translation system. In Proc. of COLING-88, pp. 476–481.Google Scholar
  15. Okada, N. 1980. Conceptual taxonomy of Japanese verbs for un-derstanding natural language and picture patterns. In Proc. of COLING-80, pp. 123–135.Google Scholar
  16. Okada, N. 1996. Integrating vision, motion and language through mind. Artificial Intelligence Review, 8:209–234.Google Scholar
  17. Shafer, D. 1976. A Mathematical Theory of Evidence. Princeton Univ. Press: Princeton, NJ.Google Scholar
  18. Thonnat, M. and Rota, N. 1999. Image understanding for visual surveillance applications. In Proc. of 3rd Int. Workshop on Cooperative Distributed Vision, pp. 51–82.Google Scholar

Copyright information

© Kluwer Academic Publishers 2002

Authors and Affiliations

  • Atsuhiro Kojima
    • 1
  • Takeshi Tamura
    • 1
  • Kunio Fukunaga
    • 2
  1. 1.Library and Science Information CenterOsaka Prefecture UniversitySakai, OsakaJapan
  2. 2.Graduate School of EngineeringOsaka Prefecture UniversitySakai, OsakaJapan

Personalised recommendations