Machine Vision and Applications

, Volume 28, Issue 3–4, pp 243–265 | Cite as

Generating natural language tags for video information management

  • Muhammad Usman Ghani Khan
  • Yoshihiko Gotoh
Original Paper


This exploratory work is concerned with generation of natural language descriptions that can be used for video retrieval applications. It is a step ahead of keyword-based tagging as it captures relations between keywords associated with videos. Firstly, we prepare hand annotations consisting of descriptions for video segments crafted from a TREC Video dataset. Analysis of this data presents insights into human’s interests on video contents. Secondly, we develop a framework for creating smooth and coherent description of video streams. It builds on conventional image processing techniques that extract high-level features from individual video frames. Natural language description is then produced based on high-level features. Although feature extraction processes are erroneous at various levels, we explore approaches to putting them together to produce a coherent, smooth and well-phrased description by incorporating spatial and temporal information. Evaluation is made by calculating ROUGE scores between human-annotated and machine-generated descriptions. Further, we introduce a task-based evaluation by human subjects which provides qualitative evaluation of generated descriptions.


Video information management Video annotation Natural language generation 


  1. 1.
    Abella, A., Kender, J.R., Starren, J.: Description generation of abnormal densities found in radiographs. In: Proceedings of the Annual Symposium on Computer Application in Medical Care, p. 42 (1995)Google Scholar
  2. 2.
    Aker, A., Gaizauskas, R.: Generating image descriptions using dependency relational patterns. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1250–1258 (2010)Google Scholar
  3. 3.
    Allen, J.F.: Towards a general theory of action and time. Artif. Intell. 23(2), 123–154 (1984)CrossRefzbMATHGoogle Scholar
  4. 4.
    Bai, L., Li, K., Pei, J., Jiang, S.: Main objects interaction activity recognition in real images. Neural Comput. Appl. 1–14 (2015)Google Scholar
  5. 5.
    Baiget, P., Fernández, C., Roca, X., Gonzàlez, J.: Trajectory-Based Abnormality Categorization for Learning Route Patterns in Surveillance. Springer, Berlin (2012)CrossRefGoogle Scholar
  6. 6.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 190–200. Association for Computational Linguistics (2011)Google Scholar
  7. 7.
    Cruz-Perez, C., Starostenko, O., Alarcon-Aquino, V., Rodriguez-Asomoza, J.: Automatic image annotation for description of urban and outdoor scenes. In: Innovations and Advances in Computing, Informatics, Systems Sciences, Networking and Engineering, pp. 139–147. Springer (2015)Google Scholar
  8. 8.
    Das, D.: Human gait classification using combined HMM & SVM hybrid classifier. In: IEEE International Conference on Electronic Design, Computer Networks & Automated Verification (EDCAV), pp. 169–174 (2015)Google Scholar
  9. 9.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389 (2014)
  10. 10.
    Feng, Y., Lapata, M.: How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1239–1249 (2010)Google Scholar
  11. 11.
    Filice, S., Da San Martino, G., Moschitti, A.: Structural representations for learning relations between pairs of texts. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, Beijing, China, July. Association for Computational Linguistics (2015)Google Scholar
  12. 12.
    Gitte, M., Bawaskar, H., Sethi, S., Shinde, A.: Content based video retrieval system. Int. J. Res. Eng. Technol. 3(6), 1 (2014)Google Scholar
  13. 13.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 2712–2719. IEEE (2013)Google Scholar
  14. 14.
    Hu, W.C., Yang, C.Y., Huang, D.Y., Huang, C.H.: Feature-based face detection against skin-color like backgrounds with varying illumination. J. Inf. Hiding Multimed. Signal Process. 2(2), 123–132 (2011)Google Scholar
  15. 15.
    Khan, M.U.G., Gotoh, Y.: Describing video contents in natural language. In: Proceedings of the EACL Workshop, Avignon (2012)Google Scholar
  16. 16.
    Khan, M.U.G., Al Harbi, N., Gotoh, Y.: A framework for creating natural language descriptions of video streams. Inf. Sci. 303, 61–82 (2015)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Khan, M.U.G., Saeed, A.: Human detection in videos. J. Theor. Appl. Inf. Technol. 5(2), 1 (2009)Google Scholar
  18. 18.
    Khan, M.U.G., Zhang, L., Gotoh, Y.: Towards coherent natural language description of video streams. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 664–671. IEEE (2011)Google Scholar
  19. 19.
    Khan, M.U.G., Nawab, R.M.A., Gotoh, Y.: Natural language descriptions of visual scenes: corpus generation and analysis. In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra), pp. 38–47. Association for Computational Linguistics (2012)Google Scholar
  20. 20.
    Kim, W., Park, J., Kim, C.: A novel method for efficient indoor–outdoor image classification. J. Signal Process. Syst. 61, 251–258 (2010)CrossRefGoogle Scholar
  21. 21.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, pp. 423–430 (2003)Google Scholar
  22. 22.
    Kojima, A., Takaya, M., Aoki, S., Miyamoto, T., Fukunaga, K.: Recognition and textual description of human activities by mobile robot. In: Proceedings of the 3rd International Conference on Innovative Computing Information and Control, pp. 53–53 (2008)Google Scholar
  23. 23.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating simple image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1601–1608 (2011)Google Scholar
  24. 24.
    Lee, H., Morariu, V., Davis, L.S.: Clauselets: leveraging temporally related actions for video event analysis. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1161–1168 (2015)Google Scholar
  25. 25.
    Lee, M.W., Hakeem, A., Haering, N., Zhu, S.C.: Save: a framework for semantic annotation of visual events. In: Proceedings of the Computer Vision and Pattern Recognition Workshops, pp. 1–8 (2008)Google Scholar
  26. 26.
    Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Proceedings of the 15th Conference on Computational Natural Language Learning, pp. 220–228 (2011)Google Scholar
  27. 27.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Proceedings of the ACL-04 Workshop (2004)Google Scholar
  28. 28.
    Lin, D., Kong, C., Fidler, S., Urtasun, R.: Generating multi-sentence lingual descriptions of indoor scenes. arXiv preprint arXiv:1503.00064 (2015)
  29. 29.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn Treebank. Comput. Linguist. 19(2), 313–330 (1993)Google Scholar
  30. 30.
    Muller, P., Reymonet, A.: Using inference for evaluating models of temporal discourse. In: 12th International Symposium on Temporal Representation and Reasoning (2005)Google Scholar
  31. 31.
    Nevatia, R., Zhao, T., Hongeng, S.: Hierarchical language-based representation of events in video streams. In: Proceedings of the Conference on Computer Vision and Pattern Recognition Workshop, vol. 4, pp. 39–39 (2003)Google Scholar
  32. 32.
    Pustejovsky, J., Castano, J., Ingria, R., Saurí, R., Gaizauskas, R., Setzer, A., Katz, G., Radev, D.: TimeML: robust specification of event and temporal expressions in text. In: Proceedings of the 5th International Workshop on Computational Semantics (2003)Google Scholar
  33. 33.
    Pustejovsky, J., Ingria, B., Sauri, R., Castano, J., Littman, J., Gaizauskas, R., Setzer, A., Katz, G., Mani, I.: The Specification Language TimeML. The Language of Time: A Reader. Oxford University Press, Oxford (2004)Google Scholar
  34. 34.
    Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B.: Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, Lecture Notes in Computer Science, vol. 8753, pp. 184–195. Springer (2014)Google Scholar
  35. 35.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 433–440. IEEE (2013)Google Scholar
  36. 36.
    Rosani, A., Conci, N., De Natale, F.G.B.: Human behavior understanding for assisted living by means of hierarchical context free grammars. In: IS&T/SPIE Electronic Imaging, pp. 90260E-90260E. International Society for Optics and Photonics (2014)Google Scholar
  37. 37.
    Singh, B., Han, X., Wu, Z., Morariu, V.I., Davis, L.S.: Selecting relevant web trained concepts for automated event retrieval. arXiv preprint arXiv:1509.07845 (2015)
  38. 38.
    Singh, D., Yadav, A.K., Kumar, V.: Human activity tracking using star skeleton and activity recognition using hmms and neural network. Int. J. Sci. Res. Publ. 4(5), 9 (2014)Google Scholar
  39. 39.
    Smeaton, A.F., Over, P., Kraaij, W.: High-level feature detection from video in TRECVid: a 5-year retrospective of achievements. In: Multimedia Content Analysis, pp. 1–24 (2009)Google Scholar
  40. 40.
    Stolcke, A.: SRILM—an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, vol. 2, pp. 901–904 (2002)Google Scholar
  41. 41.
    Tan, C.C., Jiang, Y.-G., Ngo, C.-W.: Towards textually describing complex video contents with audio-visual concept classifiers. In: Proceedings of the 19th ACM International Conference on Multimedia, pp. 655–658. ACM (2011)Google Scholar
  42. 42.
    Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING) (2014)Google Scholar
  43. 43.
    Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating language and vision to generate natural language descriptions of videos in the wild. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), August (2014)Google Scholar
  44. 44.
    Vallacher, R.R., Wegner, D.M.: A Theory of Action Identification. Psychology Press, Hove (2014)Google Scholar
  45. 45.
    Verhagen, M., Mani, I., Sauri, R., Knippen, R., Jang, S.B., Littman, J., Rumshisky, A., Phillips, J., Pustejovsky, J.: Automating temporal annotation with TARSQI. In: Proceedings of the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 81–84 (2005)Google Scholar
  46. 46.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1 (2001)Google Scholar
  47. 47.
    Wilhelm, T., Böhme, H.-J., Gross, H.-M.: Classification of face images for gender, age, facial expression, and identity. In: Artificial Neural Networks: Biological Inspirations–ICANN 2005, pp. 569–574. Springer (2005)Google Scholar
  48. 48.
    Wise, M.J.: String similarity via greedy string tiling and running karp-rabin matching. Online Preprint, Dec (1993)Google Scholar
  49. 49.
    Yan, F., Mikolajczyk, K.: Leveraging high level visual information for matching images and captions. In: Computer Vision–ACCV 2014, pp. 613–627. Springer (2015)Google Scholar
  50. 50.
    Yang, Y., Teo, C.L., Daumé III, H., Fermüller, C., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the EMNLP (2011)Google Scholar
  51. 51.
    Yang, Y., Guha, A., Fermuller, C., Aloimonos, Y.: A cognitive system for understanding human manipulation actions. Adv. Cognit. Syst. 3, 67–86 (2014)Google Scholar
  52. 52.
    Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)CrossRefGoogle Scholar
  53. 53.
    Yu, H., Siskind, J.M.: Grounded language learning from video described with sentences. In: ACL (1), pp. 53–63 (2013)Google Scholar
  54. 54.
    Zhang, L., Khan, M.U.G., Gotoh, Y.: Video scene classification based on natural language description. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, pp. 942–949. IEEE (2011)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2017

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of Engineering & TechnologyLahorePakistan
  2. 2.Department of Computer ScienceUniversity of SheffieldSheffieldUnited Kingdom

Personalised recommendations