Advertisement

Natural Language Description of Surveillance Events

Conference paper
  • 231 Downloads
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 699)

Abstract

This paper presents a novel method to represent hours of surveillance video in a pattern-based text log. We present a tag and template-based technique that automatically generates natural language descriptions of surveillance events. We combine the output of some of the existing object tracker, deep learning guided object and action classifiers, and graph-based scene knowledge to assign hierarchical tags and generate natural language description of surveillance events. Unlike some state-of-the-art image and short video descriptor methods, our approach can describe videos, specifically surveillance videos by combining frame-level, temporal-level, and behavior-level target tags/features. We evaluate our method against two baseline video descriptors, and our analysis suggests that supervised scene knowledge and template can improve video descriptions, specially in surveillance videos.

Keywords

Video to text Surveillance video description Video description 

References

  1. 1.
    Aradhye, H., Toderici, G., Yagnik, J.: Video2text: learning to annotate video content. In: IEEE International Conference on Data Mining Workshops, 2009. ICDMW’09, pp. 144–151. IEEE (2009)Google Scholar
  2. 2.
    Chen, X., Lawrence, Z.C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422–2431 (2015)Google Scholar
  3. 3.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  4. 4.
    Dogra, D.P., Ahmed, A., Bhaskar, H.: Smart video summarization using mealy machine-based trajectory modelling for surveillance applications. Multimed. Tools Appl. 75(11), 6373–6401 (2016)CrossRefGoogle Scholar
  5. 5.
    Donahue, J., Anne, H.L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)Google Scholar
  6. 6.
    Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)Google Scholar
  7. 7.
    Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)CrossRefGoogle Scholar
  8. 8.
    Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)MathSciNetzbMATHGoogle Scholar
  9. 9.
    Huang, H., Lu, Y., Zhang, F., Sun, S.: A multi-modal clustering method for web videos. In: International Conference on Trustworthy Computing and Services, pp. 163–169. Springer (2012)CrossRefGoogle Scholar
  10. 10.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)Google Scholar
  11. 11.
    Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying Visual-semantic Embeddings with Multimodal Neural Language Models (2014). arXiv:1411.2539
  12. 12.
    Krishnamoorthy, N., Malkarnenkar, G., Mooney, R.J., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: AAAI, vol. 1, p. 2 (2013)Google Scholar
  13. 13.
    Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: Treetalk: composition and compression of trees for image descriptions. TACL 2(10), 351–362 (2014)Google Scholar
  14. 14.
    Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.C., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 3153–3160. IEEE (2011)Google Scholar
  15. 15.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)Google Scholar
  16. 16.
    Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3202–3212 (2015)Google Scholar
  17. 17.
    Rohrbach, M., Qiu, W., Titov, I., Thater, S., Pinkal, M., Schiele, B.: Translating video content to natural language descriptions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 433–440 (2013)Google Scholar
  18. 18.
    Vedantam, R., Lawrence, Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)Google Scholar
  19. 19.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)Google Scholar
  20. 20.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating Videos to Natural Language Using Deep Recurrent Neural Networks (2014). arXiv:1412.4729
  21. 21.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  22. 22.
    Wei, S., Zhao, Y., Zhu, Z., Liu, N.: Multimodal fusion for video searchreranking. IEEE Trans. Knowl. Data Eng. 22(8), 1191–1199 (2010)CrossRefGoogle Scholar
  23. 23.
    Welch, G., Bishop, G.: An introduction to the Kalman filter. In: Annual Conference Computer Graphics Interactions Technology, pp. 12–17. ACM (2001)Google Scholar
  24. 24.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)Google Scholar
  25. 25.
    Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, vol. 2, pp. 28–31. IEEE (2004)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.National Institute of Technology DurgapurDurgapurIndia
  2. 2.Indian Institute of Technology BhubaneswarBhubaneswarIndia
  3. 3.Indian Institute of Technology RoorkeeRoorkeeIndia

Personalised recommendations