Advertisement

Signal, Image and Video Processing

, Volume 12, Issue 5, pp 991–999 | Cite as

Using deep features for video scene detection and annotation

  • Stanislav ProtasovEmail author
  • Adil Mehmood Khan
  • Konstantin Sozykin
  • Muhammad Ahmad
Original Paper

Abstract

The semantic video indexing problem is still underexplored. Solutions to the problem will significantly enrich the experience of video search, monitoring, and surveillance. This paper concerns scene detection and annotation, and specifically, the task of video structure mining for video indexing using deep features. The paper proposes and implements a pipeline that consists of feature extraction and filtering, shot clustering, and labeling stages. A deep convolutional network is used as the source of the features. The pipeline is evaluated using metrics for both scene detection and annotation. The results obtained show high scene detection and annotation quality estimated with various metrics. Additionally, we performed an overview and analysis of contemporary segmentation and annotation metrics. The outcome of this work can be applied to semantic video annotation in real time.

Keywords

Deep convolutional networks Scene detection Image recognition Semantic mining 

References

  1. 1.
    Altadmri, A., Ahmed, A.: Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases. In: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 74–79 (2009).  https://doi.org/10.1109/ICSIPA.2009.5478723
  2. 2.
    Amatriain, X., Agarwal, D.: Tutorial: Lessons learned from building real-life recommender systems. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 433–433. ACM, New York (2016).  https://doi.org/10.1145/2959100.2959194
  3. 3.
    Aner, A., Kender, J.R.: Video Summaries Through Mosaic-Based Shot and Scene Clustering, pp. 388–402. Springer, Berlin (2002).  https://doi.org/10.1007/3-540-47979-1_26 zbMATHGoogle Scholar
  4. 4.
    Bagdanov, A.D., Bertini, M., Bimbo, A.D., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: International Conference on Semantic Computing (ICSC), pp. 713–720 (2007).  https://doi.org/10.1109/ICSC.2007.30
  5. 5.
    Burt, P.J.: Fast filter transform for image processing. Computer graphics and image processing (1981). http://linkinghub.elsevier.com/retrieve/pii/0146664X81900927
  6. 6.
    Canny, J.: A computational approach to edge detection. PAMI-8 6, 679–698 (1986).  https://doi.org/10.1109/TPAMI.1986.4767851 CrossRefGoogle Scholar
  7. 7.
    Chatfield, K., Arandjelović, R., Parkhi, O.M., Zisserman, A.: On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimed. Inf. Retr. 4, 75–93 (2015)CrossRefGoogle Scholar
  8. 8.
    Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19(5), 427–454 (2013).  https://doi.org/10.1007/s00530-013-0306-4 CrossRefGoogle Scholar
  9. 9.
    Deng, J., Li, K., Do, M., Su, H., Fei-Fei, L.: Construction and Analysis of a Large Scale Image Ontology. Vision Sciences Society, Baltimore (2009)Google Scholar
  10. 10.
    Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999).  https://doi.org/10.1109/76.767124 CrossRefGoogle Scholar
  11. 11.
    Huayong, L., Hui, Z.: The Segmentation of News Video into Story Units, pp. 870–875. Springer, Berlin (2005).  https://doi.org/10.1007/11563952_95 Google Scholar
  12. 12.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  13. 13.
    Johnson, J., Karpathy, A., Li, F.: Densecap: fully convolutional localization networks for dense captioning. CoRR abs/1511.07571 (2015). http://arxiv.org/abs/1511.07571
  14. 14.
    Katz, E.: The Film Encyclopedia: Third Edition. HarperCollins, New York (1998). https://books.google.ru/books?id=jhx0QgAACAAJ
  15. 15.
    Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo (2000)Google Scholar
  16. 16.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2169–2178 (2006).  https://doi.org/10.1109/CVPR.2006.68 Google Scholar
  17. 17.
    Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene Segmentation in Artistic Archive Documentaries, pp. 400–410. Springer, Berlin (2010).  https://doi.org/10.1007/978-3-642-16607-5_27 Google Scholar
  18. 18.
    Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral Structuring of Home Videos, pp. 310–320. Springer, Berlin (2003).  https://doi.org/10.1007/3-540-45113-7_31 zbMATHGoogle Scholar
  19. 19.
    Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Queenot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USA (2015)Google Scholar
  20. 20.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015).  https://doi.org/10.1007/s11263-015-0816-y MathSciNetCrossRefGoogle Scholar
  21. 21.
    Schmidt, J.M.: A simple test on 2-vertex- and 2-edge-connectivity. Inf. Process. Lett. 113(7), 241–244 (2013).  https://doi.org/10.1016/j.ipl.2013.01.016. http://www.sciencedirect.com/science/article/pii/S0020019013000288
  22. 22.
    Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Trans. Circuits Syst. Video Technol. 22(6), 904–914 (2012).  https://doi.org/10.1109/TCSVT.2011.2181231 CrossRefGoogle Scholar
  23. 23.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
  24. 24.
    Tarjan, R.: A note on finding the bridges of a graph. Inf. Process. Lett. 2(6), 160–161 (1974).  https://doi.org/10.1016/0020-0190(74)90003-9. http://www.sciencedirect.com/science/article/pii/0020019074900039
  25. 25.
    Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, p. 839. IEEE Computer Society, Washington, DC, USA (1998). http://dl.acm.org/citation.cfm?id=938978.939190
  26. 26.
    Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, p. 273. IEEE Computer Society, Washington, DC, USA (2003). http://dl.acm.org/citation.cfm?id=946247.946665
  27. 27.
    Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. (2007).  https://doi.org/10.1145/1198302.1198305
  28. 28.
    Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Ciruits Syst. Video Technol. 13(1), 5–15 (2003).  https://doi.org/10.1109/TCSVT.2002.808084 CrossRefGoogle Scholar
  29. 29.
    Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002).  https://doi.org/10.1109/TMM.2002.802021 CrossRefGoogle Scholar
  30. 30.
    Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden Markov models. In: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, pp. 261–264. ACM, New York (2007).  https://doi.org/10.1145/1291233.1291287
  31. 31.
    Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998).  https://doi.org/10.1006/cviu.1997.0628 CrossRefGoogle Scholar
  32. 32.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 487–495. Curran Associates, Inc., Dutchess (2014)Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Innopolis UniversityInnopolisRussia

Personalised recommendations