Signal, Image and Video Processing

, Volume 12, Issue 5, pp 991–999 | Cite as

Using deep features for video scene detection and annotation

  • Stanislav ProtasovEmail author
  • Adil Mehmood Khan
  • Konstantin Sozykin
  • Muhammad Ahmad
Original Paper


The semantic video indexing problem is still underexplored. Solutions to the problem will significantly enrich the experience of video search, monitoring, and surveillance. This paper concerns scene detection and annotation, and specifically, the task of video structure mining for video indexing using deep features. The paper proposes and implements a pipeline that consists of feature extraction and filtering, shot clustering, and labeling stages. A deep convolutional network is used as the source of the features. The pipeline is evaluated using metrics for both scene detection and annotation. The results obtained show high scene detection and annotation quality estimated with various metrics. Additionally, we performed an overview and analysis of contemporary segmentation and annotation metrics. The outcome of this work can be applied to semantic video annotation in real time.


Deep convolutional networks Scene detection Image recognition Semantic mining 


  1. 1.
    Altadmri, A., Ahmed, A.: Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases. In: 2009 IEEE International Conference on Signal and Image Processing Applications (ICSIPA), pp. 74–79 (2009).
  2. 2.
    Amatriain, X., Agarwal, D.: Tutorial: Lessons learned from building real-life recommender systems. In: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 433–433. ACM, New York (2016).
  3. 3.
    Aner, A., Kender, J.R.: Video Summaries Through Mosaic-Based Shot and Scene Clustering, pp. 388–402. Springer, Berlin (2002). zbMATHGoogle Scholar
  4. 4.
    Bagdanov, A.D., Bertini, M., Bimbo, A.D., Serra, G., Torniai, C.: Semantic annotation and retrieval of video events using multimedia ontologies. In: International Conference on Semantic Computing (ICSC), pp. 713–720 (2007).
  5. 5.
    Burt, P.J.: Fast filter transform for image processing. Computer graphics and image processing (1981).
  6. 6.
    Canny, J.: A computational approach to edge detection. PAMI-8 6, 679–698 (1986). CrossRefGoogle Scholar
  7. 7.
    Chatfield, K., Arandjelović, R., Parkhi, O.M., Zisserman, A.: On-the-fly learning for visual search of large-scale image and video datasets. Int. J. Multimed. Inf. Retr. 4, 75–93 (2015)CrossRefGoogle Scholar
  8. 8.
    Del Fabro, M., Böszörmenyi, L.: State-of-the-art and future challenges in video scene detection: a survey. Multimed. Syst. 19(5), 427–454 (2013). CrossRefGoogle Scholar
  9. 9.
    Deng, J., Li, K., Do, M., Su, H., Fei-Fei, L.: Construction and Analysis of a Large Scale Image Ontology. Vision Sciences Society, Baltimore (2009)Google Scholar
  10. 10.
    Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circuits Syst. Video Technol. 9(4), 580–588 (1999). CrossRefGoogle Scholar
  11. 11.
    Huayong, L., Hui, Z.: The Segmentation of News Video into Story Units, pp. 870–875. Springer, Berlin (2005). Google Scholar
  12. 12.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014)
  13. 13.
    Johnson, J., Karpathy, A., Li, F.: Densecap: fully convolutional localization networks for dense captioning. CoRR abs/1511.07571 (2015).
  14. 14.
    Katz, E.: The Film Encyclopedia: Third Edition. HarperCollins, New York (1998).
  15. 15.
    Kwon, Y.M., Song, C.J., Kim, I.J.: A new approach for high level video structuring. In: IEEE International Conference on Multimedia and Expo (2000)Google Scholar
  16. 16.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 2169–2178 (2006). Google Scholar
  17. 17.
    Mitrović, D., Hartlieb, S., Zeppelzauer, M., Zaharieva, M.: Scene Segmentation in Artistic Archive Documentaries, pp. 400–410. Springer, Berlin (2010). Google Scholar
  18. 18.
    Odobez, J.M., Gatica-Perez, D., Guillemot, M.: Spectral Structuring of Home Videos, pp. 310–320. Springer, Berlin (2003). zbMATHGoogle Scholar
  19. 19.
    Over, P., Awad, G., Michel, M., Fiscus, J., Kraaij, W., Smeaton, A.F., Queenot, G., Ordelman, R.: Trecvid 2015—an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USA (2015)Google Scholar
  20. 20.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015). MathSciNetCrossRefGoogle Scholar
  21. 21.
    Schmidt, J.M.: A simple test on 2-vertex- and 2-edge-connectivity. Inf. Process. Lett. 113(7), 241–244 (2013).
  22. 22.
    Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Kittler, J.: Differential edit distance: a metric for scene segmentation evaluation. IEEE Trans. Circuits Syst. Video Technol. 22(6), 904–914 (2012). CrossRefGoogle Scholar
  23. 23.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014).
  24. 24.
    Tarjan, R.: A note on finding the bridges of a graph. Inf. Process. Lett. 2(6), 160–161 (1974).
  25. 25.
    Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, p. 839. IEEE Computer Society, Washington, DC, USA (1998).
  26. 26.
    Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based vision system for place and object recognition. In: Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, p. 273. IEEE Computer Society, Washington, DC, USA (2003).
  27. 27.
    Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. (2007).
  28. 28.
    Truong, B.T., Venkatesh, S., Dorai, C.: Scene extraction in motion pictures. IEEE Trans. Ciruits Syst. Video Technol. 13(1), 5–15 (2003). CrossRefGoogle Scholar
  29. 29.
    Vendrig, J., Worring, M.: Systematic evaluation of logical story unit segmentation. IEEE Trans. Multimed. 4(4), 492–499 (2002). CrossRefGoogle Scholar
  30. 30.
    Vinciarelli, A., Favre, S.: Broadcast news story segmentation using social network analysis and hidden Markov models. In: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, pp. 261–264. ACM, New York (2007).
  31. 31.
    Yeung, M., Yeo, B.L., Liu, B.: Segmentation of video by clustering and graph analysis. Comput. Vis. Image Underst. 71(1), 94–109 (1998). CrossRefGoogle Scholar
  32. 32.
    Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 487–495. Curran Associates, Inc., Dutchess (2014)Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Innopolis UniversityInnopolisRussia

Personalised recommendations