A Video Library System Using Scene Detection and Automatic Tagging

  • Lorenzo BaraldiEmail author
  • Costantino Grana
  • Rita Cucchiara
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 733)


We present a novel video browsing and retrieval system for edited videos, based on scene detection and automatic tagging. In the proposed system, database videos are automatically decomposed into meaningful and storytelling parts (i.e. scenes) and tagged in an automatic way by leveraging their transcript. We rely on computer vision and machine learning techniques to learn the optimal scene boundaries: a Triplet Deep Neural Network is trained to distinguish video sequences belonging to the same scene and sequences from different scenes, by exploiting multimodal features from images, audio and captions. The system also features a user interface build as a set of extensions to the eXo Platform Enterprise Content Management System (ECMS) ( This set of extensions enable the interactive visualization of a video, its automatic and semi-automatic annotation, as well as a keyword-based search inside the video collection. The platform also allows a natural integration with third-party add-ons, so that automatic annotations can be exploited outside the proposed platform.


Scene detection Tagging Video browsing Interfaces 


  1. 1.
    Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descriptors. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6583–6587 (2014)Google Scholar
  2. 2.
    Balducci, F., Grana, C., Cucchiara, R.: Affective level design for a role-playing videogame evaluated by a brain-computer interface and machine learning methods. Vis. Comput. 33(4), 413–427 (2017)CrossRefGoogle Scholar
  3. 3.
    Ballan, L., Bertini, M., Serra, G., Del Bimbo, A.: A data-driven approach for tag refinement and localization in web videos. Comput. Vis. Image Underst. 140, 58–67 (2015)CrossRefGoogle Scholar
  4. 4.
    Baraldi, L., Grana, C., Cucchiara, R.: A deep siamese network for scene detection in broadcast videos. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference (MM 2015), pp. 1199–1202 (2015).
  5. 5.
    Baraldi, L., Grana, C., Cucchiara, R.: Scene-driven retrieval in edited videos using aesthetic and semantic deep features. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval (ICMR 2016), pp. 23–29 (2016).
  6. 6.
    Baraldi, L., Grana, C., Cucchiara, R.: Recognizing and presenting the storytelling video structure with deep multimodal networks. IEEE Trans. Multimed. 19(5), 955–968 (2017)CrossRefGoogle Scholar
  7. 7.
    Bolelli, F., Borghi, G., Grana, C.: Historical handwritten text images word spotting through sliding window HOG features. In: 19th International Conference on Image Analysis and Processing (2017)Google Scholar
  8. 8.
    Chasanis, V.T., Likas, C., Galatsanos, N.P.: Scene detection in videos using shot clustering and sequence alignment. IEEE Trans. Multimed. 11(1), 89–100 (2009)CrossRefGoogle Scholar
  9. 9.
    Hanjalic, A., Lagendijk, R.L., Biemond, J.: Automated high-level movie segmentation for advanced video-retrieval systems. IEEE Trans. Circ. Syst. Video Technol. 9(4), 580–588 (1999)CrossRefGoogle Scholar
  10. 10.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  11. 11.
    Lew, M.S., Sebe, N., Djeraba, C., Jain, R.: Content-based multimedia information retrieval: state of the art and challenges. ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP) 2(1), 1–19 (2006)CrossRefGoogle Scholar
  12. 12.
    Lin, D., Fidler, S., Kong, C., Urtasun, R.: Visual semantic search: retrieving videos via complex textual queries. In: IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2657–2664, June 2014Google Scholar
  13. 13.
    Liu, C., Wang, D., Zhu, J., Zhang, B.: Learning a contextual/multi-thread model for movie/TV scene segmentation. IEEE Trans. Multimed. 15(4), 884–897 (2013)CrossRefGoogle Scholar
  14. 14.
    Logan, B., et al.: Mel frequency cepstral coefficients for music modeling. In: ISMIR (2000)Google Scholar
  15. 15.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  16. 16.
    Rasheed, Z., Shah, M.: Detection and representation of scenes in videos. IEEE Trans. Multimed. 7(6), 1097–1105 (2005)CrossRefGoogle Scholar
  17. 17.
    Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 806–813 (2014)Google Scholar
  18. 18.
    Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Meinedo, H., Bugalho, M., Trancoso, I.: Temporal video segmentation to scenes using high-level audiovisual features. IEEE Trans. Circ. Syst. Video Technol. 21(8), 1163–1177 (2011)CrossRefGoogle Scholar
  19. 19.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint (2014). arXiv:1409.1556
  20. 20.
    Sivic, J., Zisserman, A.: Video google: a text retrieval approach to object matching in videos. In: IEEE International Conference on Computer Vision, pp. 1470–1477. IEEE (2003)Google Scholar
  21. 21.
    Snoek, C.G., Huurnink, B., Hollink, L., De Rijke, M., Schreiber, G., Worring, M.: Adding semantics to detectors for video retrieval. IEEE Trans. Multimed. 9(5), 975–986 (2007)CrossRefGoogle Scholar
  22. 22.
    Yeung, M.M., Yeo, B.L., Wolf, W.H., Liu, B.: Video browsing using clustering and scene transitions on compressed sequences. In: IS&T/SPIE’s Symposium on Electronic Imaging: Science & Technology, pp. 399–413 (1995)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Lorenzo Baraldi
    • 1
    Email author
  • Costantino Grana
    • 1
  • Rita Cucchiara
    • 1
  1. 1.Dipartimento di Ingegneria “Enzo Ferrari”Università degli Studi di Modena e Reggio EmiliaModenaItaly

Personalised recommendations