Deep Learning-Based Concept Detection in vitrivr

  • Luca RossettoEmail author
  • Mahnaz Amiri Parian
  • Ralph Gasser
  • Ivan Giangreco
  • Silvan Heller
  • Heiko Schuldt
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)


This paper presents the most recent additions to the vitrivr retrieval stack, which will be put to the test in the context of the 2019 Video Browser Showdown (VBS). The vitrivr stack has been extended by approaches for detecting, localizing, or describing concepts and actions in video scenes using various convolutional neural networks. Leveraging those additions, we have added support for searching the video collection based on semantic sketches. Furthermore, vitrivr offers new types of labels for text-based retrieval. In the same vein, we have also improved upon vitrivr’s pre-existing capabilities for extracting text from video through scene text recognition. Moreover, the user interface has received a major overhaul so as to make it more accessible to novice users, especially for query formulation and result exploration.


  1. 1.
    Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), vol. 16, pp. 265–283. USENIX, Savannah, GA, USA (2016)Google Scholar
  2. 2.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany (2018, page to appear)Google Scholar
  3. 3.
    Cobârzan, C., et al.: Interactive video search tools: a detailed analysis of the video browser showdown 2015. Multimedia Tools and Appl. (MTAP) 76(4), 5539–5571 (2017)Google Scholar
  4. 4.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3213–3223. IEEE, Las Vegas (2016)Google Scholar
  5. 5.
    Mark Everingham, S.M., Eslami, A., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The Pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)CrossRefGoogle Scholar
  6. 6.
    Furuta, R., Inoue, N., Yamasaki, T.: Efficient and interactive spatial-semantic image retrieval. In: Schoeffmann, K., et al. (eds.) MMM 2018. LNCS, vol. 10704, pp. 190–202. Springer, Cham (2018). Scholar
  7. 7.
    Giangreco, I., Schuldt, H.: ADAM\(_{pro}\): database support for big multimedia retrieval. Datenbank-Spektrum 16(1), 17–26 (2016)Google Scholar
  8. 8.
    Jaderberg, M., Simonyan, K., Vedaldi, A., Zisserman, A.: Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, pp. 1–10 (2014)
  9. 9.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.-F.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732. IEEE, Columbus (2014)Google Scholar
  10. 10.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  11. 11.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (JMLR) 9(Nov), 2579–2605 (2008)zbMATHGoogle Scholar
  12. 12.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, pp. 1–12 (2013)
  13. 13.
    Rossetto, L., Giangreco, I., Gasser, R., Schuldt, H.: Competitive video retrieval with vitrivr. In: Schoeffmann, K., et al. (eds.) MMM 2018. LNCS, vol. 10705, pp. 403–406. Springer, Cham (2018). Scholar
  14. 14.
    Rossetto, L., et al.: IMOTION – searching for video sequences using multi-shot sketch queries. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9517, pp. 377–382. Springer, Cham (2016). Scholar
  15. 15.
    Rossetto, L., Giangreco, I., Schuldt, H.: Cineast: a multi-feature sketch-based video retrieval engine. In: Proceedings of the International Symposium on Multimedia (ISM), pp. 18–23. IEEE, Taichung, December 2014Google Scholar
  16. 16.
    Rossetto, L., et al.: IMOTION — a content-based video retrieval engine. In: He, X., Luo, S., Tao, D., Xu, C., Yang, J., Hasan, M.A. (eds.) MMM 2015. LNCS, vol. 8936, pp. 255–260. Springer, Cham (2015). Scholar
  17. 17.
    Rossetto, L., Giangreco, I., Tănase, C., Schuldt, H.: vitrivr: a flexible retrieval stack supporting multiple query modesfor searching in multimedia collections. In: Proceedings of the ACM Conference on Multimedia Conference (ACM MM), pp. 1183–1186. ACM, Amsterdam, October 2016Google Scholar
  18. 18.
    Rossetto, L., Giangreco, I., Tănase, C., Schuldt, H., Dupont, S., Seddati, O.: Enhanced retrieval and browsing in the IMOTION system. In: Amsaleg, L., Guðmundsson, G.Þ., Gurrin, C., Jónsson, B.Þ., Satoh, S. (eds.) MMM 2017. LNCS, vol. 10133, pp. 469–474. Springer, Cham (2017). Scholar
  19. 19.
    Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39(11), 2298–2304 (2017)CrossRefGoogle Scholar
  20. 20.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE, Santiago (2015)Google Scholar
  21. 21.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39(4), 652–663 (2017)CrossRefGoogle Scholar
  22. 22.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 4. IEEE, Honolulu (2017)Google Scholar
  23. 23.
    Zhou, X., et al.: East: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651. IEEE, Honolulu (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Luca Rossetto
    • 1
    Email author
  • Mahnaz Amiri Parian
    • 1
    • 2
  • Ralph Gasser
    • 1
  • Ivan Giangreco
    • 1
  • Silvan Heller
    • 1
  • Heiko Schuldt
    • 1
  1. 1.Databases and Information Systems Research Group, Department of Mathematics and Computer ScienceUniversity of BaselBaselSwitzerland
  2. 2.Numediart InstituteUniversity of MonsMonsBelgium

Personalised recommendations