Deep Learning-Based Video Retrieval Using Object Relationships and Associated Audio Classes

  • Byoungjun KimEmail author
  • Ji Yea ShimEmail author
  • Minho Park
  • Yong Man Ro
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11962)


This paper introduces a video retrieval tool for the 2020 Video Browser Showdown (VBS). The tool enhances the user’s video browsing experience by ensuring full use of video analysis database constructed prior to the Showdown. Deep learning based object detection, scene text detection, scene color detection, audio classification and relation detection with scene graph generation methods have been used to construct the data. The data is composed of visual, textual, and auditory information, broadening the scope to which a user can search beyond visual information. In addition, the tool provides a simple and user-friendly interface for novice users to adapt to the tool in little time.


Scene graph Scene text Audio classification 


  1. 1.
    Lokoč, J., Bailer, W., Schoeffmann, K., Muenzer, B., Awad, G.: On influential trends in interactive video retrieval: video browser showdown 2015–2017. IEEE (2018)Google Scholar
  2. 2.
    Lokoč, J., et al.: Interactive search of sequential browsing? A detailed analysis of the video browser showdown 2018. ACM TOMM 15, 29 (2019)Google Scholar
  3. 3.
    Rossetto, L., Schuldt, H., Awad, G., Butt, Asad A.: V3C – a research video collection. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S. (eds.) MMM 2019. LNCS, vol. 11295, pp. 349–360. Springer, Cham (2019). Scholar
  4. 4.
    Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: an evaluation of content characteristics. In: ICMR (2019)Google Scholar
  5. 5.
    Schöffmann, K.: A user-centric media retrieval competition: the video browser showdown 2012–2014. IEEE (2014)Google Scholar
  6. 6.
    Awad, G., Butt, A., Curtis, K.: TRECVID 2018: benchmarking video activity detection, video captioning and matching, video storytelling linking and video search. In: TRECVID (2018)Google Scholar
  7. 7.
  8. 8.
    Baek, Y., Lee, B., Han, D.: Character Region Awareness for Text Detection. arXiv:1904.01941 (2019)
  9. 9.
    Kong, Q., Xu, Y., Wang, W.: Audio set classification with attention model: a probabilistic perspective. arXiv preprint arXiv:1711.00927 (2017)
  10. 10.
    Hershey, S., Chaudhuri, S., Ellis, D.P.W.: Multi-level attention model for weakly supervised audio classification. arXiv preprint arXiv:1803.02353 (2018)
  11. 11.
    Yang, J., Lu, J., Lee, S.: Graph R-CNN for scene graph generation. arXiv preprint arXiv:1808.00191 (2018)
  12. 12.
    Xu, D., Zhu, Y., Choy, C.B.: Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426 (2017)
  13. 13.
    Zhang, J., Shih, K.J., Elgammal, A.: Graphical contrastive losses for scene graph parsing. arXiv preprint arXiv:1903.02728 (2019)
  14. 14.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497 (2016)
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123, 32–37 (2017)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Shrivastava, N., Tyagi, V.: An efficient technique for retrieval of color images in large databases. Comput. Electr. Eng. 46, 314–327 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Electrical EngineeringKAISTDaejeonSouth Korea

Personalised recommendations