World Wide Web

, Volume 21, Issue 5, pp 1259–1284 | Cite as

Exploiting detected visual objects for frame-level video filtering

  • Xingzhong Du
  • Hongzhi Yin
  • Zi Huang
  • Yi Yang
  • Xiaofang Zhou


Videos are generated at an unprecedented speed on the web. To improve the efficiency of access, developing new ways to filter the videos becomes a popular research topic. One on-going direction is using visual objects to perform frame-level video filtering. Under this direction, existing works create the unique object table and the occurrence table to maintain the connections between videos and objects. However, the creation process is not scalable and dynamic because it heavily depends on human labeling. To improve this, we propose to use detected visual objects to create these two tables for frame-level video filtering. Our study begins with investigating the existing object detection techniques. After that, we find object detection lacks the identification and connection abilities to accomplish the creation process alone. To supply these abilities, we further investigate three candidates, namely, recognizing-based, matching-based and tracking-based methods, to work with the object detection. Through analyzing the mechanism and evaluating the accuracy, we find that they are imperfect for identifying or connecting the visual objects. Accordingly, we propose a novel hybrid method that combines the matching-based and tracking-based methods to overcome the limitations. Our experiments show that the proposed method achieves higher accuracy and efficiency than the candidate methods. The subsequent analysis shows that the proposed method can efficiently support the frame-level video filtering using visual objects.


Frame-level video filtering Visual object Accuracy and effciency evaluation 



This research is jointly supported by the Australian Research Council (Grant No. DP150103008 and DP17010395), ARC Discovery Early Career Researcher Award (Grant No. DE160100308), New Staff Research Grant of the University of Queensland (Grant No.613134), National Natural Science Foundation of China (Grant No. 61572335).


  1. 1.
    Adali, S., Candan, K.S., Chen, S., Erol, K., Subrahmanian, V.S.: The advanced video information system: Data structures and query processing. MMS 4(4), 172–186 (1996)Google Scholar
  2. 2.
    Breitenstein, M.D., Reichlin, F., Leibe, B., Koller-Meier, E., Gool, L.J.V.: Robust tracking-by-detection using a detector confidence particle filter. In: ICCV, pp. 1515–1522 (2009)Google Scholar
  3. 3.
    Donderler, M.E., Ulusoy, Ö., Gudukbay, U.: A rule-based video database system architecture. Inf. Sci. 143(1-4), 13–45 (2002)CrossRefzbMATHGoogle Scholar
  4. 4.
    Donderler, M.E., Ulusoy, Ö., Gudukbay, U.: Rule-based spatiotemporal query processing for video databases. VLDB J. 13(1), 86–103 (2004)CrossRefGoogle Scholar
  5. 5.
    Donderler, M.E., Saykol, E., Arslan, U., Ulusoy, Ö., Gudukbay, U.: Bilvideo: Design and implementation of a video database management system. MTA 27(1), 79–104 (2005)Google Scholar
  6. 6.
    Du, X., Yin, H., Huang, Z., Yang, Y., Zhou, X.: Using detected visual objects to index video database. In: ADC, pp. 333–345 (2016)Google Scholar
  7. 7.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. TPAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  8. 8.
    Flickner, M., Sawhney, H.S., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., Yanker, P.: Query by image and video content: The QBIC system. IEEE Comput. 28(9), 23–32 (1995)CrossRefGoogle Scholar
  9. 9.
    Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  10. 10.
    Hare, S., Saffari, A., Torr, P.H.S.: Struck: Structured output tracking with kernels. In: ICCV, pp. 263–270 (2011)Google Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR arXiv:1502.01852 (2015)
  12. 12.
    Hjelsvold, R., Midtstraum, R.: Modelling and querying video data. In: VLDB, pp. 686–694 (1994)Google Scholar
  13. 13.
    Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.J.: A survey on visual content-based video indexing and retrieval. SMC C 41(6), 797–819 (2011)Google Scholar
  14. 14.
    Huang, Z., Shen, H.T., Shao, J., Zhou, X., Cui, B.: Bounded coordinate system indexing for real-time video clip search. TOIS 27(3) 27(3), 17:1–33 (2009)Google Scholar
  15. 15.
    Huang, Z., Shen, H.T., Shao, J., Cui, B., Zhou, X.: Practical online near-duplicate subsequence detection for continuous video streams. TMM 12(5), 386–398 (2010)Google Scholar
  16. 16.
    Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. PAMI 34(7), 1409–1422 (2012)CrossRefGoogle Scholar
  17. 17.
    Koprulu, M., Cicekli, N.K., Yazici, A.: Spatio-temporal querying in video databases. Inf. Sci. 160(1-4), 131–152 (2004)CrossRefGoogle Scholar
  18. 18.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  19. 19.
    Kuo, T.C.T., Chen, A.L.P.: Content-based query processing for video databases. TMM 2(1), 1–13 (2000)Google Scholar
  20. 20.
    Kuznetsova, A., Ju Hwang, S., Rosenhahn, B., Sigal, L.: Expanding object detector’s horizon: Incremental learning framework for object detection in videos. In: CVPR, pp. 28–36 (2015)Google Scholar
  21. 21.
    Le, T., Thonnat, M., Boucher, A., Bremond, F.: A query language combining object features and semantic events for surveillance video retrieval. In: MMM, pp. 307–317 (2008)Google Scholar
  22. 22.
    Leutenegger, S., Chli, M., Siegwart, R.: BRISK: binary robust invariant scalable keypoints. In: ICCV, pp. 2548–2555 (2011)Google Scholar
  23. 23.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60 (2), 91–110 (2004)CrossRefGoogle Scholar
  24. 24.
    Miksik, O., Mikolajczyk, K.: Evaluation of local detectors and descriptors for fast feature matching. In: ICPR, pp. 2681–2684 (2012)Google Scholar
  25. 25.
    Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP, pp. 331–340 (2009)Google Scholar
  26. 26.
    Oomoto, E., Tanaka, K.: OVID: design and implementation of a video-object database system. TKDE 5(4), 629–643 (1993)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)Google Scholar
  28. 28.
    Rosten, E., Porter, R., Drummond, T.: Faster and better: A machine learning approach to corner detection. TPAMI 32(1), 105–119 (2010)CrossRefGoogle Scholar
  29. 29.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G.R.: ORB: an efficient alternative to SIFT or SURF. In: ICCV, pp. 2564–2571 (2011)Google Scholar
  30. 30.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M.S., Berg, A.C., Li, F.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  31. 31.
    Shen, H.T., Shao, J., Huang, Z., Zhou, X.: Effective and efficient query processing for video subsequence identification. TKDE 21(3), 321–334 (2009)Google Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR arXiv:1409.1556 (2014)
  33. 33.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. In: ICML, pp. 843–852 (2015)Google Scholar
  34. 34.
    Szegedy, C., Toshev, A., Erhan, D.: Deep neural networks for object detection. In: NIPS, pp 2553–2561 (2013)Google Scholar
  35. 35.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR, pp. 2818–2826 (2016)Google Scholar
  36. 36.
    Ulusoy, Ö., Gudukbay, U., Donderler, M.E., Saykol, E., Alper, C.: Bilvideo video database management system. In: VLDB, pp. 1373–1376 (2004)Google Scholar
  37. 37.
    van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Evaluating color descriptors for object and scene recognition. PAMI 32(9), 1582–1596 (2010)CrossRefGoogle Scholar
  38. 38.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  39. 39.
    Wang, N., Li, S., Gupta, A., Yeung, D.: Transferring rich feature hierarchies for robust visual tracking. CoRR arXiv:1501.04587 (2015)
  40. 40.
    Wang, N., Yeung, D.: Learning a deep compact image representation for visual tracking. In: NIPS, pp. 809–817 (2013)Google Scholar
  41. 41.
    Wu, Y., Lim, J., Yang, M.: Object tracking benchmark. PAMI 37(9), 1834–1848 (2015)CrossRefGoogle Scholar
  42. 42.
    Yang, Y., Huang, Z., Shen, H.T., Zhou, X.: Mining multi-tag association for image tagging. WWWJ 14(2), 133–156 (2011)CrossRefGoogle Scholar
  43. 43.
    Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. ACM Comput. Surv 38(4), 13:1–45 (2006)Google Scholar
  44. 44.
    Zhu, L., Xu, Z., Yang, Y.: Bidirectional multirate reconstruction for temporal modeling in videos. CoRR arXiv:1611.09053 (2016)

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • Xingzhong Du
    • 1
  • Hongzhi Yin
    • 1
  • Zi Huang
    • 1
  • Yi Yang
    • 2
  • Xiaofang Zhou
    • 1
    • 3
  1. 1.School of Information Technology and Electrical EngineeringUniversity of QueenslandSt LuciaAustralia
  2. 2.Centre for Artificial IntelligenceUniversity of Technology SydneyUltimoAustralia
  3. 3.School of Computer Science and TechnologySoochow UniversitySuzhouChina

Personalised recommendations