Webly-Supervised Video Recognition by Mutually Voting for Relevant Web Images and Web Video Frames

  • Chuang GanEmail author
  • Chen Sun
  • Lixin Duan
  • Boqing Gong
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9907)


Video recognition usually requires a large amount of training samples, which are expensive to be collected. An alternative and cheap solution is to draw from the large-scale images and videos from the Web. With modern search engines, the top ranked images or videos are usually highly correlated to the query, implying the potential to harvest the labeling-free Web images and videos for video recognition. However, there are two key difficulties that prevent us from using the Web data directly. First, they are typically noisy and may be from a completely different domain from that of users’ interest (e.g. cartoons). Second, Web videos are usually untrimmed and very lengthy, where some query-relevant frames are often hidden in between the irrelevant ones. A question thus naturally arises: to what extent can such noisy Web images and videos be utilized for labeling-free video recognition? In this paper, we propose a novel approach to mutually voting for relevant Web images and video frames, where two forces are balanced, i.e. aggressive matching and passive video frame selection. We validate our approach on three large-scale video recognition datasets.



This work was supported in part by NSF IIS-1566511. Chuang Gan was partially supported by the National Basic Research Program of China Grant 2011CBA00300, 2011CBA00301, the National Natural Science Foundation of China Grant 61033001, 61361136003.


  1. 1.
    Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (2011)Google Scholar
  2. 2.
    Chen, J., Cui, Y., Ye, G., Liu, D., Chang, S.: Event-driven semantic concept discovery by exploiting weakly tagged internet images. In: ICMR (2014)Google Scholar
  3. 3.
    Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: ICCV (2015)Google Scholar
  4. 4.
    Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: ICCV, pp. 1409–1416 (2013)Google Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  6. 6.
    Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: webly-supervised visual concept learning. In: CVPR, pp. 3270–3277 (2014)Google Scholar
  7. 7.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  8. 8.
    Duan, L., Tsang, I.W., Xu, D., Chua, T.S.: Domain adaptation from multiple sources via auxiliary classifiers. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 289–296. ACM (2009)Google Scholar
  9. 9.
    Duan, L., Xu, D., Tsang, I.H., Luo, J.: Visual event recognition in videos by learning from web data. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1667–1680 (2012)CrossRefGoogle Scholar
  10. 10.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  11. 11.
    Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: a deep event network for multimedia event detection and evidence recounting. In: CVPR, pp. 2568–2577 (2015)Google Scholar
  12. 12.
    Gan, C., Yao, T., Yang, K., Yang, Y., Mei, T.: You lead, we exceed: labor-free video concept learning by jointly exploiting web videos and images. In: CVPR (2016)Google Scholar
  13. 13.
    Gong, B., Grauman, K., Sha, F.: Connecting the dots with landmarks: discriminatively learning domain-invariant features for unsupervised domain adaptation. In: ICML, pp. 222–230 (2013)Google Scholar
  14. 14.
    Habibian, A., Mensink, T., Snoek, C.G.: Composite concept discovery for zero-shot video event detection. In: ICMR, p. 17 (2014)Google Scholar
  15. 15.
    Habibian, A., Mensink, T., Snoek, C.G.: Videostory: a new multimedia embedding for few-example recognition and translation of events. In: ACM Multimedia, pp. 17–26 (2014)Google Scholar
  16. 16.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  17. 17.
    Huang, J., Gretton, A., Borgwardt, K.M., Schölkopf, B., Smola, A.J.: Correcting sample selection bias by unlabeled data. In: NIPS, pp. 601–608 (2006)Google Scholar
  18. 18.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM Multimedia, vol. 2, p. 4 (2014)Google Scholar
  19. 19.
    Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimedia Inf. Retrieval 2(2), 73–101 (2013)CrossRefGoogle Scholar
  20. 20.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  21. 21.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  22. 22.
    Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In: CVPR (2015)Google Scholar
  23. 23.
    Liu, W., Hua, G., Smith, J.R.: Unsupervised one-class learning for automatic outlier removal. In: CVPR, pp. 3826–3833 (2014)Google Scholar
  24. 24.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar
  25. 25.
    Oneata, D., Verbeek, J., Schmid, C., et al.: Action and event recognition with fisher vectors on a compact feature set. In: ICCV (2013)Google Scholar
  26. 26.
    Pan, S.J., Tsang, I.W., Kwok, J.T., Yang, Q.: Domain adaptation via transfer component analysis. IEEE Trans. Neural Netw. 22(2), 199–210 (2011)CrossRefGoogle Scholar
  27. 27.
    Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010)CrossRefGoogle Scholar
  28. 28.
    Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Maragos, P., Paragios, N., Daniilidis, K. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  29. 29.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  31. 31.
    Singh, B., Han, X., Wu, Z., Morariu, V.I., Davis, L.S.: Selecting relevant web trained concepts for automated event retrieval. In: ICCV (2015)Google Scholar
  32. 32.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  33. 33.
    Sriperumbudur, B.K., Gretton, A., Fukumizu, K., Schölkopf, B., Lanckriet, G.R.: Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res. 11, 1517–1561 (2010)MathSciNetzbMATHGoogle Scholar
  34. 34.
    Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using lstms. In: ICML (2015)Google Scholar
  35. 35.
    Sukhbaatar, S., Fergus, R.: Learning from noisy labels with deep neural networks. 2(3), 4 (2014). arXiv preprint arXiv:1406.2080
  36. 36.
    Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: ICCV, pp. 2596–2604 (2015)Google Scholar
  37. 37.
    Sun, C., Nevatia, R.: Large-scale web video event classification by use of fisher vectors. In: WACV (2013)Google Scholar
  38. 38.
    Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM Multimedia, pp. 371–380 (2015)Google Scholar
  39. 39.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR (2015)Google Scholar
  40. 40.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. In: ICCV (2015)Google Scholar
  41. 41.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  42. 42.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)Google Scholar
  43. 43.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)Google Scholar
  44. 44.
    Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699 (2015)Google Scholar
  45. 45.
    Yan, Z., Zhang, H., Piramuthu, R., Jagadeesh, V., DeCoste, D., Di, W., Yu, Y.: HD-CNN: Hierarchical deep convolutional neural networks for large scale visual recognition. In: ICCV, pp. 2740–2748 (2015)Google Scholar
  46. 46.
    Ye, G., Li, Y., Xu, H., Liu, D., Chang, S.F.: Eventnet: a large scale structured concept library for complex event detection in video. In: ACM Multimedia, pp. 471–480 (2015)Google Scholar
  47. 47.
    Zha, S., Luisier, F., Andrews, W., Srivastava, N., Salakhutdinov, R.: Exploiting image-trained CNN architectures for unconstrained video classification. In: BMVC (2015)Google Scholar
  48. 48.
    Zhang, H., Hu, Z., Deng, Y., Sachan, M., Yan, Z., Xing, E.P.: Learning concept taxonomies from multi-modal data. In: ACL (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.IIISTsinghua UniversityBeijingChina
  2. 2.Google ResearchMountain ViewUSA
  3. 3.AmazonSeattleUSA
  4. 4.CRCVUniversity of Central FloridaOrlandoUSA

Personalised recommendations