Multimedia Tools and Applications

, Volume 76, Issue 5, pp 6111–6126 | Cite as

Recognizing key segments of videos for video annotation by learning from web image sets



In this paper, we propose an approach of inferring the labels of unlabeled consumer videos and at the same time recognizing the key segments of the videos by learning from Web image sets for video annotation. The key segments of the videos are automatically recognized by transferring the knowledge learned from related Web image sets to the videos. We introduce an adaptive latent structural SVM method to adapt the pre-learned classifiers using Web image sets to an optimal target classifier, where the locations of the key segments are modeled as latent variables because the ground-truth of key segments are not available. We utilize a limited number of labeled videos and abundant labeled Web images for training annotation models, which significantly alleviates the time-consuming and labor-expensive collection of a large number of labeled training videos. Experiment on the two challenge datasets Columbia’s Consumer Video (CCV) and TRECVID 2014 Multimedia Event Detection (MED2014) shows our method performs better than state-of-art methods.


Video annotation Key segment Image set Transfer learning 



This work was supported in part by the 973 Program of China under grant No. 2012CB720000, the Natural Science Foundation of China(NSFC) under Grant No. 61375044 and 61472038, the Specialized Research Fund for the Doctoral Program of Higher Education of China (20121101120029), the Specialized Fund for Joint Building Program of Beijing Municipal Education Commission, and the Excellent young scholars Research Fund of BIT (2013).


  1. 1.
    Baktashmotlagh M, Harandi MT, Lovell BC, Salzmann M (2013) Unsupervised domain adaptation by domain invariant projection. In: Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 769–776. IEEEGoogle Scholar
  2. 2.
    Bhattacharya S, Yu FX, Chang SF (2014) Minimally needed evidence for complex event recognition in unconstrained videos. In: Proceedings of International Conference on Multimedia Retrieval, International Conference on Multimedia Retrieval, pp. 105:105–105:112, numpages = 8Google Scholar
  3. 3.
    Bianco S, Ciocca G, Napoletano P, Schettini R (2015) An interactive tool for manual, semi-automatic and automatic video annotation. Comput Vis Image Underst 131:88–99CrossRefGoogle Scholar
  4. 4.
    Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: International Conference on Computer Vision, vol. 2, pp. 1395–1402. IEEEGoogle Scholar
  5. 5.
    Bruzzone L, Marconcini M (2010) Domain adaptation problems: A dasvm classification technique and a circular validation strategy. Pattern Recogn Mach Intell 32(5):770–787CrossRefGoogle Scholar
  6. 6.
    Chen L, Duan L, Xu D, Xu D (2013) Event recognition in videos by learning from heterogeneous web sources. In: Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2666–2673. IEEEGoogle Scholar
  7. 7.
    Cheng WH, Chuang YY, Lin YT, Hsieh CC, Fang SY, Chen BY, Wu JL (2008) Semantic analysis for automatic event recognition and segmentation of wedding ceremony videos. IEEE Transactions on Circuits and Systems for Video Technology 1639–1650Google Scholar
  8. 8.
    Divakaran A, Javed O, Ali S, Sawhney H, Yu Q, Liu J, Cheng H, Tamrakar A (2013) Video event recognition using concept attributes. In: IEEE Winter Conference on Applications of Computer Vision, pp 339–346Google Scholar
  9. 9.
    Do TMT, Artières T (2009) Large margin training for hidden markov models with partially observed states. In: International Conference on Machine Learning, pp. 265–272. ACMGoogle Scholar
  10. 10.
    Duan L, Xu D, fu Chang S (2012) Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach. In: Computer Vision and Pattern RecognitionGoogle Scholar
  11. 11.
    Duan L, Xu D, Tsang IWH (2012) Domain adaptation from multiple sources: A domain-dependent regularization approach. IEEE Trans Neural Netw Learn Syst 23(3):504–518CrossRefGoogle Scholar
  12. 12.
    Fang M, Guo Y, Zhang X, Li X (2015) Multi-source transfer learning based on label shared subspace. Pattern Recogn Lett 51:101–106CrossRefGoogle Scholar
  13. 13.
    Habibian A, Snoek CG (2014) Recommendations for recognizing video events by concept vocabularies. Comput Vis Image Underst 124:110–122CrossRefGoogle Scholar
  14. 14.
    Ikizler-Cinbis N, Cinbis R, Sclaroff S (2009) Learning actions from the web. In: International Conference on Computer Vision, pp 995–1002Google Scholar
  15. 15.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint. 1408:5093
  16. 16.
    gang Jiang Y, Ye G, fu Chang S, Ellis D, Loui EC (2011) Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In: International Conference on Multimedia Retrieval, p. 29Google Scholar
  17. 17.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105Google Scholar
  18. 18.
    Li W, Duan L, Xu D, Tsang IW (2014) Learning with augmented features for supervised and semi-supervised heterogeneous domain adaptation. IEEE Trans Pattern Anal Mach Intell 36(6):1134–1148CrossRefGoogle Scholar
  19. 19.
    Li W, Yu Q, Sawhney H, Vasconcelos N (2013) Recognizing activities via bag of words for attribute dynamics. In: Computer Vision and Pattern Recognition, pp 2587–2594Google Scholar
  20. 20.
    Long M, Wang J, Ding G, Pan SJ, et al. (2014) Adaptation regularization: A general framework for transfer learning. IEEE Trans Knowl Data Eng 26(5):1076–1089CrossRefGoogle Scholar
  21. 21.
    Mazloom M, Gavves E, van de Sande K, Snoek C (2013) Searching informative concept banks for video event detection. In: Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, International Conference on Multimedia Retrieval, pp 255–262Google Scholar
  22. 22.
  23. 23.
    Ni B, Song Z, Yan S (2011) Web image and video mining towards universal and robust age estimator. IEEE Trans Multimedia 13(6):1217–1229CrossRefGoogle Scholar
  24. 24.
    Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359. doi: 10.1109/TKDE.2009.191 CrossRefGoogle Scholar
  25. 25.
    Schroff F, Criminisi A, Zisserman A (2011) Harvesting image databases from the web. IEEE Trans Pattern Anal Mach Intell 33(4):754–766CrossRefGoogle Scholar
  26. 26.
    Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: International Conference on Pattern Recognition, vol. 3, pp. 32–36. IEEEGoogle Scholar
  27. 27.
    Sun C, Burns B, Nevatia R, Snoek C, Bolles B, Myers G, Wang W, Yeh E (2014) Isomer: Informative segment observations for multimedia event recounting. In: Proceedings of International Conference on Multimedia Retrieval, International Conference on Multimedia Retrieval, pp. 241:241–241:248. ACMGoogle Scholar
  28. 28.
    Sun C, Nevatia R (2014) Discover: Discovering important segments for classification of video events and recountingGoogle Scholar
  29. 29.
    Sun Q, Chattopadhyay R, Panchanathan S, Ye J (2011) A two-stage weighting framework for multi-source domain adaptation. In: Advances in neural information processing systems, pp 505–513Google Scholar
  30. 30.
    Tang K, Ramanathan V, Fei-fei L, Koller D (2012) Shifting weights: Adapting object detectors from image to video. In: NIPS, pp. 647–655Google Scholar
  31. 31.
    Wang H, Wu X, Jia Y (2014) Video annotation via image groups from the web. IEEE Trans Multimedia 16(5):1282–1291CrossRefGoogle Scholar
  32. 32.
    Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann AG, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878MathSciNetCrossRefGoogle Scholar
  33. 33.
    Yang Y, Zha ZJ, Gao Y, Zhu X, Chua TS (2014) Exploiting web images for semantic video indexing via robust sample-specific loss. IEEE Trans Multimedia 16(6):1677–1689CrossRefGoogle Scholar
  34. 34.
    Zhang X, Yang Y, Zhang Y, Luan H, Li J, Zhang H, Chua TS (2015) Enhancing video event recognition using automatically constructed semantic-visual knowledge baseGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Beijing Laboratory of Intelligent Information Technology, School of Computer ScienceBeijing Institute of TechnologyBeijingChina

Personalised recommendations