Advertisement

Multimedia Tools and Applications

, Volume 75, Issue 19, pp 11961–11976 | Cite as

Video parsing via spatiotemporally analysis with images

  • Xuelong Li
  • Lichao Mou
  • Xiaoqiang LuEmail author
Article

Abstract

Effective parsing of video through the spatial and temporal domains is vital to many computer vision problems because it is helpful to automatically label objects in video instead of manual fashion, which is tedious. Some literatures propose to parse the semantic information on individual 2D images or individual video frames, however, these approaches only take use of the spatial information, ignore the temporal continuity information and fail to consider the relevance of frames. On the other hand, some approaches which only consider the spatial information attempt to propagate labels in the temporal domain for parsing the semantic information of the whole video, yet the non-injective and non-surjective natures can cause the black hole effect. In this paper, inspirited by some annotated image datasets (e.g., Stanford Background Dataset, LabelMe, and SIFT-FLOW), we propose to transfer or propagate such labels from images to videos. The proposed approach consists of three main stages: I) the posterior category probability density function (PDF) is learned by an algorithm which combines frame relevance and label propagation from images. II) the prior contextual constraint PDF on the map of pixel categories through whole video is learned by the Markov Random Fields (MRF). III) finally, based on both learned PDFs, the final parsing results are yielded up to the maximum a posterior (MAP) process which is computed via a very efficient graph-cut based integer optimization algorithm. The experiments show that the black hole effect can be effectively handled by the proposed approach.

Keywords

Semantic video parsing Transfer learning Maximum a posterior (MAP) inference Markov Random Felds (MRF) Prior contextual constraint 

Notes

Acknowledgments

This work was supported in part by the National Basic Research Program of China (973 Program) under Grant 2012CB719905, in part by the National Natural Science Foundation of China under Grant 61472413, in part by Chinese Academy of Sciences under Grant LSIT201408 and in part by the Key Research Program of the Chinese Academy of Sciences under Grant KGZD-EW-T03.

References

  1. 1.
    Bai X, Sapiro G (2009) Geodesic matting: a framework for fast interactive image and video segmentation and matting. Int J Comput Vis 82:113–132CrossRefGoogle Scholar
  2. 2.
    Baker S, Roth S, Scharstein D, Black M, Lewis J, Szeliski R (2007) A database and evaluation methodology for optical flow. In: Proceedings of international conference on computer visionGoogle Scholar
  3. 3.
    Besag J (1974) Spatial interaction and the statistical analysis of lattice systems. in J R Stat Soc B 36:192–236MathSciNetzbMATHGoogle Scholar
  4. 4.
    Boykov Y, Veksler O, Zabih R (2001) Efficient approximate energy minimization via graph cuts. IEEE Trans Pattern Anal Mach Intell 23:1222–1239CrossRefGoogle Scholar
  5. 5.
    Boykov Y, Kolmogorov V (2004) An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans Pattern Anal Mach Intell 26:1124–1137CrossRefzbMATHGoogle Scholar
  6. 6.
    Chen X, Jin X, Wang K (2014) Lighting virtual objects in a single image via coarse scene understanding. Sci China Inf Sci 57(9):092105(14)CrossRefGoogle Scholar
  7. 7.
    Chuang Y, Agarwala A, Curless B, Salesin D, Szeliski R (2002) Video matting of complex scenes. In: Proceedings of ACM SIGGRAPHGoogle Scholar
  8. 8.
    Criminisi A, Cross G, Blake A, Kolmogorov V (2006) Bilayer segmentation of live video. In: Proceedings of internaltional conference on computer vision and pattern recogintionGoogle Scholar
  9. 9.
    Ess A, Mueller T, Grabner H, van Gool L (2009) Segmentation-based urban traffic scene understanding. In: Proceedings of British machine vision conferenceGoogle Scholar
  10. 10.
    Fauqueur J, Brostow G, Cipolla R (2007) Assisted video object labeling by joint tracking of regions and keypoints. In: Proceedings of internaltional conference on computer visionGoogle Scholar
  11. 11.
    Geman S, Geman D (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741CrossRefzbMATHGoogle Scholar
  12. 12.
    Gould S, Fulton R, Koller D (2009) Decomposing a scene into geo-metric and semantically consistent regions. In: Proceedings of international conference on computer visionGoogle Scholar
  13. 13.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of internaltional conference on computer vision and pattern recogintionGoogle Scholar
  14. 14.
    Kolmogorov V, Zabih R (2004) What energy functions can be minimized via graph cuts. IEEE Trans Pattern Anal Mach Intell 26:147–159CrossRefzbMATHGoogle Scholar
  15. 15.
    Kolmogorov (2006) Convergent tree-reweighted message passing for energy minimization. IEEE Trans Pattern Anal Mach Intell 28:1568–1583CrossRefGoogle Scholar
  16. 16.
    Ladicky L, Sturgess P, Russell C, Sengupta S, Bastan-lar Y, Clocksin W, Torr P (2010) Joint optimization for object class segmentation and dense stereo reconstruction. Int J Comput Vis:1–12Google Scholar
  17. 17.
    Lee H, Battle A, Raina R, Ng AY (2006) Efficient sparse coding algorithms. In: Proceedings of neural information processing systemsGoogle Scholar
  18. 18.
    Li X, Mou L, Lu X (2014) Scene parsing from an MAP perspective. IEEE Trans Cybern. doi: 10.1109/TCYB.2014.2361489
  19. 19.
    Liu Y, Liu Y, Chan K (2011) Tensor-based locally maximum margin classifier for image and video classification. Comput Vis Image Understand 115:1762–1771CrossRefGoogle Scholar
  20. 20.
    Liu C, Yuen J, Torralba A (2011) Nonparametric scene parsing via label transfer. IEEE Trans Pattern Anal Mach Intell 33:2368–2382CrossRefGoogle Scholar
  21. 21.
    Lu X, Li X, Mou L (2014) Semi-supervised multi-task learning for scene recognition. IEEE Trans Cybern. doi: 10.1109/TCYB.2014.2362959
  22. 22.
    Malisiewicz T, Gupta A, Efros A A (2011) Ensemble of exemplar-SVMs for object detection and beyond. In: Proceedings of internaltional conference on computer visionGoogle Scholar
  23. 23.
    Mou L, Lu X, Yuan Y (2013) Object or background: whose call is it in complicated scene classification? In: Proceedings of IEEE China summit and international conference on signal and information processingGoogle Scholar
  24. 24.
    Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42:145–175CrossRefzbMATHGoogle Scholar
  25. 25.
    Robertson N, Reid (2006) A general method for human activity recognition in video. Comput Vis Image Understand 104:232–248CrossRefGoogle Scholar
  26. 26.
    Russell B, Torralba A, Murphy K, Freeman W (2008) LabelMe: a database and web-based tool for image annotation. Int J Comput Vis 77:157–173CrossRefGoogle Scholar
  27. 27.
    Shao L, Simon J, Li X (2014) Efficient search and localization of human actions in video databases. IEEE Trans Circuits Syst Video Techn 24:504–512CrossRefGoogle Scholar
  28. 28.
    Theriault C, Thome N, Cord M (2013) Dynamic scene classification: learning motion descriptors with slow features analysis. In: Proceedings of internaltional conference on computer vision and pattern recogintionGoogle Scholar
  29. 29.
    Tighe J, Lazebnik S (2013) Finding things: image parsing with regions and per-exemplar detectors. In: Proceedings of internaltional conference on computer vision and pattern recogintionGoogle Scholar
  30. 30.
    Tighe J, Lazebnik S (2013) Superparsing - scalable nonparametric image parsing with superpixels. Int J Comput Vis 101:329–349MathSciNetCrossRefGoogle Scholar
  31. 31.
    Wang J, Cohen M (2005) An iterative optimization approach for unified image segmentation and matting. In: Proceedings of international conference on computer visionGoogle Scholar
  32. 32.
    Xiao J, Hays J, Ehinger K, Oliva A, Torralba A (2010) SUN database: large-scale scene recognition from abbey to zoo. In: Proceedings of international conference on computer vision and pattern recogintionGoogle Scholar
  33. 33.
    Yang X, Gao X, Tao D, Li X, Li J (2015) Object or an efficient MRF embedded level set method for image segmentation. IEEE Trans Image Process 24:9–21MathSciNetCrossRefGoogle Scholar
  34. 34.
    Yedidia J, Freeman W, Weiss Y (2000) Generalized belief propagation. In: Proceedings of neural information processing systemsGoogle Scholar
  35. 35.
    Yedidia J, Freeman W, Weiss Y (2003) Understanding belief propagation and its generalizations. Explor Artif Intell New Millennium 8:236–239Google Scholar
  36. 36.
    Yedidia J S, Freeman W T, Weiss Y (2005) Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans Inf Theory 51:2282–2312MathSciNetCrossRefzbMATHGoogle Scholar
  37. 37.
    Yuan Y, Mou L, Lu X (2015) Scene recognition by manifold regularized deep learning architecture. IEEE Trans Neural Netw Learn Syst. doi: 10.1109/TNNLS.2014.2359471
  38. 38.
    Zhang C, Wang L, Yang R (2010) Semantic segmentation of urban scenes using dense depth maps. In: Proceedings of European conference on computer visionGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision MechanicsChinese Academy of SciencesXi’anPeople’s Republic of China

Personalised recommendations