Advertisement

Frontiers of Computer Science

, Volume 13, Issue 2, pp 302–317 | Cite as

Soft video parsing by label distribution learning

  • Miaogen Ling
  • Xin GengEmail author
Research Article
  • 10 Downloads

Abstract

In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.

Keywords

video parsing label distribution learning subactions graduality 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This research was supported by the National Key Research & Development Plan of China (2017YFB1002801), the National Science Foundation of China (61622203, 61232007), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.

Supplementary material

11704_2018_8015_MOESM1_ESM.ppt (528 kb)
Supplementary material, approximately 527 KB.

References

  1. 1.
    Pirsiavash H, Ramanan D. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 2014, 612–619Google Scholar
  2. 2.
    Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923Google Scholar
  3. 3.
    Oneata D, Verbeek J, Schmid C. The LEAR submission at thumos 2014. 2014, hal-01074442Google Scholar
  4. 4.
    Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058Google Scholar
  5. 5.
    Wang H, Oneata D, Verbeek J, Schmid C. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238MathSciNetCrossRefGoogle Scholar
  6. 6.
    Yuan J, Ni B, Yang X, Kassim A A. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102Google Scholar
  7. 7.
    Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748CrossRefGoogle Scholar
  8. 8.
    Geng X, Hou P. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517Google Scholar
  9. 9.
    Geng X, Luo L. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747Google Scholar
  10. 10.
    Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842Google Scholar
  11. 11.
    Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412CrossRefGoogle Scholar
  12. 12.
    Geng X, Zhou Z H, Smith-Miles K. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240CrossRefGoogle Scholar
  13. 13.
    Zhou D, Zhou Y, Zhang X, Zhao Q, Geng X. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647CrossRefGoogle Scholar
  14. 14.
    Zhou Y, Xue H, Geng X. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250Google Scholar
  15. 15.
    Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497Google Scholar
  16. 16.
    Shen W, Zhao K, Guo Y, Yuille A L. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843Google Scholar
  17. 17.
    Geng X, Ling M. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337Google Scholar
  18. 18.
    Neubeck A, Van Gool L. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855Google Scholar
  19. 19.
    Hoai M, Lan Z Z, De la Torre F. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272Google Scholar
  20. 20.
    Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32CrossRefzbMATHGoogle Scholar
  21. 21.
    Shi Q, Wang L, Cheng L, Smola A. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8Google Scholar
  22. 22.
    Tang K, Li F F, Koller D. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257Google Scholar
  23. 23.
    Xiong Y, Zhao Y, Wang L, Lin D, Tang X. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716Google Scholar
  24. 24.
    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36Google Scholar
  25. 25.
    Gao J, Yang Z, Sun C, Chen K, Nevatia R. Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189Google Scholar
  26. 26.
    Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515Google Scholar
  27. 27.
    Elman J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211CrossRefGoogle Scholar
  28. 28.
    Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780CrossRefGoogle Scholar
  29. 29.
    Chomsky N. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124CrossRefzbMATHGoogle Scholar
  30. 30.
    Datar M, Immorlica N, Indyk P, Mirrokni V S. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262Google Scholar
  31. 31.
    Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522CrossRefGoogle Scholar
  32. 32.
    Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004CrossRefzbMATHGoogle Scholar
  33. 33.
    Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71Google Scholar
  34. 34.
    Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528MathSciNetCrossRefzbMATHGoogle Scholar
  35. 35.
    Manning C D, Schütze H. Foundations of statistical natural language processing. Mass: MIT Press, 1999zbMATHGoogle Scholar
  36. 36.
    Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014Google Scholar
  37. 37.
    Yuan J, Liu Z, Wu Y. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743CrossRefGoogle Scholar
  38. 38.
    Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402Google Scholar
  39. 39.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8Google Scholar
  40. 40.
    Vedaldi A, Zisserman A. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492CrossRefGoogle Scholar
  41. 41.
    Everingham M, Winn J. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011Google Scholar
  42. 42.
    Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576Google Scholar
  43. 43.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497Google Scholar

Copyright information

© Higher Education Press and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringSoutheast UniversityNanjingChina

Personalised recommendations