Skip to main content
Log in

Soft video parsing by label distribution learning

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Pirsiavash H, Ramanan D. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 2014, 612–619

    Google Scholar 

  2. Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923

    Google Scholar 

  3. Oneata D, Verbeek J, Schmid C. The LEAR submission at thumos 2014. 2014, hal-01074442

    Google Scholar 

  4. Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058

    Google Scholar 

  5. Wang H, Oneata D, Verbeek J, Schmid C. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238

    Article  MathSciNet  Google Scholar 

  6. Yuan J, Ni B, Yang X, Kassim A A. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102

    Google Scholar 

  7. Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748

    Article  Google Scholar 

  8. Geng X, Hou P. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517

    Google Scholar 

  9. Geng X, Luo L. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747

    Google Scholar 

  10. Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842

    Google Scholar 

  11. Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412

    Article  Google Scholar 

  12. Geng X, Zhou Z H, Smith-Miles K. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240

    Article  Google Scholar 

  13. Zhou D, Zhou Y, Zhang X, Zhao Q, Geng X. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647

    Chapter  Google Scholar 

  14. Zhou Y, Xue H, Geng X. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250

    Google Scholar 

  15. Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497

    Google Scholar 

  16. Shen W, Zhao K, Guo Y, Yuille A L. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843

    Google Scholar 

  17. Geng X, Ling M. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337

    Google Scholar 

  18. Neubeck A, Van Gool L. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855

    Google Scholar 

  19. Hoai M, Lan Z Z, De la Torre F. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272

    Google Scholar 

  20. Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32

    Article  MATH  Google Scholar 

  21. Shi Q, Wang L, Cheng L, Smola A. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8

    Google Scholar 

  22. Tang K, Li F F, Koller D. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257

    Google Scholar 

  23. Xiong Y, Zhao Y, Wang L, Lin D, Tang X. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716

    Google Scholar 

  24. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36

    Google Scholar 

  25. Gao J, Yang Z, Sun C, Chen K, Nevatia R. Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189

    Google Scholar 

  26. Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515

    Google Scholar 

  27. Elman J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211

    Article  Google Scholar 

  28. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780

    Article  Google Scholar 

  29. Chomsky N. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124

    Article  MATH  Google Scholar 

  30. Datar M, Immorlica N, Indyk P, Mirrokni V S. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262

    Google Scholar 

  31. Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522

    Article  Google Scholar 

  32. Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004

    Book  MATH  Google Scholar 

  33. Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71

    Google Scholar 

  34. Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528

    Article  MathSciNet  MATH  Google Scholar 

  35. Manning C D, Schütze H. Foundations of statistical natural language processing. Mass: MIT Press, 1999

    MATH  Google Scholar 

  36. Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014

    Google Scholar 

  37. Yuan J, Liu Z, Wu Y. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743

    Article  Google Scholar 

  38. Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402

    Google Scholar 

  39. Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8

    Google Scholar 

  40. Vedaldi A, Zisserman A. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492

    Article  Google Scholar 

  41. Everingham M, Winn J. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011

    Google Scholar 

  42. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576

    Google Scholar 

  43. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497

    Google Scholar 

Download references

Acknowledgements

This research was supported by the National Key Research & Development Plan of China (2017YFB1002801), the National Science Foundation of China (61622203, 61232007), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Geng.

Additional information

Miaogen Ling received the BS degree in mathematical science from the Soochow University, China in 2010, and the MS degree in computer science from the Southeast University, Nanjing China in 2013. He is currently pursuing the PhD degree with the department of Computer Science and Engineering, Southeast University, China. His research interest include machine learning and its application to computer vision and multimedia analysis.

Xin Geng received the BS and MS degrees in computer science from Nanjing University, China in 2001 and 2004, respectively, and the PhD degree from Deakin University, Australia in 2008. He joined the School of Computer Science and Engineering at Southeast University, China in 2008, and is currently a professor and vice dean of the school. He has authored over 50 refereed papers, and he holds five patents in these areas. His research interests include pattern recognition, machine learning, and computer vision.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ling, M., Geng, X. Soft video parsing by label distribution learning. Front. Comput. Sci. 13, 302–317 (2019). https://doi.org/10.1007/s11704-018-8015-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-018-8015-y

Keywords

Navigation