Soft video parsing by label distribution learning

Ling, Miaogen; Geng, Xin

doi:10.1007/s11704-018-8015-y

Soft video parsing by label distribution learning

Research Article
Published: 11 April 2019

Volume 13, pages 302–317, (2019)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Miaogen Ling¹ &
Xin Geng¹

64 Accesses
20 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A discriminative structural model for joint segmentation and recognition of human actions

Article 09 June 2018

Video parsing via spatiotemporally analysis with images

Article 07 July 2015

End-to-End Joint Semantic Segmentation of Actors and Actions in Video

References

Pirsiavash H, Ramanan D. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 2014, 612–619
Google Scholar
Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923
Google Scholar
Oneata D, Verbeek J, Schmid C. The LEAR submission at thumos 2014. 2014, hal-01074442
Google Scholar
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058
Google Scholar
Wang H, Oneata D, Verbeek J, Schmid C. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238
Article MathSciNet Google Scholar
Yuan J, Ni B, Yang X, Kassim A A. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102
Google Scholar
Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
Article Google Scholar
Geng X, Hou P. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517
Google Scholar
Geng X, Luo L. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747
Google Scholar
Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
Google Scholar
Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
Article Google Scholar
Geng X, Zhou Z H, Smith-Miles K. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240
Article Google Scholar
Zhou D, Zhou Y, Zhang X, Zhao Q, Geng X. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647
Chapter Google Scholar
Zhou Y, Xue H, Geng X. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250
Google Scholar
Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497
Google Scholar
Shen W, Zhao K, Guo Y, Yuille A L. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843
Google Scholar
Geng X, Ling M. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337
Google Scholar
Neubeck A, Van Gool L. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855
Google Scholar
Hoai M, Lan Z Z, De la Torre F. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272
Google Scholar
Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32
Article MATH Google Scholar
Shi Q, Wang L, Cheng L, Smola A. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
Google Scholar
Tang K, Li F F, Koller D. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257
Google Scholar
Xiong Y, Zhao Y, Wang L, Lin D, Tang X. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716
Google Scholar
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36
Google Scholar
Gao J, Yang Z, Sun C, Chen K, Nevatia R. Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189
Google Scholar
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515
Google Scholar
Elman J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211
Article Google Scholar
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
Article Google Scholar
Chomsky N. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124
Article MATH Google Scholar
Datar M, Immorlica N, Indyk P, Mirrokni V S. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262
Google Scholar
Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522
Article Google Scholar
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
Book MATH Google Scholar
Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71
Google Scholar
Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528
Article MathSciNet MATH Google Scholar
Manning C D, Schütze H. Foundations of statistical natural language processing. Mass: MIT Press, 1999
MATH Google Scholar
Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014
Google Scholar
Yuan J, Liu Z, Wu Y. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743
Article Google Scholar
Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402
Google Scholar
Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
Google Scholar
Vedaldi A, Zisserman A. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492
Article Google Scholar
Everingham M, Winn J. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011
Google Scholar
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497
Google Scholar

Download references

Acknowledgements

This research was supported by the National Key Research & Development Plan of China (2017YFB1002801), the National Science Foundation of China (61622203, 61232007), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Southeast University, Nanjing, 211189, China
Miaogen Ling & Xin Geng

Authors

Miaogen Ling
View author publications
You can also search for this author in PubMed Google Scholar
Xin Geng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Geng.

Additional information

Miaogen Ling received the BS degree in mathematical science from the Soochow University, China in 2010, and the MS degree in computer science from the Southeast University, Nanjing China in 2013. He is currently pursuing the PhD degree with the department of Computer Science and Engineering, Southeast University, China. His research interest include machine learning and its application to computer vision and multimedia analysis.

Xin Geng received the BS and MS degrees in computer science from Nanjing University, China in 2001 and 2004, respectively, and the PhD degree from Deakin University, Australia in 2008. He joined the School of Computer Science and Engineering at Southeast University, China in 2008, and is currently a professor and vice dean of the school. He has authored over 50 refereed papers, and he holds five patents in these areas. His research interests include pattern recognition, machine learning, and computer vision.

Electronic supplementary material

Supplementary material, approximately 527 KB.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ling, M., Geng, X. Soft video parsing by label distribution learning. Front. Comput. Sci. 13, 302–317 (2019). https://doi.org/10.1007/s11704-018-8015-y

Download citation

Received: 09 January 2018
Accepted: 17 April 2018
Published: 11 April 2019
Issue Date: April 2019
DOI: https://doi.org/10.1007/s11704-018-8015-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Soft video parsing by label distribution learning

Abstract

Access this article

Similar content being viewed by others

A discriminative structural model for joint segmentation and recognition of human actions

Video parsing via spatiotemporally analysis with images

End-to-End Joint Semantic Segmentation of Actors and Actions in Video

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 527 KB.

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Soft video parsing by label distribution learning

Abstract

Access this article

Similar content being viewed by others

A discriminative structural model for joint segmentation and recognition of human actions

Video parsing via spatiotemporally analysis with images

End-to-End Joint Semantic Segmentation of Actors and Actions in Video

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material, approximately 527 KB.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation