Abstract
In this paper, we tackle the problem of segmenting out a sequence of actions from videos. The videos contain background and actions which are usually composed of ordered sub-actions. We refer the sub-actions and the background as semantic units. Considering the possible overlap between two adjacent semantic units, we propose a bidirectional sliding window method to generate the label distributions for various segments in the video. The label distribution covers a certain number of semantic unit labels, representing the degree to which each label describes the video segment. The mapping from a video segment to its label distribution is then learned by a Label Distribution Learning (LDL) algorithm. Based on the LDL model, a soft video parsing method with segmental regular grammars is proposed to construct a tree structure for the video. Each leaf of the tree stands for a video clip of background or sub-action. The proposed method shows promising results on the THUMOS’14, MSR-II and UCF101 datasets and its computational complexity is much less than the compared state-of-the-art video parsing method.
Similar content being viewed by others
References
Pirsiavash H, Ramanan D. Parsing videos of actions with segmental grammars. In: Proceedings of the IEEE Conference Computer Vision and Pattern Recognition. 2014, 612–619
Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1914–1923
Oneata D, Verbeek J, Schmid C. The LEAR submission at thumos 2014. 2014, hal-01074442
Shou Z, Wang D, Chang S F. Temporal action localization in untrimmed videos via multi-stage cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 1049–1058
Wang H, Oneata D, Verbeek J, Schmid C. A robust and efficient video representation for action recognition. International Journal of Computer Vision, 2016, 119(3): 219–238
Yuan J, Ni B, Yang X, Kassim A A. Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 3093–3102
Geng X. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(7): 1734–1748
Geng X, Hou P. Pre-release prediction of crowd opinion on movies by label distribution learning. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 3511–3517
Geng X, Luo L. Multilabel ranking with inconsistent rankers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 3742–3747
Geng X, Xia Y. Head pose estimation based on multivariate label distribution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, 1837–1842
Geng X, Yin C, Zhou Z H. Facial age estimation by learning from label distributions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(10): 2401–2412
Geng X, Zhou Z H, Smith-Miles K. Automatic age estimation based on facial aging patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(12): 2234–2240
Zhou D, Zhou Y, Zhang X, Zhao Q, Geng X. Emotion distribution learning from texts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2016, 638–647
Zhou Y, Xue H, Geng X. Emotion distribution recognition from facial expressions. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference. 2015, 1247–1250
Xing C, Geng X, Xue H. Logistic boosting regression for label distribution learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 4489–4497
Shen W, Zhao K, Guo Y, Yuille A L. Label distribution learning forests. Advances in Neural Information Processing Systems. 2017, 834–843
Geng X, Ling M. Soft video parsing by label distribution learning. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017, 1331–1337
Neubeck A, Van Gool L. Efficient non-maximum suppression. In: Proceedings of the 18th IEEE International Conference on Pattern Recognition. 2006, 850–855
Hoai M, Lan Z Z, De la Torre F. Joint segmentation and classification of human actions in video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2011, 3265–3272
Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-markov models. International Journal of Computer Vision, 2011, 93(1): 22–32
Shi Q, Wang L, Cheng L, Smola A. Discriminative human action segmentation and recognition using semi-markov model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
Tang K, Li F F, Koller D. Learning latent temporal structure for complex event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2012, 1250–1257
Xiong Y, Zhao Y, Wang L, Lin D, Tang X. A pursuit of temporal accuracy in general activity detection. 2017, arXiv preprint arXiv:1703.02716
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L. Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the European Conference on Computer Vision. 2016, 20–36
Gao J, Yang Z, Sun C, Chen K, Nevatia R. Turn tap: temporal unit regression network for temporal action proposals. 2017, arXiv preprint arXiv:1703.06189
Shou Z, Chan J, Zareian A, Miyazawa K, Chang S F. CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. 2017, arXiv preprint arXiv:1703.01515
Elman J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179–211
Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735–1780
Chomsky N. Three models for the description of language. IEEE Transactions on Information Theory, 1956, 2(3): 113–124
Datar M, Immorlica N, Indyk P, Mirrokni V S. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the 20th Annual Symposium on Computational Geometry. 2004, 253–262
Belongie S, Malik J, Puzicha J. Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(4): 509–522
Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press, 2004
Berger A L, Pietra V J D, Pietra S A D. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22(1): 39–71
Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(1-3): 503–528
Manning C D, Schütze H. Foundations of statistical natural language processing. Mass: MIT Press, 1999
Jiang Y G, Liu J, Zamir A R, Toderici G, Laptev I, Shah M, Sukthankar R. THUMOS challenge: action recognition with a large number of classes. In: Proceedings of the 1st International Workshop on Action Recognition with a large Number of Classes. 2014
Yuan J, Liu Z, Wu Y. Discriminative video pattern search for efficient action detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011, 33(9): 1728–1743
Soomro K, Zamir A R, Shah M. UCF101: a dataset of 101 human actions classes from videos in the wild. 2012, arXiv preprint arXiv:1212.0402
Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2008, 1–8
Vedaldi A, Zisserman A. Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 34(3): 480–492
Everingham M, Winn J. The pascal visual object classes challenge 2012 (VOC2012) development kit. Pattern Analysis, Statistical Modelling and Computational Learning, Technical Report, 2011
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. Advances in Neural Information Processing Systems. 2014, 568–576
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, 4489–4497
Acknowledgements
This research was supported by the National Key Research & Development Plan of China (2017YFB1002801), the National Science Foundation of China (61622203, 61232007), the Jiangsu Natural Science Funds for Distinguished Young Scholar (BK20140022), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Collaborative Innovation Center of Wireless Communications Technology.
Author information
Authors and Affiliations
Corresponding author
Additional information
Miaogen Ling received the BS degree in mathematical science from the Soochow University, China in 2010, and the MS degree in computer science from the Southeast University, Nanjing China in 2013. He is currently pursuing the PhD degree with the department of Computer Science and Engineering, Southeast University, China. His research interest include machine learning and its application to computer vision and multimedia analysis.
Xin Geng received the BS and MS degrees in computer science from Nanjing University, China in 2001 and 2004, respectively, and the PhD degree from Deakin University, Australia in 2008. He joined the School of Computer Science and Engineering at Southeast University, China in 2008, and is currently a professor and vice dean of the school. He has authored over 50 refereed papers, and he holds five patents in these areas. His research interests include pattern recognition, machine learning, and computer vision.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Ling, M., Geng, X. Soft video parsing by label distribution learning. Front. Comput. Sci. 13, 302–317 (2019). https://doi.org/10.1007/s11704-018-8015-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-018-8015-y