Dimensionality Deduction for Action Proposals: To Extract or to Select?
Action detection is an important item in machine vision. Currently some works based on deep learning framework have achieved impressive accuracy, however they still suffer from the problem of low speed. To solve this, researchers have introduced many feasible methods and temporal action proposals method is the most effective one. Fed with features extracted from videos, these methods will propose time clips which may contain actions to reduce computational workload. It is a common way to use 3D convolutions (C3D) to extract spatio-temporal features from videos, nonetheless the dimension of these features is generally high, resulting sparse distribution in each dimension. Thus, it is necessary to apply dimension reduction method in the process of temporal proposals. In this research, we experimentally find that in action detection proposal task, reducing the dimension of features is important. Because it cannot only accelerate the process of subsequent temporal proposals but also makes its performance better. Experimental results on the THUMOS 2014 dataset demonstrate that the method of feature extraction reduction is more suitable for temporal action proposals than feature selection method.
KeywordsAction detection 3D convolutions Action proposals Spatio-temporal Dimension reduction
This work was supported by the Hubei Province Training Programs of Innovation and Entrepreneurship for Undergraduates, 201710488036; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRC131; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRA116; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRA121.
- 1.Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very high dimensional datasets. In: International Conference on International Conference on Machine Learning, pp. 1047–1054. Omnipress (2010)Google Scholar
- 4.Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., et al.: Dynamic image networks for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
- 5.Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: Single-stream temporal action proposals. In: Computer Vision and Pattern Recognition, pp. 6373–6382. IEEEGoogle Scholar
- 6.Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger, pp. 6517–6525Google Scholar
- 7.Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. IEEE Computer Society (2016)Google Scholar
- 8.Jiang, J., Deng, C., Cheng, X.: Action prediction based on dense trajectory and dynamic image. In: Chinese Automation Congress (CAC) (2017)Google Scholar
- 9.Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, pp. 1440–1448. IEEE Computer Society (2015)Google Scholar
- 10.Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
- 12.Caba, F., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)Google Scholar
- 13.Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part III. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47CrossRefGoogle Scholar
- 14.Shou, Z., Wang, D., Chang, S.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
- 15.Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop, vol. 1, p. 6 (2014)Google Scholar
- 16.Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)Google Scholar