Dimensionality Deduction for Action Proposals: To Extract or to Select?

  • Jian Jiang
  • Haoyu Wang
  • Laixin Xie
  • Junwen Zhang
  • Chunhua Deng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10956)


Action detection is an important item in machine vision. Currently some works based on deep learning framework have achieved impressive accuracy, however they still suffer from the problem of low speed. To solve this, researchers have introduced many feasible methods and temporal action proposals method is the most effective one. Fed with features extracted from videos, these methods will propose time clips which may contain actions to reduce computational workload. It is a common way to use 3D convolutions (C3D) to extract spatio-temporal features from videos, nonetheless the dimension of these features is generally high, resulting sparse distribution in each dimension. Thus, it is necessary to apply dimension reduction method in the process of temporal proposals. In this research, we experimentally find that in action detection proposal task, reducing the dimension of features is important. Because it cannot only accelerate the process of subsequent temporal proposals but also makes its performance better. Experimental results on the THUMOS 2014 dataset demonstrate that the method of feature extraction reduction is more suitable for temporal action proposals than feature selection method.


Action detection 3D convolutions Action proposals Spatio-temporal Dimension reduction 



This work was supported by the Hubei Province Training Programs of Innovation and Entrepreneurship for Undergraduates, 201710488036; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRC131; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRA116; Scientific and technological innovation fund for College Students of Wuhan University of Science and Technology, 17ZRA121.


  1. 1.
    Tan, M., Wang, L., Tsang, I.W.: Learning sparse SVM for feature selection on very high dimensional datasets. In: International Conference on International Conference on Machine Learning, pp. 1047–1054. Omnipress (2010)Google Scholar
  2. 2.
    Ji, S., Xu, W., Yang, M., et al.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)CrossRefGoogle Scholar
  3. 3.
    Wang, H., Aser, A.K., Schmid, C., et al.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., et al.: Dynamic image networks for action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  5. 5.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: Single-stream temporal action proposals. In: Computer Vision and Pattern Recognition, pp. 6373–6382. IEEEGoogle Scholar
  6. 6.
    Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger, pp. 6517–6525Google Scholar
  7. 7.
    Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: unified, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. IEEE Computer Society (2016)Google Scholar
  8. 8.
    Jiang, J., Deng, C., Cheng, X.: Action prediction based on dense trajectory and dynamic image. In: Chinese Automation Congress (CAC) (2017)Google Scholar
  9. 9.
    Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision, pp. 1440–1448. IEEE Computer Society (2015)Google Scholar
  10. 10.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  11. 11.
    Laurens, V.D.M.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15(1), 3221–3245 (2014)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Caba, F., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)Google Scholar
  13. 13.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part III. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). Scholar
  14. 14.
    Shou, Z., Wang, D., Chang, S.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  15. 15.
    Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop, vol. 1, p. 6 (2014)Google Scholar
  16. 16.
    Ke, Y., Sukthankar, R., Hebert, M.: Event detection in crowded videos. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. IEEE (2007)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jian Jiang
    • 1
    • 2
  • Haoyu Wang
    • 1
    • 2
  • Laixin Xie
    • 1
    • 2
  • Junwen Zhang
    • 1
    • 2
  • Chunhua Deng
    • 1
    • 2
  1. 1.College of Computer Science and TechnologyWuhan University of Science and TechnologyWuhanChina
  2. 2.Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial SystemWuhanChina

Personalised recommendations