Abstract
Currently, one-stage frameworks have been widely applied for temporal action detection, but they still suffer from the challenge that the action instances span a wide range of time. The reason is that these one-stage detectors, e.g., Single Shot Multi-Box Detector (SSD), extract temporal features only applying a single-level layer for each head, which is not discriminative enough to perform classification and regression. In this paper, we propose a Multi-Level Temporal Pyramid Network (MLTPN) to improve the discrimination of the features. Specially, we first fuse the features from multiple layers with different temporal resolutions, to encode multi-layer temporal information. We then apply a multi-level feature pyramid architecture on the features to enhance their discriminative abilities. Finally, we design a simple yet effective feature fusion module to fuse the multi-level multi-scale features. By this means, the proposed MLTPN can learn rich and discriminative features for different action instances with different durations. We evaluate MLTPN on two challenging datasets: THUMOS’14 and Activitynet v1.3, and the experimental results show that MLTPN obtains competitive performance on Activitynet v1.3 and outperforms the state-of-the-art approaches on THUMOS’14 significantly.
The first author of this paper is a graduate student.
This work is supported by the National Natural Science Foundation of China under grant 61871435 and the Fundamental Research Funds for the Central Universities no. 2019kfyXKJC024.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Jiang, Y.-G., et al.: THUMOS challenge: action recognition with a large number of classes (2014)
Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recognit. Challenge 1(2), 2 (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (2014)
Long, F., et al.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Li, X., et al.: Deep Concept-wise Temporal Convolutional Networks for Action Localization. arXiv preprint arXiv:1908.09442 (2019)
Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Chao, Y.-W., et al.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Lin, T., et al.: BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: Proceedings of the 25th ACM International Conference on Multimedia (2017)
Yeung, S., et al.: End-to-end learning of action detection from frame glimpses in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Buch, S., et al.: End-to-end, single-stream temporal action detection in untrimmed videos. In: BMVC, vol. 2 (2017)
Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (2015)
Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Buch, S., et al.: SST: single-stream temporal action proposals. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Gao, J., et al.: Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Gao, J., Chen, K., Nevatia, R.: CTAP: complementary temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Dai, X., et al.: Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Zhao, Y., et al.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision (2017)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Shou, Z., et al.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017)
Zhao, Q., et al.: M2Det: a single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33 (2019)
Liu, Y., et al.: Multi-granularity generator for temporal action proposal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Singh, B., et al.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Lin, T., et al.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Yuan, Z., et al.: Temporal action localization by structured maximal sums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Caba Heilbron, F., et al.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
Singh, G., Cuzzolin, F.: Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979 (2016)
Xiong, Y., et al.: A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017)
Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_47
Rezatofighi, H., et al.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014 (2014)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Yuan, J., et al.: Temporal action localization with pyramid of score distribution features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Wang, R., Tao, D.: UTS at activitynet 2016. AcitivityNet Large Scale Activity Recognition Challenge 8, 2016 (2016)
Lin, T., Zhao, X., Shou, Z.: Temporal convolution based action proposal: submission to activitynet 2017. arXiv preprint arXiv:1707.06750 (2017)
Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L\(^{1}\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74936-3_22
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Zhao, Y., et al.: Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011 8 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, X., Gao, C., Zhang, S., Sang, N. (2020). Multi-level Temporal Pyramid Network for Action Detection. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12306. Springer, Cham. https://doi.org/10.1007/978-3-030-60639-8_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-60639-8_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60638-1
Online ISBN: 978-3-030-60639-8
eBook Packages: Computer ScienceComputer Science (R0)