Complementary Boundary Estimation Network for Temporal Action Proposal Generation

Abstract

Temporal Action Detection is an important yet challenging task, in which temporal action proposal generation plays an important part. Since the temporal boundaries of action instances in videos are often ambiguous, it’s difficult to locate them precisely. Boundary Sensitive Network (BSN) (Lin et al. in ECCV, 2018) is a state-of-the-art corner-based method that can generate high-quality proposals with high recall rate. It contains a temporal evaluation network and a proposal evaluation network to generate and evaluate proposals separately, which can find the temporal boundaries of action instances directly to produce proposals with flexible temporal intervals and evaluate the quality of proposals. But BSN still has some issues: (1) Due to the small reception field of temporal evaluation network, it often generates many false temporal boundaries. (2) Evaluating the quality of proposals is a difficult task and not well solved in the paper. To address these issues, we propose Complementary Boundary Estimation Network (CBEN), an improved approach to temporal action proposal generation based on the framework of BSN. Specifically, we improve BSN in two aspects: Firstly, considering the temporal evaluation network of BSN can only capture local information and tends to have high response at background segments, we combine it with a new network with larger reception field to better identify false temporal action boundaries. Secondly, to evaluate the quality of temporal action proposals more accurately, we propose a class-based proposal evaluation network and combine it with a tIoU-based proposal evaluation network to filter out low-quality proposals. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets indicate that CBEN can achieve better performance than current mainstream methods on temporal action proposal generation. We further combine CBEN with an off-the-shelf action classifier, and show consistent performance improvements on THUMOS14 dataset.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

References

  1. 1.

    Bodla N, Singh B, Chellappa R, Davis LS (2017) Soft-nms–improving object detection with one line of code. In: ICCV, pp 5561–5569

  2. 2.

    Buch S, Escorcia V, Shen C, Ghanem B, Niebles JC (2017) Sst: single-stream temporal action proposals. In: CVPR, pp 2911–2920

  3. 3.

    Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970

  4. 4.

    Chao YW, Vijayanarasimhan S, Seybold B, Ross DA, Deng J, Sukthankar R (2018) Rethinking the faster r-cnn architecture for temporal action localization. In: CVPR, pp 1130–1139

  5. 5.

    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255

  6. 6.

    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941

  7. 7.

    Gao J, Chen K, Nevatia R (2018) Ctap: complementary temporal action proposal generation. In: ECCV, pp 68–83

  8. 8.

    Gao J, Yang Z, Chen S, Kan C, Nevatia R (2017) Turn tap: temporal unit regression network for temporal action proposals. In: ICCV, pp 3628–3636

  9. 9.

    Gao J, Yang Z, Nevatia R (2017) Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180

  10. 10.

    He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: ICCV, pp 2961–2969

  11. 11.

    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition, pp 770–778

  12. 12.

    Jiang YG, Liu J, Zamir AR, Toderici G, Laptev I, Shah M, Sukthankar R (2014) Thumos challenge: action recognition with a large number of classes

  13. 13.

    Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: ECCV, pp 734–750

  14. 14.

    Li X, Lin T, Liu X, Gan C, Zuo W, Li C, Long X, He D, Li F, Wen S (2019) Deep concept-wise temporal convolutional networks for action localization. In: ICCV

  15. 15.

    Lin T, Zhao X, Shou Z (2017) Single shot temporal action detection. In: ACM international conference on multimedia, pp 988–996

  16. 16.

    Lin T, Zhao X, Su H, Wang C, Yang M (2018) Bsn: boundary sensitive network for temporal action proposal generation. In: ECCV, pp 3–19

  17. 17.

    Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) Ssd: single shot multibox detector. In: ECCV

  18. 18.

    Lu X, Li B, Yue Y, Li Q, Yan J (2019) Grid r-cnn. In: CVPR, pp 7363–7372

  19. 19.

    Luo W, Li Y, Urtasun R, Zemel RS (2016) Understanding the effective receptive field in deep convolutional neural networks. In: NeurIPS, pp 4898–4906

  20. 20.

    Newell A, Yang K, Jia D (2016) Stacked hourglass networks for human pose estimation. In: ECCV, pp 483–499

  21. 21.

    Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. PAMI 39(6):1137–1149

    Article  Google Scholar 

  22. 22.

    Shou Z, Chan J, Zareian A, Miyazawa K, Chang SF (2017) Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR, pp 5734–5743

  23. 23.

    Shou Z, Wang D, Chang SF (2016) Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR, pp 1049–1058

  24. 24.

    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497

  25. 25.

    Xiong Y, Wang L, Zhe W, Zhang B, Hang S, Wei L, Lin D, Yu Q, Gool LV, Tang X (2016) Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797

  26. 26.

    Xiong Y, Yue Z, Wang L, Lin D, Tang X (2017) A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716

  27. 27.

    Xu H, Das A, Saenko K (2017) R-c3d: region convolutional 3d network for temporal activity detection. In: ICCV, pp 5783–5792

  28. 28.

    Zhao Y, Xiong Y, Wang L, Wu Z, Tang X, Lin D (2017) Temporal action detection with structured segment networks. In: ICCV, pp 2914–2923

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 61673402, 61273270, and 60802069; in part by the Natural Science Foundation of Guangdong under Grants 2017A030311029; in part by the Science and Technology Program of Guangzhou under Grants 201704020180; and in part by the Fundamental Research Funds for the Central Universities of China.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Haifeng Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Calculation of tIoU

tIoU is short for temporal Intersection over Union. Figure 13 shows the beginning and ending time of a temporal action proposal and ground truth. We can calculate the tIoU between the proposal and ground truth by

$$\begin{aligned} tIoU=\frac{Intersection}{Union}=\frac{t_2-t_b}{t_e-t_1} \end{aligned}$$
(11)

The higher value of tIoU represents the proposal is closer to ground truth, e.g. the higher quality of the proposal.

Fig. 13
figure13

The schematic diagram of ground truth (blue) and proposal (green). \(t_b\) and \(t_e\) are the beginning and ending time of ground truth, and \(t_1\) and \(t_2\) are the beginning and ending time of the proposal. (Color figure online)

Appendix B: Derivative of Mean Square Error

Assume we use mean square error to optimize the tIoU-based PEN, then the firs-order derivative of weights in the output layer can be calculated by

$$\begin{aligned} f(\mathbf {w})= & {} sigmoid(\mathbf {w}^\mathrm {T}\mathbf {x}^{(i)})\nonumber \\ J(\mathbf {w})= & {} \frac{1}{2n}\sum _{i=1}^{n}[y_i-f(\mathbf {w})]^2\nonumber \\ \frac{\partial {J}}{\partial {\mathbf {w}}}= & {} -\frac{1}{n}\sum _{i=1}^{n}(f(\mathbf {w})-y_i)(f(\mathbf {w})-1)f(\mathbf {w})\mathbf {x}^{(i)} \end{aligned}$$
(12)

To be noted, \(g(x)=(x-y)(x-1)x\) is not an increasing function when \(x,y\in (0,1)\), so \(\frac{\partial {J}}{\partial {\mathbf {w}}}\) is not a increasing function either which means \(J(\mathbf {w})\) is a non-convex function.

Appendix C: Derivative of Softmax Cross Entropy

Assume softmax cross entropy is used to optimize the class-based PEN, then we can calculate the first-order and second-order derivative of weights in the output layer by

$$\begin{aligned} a_j= & {} \frac{e^{\mathbf {w}_j^\mathrm {T}{\mathbf {x}}}}{\sum _{k=1}^{c}e^{\mathbf {w}_k^\mathrm {T}{\mathbf {x}}}}\nonumber \\ J(\mathbf {w})= & {} -\frac{1}{n}\sum _{i=1}^{n}\sum _{j=1}^{c}y_j^{(i)}log a_j^{(i)}\nonumber \\ \frac{\partial {J}}{\partial {\mathbf {w}_n}}= & {} -\frac{1}{n}\sum _{i=1}^{n}[\mathbf {x}^{(i)}(y_n^{(i)}-a_n)]\nonumber \\ \frac{\partial {^2J}}{\partial {\mathbf {w}_n^2}}= & {} \frac{1}{n}\sum _{i=1}^{n}a_n(1-a_n)\mathbf {x}^{(i)}\mathbf {x}^{(i)\mathrm {T}} \end{aligned}$$
(13)

Considering \(a_n\in (0,1)\), \(\frac{\partial {^2J}}{\partial {\mathbf {w}_n^2}}\) is positive semi-definite matrix, which means \(J(\mathbf {w})\) is a convex function.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Hu, H. Complementary Boundary Estimation Network for Temporal Action Proposal Generation. Neural Process Lett (2020). https://doi.org/10.1007/s11063-020-10349-x

Download citation

Keywords

  • Temporal action proposal generation
  • Temporal boundary evaluation
  • Proposal evaluation
  • Network fusion