Skip to main content
Log in

F2S-Net: learning frame-to-segment prediction for online action detection

  • Research
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Online action detection (OAD) aims at predicting action per frame from a streaming untrimmed video in real time. Most existing approaches leverage all the historical frames in the sliding window as the temporal context of the current frame since single-frame prediction is often unreliable. However, such a manner inevitably introduces useless even noisy video content, which often misleads action classifier when recognizing the ongoing action in the current frame. To alleviate this difficulty, we propose a concise and novel F2S-Net, which can adaptively discover the contextual segments in the online sliding window, and convert current frame prediction into relevant-segment prediction. More specifically, as the current frame can be either action or background, we develop F2S-Net with a distinct two-branch structure, i.e., the action (or background) branch can exploit the action (or background) segments. Via multi-level action supervision, these two branches can complementarily enhance each other, allowing to identify the contextual segments in the sliding window to robustly predict what is ongoing. We evaluate our approach on popular OAD benchmarks, i.e., THUMOS-14, TVSeries and HDD. The extensive results show that our F2S-Net outperforms the recent state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. An, J., Kang, H., Han, S.H., Yang, M.H., Kim, S.J.: Miniroad: Minimal rnn framework for online action detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10341–10350 (2023)

  2. Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: A large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)

  4. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  5. Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: CVPR (2018)

  6. Chen, J., Mittal, G., Yu, Y., Kong, Y., Chen, M.: Gatehub: Gated history unit with background suppression for online action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19925–19934 (2022)

  7. De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: European Conference on Computer Vision, pp. 269–284. Springer (2016)

  8. Du, T., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)

  9. Du, T., Wang, H., Torresani, L., Ray, J., Lecun, Y.: A closer look at spatiotemporal convolutions for action recognition. In: CVPR (2018)

  10. Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Learning to discriminate information for online action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 809–818 (2020)

  11. Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Temporal filtering networks for online action detection. Pattern Recogn. 111, 107695 (2021)

    Article  Google Scholar 

  12. Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211 (2019)

  13. Gao, J., Chen, K., Nevatia, R.: Ctap: Complementary temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–83 (2018)

  14. Gao, J., Yang, Z., Nevatia, R.: Red: Reinforced encoder-decoder networks for action anticipation. arXiv preprint arXiv:1707.04818 (2017)

  15. Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: Woad: Weakly supervised online action detection in untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1915–1923 (2021)

  16. Girdhar, R., Carreira, J., Doersch, C., Zisserman, A.: Video action transformer network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253 (2019)

  17. Guo, H., Ren, Z., Wu, Y., Hua, G., Ji, Q.: Uncertainty-based spatial-temporal attention for online action detection. In: European Conference on Computer Vision, pp. 69–86. Springer (2022)

  18. Hou, J., Wu, X., Wang, R., Luo, J., Jia, Y.: Confidence-guided self refinement for action prediction in untrimmed videos. IEEE Trans. Image Process. 29, 6017–6031 (2020). https://doi.org/10.1109/TIP.2020.2987425

    Article  Google Scholar 

  19. Huang, L., Huang, Y., Ouyang, W., Wang, L.: Modeling sub-actions for weakly supervised temporal action localization. IEEE Trans. Image Process. (2021). https://doi.org/10.1109/TIP.2021.3078324

    Article  Google Scholar 

  20. Idrees, H., Zamir, A.R., Jiang, Y.G., Gorban, A., Laptev, I., Sukthankar, R., Shah, M.: The thumos challenge on action recognition for videos in the wild. Comput. Vis. Image Underst. 155, 1–23 (2017)

    Article  Google Scholar 

  21. Jain, M., Ghodrati, A., Snoek, C.G.: Actionbytes: Learning from trimmed videos to localize actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1171–1180 (2020)

  22. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4405–4413 (2017)

  23. Kim, Y.H., Nam, S., Kim, S.J.: Temporally smooth online action detection using cycle-consistent future anticipation. Pattern Recogn. 116, 107954 (2021)

    Article  Google Scholar 

  24. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)

  25. Lea, C., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks: A unified approach to action segmentation. In: Computer vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8–10 and 15–16, 2016, proceedings, Part III 14, pp. 47–54. Springer (2016)

  26. Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.: Online human action detection using joint classification-regression recurrent neural networks. In: ECCV (2016)

  27. Lin, J., Gan, C., Han, S.: TSM: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7083–7093 (2019)

  28. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV (2018)

  29. Liu, Y., Wang, L., Wang, Y., Ma, X., Qiao, Y.: Fineaction: A fine-grained video dataset for temporal action localization. IEEE Trans. Image Process. 31, 6937–6950 (2022)

    Article  Google Scholar 

  30. Praveenkumar, S., Patil, P., Hiremath, P.: A novel algorithm for human action recognition in compressed domain using attention-guided approach. J. Real-Time Image Process. 20(6), 122 (2023)

    Article  Google Scholar 

  31. Qu, S., Chen, G., Xu, D., Dong, J., Lu, F., Knoll, A.: Lap-net: Adaptive features sampling via learning action progression for online action detection. arXiv preprint arXiv:2011.07915 (2020)

  32. Ramanishka, V., Chen, Y.T., Misu, T., Saenko, K.: Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7699–7707 (2018)

  33. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

  34. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 (2017)

  35. Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: Autoloc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 (2018)

  36. Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR (2016)

  37. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)

  38. Su, R., Xu, D., Sheng, L., Ouyang, W.: PSG-TAL: Progressive cross-granularity cooperation for temporal action localization. IEEE Trans. Image Process. 30, 2103–2113 (2021). https://doi.org/10.1109/TIP.2020.3044218

    Article  Google Scholar 

  39. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

  40. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  41. Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334 (2017)

  42. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159 (2015)

  43. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: European Conference on Computer Vision, pp. 20–36. Springer (2016)

  44. Wang, X., Zhang, S., Qing, Z., Shao, Y., Zuo, Z., Gao, C., Sang, N.: Oadtr: Online action detection with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7565–7575 (2021)

  45. Wei, X., Yao, S., Zhao, C., Hu, D., Luo, H., Lu, Y.: Lightweight multimodal feature graph convolutional network for dangerous driving behavior detection. J. Real-Time Image Process. 20(1), 15 (2023)

    Article  Google Scholar 

  46. Xiong, Y., Wang, L., Wang, Z., Zhang, B., Song, H., Li, W., Lin, D., Qiao, Y., Van Gool, L., Tang, X.: Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint arXiv:1608.00797 (2016)

  47. Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. In: ICCV (2017)

  48. Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 5532–5541 (2019)

  49. Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., Soatto, S.: Long short-term transformer for online action detection. Adv. Neural Inf. Process. Syst. 34, 1086–1099 (2021)

    Google Scholar 

  50. Yang, L., Han, J., Zhang, D.: Colar: Effective and efficient online action detection by consulting exemplars. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3160–3169 (2022)

  51. Yang, L., Peng, H., Zhang, D., Fu, J., Han, J.: Revisiting anchor mechanisms for temporal action localization. IEEE Trans. Image Process. 29, 8535–8548 (2020). https://doi.org/10.1109/TIP.2020.3016486

    Article  Google Scholar 

  52. Zeng, R., Gan, C., Chen, P., Huang, W., Wu, Q., Tan, M.: Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans. Image Process. 28(12), 5797–5808 (2019). https://doi.org/10.1109/TIP.2019.2922108

    Article  MathSciNet  Google Scholar 

  53. Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., Gan, C.: Graph convolutional networks for temporal action localization. In: ICCV (2019)

  54. Zhang, Y., Gan, J., Zhao, Z., Chen, J., Chen, X., Diao, Y., Tu, S.: A real-time fall detection model based on BlazePose and improved ST-GCN. J. Real-Time Image Process. 20(6), 1–12 (2023)

    Article  Google Scholar 

  55. Zhao, P., Xie, L., Zhang, Y., Wang, Y., Tian, Q.: Privileged knowledge distillation for online action detection. arXiv preprint arXiv:2011.09158 (2020)

  56. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)

Download references

Acknowledgements

This work was supported by the National Key R&D Program of China (NO. 2022ZD0160505) and the National Natural Science Foundation of China under Grant (62272450).

Author information

Authors and Affiliations

Authors

Contributions

Y.L: Method design, implementation, coding, and writing. Y.Q, Y.W: Method design and writing.

Corresponding author

Correspondence to Yali Wang.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Qiao, Y. & Wang, Y. F2S-Net: learning frame-to-segment prediction for online action detection. J Real-Time Image Proc 21, 73 (2024). https://doi.org/10.1007/s11554-024-01454-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11554-024-01454-4

Keywords

Navigation