Spatio-Temporal Action Localization for Pedestrian Action Detection

  • Linchao He
  • Jiong MuEmail author
  • Mengting Luo
  • Yunlu Lu
  • Xuefeng Tan
  • Dejun Zhang
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 551)


Current state-of-the-art temporal action detection methods are focused on untrimmed, multi-target videos. As popularized in the object detection framework, these methods perform classification on action class and detection of the duration for multiple instances. But these methods are unrealistic because the action of targets is usually irrelevant and complex in real-world. The previous methods utilize optical flow to handle multiple instances, but they cost too much time on estimating optical flow for evaluating. Inspired by spatio-temporal action detection, we improve the previous method with a new pedestrian action detection network which can detect a pedestrian in real-time. We replace Single Shot Multi-Box Detection (SSD) with RFB-Net which is more efficiency. The tube linking algorithm is introduced to link bounding boxes to different action instances. We use pedestrian action detection network to only process RGB frames which cost less time compared to two-stream based methods. Our framework achieves comparable result compared to the state-of-the-art and can detect in real-time.


Neural network Object detection Action recognition 



This work was supported by the National Natural Science Foundation of China under Grant 61702350, and the Sichuan Provence Department of Education (NO. 17ZA0297).


  1. 1.
    Girshick, R., Donahue, J., Darrell, T.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition (2014)Google Scholar
  2. 2.
    Girshick, R.: Fast R-CNN. In: International Conference on Computer Vision (2015)Google Scholar
  3. 3.
    Ren, S., He, K., Girshick, R.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  4. 4.
    Liu, W., Anguelov, D., Erhan, D.: SSD: single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37 (2016)Google Scholar
  5. 5.
    Liu, S., Huang, D., Wang, Y.: Receptive field block net for accurate and fast object detection. In: The European Conference on Computer Vision (ECCV) (2018)Google Scholar
  6. 6.
    Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
  7. 7.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. arXiv preprint (2017)Google Scholar
  8. 8.
    Wang, L., Xiong, Y., Wang, Z.: Temporal segment networks: towards good practices for deep action recognition. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016)Google Scholar
  9. 9.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)Google Scholar
  10. 10.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: European Conference on Computer Vision, pp. 744–759 (2016)Google Scholar
  11. 11.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  12. 12.
    Liu, J., Shahroudy, A., Wang, G.: SSNet: scale selection network for online 3D action prediction. In: CVPR (2018)Google Scholar
  13. 13.
    Singh, G., Saha, S., Sapienza, M.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3637–3646 (2017)Google Scholar
  14. 14.
    Saha, S., Singh, G., Sapienza, M.: Deep learning for detecting multiple space-time action tubes in videos. arXiv preprint arXiv:1608.01529 (2016)
  15. 15.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  16. 16.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V.: Action tubelet detector for spatio-temporal action localization. In: ICCV-IEEE International Conference on Computer Vision (2017)Google Scholar
  17. 17.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  18. 18.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3164–3172 (2015)Google Scholar
  19. 19.
    Lin, T.-Y., Maire, M., Belongie, S.: Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)Google Scholar
  20. 20.
    Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: Computer Vision and Pattern Recognition, pp. 5823–5832 (2017)Google Scholar
  21. 21.
    He, J., Ibrahim, M.S., Deng, Z.: Generic tubelet proposals for action localization. arXiv preprint arXiv:1705.10861 (2017)

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • Linchao He
    • 1
    • 2
  • Jiong Mu
    • 1
    • 2
    Email author
  • Mengting Luo
    • 1
    • 2
  • Yunlu Lu
    • 1
  • Xuefeng Tan
    • 1
    • 2
  • Dejun Zhang
    • 3
  1. 1.College of Information and EngineeringSichuan Agricultural UniversityYa’anChina
  2. 2.The Lab of Agricultural Information Engineering, Sichuan Key LaboratoryYa’anChina
  3. 3.School of Geography and Information EngineeringChina University of GeosciencesWuhanChina

Personalised recommendations