Online Detection of Action Start in Untrimmed, Streaming Videos

  • Zheng ShouEmail author
  • Junting Pan
  • Jonathan Chan
  • Kazuyuki Miyazawa
  • Hassan Mansour
  • Anthony Vetro
  • Xavier Giro-i-Nieto
  • Shih-Fu Chang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos. The goal of ODAS is to detect the start of an action instance, with high categorization accuracy and low detection latency. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. We propose three novel methods to specifically address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. We conduct extensive experiments using THUMOS’14 and ActivityNet. We show that our proposed methods lead to significant performance gains and improve the state-of-the-art methods. An ablation study confirms the effectiveness of each proposed method.


Online detection Action start Generative Adversarial Network Evaluation protocol 



We appreciate the support from Mitsubishi Electric for this project.


  1. 1.
    Activitynet challenge 2016. (2016)
  2. 2.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. (2011)Google Scholar
  3. 3.
    Aliakbarian, M.S., Saleh, F., Salzmann, M., Fernando, B., Petersson, L., Andersson, L.: Encouraging LSTMs to anticipate actions very early. In: ICCV (2017)Google Scholar
  4. 4.
    Asadi-Aghbolaghi, M., et al.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: FG (2017)Google Scholar
  5. 5.
    Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. J. Electron. Imaging (1996)Google Scholar
  6. 6.
    Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., Niebles, J.C.: End-to-end, single-stream temporal action detection in untrimmed videos. In: BMVC (2017)Google Scholar
  7. 7.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  8. 8.
    Cao, Y., et al.: Recognize human activities from partially observed videos. In: CVPR (2013)Google Scholar
  9. 9.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  10. 10.
    Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: Multi-fiber networks for video recognition. In: ECCV (2018)Google Scholar
  11. 11.
    Cheng, G., Wan, Y., Saudagar, A.N., Namuduri, K., Buckles, B.P.: Advances in human action recognition: a survey (2015).
  12. 12.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: ICCV (2017)Google Scholar
  13. 13.
    Dai, Z., Yang, Z., Yang, F., Cohen, W.W., Salakhutdinov, R.: Good semi-supervised learning that requires a bad GAN. In: NIPS (2017)Google Scholar
  14. 14.
    Dave, A., Russakovsky, O., Ramanan, D.: Predictive-corrective networks for action detection. In: CVPR (2017)Google Scholar
  15. 15.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). Scholar
  16. 16.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  17. 17.
    Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: DevNet: a deep event network for multimedia event detection and evidence recounting. In: CVPR (2015)Google Scholar
  18. 18.
    Gao, J., Chen, K., Nevatia, R.: Ctap: Complementary temporal action proposal generation. ECCV (2018)Google Scholar
  19. 19.
    Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: temporal activity localization via language query. In: ICCV (2017)Google Scholar
  20. 20.
    Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. In: BMVC (2017)Google Scholar
  21. 21.
    Gao, J., Yang, Z., Nevatia, R.: Red: reinforced encoder-decoder networks for action anticipation. In: BMVC (2017)Google Scholar
  22. 22.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)Google Scholar
  23. 23.
    De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). Scholar
  24. 24.
    Goodfellow, I., et al.: Generative adversarial nets. In: NIPS (2014)Google Scholar
  25. 25.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  26. 26.
    Heilbron, F.C., Barrios, W., Escorcia, V., Ghanem, B.: SCC: semantic context cascade for efficient action detection. In: CVPR (2017)Google Scholar
  27. 27.
    Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)Google Scholar
  28. 28.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: CVPR (2012)Google Scholar
  29. 29.
    Hoai, M., De la Torre, F.: Max-margin early event detectors. In: IJCV (2014)Google Scholar
  30. 30.
    Huang, D., Yao, S., Wang, Y., De La Torre, F.: Sequential max-margin event detectors. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 410–424. Springer, Cham (2014). Scholar
  31. 31.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  32. 32.
    Jégou, H., Douze, M., Schmid, C., Pérez., P.: Aggregating local descriptors into a compact image representation. In: CVPR (2010)Google Scholar
  33. 33.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014).
  34. 34.
    Kang, S.M., Wildes, R.P.: Review of action recognition and detection methods. arXiv preprint arXiv:1610.06906 (2016)
  35. 35.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  36. 36.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014). Scholar
  37. 37.
    Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: CVPR (2017)Google Scholar
  38. 38.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  39. 39.
    Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.: Online human action detection using joint classification-regression recurrent neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 203–220. Springer, Cham (2016). Scholar
  40. 40.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM (2017)Google Scholar
  41. 41.
    Liu, C., Li, Y., Hu, Y., Liu, J.: Online action detection and forecast via multitask deep recurrent neural networks. In: ICASSP (2017)Google Scholar
  42. 42.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: CVPR (2016)Google Scholar
  43. 43.
    Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: ICML (2017)Google Scholar
  44. 44.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010). Scholar
  45. 45.
    Poppe, R.: A survey on vision-based human action recognition. In: Image and Vision Computing (2010)CrossRefGoogle Scholar
  46. 46.
    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  47. 47.
    Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: CVPR (2016)Google Scholar
  48. 48.
    Ryoo, M.S.: Human activity prediction: Early recognition of ongoing activities from streaming videos. In: ICCV (2011)Google Scholar
  49. 49.
    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training GANs. In: NIPS (2016)Google Scholar
  50. 50.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  51. 51.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  52. 52.
    Shou, Z., et al.: Online detection of action start in untrimmed, streaming videos. arXiv preprint arXiv:1802.06822 (2018)
  53. 53.
    Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: CVPR (2017)Google Scholar
  54. 54.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  55. 55.
    Singh, G., Saha, S., Cuzzolin, F.: Online real time multiple spatiotemporal action localisation and prediction on a single platform. In: ICCV (2017)Google Scholar
  56. 56.
    Smeaton, A.F., Over, P., Doherty, A.R.: Video shot boundary detection: seven years of trecvid activity. Comput. Vis. Image Underst. 114, 411–418 (2010)CrossRefGoogle Scholar
  57. 57.
    Soomro, K., Idrees, H., Shah, M.: Predicting the where and what of actors and actions through online action localization. In: CVPR (2016)Google Scholar
  58. 58.
    Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. In: ICLR (2016)Google Scholar
  59. 59.
    Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM MM (2015)Google Scholar
  60. 60.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  61. 61.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: ConvNet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)
  62. 62.
    Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: CVPR (2017)Google Scholar
  63. 63.
    Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. In: CVPR (2016)Google Scholar
  64. 64.
    Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: CVPR (2017)Google Scholar
  65. 65.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR (2011)Google Scholar
  66. 66.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  67. 67.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  68. 68.
    Warhade, K., Merchant, S.N., Desai, U.B.: Video Shot Boundary Detection. River Publishers, Delft (2011)Google Scholar
  69. 69.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)CrossRefGoogle Scholar
  70. 70.
    Xu, H., Das, A., Saenko, K.: R-C3D: Region convolutional 3D network for temporal activity detection. In: ICCV (2017)Google Scholar
  71. 71.
    Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: CVPR (2015)Google Scholar
  72. 72.
    Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. In: BMVC (2017)Google Scholar
  73. 73.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)Google Scholar
  74. 74.
    Yu, G., Yuan, J., Liu, Z.: Predicting human activities using spatio-temporal structure of interest points. In: ACM MM (2012)Google Scholar
  75. 75.
    Yuan, J., Ni, B., Yang, X., Kassim, A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)Google Scholar
  76. 76.
    Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: CVPR (2017)Google Scholar
  77. 77.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)Google Scholar
  78. 78.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Columbia UniversityNew YorkUSA
  2. 2.Universitat Politecnica de CatalunyaBarcelonaSpain
  3. 3.Mitsubishi ElectricTokyoJapan
  4. 4.Mitsubishi Electric Research Laboratories, Inc.CambridgeUSA

Personalised recommendations