Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

  • Limin WangEmail author
  • Yuanjun Xiong
  • Zhe Wang
  • Yu Qiao
  • Dahua Lin
  • Xiaoou Tang
  • Luc Van Gool
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9912)


Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 (\( 69.4\,\% \)) and UCF101 (\( 94.2\,\% \)). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices (Models and code at


Action recognition Temporal segment networks Good practices ConvNets 



This work was supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626), Early Career Scheme (ECS) grant (No. 24204215), ERC Advanced Grant VarCity (No. 273940), Guangdong Innovative Research Program (2015B010129013, 2014B050 505017), and Shenzhen Research Program (KQCX2015033117354153, JSGG2015 0925164740726, CXZZ20150930104115529), and External Cooperation Program of BIC, Chinese Academy of Sciences (172644KYSB20150019).

Supplementary material

419983_1_En_2_MOESM1_ESM.pdf (8.7 mb)
Supplementary material 1 (pdf 8897 KB)


  1. 1.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  2. 2.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  3. 3.
    Wang, L., Qiao, Y., Tang, X.: Motionlets: mid-level 3D parts for human motion recognition. In: CVPR, pp. 2674–2681 (2013)Google Scholar
  4. 4.
    Ng, J.Y.H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR, pp. 4694–4702 (2015)Google Scholar
  5. 5.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp. 4305–4314 (2015)Google Scholar
  6. 6.
    Gan, C., Wang, N., Yang, Y., Yeung, D.Y., Hauptmann, A.G.: Devnet: a deep event network for multimedia event detection and evidence recounting. In: CVPR, pp. 2568–2577 (2015)Google Scholar
  7. 7.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  8. 8.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  9. 9.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR, pp. 1–14 (2015)Google Scholar
  10. 10.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  11. 11.
    Xiong, Y., Zhu, K., Lin, D., Tang, X.: Recognize complex events from static images by fusing deep channels. In: CVPR, pp. 1600–1609 (2015)Google Scholar
  12. 12.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)Google Scholar
  13. 13.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  14. 14.
    Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)Google Scholar
  15. 15.
    Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  17. 17.
    Wang, L., Qiao, Y., Tang, X.: Latent hierarchical model of temporal structure for complex activity classification. IEEE Trans. Image Process. 23(2), 810–822 (2014)CrossRefMathSciNetGoogle Scholar
  18. 18.
    Fernando, B., Gavves, E., O., MJ, Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR, pp. 5378–5387 (2015)Google Scholar
  19. 19.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. CoRR abs/1604.04494 (2016)Google Scholar
  20. 20.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  21. 21.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)Google Scholar
  22. 22.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T.A., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV, pp. 2556–2563 (2011)Google Scholar
  23. 23.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp. 448–456 (2015)Google Scholar
  24. 24.
    Gan, C., Yao, T., Yang, K., Yang, Y., Mei, T.: You lead, we exceed: labor-free video concept learning by jointly exploiting web videos and images. In: CVPR, pp. 923–932 (2016)Google Scholar
  25. 25.
    Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)CrossRefGoogle Scholar
  26. 26.
    Gan, C., Yang, Y., Zhu, L., Zhao, D., Zhuang, Y.: Recognizing an action using its name: a knowledge-based approach. Int. J. Comput. Vis. 120(1), 61–77 (2016)CrossRefMathSciNetGoogle Scholar
  27. 27.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  28. 28.
    Sun, L., Jia, K., Yeung, D., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp. 4597–4605 (2015)Google Scholar
  29. 29.
    Pirsiavash, H., Ramanan, D.: Parsing videos of actions with segmental grammars. In: CVPR, pp. 612–619 (2014)Google Scholar
  30. 30.
    Wang, L., Qiao, Y., Tang, X.: Video action detection with relational dynamic-poselets. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 565–580. Springer, Heidelberg (2014)Google Scholar
  31. 31.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D.A., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  32. 32.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014)Google Scholar
  33. 33.
    Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)Google Scholar
  34. 34.
    Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2013)Google Scholar
  35. 35.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-\({L}^{1}\) optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  36. 36.
    Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. CoRR abs/1408.5093Google Scholar
  37. 37.
    Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: CVPR, pp. 596–603 (2014)Google Scholar
  38. 38.
    Wang, H., Schmid, C.: LEAR-INRIA submission for the thumos workshop. In: ICCV Workshop on THUMOS Challenge, pp. 1–3 (2013)Google Scholar
  39. 39.
    Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. Int. J. Comput. Vis. 119(3), 254–271 (2016)CrossRefMathSciNetGoogle Scholar
  40. 40.
    Ni, B., Moulin, P., Yang, X., Yan, S.: Motion part regularization: improving action recognition via trajectory group selection. In: CVPR, pp. 3698–3706 (2015)Google Scholar
  41. 41.
    Zhu, W., Hu, J., Sun, G., Cao, X., Qiao, Y.: A key volume mining deep framework for action recognition. In: CVPR, pp. 1991–1999 (2016)Google Scholar
  42. 42.

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Limin Wang
    • 1
    Email author
  • Yuanjun Xiong
    • 2
  • Zhe Wang
    • 3
  • Yu Qiao
    • 3
  • Dahua Lin
    • 2
  • Xiaoou Tang
    • 2
  • Luc Van Gool
    • 1
  1. 1.Computer Vision LabETH ZurichZurichSwitzerland
  2. 2.Department of Information EngineeringThe Chinese University of Hong KongHong KongChina
  3. 3.Shenzhen Institutes of Advanced Technology, CASShenzhenChina

Personalised recommendations