Spatio-temporal Channel Correlation Networks for Action Classification

  • Ali Diba
  • Mohsen FayyazEmail author
  • Vivek Sharma
  • M. Mahdi Arzani
  • Rahman Yousefzadeh
  • Juergen Gall
  • Luc Van Gool
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11208)


The work in this paper is driven by the question if spatio-temporal correlations are enough for 3D convolutional neural networks (CNN)? Most of the traditional 3D networks use local spatio-temporal features. We introduce a new block that models correlations between channels of a 3D CNN with respect to temporal and spatial features. This new block can be added as a residual unit to different parts of 3D CNNs. We name our novel block ‘Spatio-Temporal Channel Correlation’ (STC). By embedding this block to the current state-of-the-art architectures such as ResNext and ResNet, we improve the performance by 2–3% on the Kinetics dataset. Our experiments show that adding STC blocks to current state-of-the-art architectures outperforms the state-of-the-art methods on the HMDB51, UCF101 and Kinetics datasets. The other issue in training 3D CNNs is about training them from scratch with a huge labeled dataset to get a reasonable performance. So the knowledge learned in 2D CNNs is completely ignored. Another contribution in this work is a simple and effective technique to transfer knowledge from a pre-trained 2D CNN to a randomly initialized 3D CNN for a stable weight initialization. This allows us to significantly reduce the number of training samples for 3D CNNs. Thus, by fine-tuning this network, we beat the performance of generic and recent methods in 3D CNNs, which were trained on large video datasets, e.g. Sports-1M, and fine-tuned on the target datasets, e.g. HMDB51/UCF101.



This work was supported by DBOF PhD scholarship, KU Leuven:CAMETRON project, and KIT:DFG-PLUMCOT project. The authors would like to thank Sensifai engineering team. Mohsen Fayyaz and Juergen Gall have been financially supported by the DFG project GA 1927/4-1 (Research Unit FOR 2535) and the ERC Starting Grant ARCA (677650).


  1. 1.
    Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: CVPR (2017)Google Scholar
  2. 2.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  3. 3.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: CVPR (2015)Google Scholar
  4. 4.
    Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNS retrace the history of 2D CNNS and imagenet? In: CVPR (2018)Google Scholar
  5. 5.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv:1708.05038 (2017)
  6. 6.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: BMVC (2008)Google Scholar
  7. 7.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: ACM’MM (2007)Google Scholar
  8. 8.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR (2008)Google Scholar
  9. 9.
    Willems, G., Tuytelaars, T., Van Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). Scholar
  10. 10.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). Scholar
  11. 11.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  12. 12.
    Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: CVPR (2015)Google Scholar
  13. 13.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  14. 14.
    Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., Russell, B.: Actionvlad: Learning spatio-temporal aggregation for action classification. In: CVPR (2017)Google Scholar
  15. 15.
    Tang, P., Wang, X., Shi, B., Bai, X., Liu, W., Tu, Z.: Deep fishernet for object classification. arXiv:1608.00182 (2016)
  16. 16.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  17. 17.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR (2016)Google Scholar
  18. 18.
    Sun, L., Jia, K., Yeung, D.Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV (2015)Google Scholar
  19. 19.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  20. 20.
    Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  21. 21.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  22. 22.
    Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for video action recognition. In: Advances in Neural Information Processing Systems, pp. 3468–3476 (2016)Google Scholar
  23. 23.
    Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2017)Google Scholar
  24. 24.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 (2017)
  25. 25.
    Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)
  26. 26.
    Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv:1503.02531 (2015)
  27. 27.
    Gupta, S., Hoffman, J., Malik, J.: Cross modal distillation for supervision transfer. In: CVPR (2016)Google Scholar
  28. 28.
    Diba, A., Pazandeh, A.M., Van Gool, L.: Efficient two-stream motion and appearance 3D CNNS for video classification. In: ECCV Workshops (2016)Google Scholar
  29. 29.
    Arandjelović, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)Google Scholar
  30. 30.
    Limmer, M., Lensch, H.P.: Infrared colorization using deep convolutional neural networks. In: ICMLA (2016)Google Scholar
  31. 31.
    Mansimov, E., Srivastava, N., Salakhutdinov, R.: Initialization strategies of spatio-temporal convolutional neural networks. arXiv preprint arXiv:1503.07274 (2015)
  32. 32.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507 (2017)
  33. 33.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: ICCV (2015)Google Scholar
  34. 34.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  35. 35.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)Google Scholar
  36. 36.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  37. 37.
    Kuehne, H., Jhuang, H., Stiefelhagen, R., Serre, T.: HMDB51: a large video database for human motion recognition. In: Nagel, W., Kröner, D., Resch, M. (eds.) High Performance Computing in Science and Engineering’12. Springer, Heidelberg (2013). Scholar
  38. 38.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
  39. 39.
    Abu-El-Haija, S., et al.: Youtube-8m: A large-scale video classification benchmark. arXiv:1609.08675 (2016)
  40. 40.
    Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: CVPR (2014)Google Scholar
  41. 41.
    Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR (2015)Google Scholar
  42. 42.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  43. 43.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Ali Diba
    • 1
    • 4
  • Mohsen Fayyaz
    • 2
    Email author
  • Vivek Sharma
    • 3
  • M. Mahdi Arzani
    • 4
  • Rahman Yousefzadeh
    • 4
  • Juergen Gall
    • 2
  • Luc Van Gool
    • 1
    • 4
  1. 1.ESAT-PSI, KU LeuvenLeuvenBelgium
  2. 2.University of BonnBonnGermany
  3. 3.CV:HCI, KITKarlsruheGermany
  4. 4.SensifaiBruxellesBelgium

Personalised recommendations