International Journal of Computer Vision

, Volume 127, Issue 4, pp 340–362 | Cite as

Second-order Temporal Pooling for Action Recognition

  • Anoop CherianEmail author
  • Stephen Gould


Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.


Action recognition Deep Learning Kernel descriptors Second-order statistics Pooling Image Representations End-to-end learning Region covariance descriptors 



This research was supported by the Australian Research Council (ARC) through the Centre of Excellence for Robotic Vision (CE140100016) and was undertaken with the resources from the National Computational Infrastructure (NCI) at the Australian National University. The authors also thank Mr. Edison Guo (ANU) for helpful discussions.


  1. Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2006). Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421.CrossRefGoogle Scholar
  2. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding, pp 29–39.Google Scholar
  3. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.Google Scholar
  4. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. IEEE: In ICCV.Google Scholar
  5. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.Google Scholar
  6. Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR.Google Scholar
  7. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 4724–4733. IEEE.Google Scholar
  8. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV.Google Scholar
  9. Chaquet, J. M., Carmona, E. J., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.CrossRefGoogle Scholar
  10. Chatfield, K., Simonyan, K., Vedaldi, A, & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531.
  11. Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.Google Scholar
  12. Cherian, A., Sra, S., Banerjee, A., & Papanikolopoulos, N. (2013). Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9), 2161–2174.CrossRefGoogle Scholar
  13. Cherian, A., Fernando, B., Harandi, M., & Gould, S. (2017a). Generalized rank pooling for action recognition. In CVPR.Google Scholar
  14. Cherian, A., Koniusz, P., & Gould, S. (2017b). Higher-order pooling of CNN features via kernel linerization for action recognition. In WACV.Google Scholar
  15. Cherian, A., Sra, S., Gould, S., & Hartley, R. (2018). Non-linear temporal subspace representations for activity recognition. In CVPR, pp 2197–2206.Google Scholar
  16. Chéron, G., Laptev, I., & Schmid, C.. (2015). P-CNN: Pose-based CNN features for action recognition. arXiv preprint arXiv:1506.03607.
  17. Davis, J. W., & Bobick, A. F. (1997). The representation and recognition of human movement using temporal templates. IEEE: In CVPR.Google Scholar
  18. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389.
  19. Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In ICCV.Google Scholar
  20. Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In NIPS.Google Scholar
  21. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016b). Convolutional two-stream network fusion for video action recognition. In CVPR.Google Scholar
  22. Feichtenhofer, C,, Pinz, A., & Wildes, R. P. (2017). Spatiotemporal multiplier networks for video action recognition. IEEE: In CVPR.Google Scholar
  23. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015a). Modeling video evolution for action recognition. In CVPR.Google Scholar
  24. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015b). Modeling video evolution for action recognition. In CVPR.Google Scholar
  25. Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.CrossRefGoogle Scholar
  26. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, p. 3.Google Scholar
  27. Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.Google Scholar
  28. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 4.Google Scholar
  29. Guo, K., Ishwar, P., & Konrad, J. (2013). Action recognition from video using feature covariance matrices. In TIP.Google Scholar
  30. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.Google Scholar
  31. Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. ISSN 0262-8856. Regularization Techniques for High-Dimensional Data Analysis.CrossRefGoogle Scholar
  32. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–401.MathSciNetCrossRefzbMATHGoogle Scholar
  33. Huang, Z., & Van Gool, L. (2017). A riemannian network for spd matrix learning. In AAAI.Google Scholar
  34. Ionescu, C., Vantzos, O. & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV.Google Scholar
  35. Jebara, T., & Kondor, R. (2003). Bhattacharyya and expected likelihood kernels. In Learning theory and kernel machines, pp. 57–71. Springer.Google Scholar
  36. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.Google Scholar
  37. Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. PAMI, 33(1), 117–128.CrossRefGoogle Scholar
  38. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, Michael J. (2013). Towards understanding action recognition. In ICCV.Google Scholar
  39. Ji, S., Wei, X., Yang, M., & Kai, Y. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.CrossRefGoogle Scholar
  40. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, Li (2014). Large-scale video classification with convolutional neural networks. In CVPR.Google Scholar
  41. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  42. Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.Google Scholar
  43. Koniusz, P., Cherian, A., & Porikli, F. (2016). Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV. Google Scholar
  44. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
  45. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. IEEE: In ICCV.Google Scholar
  46. Lan, T., Chen, T.-C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In ECCV.Google Scholar
  47. Lan, T., Zhu, Y., Zamir Roshan, A., & Savarese, S. (2015). Action recognition by hierarchical mid-level action elements. In ICCV.Google Scholar
  48. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.CrossRefGoogle Scholar
  49. Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR.Google Scholar
  50. Lei, J., Ren, X, & Fox, D. (2012). Fine-grained kitchen activity recognition using RGB-D. In ACM Conference on Ubiquitous Computing.Google Scholar
  51. Li, P., Wang, Q., Zuo, W., & Zhang, L. (2013). Log-euclidean kernels for sparse representation and dictionary learning. In ICCV.Google Scholar
  52. Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C. et al. (2018). Moments in time dataset: One million videos for event understanding. arXiv preprint arXiv:1801.03150.
  53. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV, Springer.Google Scholar
  54. Ni, B., Paramathayalan, V. R., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.Google Scholar
  55. Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.Google Scholar
  56. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML.Google Scholar
  57. Peng, X., Zou, C., Qiao, Y., & Qiang, P. (2014). Action recognition with stacked fisher vectors. In ECCV, Springer.Google Scholar
  58. Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In CVIU.Google Scholar
  59. Pennec, X., Fillard, P., & Ayache, N. (2006). A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1), 41–66.CrossRefzbMATHGoogle Scholar
  60. Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.Google Scholar
  61. Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, (pp. 678–689). Springer.Google Scholar
  62. Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.CrossRefGoogle Scholar
  63. Ren, S,, He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, (pp. 91–99).Google Scholar
  64. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.Google Scholar
  65. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2015). Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv preprint arXiv:1502.06648.
  66. Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through context-free grammar based representation. In CVPR.Google Scholar
  67. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.Google Scholar
  68. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.Google Scholar
  69. Soomro, K., Zamir, A. R, & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  70. Sra, S. (2011). Positive definite matrices and the symmetric stein divergence. Technical report.Google Scholar
  71. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.Google Scholar
  72. Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.Google Scholar
  73. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.Google Scholar
  74. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In CVPR.Google Scholar
  75. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.Google Scholar
  76. Tran, D., Bourdev, L., D., Fergus, R., Torresani, L., & Paluri, M.. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.Google Scholar
  77. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. PAMI, 34(3), 480–492.CrossRefGoogle Scholar
  78. Wang, C., Wang, Y., & Yuille, A. L. (2013a). An approach to pose-based action recognition. In CVPR.Google Scholar
  79. Wang, H, & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.Google Scholar
  80. Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013b). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.MathSciNetCrossRefGoogle Scholar
  81. Wang, J., Cherian, A., & Porikli, F. (2017). Ordered pooling of optical flow sequences for action recognition. In WACV.Google Scholar
  82. Wang, J., Cherian, A., Porikli, F., & Gould, S. (2018). Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1149–1158).Google Scholar
  83. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.Google Scholar
  84. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.Google Scholar
  85. Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.Google Scholar
  86. Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.Google Scholar
  87. Yao, A., Gall, J., Fanelli, G., & Van Gool, L. J. (2011a). Does human action recognition benefit from pose estimation?. In BMVC.Google Scholar
  88. Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5 d graph matching. In ECCV.Google Scholar
  89. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011b). Human action recognition by learning bases of action attributes and parts. In ICCV.Google Scholar
  90. Yu, K., & Salzmann, M. (2017). Second-order convolutional neural networks. arXiv preprint arXiv:1703.06817.
  91. Yuan, C., Hu, W., Li, X., Maybank, S., & Luo, G. (2009). Human action recognition under log-euclidean riemannian metric. In ACCV.Google Scholar
  92. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.Google Scholar
  93. Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In ECCV.Google Scholar
  94. Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.Google Scholar
  95. Zisserman, A., Carreira, J., Simonyan, K., Kay, W., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T. et al. (2017). The kinetics human action video dataset.Google Scholar
  96. Zuffi, S., & Black, M. J. (2013). Puppet flow. IJCV, 101(3), 437–458.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Australian Centre for Robotic VisionThe Australian National UniversityCanberraAustralia

Personalised recommendations