Second-order Temporal Pooling for Action Recognition

Abstract

Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Notes

  1. 1.

    As we fine-tune the VGG network from a pre-trained ImageNet model, we use \(\beta = 3\) for SMAID in our implementation.

  2. 2.

    With a slight abuse of previously introduced notations, we assume T to be raw feature trajectories without any scaling or normalization.

  3. 3.

    Available from https://github.com/feichtenhofer/twostreamfusion.

  4. 4.

    http://caffe.berkeleyvision.org/.

  5. 5.

    The VGG-16 and ResNet-152 pre-trained models are publicly available at http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/models/twostream_base/vgg16/http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/models/twostream_base/resnet152/.

  6. 6.

    http://www.vlfeat.org.

  7. 7.

    https://deepmind.com/research/open-source/open-source-datasets/kinetics/.

References

  1. Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2006). Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421.

    Article  Google Scholar 

  2. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding, pp 29–39.

  3. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.

  4. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. IEEE: In ICCV.

  5. Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.

  6. Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR.

  7. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 4724–4733. IEEE.

  8. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV.

  9. Chaquet, J. M., Carmona, E. J., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.

    Article  Google Scholar 

  10. Chatfield, K., Simonyan, K., Vedaldi, A, & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531.

  11. Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.

  12. Cherian, A., Sra, S., Banerjee, A., & Papanikolopoulos, N. (2013). Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9), 2161–2174.

    Article  Google Scholar 

  13. Cherian, A., Fernando, B., Harandi, M., & Gould, S. (2017a). Generalized rank pooling for action recognition. In CVPR.

  14. Cherian, A., Koniusz, P., & Gould, S. (2017b). Higher-order pooling of CNN features via kernel linerization for action recognition. In WACV.

  15. Cherian, A., Sra, S., Gould, S., & Hartley, R. (2018). Non-linear temporal subspace representations for activity recognition. In CVPR, pp 2197–2206.

  16. Chéron, G., Laptev, I., & Schmid, C.. (2015). P-CNN: Pose-based CNN features for action recognition. arXiv preprint arXiv:1506.03607.

  17. Davis, J. W., & Bobick, A. F. (1997). The representation and recognition of human movement using temporal templates. IEEE: In CVPR.

  18. Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389.

  19. Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In ICCV.

  20. Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In NIPS.

  21. Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016b). Convolutional two-stream network fusion for video action recognition. In CVPR.

  22. Feichtenhofer, C,, Pinz, A., & Wildes, R. P. (2017). Spatiotemporal multiplier networks for video action recognition. IEEE: In CVPR.

  23. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015a). Modeling video evolution for action recognition. In CVPR.

  24. Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015b). Modeling video evolution for action recognition. In CVPR.

  25. Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.

    Article  Google Scholar 

  26. Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, p. 3.

  27. Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.

  28. Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 4.

  29. Guo, K., Ishwar, P., & Konrad, J. (2013). Action recognition from video using feature covariance matrices. In TIP.

  30. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  31. Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. ISSN 0262-8856. Regularization Techniques for High-Dimensional Data Analysis.

    Article  Google Scholar 

  32. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–401.

    MathSciNet  Article  MATH  Google Scholar 

  33. Huang, Z., & Van Gool, L. (2017). A riemannian network for spd matrix learning. In AAAI.

  34. Ionescu, C., Vantzos, O. & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV.

  35. Jebara, T., & Kondor, R. (2003). Bhattacharyya and expected likelihood kernels. In Learning theory and kernel machines, pp. 57–71. Springer.

  36. Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.

  37. Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. PAMI, 33(1), 117–128.

    Article  Google Scholar 

  38. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, Michael J. (2013). Towards understanding action recognition. In ICCV.

  39. Ji, S., Wei, X., Yang, M., & Kai, Y. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.

    Article  Google Scholar 

  40. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, Li (2014). Large-scale video classification with convolutional neural networks. In CVPR.

  41. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.

  42. Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.

  43. Koniusz, P., Cherian, A., & Porikli, F. (2016). Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV.

  44. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  45. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. IEEE: In ICCV.

  46. Lan, T., Chen, T.-C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In ECCV.

  47. Lan, T., Zhu, Y., Zamir Roshan, A., & Savarese, S. (2015). Action recognition by hierarchical mid-level action elements. In ICCV.

  48. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.

    Article  Google Scholar 

  49. Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR.

  50. Lei, J., Ren, X, & Fox, D. (2012). Fine-grained kitchen activity recognition using RGB-D. In ACM Conference on Ubiquitous Computing.

  51. Li, P., Wang, Q., Zuo, W., & Zhang, L. (2013). Log-euclidean kernels for sparse representation and dictionary learning. In ICCV.

  52. Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C. et al. (2018). Moments in time dataset: One million videos for event understanding. arXiv preprint arXiv:1801.03150.

  53. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV, Springer.

  54. Ni, B., Paramathayalan, V. R., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.

  55. Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.

  56. Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML.

  57. Peng, X., Zou, C., Qiao, Y., & Qiang, P. (2014). Action recognition with stacked fisher vectors. In ECCV, Springer.

  58. Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In CVIU.

  59. Pennec, X., Fillard, P., & Ayache, N. (2006). A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1), 41–66.

    Article  MATH  Google Scholar 

  60. Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.

  61. Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, (pp. 678–689). Springer.

  62. Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.

    Article  Google Scholar 

  63. Ren, S,, He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, (pp. 91–99).

  64. Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.

  65. Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2015). Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv preprint arXiv:1502.06648.

  66. Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through context-free grammar based representation. In CVPR.

  67. Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.

  68. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.

  69. Soomro, K., Zamir, A. R, & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.

  70. Sra, S. (2011). Positive definite matrices and the symmetric stein divergence. Technical report.

  71. Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.

  72. Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.

  73. Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.

  74. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In CVPR.

  75. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.

  76. Tran, D., Bourdev, L., D., Fergus, R., Torresani, L., & Paluri, M.. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.

  77. Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. PAMI, 34(3), 480–492.

    Article  Google Scholar 

  78. Wang, C., Wang, Y., & Yuille, A. L. (2013a). An approach to pose-based action recognition. In CVPR.

  79. Wang, H, & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.

  80. Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013b). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.

    MathSciNet  Article  Google Scholar 

  81. Wang, J., Cherian, A., & Porikli, F. (2017). Ordered pooling of optical flow sequences for action recognition. In WACV.

  82. Wang, J., Cherian, A., Porikli, F., & Gould, S. (2018). Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1149–1158).

  83. Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.

  84. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.

  85. Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.

  86. Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.

  87. Yao, A., Gall, J., Fanelli, G., & Van Gool, L. J. (2011a). Does human action recognition benefit from pose estimation?. In BMVC.

  88. Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5 d graph matching. In ECCV.

  89. Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011b). Human action recognition by learning bases of action attributes and parts. In ICCV.

  90. Yu, K., & Salzmann, M. (2017). Second-order convolutional neural networks. arXiv preprint arXiv:1703.06817.

  91. Yuan, C., Hu, W., Li, X., Maybank, S., & Luo, G. (2009). Human action recognition under log-euclidean riemannian metric. In ACCV.

  92. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.

  93. Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In ECCV.

  94. Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.

  95. Zisserman, A., Carreira, J., Simonyan, K., Kay, W., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T. et al. (2017). The kinetics human action video dataset.

  96. Zuffi, S., & Black, M. J. (2013). Puppet flow. IJCV, 101(3), 437–458.

    Article  Google Scholar 

Download references

Acknowledgements

This research was supported by the Australian Research Council (ARC) through the Centre of Excellence for Robotic Vision (CE140100016) and was undertaken with the resources from the National Computational Infrastructure (NCI) at the Australian National University. The authors also thank Mr. Edison Guo (ANU) for helpful discussions.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Anoop Cherian.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Ivan Laptev.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cherian, A., Gould, S. Second-order Temporal Pooling for Action Recognition. Int J Comput Vis 127, 340–362 (2019). https://doi.org/10.1007/s11263-018-1111-5

Download citation

Keywords

  • Action recognition
  • Deep Learning
  • Kernel descriptors
  • Second-order statistics
  • Pooling
  • Image Representations
  • End-to-end learning
  • Region covariance descriptors