Second-order Temporal Pooling for Action Recognition

Cherian, Anoop; Gould, Stephen

doi:10.1007/s11263-018-1111-5

Second-order Temporal Pooling for Action Recognition

Published: 19 August 2018

Volume 127, pages 340–362, (2019)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Anoop Cherian¹ &
Stephen Gould¹

1579 Accesses
25 Citations
Explore all metrics

Abstract

Deep learning models for video-based action recognition usually generate features for short clips (consisting of a few frames); such clip-level features are aggregated to video-level representations by computing statistics on these features. Typically zero-th (max) or the first-order (average) statistics are used. In this paper, we explore the benefits of using second-order statistics.Specifically, we propose a novel end-to-end learnable feature aggregation scheme, dubbed temporal correlation pooling that generates an action descriptor for a video sequence by capturing the similarities between the temporal evolution of clip-level CNN features computed across the video. Such a descriptor, while being computationally cheap, also naturally encodes the co-activations of multiple CNN features, thereby providing a richer characterization of actions than their first-order counterparts. We also propose higher-order extensions of this scheme by computing correlations after embedding the CNN features in a reproducing kernel Hilbert space. We provide experiments on benchmark datasets such as HMDB-51 and UCF-101, fine-grained datasets such as MPII Cooking activities and JHMDB, as well as the recent Kinetics-600. Our results demonstrate the advantages of higher-order pooling schemes that when combined with hand-crafted features (as is standard practice) achieves state-of-the-art accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep multiple aggregation networks for action recognition

Article 01 February 2024

Action Recognition Using Multiple Pooling Strategies of CNN Features

Article 03 October 2018

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Notes

As we fine-tune the VGG network from a pre-trained ImageNet model, we use \(\beta = 3\) for SMAID in our implementation.
With a slight abuse of previously introduced notations, we assume T to be raw feature trajectories without any scaling or normalization.
Available from https://github.com/feichtenhofer/twostreamfusion.
http://caffe.berkeleyvision.org/.
The VGG-16 and ResNet-152 pre-trained models are publicly available at http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/models/twostream_base/vgg16/http://ftp.tugraz.at/pub/feichtenhofer/tsfusion/models/twostream_base/resnet152/.
http://www.vlfeat.org.
https://deepmind.com/research/open-source/open-source-datasets/kinetics/.

References

Arsigny, V., Fillard, P., Pennec, X., & Ayache, N. (2006). Log-euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2), 411–421.
Article Google Scholar
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011). Sequential deep learning for human action recognition. In Human Behavior Understanding, pp 29–39.
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., & Gould, S. (2016). Dynamic image networks for action recognition. In CVPR.
Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. IEEE: In ICCV.
Bojanowski, P., Lajugie, R., Bach, F., Laptev, I., Ponce, J., Schmid, C., & Sivic, J. (2014). Weakly supervised action labeling in videos under ordering constraints. In ECCV.
Cai, Z., Wang, L., Peng, X., & Qiao, Y. (2014). Multi-view super vector for action recognition. In CVPR.
Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 4724–4733. IEEE.
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In ECCV.
Chaquet, J. M., Carmona, E. J., & Fernández-Caballero, A. (2013). A survey of video datasets for human action and activity recognition. Computer Vision and Image Understanding, 117(6), 633–659.
Article Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A, & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531.
Chen, X., & Yuille, A.L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.
Cherian, A., Sra, S., Banerjee, A., & Papanikolopoulos, N. (2013). Jensen-bregman logdet divergence with application to efficient similarity search for covariance matrices. PAMI, 35(9), 2161–2174.
Article Google Scholar
Cherian, A., Fernando, B., Harandi, M., & Gould, S. (2017a). Generalized rank pooling for action recognition. In CVPR.
Cherian, A., Koniusz, P., & Gould, S. (2017b). Higher-order pooling of CNN features via kernel linerization for action recognition. In WACV.
Cherian, A., Sra, S., Gould, S., & Hartley, R. (2018). Non-linear temporal subspace representations for activity recognition. In CVPR, pp 2197–2206.
Chéron, G., Laptev, I., & Schmid, C.. (2015). P-CNN: Pose-based CNN features for action recognition. arXiv preprint arXiv:1506.03607.
Davis, J. W., & Bobick, A. F. (1997). The representation and recognition of human movement using temporal templates. IEEE: In CVPR.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2014). Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389.
Duchenne, O., Laptev, I., Sivic, J., Bach, F., & Ponce, J. (2009). Automatic annotation of human actions in video. In ICCV.
Feichtenhofer, C., Pinz, A., & Wildes, R. (2016a). Spatiotemporal residual networks for video action recognition. In NIPS.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016b). Convolutional two-stream network fusion for video action recognition. In CVPR.
Feichtenhofer, C,, Pinz, A., & Wildes, R. P. (2017). Spatiotemporal multiplier networks for video action recognition. IEEE: In CVPR.
Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015a). Modeling video evolution for action recognition. In CVPR.
Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., & Tuytelaars, T. (2015b). Modeling video evolution for action recognition. In CVPR.
Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.
Article Google Scholar
Girdhar, R., Ramanan, D., Gupta, A., Sivic, J., & Russell, B. (2017). Actionvlad: Learning spatio-temporal aggregation for action classification. In CVPR, volume 2, p. 3.
Gkioxari, G., & Malik, J. (2015). Finding action tubes. In CVPR.
Gu, C., Sun, C., Ross, D. A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., et al. (2017). AVA: A video dataset of spatio-temporally localized atomic visual actions. CoRR, abs/1705.08421, 4.
Guo, K., Ishwar, P., & Konrad, J. (2013). Action recognition from video using feature covariance matrices. In TIP.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Herath, S., Harandi, M., & Porikli, F. (2017). Going deeper into action recognition: A survey. Image and Vision Computing, 60, 4–21. ISSN 0262-8856. Regularization Techniques for High-Dimensional Data Analysis.
Article Google Scholar
Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382–401.
Article MathSciNet MATH Google Scholar
Huang, Z., & Van Gool, L. (2017). A riemannian network for spd matrix learning. In AAAI.
Ionescu, C., Vantzos, O. & Sminchisescu, C. (2015). Matrix backpropagation for deep networks with structured layers. In ICCV.
Jebara, T., & Kondor, R. (2003). Bhattacharyya and expected likelihood kernels. In Learning theory and kernel machines, pp. 57–71. Springer.
Jégou, H., Douze, M., & Schmid, C. (2009). On the burstiness of visual elements. In CVPR.
Jegou, H., Douze, M., & Schmid, C. (2011). Product quantization for nearest neighbor search. PAMI, 33(1), 117–128.
Article Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., & Black, Michael J. (2013). Towards understanding action recognition. In ICCV.
Ji, S., Wei, X., Yang, M., & Kai, Y. (2013). 3d convolutional neural networks for human action recognition. PAMI, 35(1), 221–231.
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, Li (2014). Large-scale video classification with convolutional neural networks. In CVPR.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
Klaser, A., Marszałek, M., & Schmid, C. (2008). A spatio-temporal descriptor based on 3d-gradients. In BMVC.
Koniusz, P., Cherian, A., & Porikli, F. (2016). Tensor representations via kernel linearization for action recognition from 3D skeletons. In ECCV.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., & Serre, T. (2011). Hmdb: A large video database for human motion recognition. IEEE: In ICCV.
Lan, T., Chen, T.-C., & Savarese, S. (2014). A hierarchical representation for future action prediction. In ECCV.
Lan, T., Zhu, Y., Zamir Roshan, A., & Savarese, S. (2015). Action recognition by hierarchical mid-level action elements. In ICCV.
Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123.
Article Google Scholar
Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011). Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In CVPR.
Lei, J., Ren, X, & Fox, D. (2012). Fine-grained kitchen activity recognition using RGB-D. In ACM Conference on Ubiquitous Computing.
Li, P., Wang, Q., Zuo, W., & Zhang, L. (2013). Log-euclidean kernels for sparse representation and dictionary learning. In ICCV.
Monfort, M., Zhou, B., Bargal, S. A., Andonian, A., Yan, T., Ramakrishnan, K., Brown, L., Fan, Q., Gutfruend, D., Vondrick, C. et al. (2018). Moments in time dataset: One million videos for event understanding. arXiv preprint arXiv:1801.03150.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV, Springer.
Ni, B., Paramathayalan, V. R., & Moulin, P. (2014). Multiple granularity analysis for fine-grained action detection. In CVPR.
Oneata, D., Verbeek, J., & Schmid, C. (2013). Action and event recognition with fisher vectors on a compact feature set. In ICCV.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In ICML.
Peng, X., Zou, C., Qiao, Y., & Qiang, P. (2014). Action recognition with stacked fisher vectors. In ECCV, Springer.
Peng, X., Wang, L., Wang, X., & Qiao, Y. (2016). Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. In CVIU.
Pennec, X., Fillard, P., & Ayache, N. (2006). A riemannian framework for tensor computing. International Journal of Computer Vision, 66(1), 41–66.
Article MATH Google Scholar
Pirsiavash, H., & Ramanan, D. (2014). Parsing videos of actions with segmental grammars. In CVPR.
Pishchulin, L., Andriluka, M., & Schiele, B. (2014). Fine-grained activity recognition with holistic and pose based features. In Pattern Recognition, (pp. 678–689). Springer.
Prest, A., Schmid, C., & Ferrari, V. (2012). Weakly supervised learning of interactions between humans and objects. PAMI, 34(3), 601–614.
Article Google Scholar
Ren, S,, He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, (pp. 91–99).
Rohrbach, M., Amin, S., Andriluka, M., & Schiele, B. (2012). A database for fine grained activity detection of cooking activities. In CVPR.
Rohrbach, M., Rohrbach, A., Regneri, M., Amin, S., Andriluka, M., Pinkal, M., & Schiele, B. (2015). Recognizing fine-grained and composite activities using hand-centric features and script data. arXiv preprint arXiv:1502.06648.
Ryoo, M. S., & Aggarwal, J. K. (2006). Recognition of composite human activities through context-free grammar based representation. In CVPR.
Sadanand, S., & Corso, J. J. (2012). Action bank: A high-level representation of activity in video. In CVPR.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS.
Soomro, K., Zamir, A. R, & Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
Sra, S. (2011). Positive definite matrices and the symmetric stein divergence. Technical report.
Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In ICML.
Sun, C., & Nevatia, R. (2014). Discover: Discovering important segments for classification of video events and recounting. In CVPR.
Tang, K., Fei-Fei, L., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR.
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In CVPR.
Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.
Tran, D., Bourdev, L., D., Fergus, R., Torresani, L., & Paluri, M.. (2015). Learning spatiotemporal features with 3D convolutional networks. In ICCV.
Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. PAMI, 34(3), 480–492.
Article Google Scholar
Wang, C., Wang, Y., & Yuille, A. L. (2013a). An approach to pose-based action recognition. In CVPR.
Wang, H, & Schmid, C. (2013). Action recognition with improved trajectories. In ICCV.
Wang, H., Kläser, A., Schmid, C., & Liu, C.-L. (2013b). Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1), 60–79.
Article MathSciNet Google Scholar
Wang, J., Cherian, A., & Porikli, F. (2017). Ordered pooling of optical flow sequences for action recognition. In WACV.
Wang, J., Cherian, A., Porikli, F., & Gould, S. (2018). Video representation learning using discriminative pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 1149–1158).
Wang, L., Qiao, Y., & Tang, X. (2015). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In ECCV.
Wei, S.-E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.
Wu, C., Zhang, J., Savarese, S., & Saxena, A. (2015). Watch-n-patch: Unsupervised understanding of actions and relations. In CVPR.
Yao, A., Gall, J., Fanelli, G., & Van Gool, L. J. (2011a). Does human action recognition benefit from pose estimation?. In BMVC.
Yao, B., & Fei-Fei, L. (2012). Action recognition with exemplar based 2.5 d graph matching. In ECCV.
Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L., & Fei-Fei, L. (2011b). Human action recognition by learning bases of action attributes and parts. In ICCV.
Yu, K., & Salzmann, M. (2017). Second-order convolutional neural networks. arXiv preprint arXiv:1703.06817.
Yuan, C., Hu, W., Li, X., Maybank, S., & Luo, G. (2009). Human action recognition under log-euclidean riemannian metric. In ACCV.
Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In CVPR.
Zhou, Y., Ni, B., Yan, S., Moulin, P., & Tian, Q. (2014). Pipelining localized semantic features for fine-grained action recognition. In ECCV.
Zhou, Y., Ni, B., Hong, R., Wang, M., & Tian, Q. (2015). Interaction part mining: A mid-level approach for fine-grained action recognition. In CVPR.
Zisserman, A., Carreira, J., Simonyan, K., Kay, W., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T. et al. (2017). The kinetics human action video dataset.
Zuffi, S., & Black, M. J. (2013). Puppet flow. IJCV, 101(3), 437–458.
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Australian Research Council (ARC) through the Centre of Excellence for Robotic Vision (CE140100016) and was undertaken with the resources from the National Computational Infrastructure (NCI) at the Australian National University. The authors also thank Mr. Edison Guo (ANU) for helpful discussions.

Author information

Authors and Affiliations

Australian Centre for Robotic Vision, The Australian National University, Canberra, Australia
Anoop Cherian & Stephen Gould

Authors

Anoop Cherian
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Gould
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anoop Cherian.

Additional information

Communicated by Ivan Laptev.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cherian, A., Gould, S. Second-order Temporal Pooling for Action Recognition. Int J Comput Vis 127, 340–362 (2019). https://doi.org/10.1007/s11263-018-1111-5

Download citation

Received: 23 April 2017
Accepted: 06 August 2018
Published: 19 August 2018
Issue Date: 15 April 2019
DOI: https://doi.org/10.1007/s11263-018-1111-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Second-order Temporal Pooling for Action Recognition

Abstract

Access this article

Similar content being viewed by others

Deep multiple aggregation networks for action recognition

Action Recognition Using Multiple Pooling Strategies of CNN Features

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Second-order Temporal Pooling for Action Recognition

Abstract

Access this article

Similar content being viewed by others

Deep multiple aggregation networks for action recognition

Action Recognition Using Multiple Pooling Strategies of CNN Features

Hierarchical Temporal Pooling for Efficient Online Action Recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation