Advertisement

Spatiotemporal Fusion Networks for Video Action Recognition

  • Zheng Liu
  • Haifeng HuEmail author
  • Junxuan Zhang
Article
  • 63 Downloads

Abstract

Learning spatiotemporal information is a fundamental part in action recognition. In this work, we attempt to extract efficient spatiotemporal information for video representation through a novel architecture, termed as SpatioTemporal Fusion Networks (STFN). STFN extract spatiotemporal information by introducing connections between the spatial and temporal streams in two-stream networks with fusion blocks, called as Compactly Fuse Spatial and Temporal information (CFST) block, whose goal is to integrate spatial and temporal information with little computational cost. CFST is built upon Compact Bilinear Pooling which can capture multiplicative interactions at corresponding locations. For better integration of two streams, we make an exploration of fusion configuration about where to insert fusion block and a combination of CFST block and additive interaction. We evaluate our proposed architecture on UCF-101 and HMDB-51, and obtain a comparable performance.

Keywords

Action recognition Convolutional neural networks Compact bilinear pooling 

Notes

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 61673402, Grant 61273270, and Grant 60802069, in part by the Natural Science Foundation of Guangdong under Grant 2017A030311029, Grant 2016B010109002, in part by the Science and Technology Program of Guangzhou under Grant 201704020180 and Grant 201604020024, and in part by the Fundamental Research Funds for the Central Universities of China.

References

  1. 1.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 2:1097–1105Google Scholar
  2. 2.
    Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587Google Scholar
  3. 3.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440Google Scholar
  4. 4.
    Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1725–1732Google Scholar
  5. 5.
    Ji S, Xu W, Yang M et al (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231CrossRefGoogle Scholar
  6. 6.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576Google Scholar
  7. 7.
    Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. Advances in neural information processing systems, pp 3468–3476Google Scholar
  8. 8.
    Gao Y, Beijbom O, Zhang N et al (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326Google Scholar
  9. 9.
    Soomro K, Zamir A R, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
  10. 10.
    Kuehne H, Jhuang H, Stiefelhagen R et al (2013) Hmdb51: A large video database for human motion recognition. In: High Performance Computing in Science and Engineering 12. Springer, Berlin, pp 571–582Google Scholar
  11. 11.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRefGoogle Scholar
  12. 12.
    Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: European conference on computer vision. Springer, Berlin, pp 650–663Google Scholar
  13. 13.
    Wang H, Kläser A, Schmid C et al (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3169–3176Google Scholar
  14. 14.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558Google Scholar
  15. 15.
    Laptev I, Marszalek M, Schmid C et al (2008) Learning realistic human actions from movies. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–8Google Scholar
  16. 16.
    Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision. Springer, Cham, pp 20–36Google Scholar
  17. 17.
    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941Google Scholar
  18. 18.
    Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7445–7454Google Scholar
  19. 19.
    Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497Google Scholar
  20. 20.
    Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE international conference on computer vision, pp 5534–5542Google Scholar
  21. 21.
    Diba A, Fayyaz M, Sharma V et al (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200
  22. 22.
    Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4724–4733Google Scholar
  23. 23.
    Wang L, Li W, Li W, et al (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1430–1439Google Scholar
  24. 24.
    Sun S, Kuang Z, Sheng L, et al (2018) Optical flow guided feature: a fast and robust motion representation for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1390–1399Google Scholar
  25. 25.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4305–4314Google Scholar
  26. 26.
    Zhu W, Hu J, Sun G et al (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1991–1999Google Scholar
  27. 27.
    Hong C, Yu J, Wan J et al (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRefGoogle Scholar
  28. 28.
    Hong C, Chen X, Wang X et al (2016) Hypergraph regularized autoencoder for image-based 3D human pose recovery. Signal Process 124:132–140CrossRefGoogle Scholar
  29. 29.
    Hong C, Yu J, Tao D et al (2015) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751Google Scholar
  30. 30.
    Yang M, Liu Y, You Z (2017) The Euclidean embedding learning based on convolutional neural network for stereo matching. Neurocomputing 267:195–200CrossRefGoogle Scholar
  31. 31.
    Qian S, Liu H, Liu C et al (2018) Adaptive activation functions in convolutional neural networks. Neurocomputing 272:204–212CrossRefGoogle Scholar
  32. 32.
    Yu Z, Yu J, Fan J et al (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1839–1848Google Scholar
  33. 33.
    Yu Z, Yu J, Xiang C et al (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 99:1–13Google Scholar
  34. 34.
    Kim JH, On KW, Lim W et al (2016) Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325
  35. 35.
    Simon M, Gao Y, Darrell T et al (2017) Generalized orderless pooling performs implicit salient matching. In: Proceedings of the IEEE international conference on computer vision, pp 4960–4969Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.School of Electronics and Information EngineeringSun Yat-sen UniversityGuangzhouChina

Personalised recommendations