Skip to main content
Log in

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Recent video action recognition methods directly use RGB pixels in the compressed domain. The cumbersome decoding process of traditional methods is avoided, enabling efficient recognition. However, these methods require converting the discrete cosine transform (DCT) frequency to an extended RGB pixel representation with heavy time consuming. To alleviate this drawback, a novel frequency 2D Slow-I-Fast-P network (F2D-SIFPNet) is proposed that significantly enhances the speed of action recognition. Initially, a new Frequency-Domain Partial Decompression (FPDec) method was designed for extracting the frequency domain DCT coefficients directly from the compressed video, eliminating the last time-consuming decoding process in FFmpeg. Subsequently, the Frequency-Domain Channel Selection (FCS) strategy was introduced for down-sampling the frequency-domain data, thereby augmenting the saliency of the input. Additionally, the Frequency Slow-I-Fast-P path (FSIFP) and the Adaptive Motion Excitation (AME) module were presented to emphasize the significant frequency components. FSIFP efficiently models slow spatial features and fast temporal changes simultaneously, while the AME generates an adaptive convolution kernel that captures both long-term and short-term motion cues. Extensive experiments were conducted on four public datasets: Kinetics-700, Kinetics-400, UCF-101, and HMDB-51. The results showed superior accuracies of 55.6\(\%\), 74.0\(\%\), 96.3\(\%\) and 74.6\(\%\) respectively, with preprocessing times being 6.31 times faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data Availability

The data used in this paper are all from public datasets.

References

  1. Abbaszadeh Shahri A, Chunling S, Larsson S (2023) A hybrid ensemble-based automated deep learning approach to generate 3d geo-models and uncertainty analysis. Eng Comput pp 1–16

  2. Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Env 80:267–284

    Article  Google Scholar 

  3. Abbaszadeh Shahri A, Shan C, Larsson S (2022) A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning. Nat Resour Res 31(3):1351–1373

    Article  Google Scholar 

  4. Aggarwal AK, Jaidka P (2022) Segmentation of crop images for crop yield prediction. Int J Biol Biomed 7

  5. Ahn D, Kim S, Hong H, Ko BC (2023) Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3330–3339

  6. Bai J, Yuan L, Xia ST, Yan S, Li Z, Liu W (2022) Improving vision transformers by revisiting high-frequency components. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, Proceedings, Part XXIV, Springer, pp 1–18. Accessed 23–27 Oct 2022

  7. Battash B, Barad H, Tang H, Bleiweiss A (2020) Mimic the raw domain: Accelerating action recognition in the compressed domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 684–685

  8. Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987

  9. Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1910–1921

  10. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367

  11. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255

  12. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  13. Ehrlich M, Davis LS (2019) Deep residual learning in the jpeg transform domain. In: Proceedings of the IEEE international conference on computer vision, pp 3484–3493

  14. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211

  15. Franchi G, Bursuc A, Aldea E, Dubuisson S, Bloch I (2023) Encoding the latent posterior of bayesian neural networks for uncertainty quantification. IEEE Trans Pattern Anal Mach Intell

  16. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677

  17. Gueguen L, Sergeev A, Kadlec B, Liu R, Yosinski J (2018) Faster neural networks straight from jpeg. Adv Neural Inf Process Syst 31:3933–3944

    Google Scholar 

  18. Guo J, Zhang J, Zhang X, Ma M (2023) Lae-net: Light and efficient network for compressed video action recognition. In: International conference on multimedia modeling, Springer, pp 265–276

  19. Hao Y, Zhang H, Ngo CW, He X (2022) Group contextualization for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 928–938

  20. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6546–6555

  21. He L, Zhang M, Zhang S, Wang L, Li F (2022) Mtrfn: Multiscale temporal receptive field network for compressed video action recognition at edge servers. IEEE Internet Things J 9(15):13965–13977

    Article  Google Scholar 

  22. Hosseini SA, Abbaszadeh Shahri A, Asheghi R (2022) Prediction of bedload transport rate using a block combined network structure. Hydrol Sci J 67(1):117–128

    Article  Google Scholar 

  23. Hu H, Zhou W, Li X, Yan N, Li H (2020) Mv2flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(3s):1–19

  24. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7132–7141

  25. Hu S, Chen L, Wu P, Li H, Yan J, Tao D (2022) St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: European conference on computer vision, Springer, pp 533–549

  26. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, PMLR, pp 448–456

  27. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950

  28. Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B (2021) Movinets: Mobile video networks for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16020–16030

  29. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, IEEE, pp 2556–2563

  30. Le Gall D (1991) Mpeg: A video compression standard for multimedia applications. Commun ACM 34(4):46–58

    Article  Google Scholar 

  31. Li B, Chen J, Zhang D, Bao X, Huang D (2022) Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement. In: International joint conference on artificial intelligence

  32. Li B, Kong L, Zhang D, Bao X, Huang D, Wang Y (2020) Towards practical compressed video action recognition: A temporal enhanced multi-stream network. In: 2020 25th International conference on pattern recognition (ICPR), pp 3744–3750

  33. Li J, Wei P, Zhang Y, Zheng N (2020) A slow-i-fast-p architecture for compressed video action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2039–2047

  34. Li X, Zhang Y, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. arXiv preprint arXiv:2104.11746

  35. Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 909–918

  36. Li Y, Wu CY, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814

  37. Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093

  38. Liu K, Liu W, Ma H, Tan M, Gan C (2020) A real-time action representation with temporal encoding and deep compression. IEEE Trans Circuits Syst Video Technol 31(2):647–660

    Article  Google Scholar 

  39. Liu Y, Cao J, Bai W, Li B, Hu W (2023) Learning from the raw domain: Cross modality distillation for compressed video action recognition. In: ICASSP 2023-2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1–5

  40. Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13708–13718

  41. Luo W, Liu Y, Li B, Hu W, Miao Y, Li Y (2022) Long-short term cross-transformer in compressed domain for few-shot video classification. In: International joint conference on artificial intelligence

  42. Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl-Based Syst 194:105590

    Article  Google Scholar 

  43. Qin Z, Zhang P, Wu F, Li X (2021) Fcanet: Frequency channel attention networks. In: Proceedings of the IEEE international conference on computer vision, pp 783–792

  44. Qing Z, Zhang S, Huang Z, Wang X, Wang Y, Lv Y, Gao C, Sang N (2023) Mar: Masked autoencoders for efficient action recognition. IEEE Trans Multimed

  45. Rauschnabel PA, Felix R, Hinsch C, Shahab H, Alt F (2022) What is xr? towards a framework for augmented and virtual reality. Comput Hum Behav 133:107289

    Article  Google Scholar 

  46. Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. Wiley

  47. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252

    Article  MathSciNet  Google Scholar 

  48. dos Santos SF, Almeida J (2020) Faster and accurate compressed video action recognition straight from the frequency domain. In: 2020 33rd SIBGRAPI Conference on graphics, patterns and images (SIBGRAPI), IEEE, pp 62–68

  49. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  50. Shen Z, Wu XJ, Xu T (2021) Fexnet: Foreground extraction network for human action recognition. IEEE Trans Circuits Syst Video Technol 32(5):3141–3151

    Article  Google Scholar 

  51. Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang SF, Yan Z (2019) Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1268–1277

  52. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199

  53. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  54. Stergiou A, Poppe R (2021) Learn to cycle: Time-consistent feature discovery for action recognition. Pattern Recogn Lett 141:1–7

    Article  Google Scholar 

  55. Torfason R, Mentzer F, Agustsson E, Tschannen M, Timofte R, Van Gool L (2018) Towards image understanding from deep compression without decoding. OpenReniew. net-ICLR 2018

  56. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6450–6459

  57. Truong TD, Bui QH, Duong CN, Seo HS, Phung SL, Li X, Luu K (2022) Direcformer: A directed attention in transformer approach to robust action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20030–20040

  58. Wang J, Lin Y, Zhang M, Gao Y, Ma AJ (2021) Multi-level temporal dilated dense prediction for action recognition. IEEE Trans Multimedia 24:2553–2566

    Article  Google Scholar 

  59. Wang L, Tong Z, Ji B, Wu G (2021) Tdn: Temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1895–1904

  60. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of European conference on computer vision, Springer, pp 20–36

  61. Wang R, Chen D, Wu Z, Chen Y, Dai X, Liu M, Jiang YG, Zhou L, Yuan L (2022) Bevt: Bert pretraining of video transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14733–14743

  62. Wang X, Lin D, Wan L (2022) Ffnet: Frequency fusion network for semantic scene completion. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2550–2557

  63. Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223

  64. Wang Z, She Q, Smolic A (2021) Team-net: Multi-modal learning for video action recognition with partial decoding. In: British machine vision conference

  65. Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6026–6035

  66. Xiao J, Suab SA, Chen X, Singh CK, Singh D, Aggarwal AK, Korom A, Widyatmanti W, Mollah TH, Minh HVT et al (2023) Enhancing assessment of corn growth performance using unmanned aerial vehicles (uavs) and deep learning. Measurement 214:112764

    Article  Google Scholar 

  67. Xiong L, Jia X, Ming Y, Zhou J, Feng F, Hu N (2021) Faster-fcoviar: Faster frequency-domain compressed video action recognition. In: British machine vision conference

  68. Xu K, Qin M, Sun F, Wang Y, Chen YK, Ren F (2020) Learning in the frequency domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1740–1749

  69. Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3333–3343

  70. Yao Y, Jiang X, Fujita H, Fang Z (2022) A sparse graph wavelet convolution neural network for video-based person re-identification. Pattern Recogn 129:108708

    Article  Google Scholar 

  71. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2718–2726

  72. Zhang J, Wang X, Wan Y, Wang L, Wang J, Philip SY (2023) Sor-tc: Self-attentive octave resnet with temporal consistency for compressed video action recognition. Neurocomputing 533:191–205

    Article  Google Scholar 

  73. Zhao H, Torralba A, Torresani L, Yan Z (2019) Hacs: Human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE international conference on computer vision, pp 8668–8678

  74. Zheng Z, Yang L, Wang Y, Zhang M, He L, Huang G, Li F (2023) Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circ Syst Vid Technol

  75. Zhong Y, Li B, Tang L, Kuang S, Wu S, Ding S (2022) Detecting camouflaged object in frequency domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4504–4513

  76. Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding. In: Proceedings of European conference on computer vision, pp 695–712

Download references

Acknowledgements

The work presented in this paper was partly supported by Natural Science Foundation of China (Grant No. 62076030), and basic research fees of Beijing University of Posts and Telecommunications (2023ZCJH08), together with ZTE Corporation.

Author information

Authors and Affiliations

Authors

Contributions

Yue Ming and Jiangwan Zhou: Conceptualization, Methodology, Writing-Original draft preparation; Yue Ming, Jiangwan Zhou and Lu Xiong: Data curation, Validation; Fan Feng: Writing-Reviewing and Editing; Xia Jia and Qingfang Zheng: Fund support; Nannan Hu: Supervision, Writing-Reviewing and Editing.

Corresponding author

Correspondence to Xia Jia.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ming, Y., Zhou, J., Jia, X. et al. F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05408-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-024-05408-y

Keywords

Navigation