F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Ming, Yue; Zhou, Jiangwan; Jia, Xia; Zheng, Qingfang; Xiong, Lu; Feng, Fan; Hu, Nannan

doi:10.1007/s10489-024-05408-y

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Published: 16 April 2024

(2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Yue Ming ORCID: orcid.org/0000-0001-7105-4207¹,
Jiangwan Zhou¹,
Xia Jia²,
Qingfang Zheng²,
Lu Xiong¹,
Fan Feng¹ &
…
Nannan Hu¹

87 Accesses
Explore all metrics

Abstract

Recent video action recognition methods directly use RGB pixels in the compressed domain. The cumbersome decoding process of traditional methods is avoided, enabling efficient recognition. However, these methods require converting the discrete cosine transform (DCT) frequency to an extended RGB pixel representation with heavy time consuming. To alleviate this drawback, a novel frequency 2D Slow-I-Fast-P network (F2D-SIFPNet) is proposed that significantly enhances the speed of action recognition. Initially, a new Frequency-Domain Partial Decompression (FPDec) method was designed for extracting the frequency domain DCT coefficients directly from the compressed video, eliminating the last time-consuming decoding process in FFmpeg. Subsequently, the Frequency-Domain Channel Selection (FCS) strategy was introduced for down-sampling the frequency-domain data, thereby augmenting the saliency of the input. Additionally, the Frequency Slow-I-Fast-P path (FSIFP) and the Adaptive Motion Excitation (AME) module were presented to emphasize the significant frequency components. FSIFP efficiently models slow spatial features and fast temporal changes simultaneously, while the AME generates an adaptive convolution kernel that captures both long-term and short-term motion cues. Extensive experiments were conducted on four public datasets: Kinetics-700, Kinetics-400, UCF-101, and HMDB-51. The results showed superior accuracies of 55.6\(\%\), 74.0\(\%\), 96.3\(\%\) and 74.6\(\%\) respectively, with preprocessing times being 6.31 times faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Human Action Recognition and Prediction: A Survey

Article 28 March 2022

Data Availability

The data used in this paper are all from public datasets.

References

Abbaszadeh Shahri A, Chunling S, Larsson S (2023) A hybrid ensemble-based automated deep learning approach to generate 3d geo-models and uncertainty analysis. Eng Comput pp 1–16
Abbaszadeh Shahri A, Maghsoudi Moud F (2021) Landslide susceptibility mapping using hybridized block modular intelligence model. Bull Eng Geol Env 80:267–284
Article Google Scholar
Abbaszadeh Shahri A, Shan C, Larsson S (2022) A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning. Nat Resour Res 31(3):1351–1373
Article Google Scholar
Aggarwal AK, Jaidka P (2022) Segmentation of crop images for crop yield prediction. Int J Biol Biomed 7
Ahn D, Kim S, Hong H, Ko BC (2023) Star-transformer: a spatio-temporal cross attention transformer for human action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3330–3339
Bai J, Yuan L, Xia ST, Yan S, Li Z, Liu W (2022) Improving vision transformers by revisiting high-frequency components. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, Proceedings, Part XXIV, Springer, pp 1–18. Accessed 23–27 Oct 2022
Battash B, Barad H, Tang H, Bleiweiss A (2020) Mimic the raw domain: Accelerating action recognition in the compressed domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp 684–685
Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987
Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1910–1921
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) Multi-fiber networks for video recognition. In: Proceedings of the European conference on computer vision (ECCV), pp 352–367
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Ehrlich M, Davis LS (2019) Deep residual learning in the jpeg transform domain. In: Proceedings of the IEEE international conference on computer vision, pp 3484–3493
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211
Franchi G, Bursuc A, Aldea E, Dubuisson S, Bloch I (2023) Encoding the latent posterior of bayesian neural networks for uncertainty quantification. IEEE Trans Pattern Anal Mach Intell
Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677
Gueguen L, Sergeev A, Kadlec B, Liu R, Yosinski J (2018) Faster neural networks straight from jpeg. Adv Neural Inf Process Syst 31:3933–3944
Google Scholar
Guo J, Zhang J, Zhang X, Ma M (2023) Lae-net: Light and efficient network for compressed video action recognition. In: International conference on multimedia modeling, Springer, pp 265–276
Hao Y, Zhang H, Ngo CW, He X (2022) Group contextualization for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 928–938
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6546–6555
He L, Zhang M, Zhang S, Wang L, Li F (2022) Mtrfn: Multiscale temporal receptive field network for compressed video action recognition at edge servers. IEEE Internet Things J 9(15):13965–13977
Article Google Scholar
Hosseini SA, Abbaszadeh Shahri A, Asheghi R (2022) Prediction of bedload transport rate using a block combined network structure. Hydrol Sci J 67(1):117–128
Article Google Scholar
Hu H, Zhou W, Li X, Yan N, Li H (2020) Mv2flow: Learning motion representation for fast compressed video action recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16(3s):1–19
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7132–7141
Hu S, Chen L, Wu P, Li H, Yan J, Tao D (2022) St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: European conference on computer vision, Springer, pp 533–549
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, PMLR, pp 448–456
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P et al (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950
Kondratyuk D, Yuan L, Li Y, Zhang L, Tan M, Brown M, Gong B (2021) Movinets: Mobile video networks for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16020–16030
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: Proceedings of the IEEE international conference on computer vision, IEEE, pp 2556–2563
Le Gall D (1991) Mpeg: A video compression standard for multimedia applications. Commun ACM 34(4):46–58
Article Google Scholar
Li B, Chen J, Zhang D, Bao X, Huang D (2022) Representation learning for compressed video action recognition via attentive cross-modal interaction with motion enhancement. In: International joint conference on artificial intelligence
Li B, Kong L, Zhang D, Bao X, Huang D, Wang Y (2020) Towards practical compressed video action recognition: A temporal enhanced multi-stream network. In: 2020 25th International conference on pattern recognition (ICPR), pp 3744–3750
Li J, Wei P, Zhang Y, Zheng N (2020) A slow-i-fast-p architecture for compressed video action recognition. In: Proceedings of the 28th ACM international conference on multimedia, pp 2039–2047
Li X, Zhang Y, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: Video transformer without convolutions. arXiv preprint arXiv:2104.11746
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: Temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 909–918
Li Y, Wu CY, Fan H, Mangalam K, Xiong B, Malik J, Feichtenhofer C (2022) Mvitv2: Improved multiscale vision transformers for classification and detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4804–4814
Lin J, Gan C, Han S (2019) Tsm: Temporal shift module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093
Liu K, Liu W, Ma H, Tan M, Gan C (2020) A real-time action representation with temporal encoding and deep compression. IEEE Trans Circuits Syst Video Technol 31(2):647–660
Article Google Scholar
Liu Y, Cao J, Bai W, Li B, Hu W (2023) Learning from the raw domain: Cross modality distillation for compressed video action recognition. In: ICASSP 2023-2023 IEEE International conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1–5
Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13708–13718
Luo W, Liu Y, Li B, Hu W, Miao Y, Li Y (2022) Long-short term cross-transformer in compressed domain for few-shot video classification. In: International joint conference on artificial intelligence
Pérez-Hernández F, Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowl-Based Syst 194:105590
Article Google Scholar
Qin Z, Zhang P, Wu F, Li X (2021) Fcanet: Frequency channel attention networks. In: Proceedings of the IEEE international conference on computer vision, pp 783–792
Qing Z, Zhang S, Huang Z, Wang X, Wang Y, Lv Y, Gao C, Sang N (2023) Mar: Masked autoencoders for efficient action recognition. IEEE Trans Multimed
Rauschnabel PA, Felix R, Hinsch C, Shahab H, Alt F (2022) What is xr? towards a framework for augmented and virtual reality. Comput Hum Behav 133:107289
Article Google Scholar
Richardson IE (2004) H. 264 and MPEG-4 video compression: video coding for next-generation multimedia. Wiley
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vision 115(3):211–252
Article MathSciNet Google Scholar
dos Santos SF, Almeida J (2020) Faster and accurate compressed video action recognition straight from the frequency domain. In: 2020 33rd SIBGRAPI Conference on graphics, patterns and images (SIBGRAPI), IEEE, pp 62–68
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Shen Z, Wu XJ, Xu T (2021) Fexnet: Foreground extraction network for human action recognition. IEEE Trans Circuits Syst Video Technol 32(5):3141–3151
Article Google Scholar
Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang SF, Yan Z (2019) Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1268–1277
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199
Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Stergiou A, Poppe R (2021) Learn to cycle: Time-consistent feature discovery for action recognition. Pattern Recogn Lett 141:1–7
Article Google Scholar
Torfason R, Mentzer F, Agustsson E, Tschannen M, Timofte R, Van Gool L (2018) Towards image understanding from deep compression without decoding. OpenReniew. net-ICLR 2018
Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6450–6459
Truong TD, Bui QH, Duong CN, Seo HS, Phung SL, Li X, Luu K (2022) Direcformer: A directed attention in transformer approach to robust action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 20030–20040
Wang J, Lin Y, Zhang M, Gao Y, Ma AJ (2021) Multi-level temporal dilated dense prediction for action recognition. IEEE Trans Multimedia 24:2553–2566
Article Google Scholar
Wang L, Tong Z, Ji B, Wu G (2021) Tdn: Temporal difference networks for efficient action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1895–1904
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: Proceedings of European conference on computer vision, Springer, pp 20–36
Wang R, Chen D, Wu Z, Chen Y, Dai X, Liu M, Jiang YG, Zhou L, Yuan L (2022) Bevt: Bert pretraining of video transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14733–14743
Wang X, Lin D, Wan L (2022) Ffnet: Frequency fusion network for semantic scene completion. In: Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2550–2557
Wang Z, She Q, Smolic A (2021) Action-net: Multipath excitation for action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13214–13223
Wang Z, She Q, Smolic A (2021) Team-net: Multi-modal learning for video action recognition with partial decoding. In: British machine vision conference
Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6026–6035
Xiao J, Suab SA, Chen X, Singh CK, Singh D, Aggarwal AK, Korom A, Widyatmanti W, Mollah TH, Minh HVT et al (2023) Enhancing assessment of corn growth performance using unmanned aerial vehicles (uavs) and deep learning. Measurement 214:112764
Article Google Scholar
Xiong L, Jia X, Ming Y, Zhou J, Feng F, Hu N (2021) Faster-fcoviar: Faster frequency-domain compressed video action recognition. In: British machine vision conference
Xu K, Qin M, Sun F, Wang Y, Chen YK, Ren F (2020) Learning in the frequency domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1740–1749
Yan S, Xiong X, Arnab A, Lu Z, Zhang M, Sun C, Schmid C (2022) Multiview transformers for video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3333–3343
Yao Y, Jiang X, Fujita H, Fang Z (2022) A sparse graph wavelet convolution neural network for video-based person re-identification. Pattern Recogn 129:108708
Article Google Scholar
Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2718–2726
Zhang J, Wang X, Wan Y, Wang L, Wang J, Philip SY (2023) Sor-tc: Self-attentive octave resnet with temporal consistency for compressed video action recognition. Neurocomputing 533:191–205
Article Google Scholar
Zhao H, Torralba A, Torresani L, Yan Z (2019) Hacs: Human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE international conference on computer vision, pp 8668–8678
Zheng Z, Yang L, Wang Y, Zhang M, He L, Huang G, Li F (2023) Dynamic spatial focus for efficient compressed video action recognition. IEEE Trans Circ Syst Vid Technol
Zhong Y, Li B, Tang L, Kuang S, Wu S, Ding S (2022) Detecting camouflaged object in frequency domain. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4504–4513
Zolfaghari M, Singh K, Brox T (2018) Eco: Efficient convolutional network for online video understanding. In: Proceedings of European conference on computer vision, pp 695–712

Download references

Acknowledgements

The work presented in this paper was partly supported by Natural Science Foundation of China (Grant No. 62076030), and basic research fees of Beijing University of Posts and Telecommunications (2023ZCJH08), together with ZTE Corporation.

Author information

Authors and Affiliations

Beijing Key Laboratory of Work Safety Intelligent Monitoring, School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, People’s Republic of China
Yue Ming, Jiangwan Zhou, Lu Xiong, Fan Feng & Nannan Hu
ZTE corpration, Nanjing, 210008, China
Xia Jia & Qingfang Zheng

Authors

Yue Ming
View author publications
You can also search for this author in PubMed Google Scholar
Jiangwan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xia Jia
View author publications
You can also search for this author in PubMed Google Scholar
Qingfang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Lu Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Fan Feng
View author publications
You can also search for this author in PubMed Google Scholar
Nannan Hu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Yue Ming and Jiangwan Zhou: Conceptualization, Methodology, Writing-Original draft preparation; Yue Ming, Jiangwan Zhou and Lu Xiong: Data curation, Validation; Fan Feng: Writing-Reviewing and Editing; Xia Jia and Qingfang Zheng: Fund support; Nannan Hu: Supervision, Writing-Reviewing and Editing.

Corresponding author

Correspondence to Xia Jia.

Ethics declarations

Competing interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ming, Y., Zhou, J., Jia, X. et al. F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition. Appl Intell (2024). https://doi.org/10.1007/s10489-024-05408-y

Download citation

Accepted: 19 March 2024
Published: 16 April 2024
DOI: https://doi.org/10.1007/s10489-024-05408-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Abstract

Access this article

Similar content being viewed by others

A review of convolutional neural networks in computer vision

Video summarization using deep learning techniques: a detailed analysis and investigation

Human Action Recognition and Prediction: A Survey

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

F2D-SIFPNet: a frequency 2D Slow-I-Fast-P network for faster compressed video action recognition

Abstract

Access this article

Similar content being viewed by others

A review of convolutional neural networks in computer vision

Video summarization using deep learning techniques: a detailed analysis and investigation

Human Action Recognition and Prediction: A Survey

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation