Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition

Liu, Wenxuan; Jia, Xuemei; Ju, Yihao; Ju, Yakun; Jiang, Kui; Wu, Shifeng; Zhong, Luo; Zhong, Xian

doi:10.1007/s00371-024-03427-x

Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition

Review
Published: 14 May 2024

(2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Wenxuan Liu¹,
Xuemei Jia²,
Yihao Ju³,
Yakun Ju⁴,
Kui Jiang⁵,
Shifeng Wu²,
Luo Zhong¹ &
…
Xian Zhong^1,4

75 Accesses
Explore all metrics

Abstract

Video action recognition aims to classify actions within sequences of video frames, which has important applications in computer vision fields. Existing methods have shown proficiency in well-lit environments but experience a drop in efficiency under low-light conditions. This decline is due to the challenge of extracting relevant information from dark, noisy images. Furthermore, simply introducing enhancement networks as preprocessing will lead to an increase in both parameters and computational burden for the video. To address this dilemma, this paper presents a novel frequency-based method, FRequency-Auxiliary Guided Relational Attention NeTwork (FRAGRANT), designed specifically for low-light action recognition. Its distinctive features can be summarized as: (1) a novel Frequency-Auxiliary Module that focuses on informative object regions, characterizing action and motion while effectively suppressing noise; (2) a sophisticated Relational Attention Module that enhances motion representation by modeling the local s between position neighbors, thereby more efficiently resolving issues, such as fuzzy boundaries. Comprehensive testing demonstrates that FRAGRANT outperforms existing methods, achieving state-of-the-art results on various standard low-light action recognition benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Deep learning models for digital image processing: a review

Article 07 January 2024

Data availability

The codes are available at https://github.com/lwxfight/TEST.

Notes

https://github.com/Sense-X/UniFormer.

References

Hu, M., Jiang, K., Liao, L., Xiao, J., Jiang, J., Wang, Z.: Spatial-temporal space hand-in-hand: Spatial-temporal video super-resolution via cycle-projected mutual learning. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3564–3573 (2022)
Sheng, B., Li, P., Ali, R., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022)
Article Google Scholar
Kamel, A., Sheng, B., Li, P., Kim, J., Feng, D.D.: Efficient body motion quantification and similarity evaluation using 3-d joints skeleton coordinates. IEEE Trans. Syst. Man Cybern. Syst. 51(5), 2774–2788 (2021)
Article Google Scholar
Huang, W., Jia, X., Zhong, X., Wang, X., Jiang, K., Wang, Z.: Beyond the parts: learning coarse-to-fine adaptive alignment representation for person search. ACM Trans. Multimedia Comput. Commun. Appl. 19(3), 105–110519 (2023)
Article Google Scholar
Liu, W., Zhong, X., Zhou, Z., Jiang, K., Wang, Z., Lin, C.: Dual-recommendation disentanglement network for view fuzz in action recognition. IEEE Trans. Image Process. 32, 2719–2733 (2023)
Article Google Scholar
Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 6201–6210 (2019)
Tian, C., Zheng, M., Zuo, W., Zhang, S., Zhang, Y., Lin, C.: A cross transformer for image denoising. Inf. Fus. 102, 102043 (2024)
Article Google Scholar
Zhong, X., Tu, S., Ma, X., Jiang, K., Huang, W., Wang, Z.: Rainy WCity: A real rainfall dataset with diverse conditions for semantic driving scene understanding. In: Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1743–1749 (2022)
Zhang, B., Suo, J., Dai, Q.: A complementary dual-backbone transformer extracting and fusing weak cues for object detection in extremely dark videos. Inf. Fus. 97, 101822 (2023)
Article Google Scholar
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 3551–3558 (2013)
Pan, Z., Cai, J., Zhuang, B.: Fast vision transformers with Hilo attention. Adv. Neural Inf. Process. Syst. (2022)
Buijs, H., Pomerleau, A., Fournier, M., Tam, W.Y.: Implementation of a fast Fourier transform (fft) for image processing applications. IEEE Trans. Acoust. Speech Signal Process. 22, 420–424 (1974)
Article Google Scholar
Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Zuiderveld, K.: Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 39(3), 355–368 (1987)
Article Google Scholar
Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Article Google Scholar
Gao, C., Du, Y., Liu, J., Lv, J., Yang, L., Meng, D., Hauptmann, A.G.: InfAR dataset: infrared action recognition at different times. Neurocomputing 212, 36–47 (2016)
Article Google Scholar
Jiang, Z., Rozgic, V., Adali, S.: Learning spatiotemporal features for infrared action recognition with 3D convolutional neural networks. In: Proceedings of IEEE / CVF Computer Vision and Pattern Recognition Conference Workshops, pp. 309–317 (2017)
de la Riva, M., Mettes, P.: Bayesian 3D convnets for action recognition from few examples. In: Proceedings of IEEE/CVF International Conference on Computer Vision Workshops, pp. 1337–1343 (2019)
Xu, L., Zhong, X., Liu, W., Zhao, S., Yang, Z., Zhong, L.: Subspace enhancement and colorization network for infrared video action recognition. In: Proceedings of Pacific Rim International Conference on Artificial Intelligence, pp. 321–336 (2021)
Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: UniFormer: Unified transformer for efficient spatial-temporal representation learning. In: Proceedings of International Conference on Learning Representations (2022)
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., Qiao, Y.: UniFormerV2: Unlocking the potential of image vits for video understanding. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 1632–1643 (2023)
Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4013–4021 (2016)
Shchekotov, I., Andreev, P.K., Ivanov, O., Alanov, A., Vetrov, D.: FFC-SE: Fast Fourier convolution for speech enhancement. In: Proceedings of International Speech Communication Association, pp. 1188–1192 (2022)
Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. (2020)
Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y., Ren, F.: Learning in the frequency domain. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1737–1746 (2020)
Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M.C., Manocha, D.: FAR: Fourier aerial video recognition. In: Proceedings of European Conference on Computer Vision, pp. 657–676 (2022)
Guo, S., Wang, W., Wang, X., Xu, X.: Low-light image enhancement with joint illumination and noise data distribution transformation. Vis. Comput. 39(4), 1363–1374 (2023)
Google Scholar
Hao, S., Han, X., Guo, Y., Xu, X., Wang, M.: Low-light image enhancement with semi-decoupled decomposition. IEEE Trans. Multimed. 22(12), 3025–3038 (2020)
Article Google Scholar
Sheng, B., Li, P., Jin, Y., Tan, P., Lee, T.: Intrinsic image decomposition with step and drift shading separation. IEEE Trans. Vis. Comput. Graph. 26(2), 1332–1346 (2020)
Article Google Scholar
Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM Trans. Graph. 36(4), 118–111812 (2017)
Article Google Scholar
Liu, J., Xu, D., Yang, W., Fan, M., Huang, H.: Benchmarking low-light image enhancement and beyond. Int. J. Comput. Vis. 129(4), 1153–1184 (2021)
Article Google Scholar
Jiang, K., Wang, Z., Wang, Z., Chen, C., Yi, P., Lu, T., Lin, C.: Degrade is upgrade: Learning degradation for low-light image enhancement. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 1078–1086 (2022)
Wang, T., Zhang, K., Shen, T., Luo, W., Stenger, B., Lu, T.: Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method. In: Proceedings of AAAI Conference on Artificial Intelligence (2023)
Hira, S., Das, R., Modi, A., Pakhomov, D.: Delta sampling R-BERT for limited data and low-light action recognition. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pp. 853–862 (2021)
Zeng, J.: Indgic: supervised action recognition under low illumination. arXiv:2308.15345 (2023)
Lv, F., Lu, F., Wu, J., Lim, C.: MBLLEN: low-light image/video enhancement using CNNs. In: Proceedings of British Machine Vision Conference, p. 220 (2018)
Jiang, H., Zheng, Y.: Learning to see moving objects in the dark. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 7323–7332 (2019)
Sheng, B., Li, P., Fang, X., Tan, P., Wu, E.: Depth-aware motion deblurring using loopy belief propagation. IEEE Trans. Circuits Syst. Video Technol. 30(4), 955–969 (2020)
Article Google Scholar
Zhang, F., Li, Y., You, S., Fu, Y.: Learning temporal consistency for low light video enhancement from single images. In: proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4967–4976 (2021)
Huang, S., Wang, M., Zheng, X., Chen, J., Tang, C.: Hierarchical and dynamic graph attention network for drug-disease association prediction. IEEE J. Biomed. Health Inform. 1–12 (2024)
Tang, C., Liu, X., Zheng, X., Li, W., Xiong, J., Wang, L., Zomaya, A.Y., Longo, A.: Defusionnet: defocus blur detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 955–968 (2022)
Article Google Scholar
Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational self-attention: What’s missing in attention for video understanding. In: Advances in Neural Information Processing Systems, pp. 8046–8059 (2021)
Li, D., Hu, J., Wang, C., Li, X., She, Q., Zhu, L., Zhang, T., Chen, Q.: Involution: Inverting the inherence of convolution for visual recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T.A., Serre, T.: HMDB: A large video database for human motion recognition. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 2556–2563 (2011)
Tu, Z., Liu, Y., Zhang, Y., Mu, Q., Yuan, J.: DTCM: joint optimization of dark enhancement and action recognition in videos. IEEE Trans. Image Process. 32, 3507–3520 (2023)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of International Conference on Learning Representations (2015)
Long, X., de Melo, G., He, D., Li, F., Chi, Z., Wen, S., Gan, C.: Purely attention based local feature integration for video classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 2140–2154 (2022)
Google Scholar
Chen, X., Gao, C., Li, C., Yang, Y., Meng, D.: Infrared action detection in the dark via cross-stream attention mechanism. IEEE Trans. Multimedia 24, 288–300 (2021)
Article Google Scholar
Munsif, M., Khan, S.U., Khan, N., Baik, S.W.: Attention-based deep learning framework for action recognition in a dark environment. Hum. Cent. Comput. Inf. Sci. 14 (2024)
Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: Proceedings of ACM Multimedia, pp. 2039–2047 (2020)
Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: Temporal excitation and aggregation for action recognition. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 906–915 (2020)
Li, X., Shuai, B., Tighe, J.: Directional temporal modeling for action recognition. In: Proceedings of European Conference on Computer Vision, pp. 275–291 (2020)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of International Conference on Machine Learning, pp. 813–824 (2021)
Li, K., Li, X., Wang, Y., Wang, J., Qiao, Y.: CT-Net: Channel tensorization network for video classification. In: Proceedings of International Conference Learning Representation, pp. 1–13 (2021)
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. In: Proceedings AAAI Conference on Artificial Intelligence, pp. 1451–1459 (2021)
Tian, Y., Zhong, X., Liu, W., Jia, X., Zhao, S., Ye, M.: Random walk erasing with attention calibration for action recognition. In: Proceedings of Pacific Rim International Conference on Artificial Intelligence, pp. 236–251 (2021)
Li, K., Zhang, Z., Wu, G., Xiong, X., Lee, C., Lu, Z., Fu, Y., Pfister, T.: Learning from weakly-labeled web videos via exploring sub-concepts. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 1341–1349 (2022)
Lin, J., Gan, C., Wang, K., Han, S.: TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2760–2774 (2022)
Google Scholar
Luo, H., Lin, G., Yao, Y., Tang, Z., Wu, Q., Hua, X.: Dense semantics-assisted networks for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 32(5), 3073–3084 (2022)
Alfasly, S., Lu, J., Xu, C., Al-Huda, Z., Jiang, Q., Lu, Z., Chui, C.K.: FastPicker: adaptive independent two-stage video-to-video summarization for efficient action recognition. Neurocomputing 516, 231–244 (2023)
Article Google Scholar
Sheng, X., Li, K., Shen, Z., Xiao, G.: A progressive difference method for capturing visual tempos on action recognition. IEEE Trans. Circuits Syst. Video Technol. 33(3), 977–987 (2023)
Article Google Scholar
Wang, M., Xing, J., Su, J., Chen, J., Liu, Y.: Learning spatiotemporal and motion features in a unified 2d network for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3347–3362 (2023)
Google Scholar
Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S.H., Khan, F.S.: Fine-tuned CLIP models are efficient video learners. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 6545–6554 (2023)
Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition (2020)
Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: AIM: adapting image models for efficient video action recognition. In: Proceedings of International Conference on Learning Representation (2023)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62271361 and 52271366, and the Fundamental Research Funds for the Central Universities under Grant WHUTIOT2023-002. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Author information

Authors and Affiliations

Hubei Key Lab of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China
Wenxuan Liu, Luo Zhong & Xian Zhong
School of Computer Science, Wuhan University, Wuhan, 430072, China
Xuemei Jia & Shifeng Wu
Wuhan Traffic Management Bureau, Wuhan, 430024, China
Yihao Ju
Rapid-Rich Object Search Lab, School of Electrical and Electronic Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Yakun Ju & Xian Zhong
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Kui Jiang

Authors

Wenxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xuemei Jia
View author publications
You can also search for this author in PubMed Google Scholar
Yihao Ju
View author publications
You can also search for this author in PubMed Google Scholar
Yakun Ju
View author publications
You can also search for this author in PubMed Google Scholar
Kui Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Shifeng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Luo Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Xian Zhong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenxuan Liu: Conceptualization, Methodology, Visualization, Writing - Original Draft. Xuemei Jia: Code, Validation of the model, Writing - Review & Editing. Yihao Ju: Resources, Sorting out references, Writing - Review & Editing. Yakun Ju: Software, Writing - Review & Editing. Kui Jiang: Data management, Interpretation, Funding acquisition. Shifeng Wu: Data Curation and Prepared the Real Scene Test Luo Zhong: Revise Intellectual Content Xian Zhong: Highlight Contribution, Funding acquisition, Writing - Review & Editing.

Corresponding authors

Correspondence to Xuemei Jia or Xian Zhong.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, W., Jia, X., Ju, Y. et al. Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03427-x

Download citation

Accepted: 22 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1007/s00371-024-03427-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Deep learning models for digital image processing: a review

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Attention mechanisms in computer vision: A survey

Deep learning models for digital image processing: a review

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation