Skip to main content
Log in

Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition

  • Review
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Video action recognition aims to classify actions within sequences of video frames, which has important applications in computer vision fields. Existing methods have shown proficiency in well-lit environments but experience a drop in efficiency under low-light conditions. This decline is due to the challenge of extracting relevant information from dark, noisy images. Furthermore, simply introducing enhancement networks as preprocessing will lead to an increase in both parameters and computational burden for the video. To address this dilemma, this paper presents a novel frequency-based method, FRequency-Auxiliary Guided Relational Attention NeTwork (FRAGRANT), designed specifically for low-light action recognition. Its distinctive features can be summarized as: (1) a novel Frequency-Auxiliary Module that focuses on informative object regions, characterizing action and motion while effectively suppressing noise; (2) a sophisticated Relational Attention Module that enhances motion representation by modeling the local s between position neighbors, thereby more efficiently resolving issues, such as fuzzy boundaries. Comprehensive testing demonstrates that FRAGRANT outperforms existing methods, achieving state-of-the-art results on various standard low-light action recognition benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The codes are available at https://github.com/lwxfight/TEST.

Notes

  1. https://github.com/Sense-X/UniFormer.

References

  1. Hu, M., Jiang, K., Liao, L., Xiao, J., Jiang, J., Wang, Z.: Spatial-temporal space hand-in-hand: Spatial-temporal video super-resolution via cycle-projected mutual learning. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3564–3573 (2022)

  2. Sheng, B., Li, P., Ali, R., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022)

    Article  Google Scholar 

  3. Kamel, A., Sheng, B., Li, P., Kim, J., Feng, D.D.: Efficient body motion quantification and similarity evaluation using 3-d joints skeleton coordinates. IEEE Trans. Syst. Man Cybern. Syst. 51(5), 2774–2788 (2021)

    Article  Google Scholar 

  4. Huang, W., Jia, X., Zhong, X., Wang, X., Jiang, K., Wang, Z.: Beyond the parts: learning coarse-to-fine adaptive alignment representation for person search. ACM Trans. Multimedia Comput. Commun. Appl. 19(3), 105–110519 (2023)

    Article  Google Scholar 

  5. Liu, W., Zhong, X., Zhou, Z., Jiang, K., Wang, Z., Lin, C.: Dual-recommendation disentanglement network for view fuzz in action recognition. IEEE Trans. Image Process. 32, 2719–2733 (2023)

    Article  Google Scholar 

  6. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recognition. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 6201–6210 (2019)

  7. Tian, C., Zheng, M., Zuo, W., Zhang, S., Zhang, Y., Lin, C.: A cross transformer for image denoising. Inf. Fus. 102, 102043 (2024)

    Article  Google Scholar 

  8. Zhong, X., Tu, S., Ma, X., Jiang, K., Huang, W., Wang, Z.: Rainy WCity: A real rainfall dataset with diverse conditions for semantic driving scene understanding. In: Proceedings of International Joint Conferences on Artificial Intelligence, pp. 1743–1749 (2022)

  9. Zhang, B., Suo, J., Dai, Q.: A complementary dual-backbone transformer extracting and fusing weak cues for object detection in extremely dark videos. Inf. Fus. 97, 101822 (2023)

    Article  Google Scholar 

  10. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 3551–3558 (2013)

  11. Pan, Z., Cai, J., Zhuang, B.: Fast vision transformers with Hilo attention. Adv. Neural Inf. Process. Syst. (2022)

  12. Buijs, H., Pomerleau, A., Fournier, M., Tam, W.Y.: Implementation of a fast Fourier transform (fft) for image processing applications. IEEE Trans. Acoust. Speech Signal Process. 22, 420–424 (1974)

    Article  Google Scholar 

  13. Pizer, S.M., Amburn, E.P., Austin, J.D., Cromartie, R., Zuiderveld, K.: Adaptive histogram equalization and its variations. Comput. Vis. Graph. Image Process. 39(3), 355–368 (1987)

    Article  Google Scholar 

  14. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)

    Article  Google Scholar 

  15. Gao, C., Du, Y., Liu, J., Lv, J., Yang, L., Meng, D., Hauptmann, A.G.: InfAR dataset: infrared action recognition at different times. Neurocomputing 212, 36–47 (2016)

    Article  Google Scholar 

  16. Jiang, Z., Rozgic, V., Adali, S.: Learning spatiotemporal features for infrared action recognition with 3D convolutional neural networks. In: Proceedings of IEEE / CVF Computer Vision and Pattern Recognition Conference Workshops, pp. 309–317 (2017)

  17. de la Riva, M., Mettes, P.: Bayesian 3D convnets for action recognition from few examples. In: Proceedings of IEEE/CVF International Conference on Computer Vision Workshops, pp. 1337–1343 (2019)

  18. Xu, L., Zhong, X., Liu, W., Zhao, S., Yang, Z., Zhong, L.: Subspace enhancement and colorization network for infrared video action recognition. In: Proceedings of Pacific Rim International Conference on Artificial Intelligence, pp. 321–336 (2021)

  19. Li, K., Wang, Y., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: UniFormer: Unified transformer for efficient spatial-temporal representation learning. In: Proceedings of International Conference on Learning Representations (2022)

  20. Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Wang, L., Qiao, Y.: UniFormerV2: Unlocking the potential of image vits for video understanding. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 1632–1643 (2023)

  21. Lavin, A., Gray, S.: Fast algorithms for convolutional neural networks. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 4013–4021 (2016)

  22. Shchekotov, I., Andreev, P.K., Ivanov, O., Alanov, A., Vetrov, D.: FFC-SE: Fast Fourier convolution for speech enhancement. In: Proceedings of International Speech Communication Association, pp. 1188–1192 (2022)

  23. Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural Inf. Process. Syst. (2020)

  24. Xu, K., Qin, M., Sun, F., Wang, Y., Chen, Y., Ren, F.: Learning in the frequency domain. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1737–1746 (2020)

  25. Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M.C., Manocha, D.: FAR: Fourier aerial video recognition. In: Proceedings of European Conference on Computer Vision, pp. 657–676 (2022)

  26. Guo, S., Wang, W., Wang, X., Xu, X.: Low-light image enhancement with joint illumination and noise data distribution transformation. Vis. Comput. 39(4), 1363–1374 (2023)

    Google Scholar 

  27. Hao, S., Han, X., Guo, Y., Xu, X., Wang, M.: Low-light image enhancement with semi-decoupled decomposition. IEEE Trans. Multimed. 22(12), 3025–3038 (2020)

    Article  Google Scholar 

  28. Sheng, B., Li, P., Jin, Y., Tan, P., Lee, T.: Intrinsic image decomposition with step and drift shading separation. IEEE Trans. Vis. Comput. Graph. 26(2), 1332–1346 (2020)

    Article  Google Scholar 

  29. Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM Trans. Graph. 36(4), 118–111812 (2017)

    Article  Google Scholar 

  30. Liu, J., Xu, D., Yang, W., Fan, M., Huang, H.: Benchmarking low-light image enhancement and beyond. Int. J. Comput. Vis. 129(4), 1153–1184 (2021)

    Article  Google Scholar 

  31. Jiang, K., Wang, Z., Wang, Z., Chen, C., Yi, P., Lu, T., Lin, C.: Degrade is upgrade: Learning degradation for low-light image enhancement. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 1078–1086 (2022)

  32. Wang, T., Zhang, K., Shen, T., Luo, W., Stenger, B., Lu, T.: Ultra-high-definition low-light image enhancement: A benchmark and transformer-based method. In: Proceedings of AAAI Conference on Artificial Intelligence (2023)

  33. Hira, S., Das, R., Modi, A., Pakhomov, D.: Delta sampling R-BERT for limited data and low-light action recognition. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition, pp. 853–862 (2021)

  34. Zeng, J.: Indgic: supervised action recognition under low illumination. arXiv:2308.15345 (2023)

  35. Lv, F., Lu, F., Wu, J., Lim, C.: MBLLEN: low-light image/video enhancement using CNNs. In: Proceedings of British Machine Vision Conference, p. 220 (2018)

  36. Jiang, H., Zheng, Y.: Learning to see moving objects in the dark. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 7323–7332 (2019)

  37. Sheng, B., Li, P., Fang, X., Tan, P., Wu, E.: Depth-aware motion deblurring using loopy belief propagation. IEEE Trans. Circuits Syst. Video Technol. 30(4), 955–969 (2020)

    Article  Google Scholar 

  38. Zhang, F., Li, Y., You, S., Fu, Y.: Learning temporal consistency for low light video enhancement from single images. In: proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4967–4976 (2021)

  39. Huang, S., Wang, M., Zheng, X., Chen, J., Tang, C.: Hierarchical and dynamic graph attention network for drug-disease association prediction. IEEE J. Biomed. Health Inform. 1–12 (2024)

  40. Tang, C., Liu, X., Zheng, X., Li, W., Xiong, J., Wang, L., Zomaya, A.Y., Longo, A.: Defusionnet: defocus blur detection via recurrently fusing and refining discriminative multi-scale deep features. IEEE Trans. Pattern Anal. Mach. Intell. 44(2), 955–968 (2022)

    Article  Google Scholar 

  41. Kim, M., Kwon, H., Wang, C., Kwak, S., Cho, M.: Relational self-attention: What’s missing in attention for video understanding. In: Advances in Neural Information Processing Systems, pp. 8046–8059 (2021)

  42. Li, D., Hu, J., Wang, C., Li, X., She, Q., Zhu, L., Zhang, T., Chen, Q.: Involution: Inverting the inherence of convolution for visual recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12321–12330 (2021)

  43. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T.A., Serre, T.: HMDB: A large video database for human motion recognition. In: Proceedings of IEEE/CVF International Conference on Computer Vision, pp. 2556–2563 (2011)

  44. Tu, Z., Liu, Y., Zhang, Y., Mu, Q., Yuan, J.: DTCM: joint optimization of dark enhancement and action recognition in videos. IEEE Trans. Image Process. 32, 3507–3520 (2023)

    Article  Google Scholar 

  45. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017)

  46. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of International Conference on Learning Representations (2015)

  47. Long, X., de Melo, G., He, D., Li, F., Chi, Z., Wen, S., Gan, C.: Purely attention based local feature integration for video classification. IEEE Trans. Pattern Anal. Mach. Intell. 44(4), 2140–2154 (2022)

    Google Scholar 

  48. Chen, X., Gao, C., Li, C., Yang, Y., Meng, D.: Infrared action detection in the dark via cross-stream attention mechanism. IEEE Trans. Multimedia 24, 288–300 (2021)

    Article  Google Scholar 

  49. Munsif, M., Khan, S.U., Khan, N., Baik, S.W.: Attention-based deep learning framework for action recognition in a dark environment. Hum. Cent. Comput. Inf. Sci. 14 (2024)

  50. Li, J., Wei, P., Zhang, Y., Zheng, N.: A slow-i-fast-p architecture for compressed video action recognition. In: Proceedings of ACM Multimedia, pp. 2039–2047 (2020)

  51. Li, Y., Ji, B., Shi, X., Zhang, J., Kang, B., Wang, L.: TEA: Temporal excitation and aggregation for action recognition. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 906–915 (2020)

  52. Li, X., Shuai, B., Tighe, J.: Directional temporal modeling for action recognition. In: Proceedings of European Conference on Computer Vision, pp. 275–291 (2020)

  53. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Proceedings of International Conference on Machine Learning, pp. 813–824 (2021)

  54. Li, K., Li, X., Wang, Y., Wang, J., Qiao, Y.: CT-Net: Channel tensorization network for video classification. In: Proceedings of International Conference Learning Representation, pp. 1–13 (2021)

  55. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: SMART frame selection for action recognition. In: Proceedings AAAI Conference on Artificial Intelligence, pp. 1451–1459 (2021)

  56. Tian, Y., Zhong, X., Liu, W., Jia, X., Zhao, S., Ye, M.: Random walk erasing with attention calibration for action recognition. In: Proceedings of Pacific Rim International Conference on Artificial Intelligence, pp. 236–251 (2021)

  57. Li, K., Zhang, Z., Wu, G., Xiong, X., Lee, C., Lu, Z., Fu, Y., Pfister, T.: Learning from weakly-labeled web videos via exploring sub-concepts. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 1341–1349 (2022)

  58. Lin, J., Gan, C., Wang, K., Han, S.: TSM: temporal shift module for efficient and scalable video understanding on edge devices. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2760–2774 (2022)

    Google Scholar 

  59. Luo, H., Lin, G., Yao, Y., Tang, Z., Wu, Q., Hua, X.: Dense semantics-assisted networks for video action recognition. IEEE Trans. Circuits Syst. Video Technol. 32(5), 3073–3084 (2022)

  60. Alfasly, S., Lu, J., Xu, C., Al-Huda, Z., Jiang, Q., Lu, Z., Chui, C.K.: FastPicker: adaptive independent two-stage video-to-video summarization for efficient action recognition. Neurocomputing 516, 231–244 (2023)

    Article  Google Scholar 

  61. Sheng, X., Li, K., Shen, Z., Xiao, G.: A progressive difference method for capturing visual tempos on action recognition. IEEE Trans. Circuits Syst. Video Technol. 33(3), 977–987 (2023)

    Article  Google Scholar 

  62. Wang, M., Xing, J., Su, J., Chen, J., Liu, Y.: Learning spatiotemporal and motion features in a unified 2d network for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 3347–3362 (2023)

    Google Scholar 

  63. Rasheed, H.A., Khattak, M.U., Maaz, M., Khan, S.H., Khan, F.S.: Fine-tuned CLIP models are efficient video learners. In: Proceedings of IEEE/CVF Computer Vision and Pattern Recognition Conference, pp. 6545–6554 (2023)

  64. Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action recognition. In: Proceedings of IEEE/CVF Conference on Computer Vision Pattern Recognition (2020)

  65. Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: AIM: adapting image models for efficient video action recognition. In: Proceedings of International Conference on Learning Representation (2023)

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62271361 and 52271366, and the Fundamental Research Funds for the Central Universities under Grant WHUTIOT2023-002. The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.

Author information

Authors and Affiliations

Authors

Contributions

Wenxuan Liu: Conceptualization, Methodology, Visualization, Writing - Original Draft. Xuemei Jia: Code, Validation of the model, Writing - Review & Editing. Yihao Ju: Resources, Sorting out references, Writing - Review & Editing. Yakun Ju: Software, Writing - Review & Editing. Kui Jiang: Data management, Interpretation, Funding acquisition. Shifeng Wu: Data Curation and Prepared the Real Scene Test Luo Zhong: Revise Intellectual Content Xian Zhong: Highlight Contribution, Funding acquisition, Writing - Review & Editing.

Corresponding authors

Correspondence to Xuemei Jia or Xian Zhong.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, W., Jia, X., Ju, Y. et al. Fragrant: frequency-auxiliary guided relational attention network for low-light action recognition. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03427-x

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03427-x

Keywords

Navigation