Skip to main content
Log in

Leveraging cross-resolution attention for effective extreme low-resolution video action recognition

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Recognizing human actions in extremely low-resolution (eLR) videos poses a formidable challenge in the action recognition domain due to the lack of temporal and spatial information in the corresponding eLR frames. In this work, we propose a novel eLR video human action recognition architecture that recognize actions in an eLR setup. The proposed approach and its variants utilize an expanded knowledge distillation scheme that provides the essential flow of information from high-resolution (HR) frames to eLR frames. To further improve the generalization capability, we integrate cross-resolution attention modules that can operate without HR information during inference time. Additionally, we investigate the impact of an eLR data preprocessing pipeline that leverages a super-resolution algorithm and experimentally show the efficacy of the proposed models in eLR space. Our experiments indicate the importance of examining eLR human action recognition and demonstrate that the proposed methods can surpass and/or compete with the current state-of-the-art methods, achieving effective generalization capabilities on both UCF-101 and HMDB-51 datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Bai, Y., Zou, Q., Chen, X., et al.: Extreme low-resolution action recognition with confident spatial-temporal attention transfer. Int. J. Comput. Vis. 1–16 (2023)

  2. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

  3. Chen, J., Wu, J., Konrad, J., et al.: Semi-coupled two-stream fusion convnets for action recognition at extremely low resolutions. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 139–147 (2017)

  4. Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)

  5. Dai, R., Das, S., Brémond, F.: Learning an augmented RGB representation with cross-modal knowledge distillation for action detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13053–13064 (2021)

  6. Dave, I.R., Chen, C., Shah, M.: Spact: Self-supervised privacy preservation for action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20164–20173 (2022)

  7. Demir, U., Rawat, Y.S., Shah, M.: Tinyvirat: low-resolution video action recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp. 7387–7394 (2021)

  8. Feichtenhofer, C., Fan, H., Malik, J., et al.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)

  9. Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv preprint arXiv:1503.02531 (2015)

  10. Hou, M., Liu, S., Zhou, J., et al.: Extreme low-resolution activity recognition using a super-resolution-oriented generative adversarial network. Micromachines 12(6), 670 (2021)

    Article  Google Scholar 

  11. Huang, Z., Wang, X., Wei, Y., et al.: Ccnet: Criss-cross attention for semantic segmentation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence p. 1 (2020). https://doi.org/10.1109/TPAMI.2020.3007032

  12. Kim, H., Jain, M., Lee, J.T., et al.: Efficient action recognition via dynamic knowledge propagation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13719–13728 (2021)

  13. Kuehne, H., Jhuang, H., Garrote, E., et al.: Hmdb: a large video database for human motion recognition. In: 2011 International conference on computer vision, IEEE, pp. 2556–2563 (2011)

  14. Liu, T., Lam, K.-M., Kong, J.: Distilling privileged knowledge for anomalous event detection from weakly labeled videos. In: IEEE Transactions on Neural Networks and Learning Systems, IEEE (2023)

  15. Liu, Z., Ning, J., Cao, Y., et al.: Video swin transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 3202–3211 (2022)

  16. Liu, Z., Wang, L., Wu, W., et al.: Tam: Temporal adaptive module for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13708–13718 (2021)

  17. Ma, C., Guo, Q., Jiang, Y., et al.: Rethinking resolution in the context of efficient video recognition. Adv. Neural Inf. Process. Syst. 35, 37865–37877 (2022)

    Google Scholar 

  18. Purwanto, D., Renanda Adhi Pramono, R., Chen, Y.T., et al.: Extreme low resolution action recognition with spatial-temporal multi-head self-attention and knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, (2019)

  19. Purwanto, D., Pramono, R.R.A., Chen, Y.T., et al.: Three-stream network with bidirectional self-attention for action recognition in extreme low resolution videos. IEEE Signal Process. Lett. 26(8), 1187–1191 (2019)

    Article  Google Scholar 

  20. Russo, P., Ticca, S., Alati, E., et al.: Learning to see through a few pixels: Multi streams network for extreme low-resolution action recognition. IEEE Access 9, 12019–12026 (2021)

    Article  Google Scholar 

  21. Ryoo, M., Kim, K., Yang, H.: Extreme low resolution activity recognition with multi-siamese embedding learning. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

  22. Ryoo, M.S., Rothrock, B., Fleming, C., et al.: Privacy-preserving human activity recognition from extreme low resolution. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)

  23. Selvaraju, R.R., Cogswell, M., Das, A., et al.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626 (2017)

  24. Shaikh, A.H., Meshram, B.: Security issues in cloud computing. In: Intelligent Computing and Networking. Springer, pp. 63–77 (2021)

  25. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A Dataset of 101 Human Actions Classes from Videos in the Wild. arXiv preprint arXiv:1212.0402 (2012)

  26. Xu, M., Sharghi, A., Chen, X., et al.: Fully-coupled two-stream spatiotemporal networks for extremely low resolution action recognition. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, pp. 1607–1615 (2018)

  27. Zhang, K., Gool, L.V., Timofte, R.: Deep unfolding network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3217–3226 (2020)

Download references

Funding

This declaration is not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All of the authors contributed equally to this work and reviewed the manuscript.

Corresponding author

Correspondence to Oguzhan Oguz.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Ethical approval

This declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oguz, O., Ikizler-Cinbis, N. Leveraging cross-resolution attention for effective extreme low-resolution video action recognition. SIViP 18, 399–406 (2024). https://doi.org/10.1007/s11760-023-02766-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02766-x

Keywords

Navigation