Skip to main content
Log in

AP-TransNet: a polarized transformer based aerial human action recognition framework

  • RESEARCH
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (\({\text{i}})\) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (\({\text{ii}})\) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Reshma, R., Ramesh, T., Sathishkumar, P.: Security situational aware intelligent road traffic monitoring using UAVs. In: International Conference on VLSI Systems, Architectures, Technology and Applications (VLSI-SATA), Bengaluru, India, (2016)

  2. Kaff, A.A., Moreno, F.M., José, L.J.S., García, F., Martín, D., Escalera, A.D.l., Nieva, A., Garcéa, J.L.M.: VBII-UAV: Vision-Based Infrastructure Inspection-UAV. In: Recent Advances in Information Systems and Technologies. (WorldCIST 2017) Advances in Intelligent Systems and Computing, Porto Santo Island, Madeira, Portugal, (2017)

  3. Erdelj, M., Natalizio, E., Chowdhu, K.R., Akyildiz, I.F.: Help from the Sky: leveraging UAVs for Disaster Management. IEEE Pervasive Comput. 16(1), 24–32 (2012)

    Article  Google Scholar 

  4. Peschel, J.M., Murphy, R.R.: On the human-machine interaction of unmanned aerial system mission specialists. IEEE Trans. Human-Machine Syst. 43(1), 53–62 (2013)

    Article  Google Scholar 

  5. San, K.T., Mun, S.J., Choe, Y.H., Chang, Y.S.: UAV Delivery Monitoring System. In: MATEC Web of Conferences, (2018)

  6. Rango, A., Laliberte, A., Herrick, J.E., Winters, C., Havstad, K., Steele, C., Browning, D.: Unmanned aerial vehicle-based remote sensing for rangeland assessment, monitoring, and management. J. Appl. Remote. Sens. 3(1), 033542 (2009)

    Article  Google Scholar 

  7. Akbari, Y., Almaadeed, N., Maadeed, S.A., Elharrouss, O.: Applications, databases and open computer vision research from drone videos and images: a survey. Artif. Int. Rev. 54(5), 3887–3938 (2021)

    Article  Google Scholar 

  8. Finn, R.L., Wright, D.: Unmanned aircraft systems: Surveillance, ethics and privacy in civil applications. Comput. Law Secur. Rev. 28(2), 184–194 (2012)

    Article  Google Scholar 

  9. Kim, H.C., Lim, C.S., Lee, C.S., Choi, J.H.: Introduction of real-time video surveillance system using UAV. J. Commun. 11(2), 213–220 (2016)

    Google Scholar 

  10. Bozcan, I., Kayacan, E.: UAV-AdNet: Unsupervised Anomaly Detection using Deep Neural Networks for Aerial Surveillance. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, (2020)

  11. Dilshad, N., Hwang, J., Song, J., Sung, N.: Applications and Challenges in Video Surveillance via Drone: A Brief Survey. In: International Conference on Information and Communication Technology Convergence (ICTC), Jeju Islan, Korea, (2020)

  12. He, K., Zhang, X., Ren S., Sun, J.: Deep Residual Learning for Image Recognition. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, (2016)

  13. Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J.C., Patel, V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding faces: machines may be just as good, or better, than humans. IEEE Signal Process. Mag. 35(1), 66–83 (2018)

    Article  Google Scholar 

  14. Qiu, Z., Yao, T., Mei, T.: Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Trans. Multimed. 20(4), 939–949 (2018)

    Article  Google Scholar 

  15. Jin, P., Mou, L., Hua, Y., Xia, G.S., Zhu, X.X.: FuTH-Net: fusing temporal relations and holistic features for aerial video classification. IEEE Trans. Geosci. Remote Sensing (2022). https://doi.org/10.1109/TGRS.2022.3150917

    Article  Google Scholar 

  16. Hou, R., Chen, C., Shah, M.: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, (2017)

  17. Yang, Z., An, G., Zhang, R., Zheng, Z., Ruan, Q.: SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Process. (2022). https://doi.org/10.1049/ipr2.12725

    Article  Google Scholar 

  18. Muhammad, K., Ullah, M.A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., Albuquerque, V.H.C.D.: Human action recognition using attention based LSTM network with dilated CNN features. Future Gener. Comput. Syst. 125, 820–830 (2021)

    Article  Google Scholar 

  19. Dhiman, C., Vishwakarma, D.K., Aggarwal, P.: Part-wise Spatio-temporal attention driven CNN based 3D human action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–24 (2020)

    Article  Google Scholar 

  20. Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. (TIP) 29, 3835–3844 (2020)

    Article  Google Scholar 

  21. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Recog. Machine Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  22. Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, (2017)

  23. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A Closer Look at Spatiotemporal Convolutions for Action Recognition. In: IEEE International Conference on Pattern Recognition (CVPR) , Salt Lake City, Utah, (2018)

  24. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)

  25. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)

    Article  Google Scholar 

  26. Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond Short Snippets: Deep Networks for Video Classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, Boston, MA, USA, (2015)

  27. Geraldes, R., Gonçalves, A., Lai, T., Villerabel, M., Deng, W., Salta, A., Nakayama, K.: UAV-based situational awareness system using deep learning. IEEE Access 7, 122583–122594 (2019)

    Article  Google Scholar 

  28. Pareek, P., Thakkar, A.: A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 54, 2259–2232 (2021)

    Article  Google Scholar 

  29. Dhiman, C., Vishwakarma, D.K.: A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Appl. Artif. Intell. 77, 21–45 (2019)

    Article  Google Scholar 

  30. Feichtenhofer, C.: X3D: Expanding Architectures for Efficient Video Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, (2020)

  31. Behl, H.S., Sapienza, M., Sin, G., Saha, S., Cuzzolin, F., Torr, P.H.S.: Incremental Tube Construction for Human Action Detection. In: British Machine Vision Conference (BMVC) , Northumbria University Newcastle, (2018)

  32. Sultani, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Pattern Recogn. (2021). https://doi.org/10.1016/j.cviu.2021.103186

    Article  Google Scholar 

  33. Zhou, X., Liu, S., Pavlakos, G., Kumar, V., Daniilidis, K.: Human Motion Capture Using a Drone. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, (2018)

  34. Devlin, J., Chang, M.-W., Lee, K., Kristina, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Computation and Language, (2019)

  35. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative. In: Pre-print, (2018)

  36. Liu, M.O.N.G.J.D.M.J.D.C.O.L.M.L.L.Z.V.S.Y.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. In: preprint arXiv:1907.11692 , (2019)

  37. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, (2020)

  38. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jég, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, (2021)

  39. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagor, S.: End-to-End Object Detection with Transformers. In: European conference on computer vision, (2020)

  40. Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021)

  41. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Luci, M.: ViViT: A Video Vision Transformer. In: International conference on Computer Vision (ICCV), (2021)

  42. Chen, J., Ho, C.M.: MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition. In: WACV, (2022)

  43. Zhao, B., Wang, Y., Su, K., Ren, H., Sun, H.: reading pictures instead of looking": RGB-D image-based action recognition via capsule network and kalman filter. Sensors (Basel) 6, 2217 (2021)

    Article  Google Scholar 

  44. He, J., Gao, S.: TBSN: Sparse-Transformer Based Siamese Network for Few-Shot Action Recognition. In: 2nd Information Communication Technologies Conference (ICTC), Nanjing, China, (2021)

  45. Akkaya, I.B. Kathiresan, S.S., Arani, E., Zonooz, B.: Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation. In: arXiv:2305.08551 [cs.CV], (2023)

  46. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, (2015)

  47. Liu, H., Liu, F., Fan X., Huang, D.: Polarized Self-Attention: Towards High-quality Pixel-wise Regression. In: arXiv:2107.00782v2 [cs.CV], (2021)

  48. Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV, Munich, (2018)

  49. Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR, Long Beach, CA, (2019)

  50. Hu, J., Shen, L., Albanie, S., Sun, G., Vedaldi, A.: Gather-excite: Exploiting feature context in convolutional neural networks. In: NIPS, Montreal, Canada, (2018)

  51. Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: ICCV, Seoul, Korea, (2019)

  52. Perera, A.G., Law, Y.W., Chah, J.: Drone-action: an outdoor recorded drone video dataset for action recognition. Drones (2019). https://doi.org/10.3390/drones3040082

    Article  Google Scholar 

  53. Nagendran, A., Harper, D., Shah, M.: Visual sensors and an inertial navigation system mounted on a helium balloon can collect high-definition video that is synchronized with metadata. In: SPIE : The international SOciety of optics and photonics, (2010)

  54. Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multiviewpoint outdoor dataset for human action recognition. IEEE Trans. Human-Machine Syst. 99, 1–9 (2020)

    Google Scholar 

  55. Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: arXiv, (2014)

  56. Janocha, K., Czarnecki, W.M.: On loss functions for deep neural networks in classification. Theor. Foundat. Machine Learn. (TFML ) (2017). https://doi.org/10.4467/20838476SI.16.004.6185

    Article  Google Scholar 

  57. Zhang, Z., Sabuncu, M.R.: Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In: Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, (2018)

  58. Rodriguez, E.G. Ganem, G.L., Pleiss, G., Cunningham, J.P.: Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning. In: Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, PMLR, (2020)

  59. Han, P., Abolfazl, R.: Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery. In: Advances in Visual Computing; Lecture Notes in Computer Science; Springer, Cham, Switzerland, (2020)

  60. Othman, N.A., Aydin, I.: Development of a novel lightweight CNN model for classification of human actions in UAV-captured videos. Drones 7(3), 148 (2023)

    Article  Google Scholar 

  61. Sultania, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Image Understan. 206, 103186 (2021)

    Article  Google Scholar 

  62. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards Understanding Action Recognition. In: IEEE International Conference on Computer Vision, Sydney, Australia, (2013)

  63. Cheron, G., Laptev, I., Schmid, C.: PCNN: Pose-Based CNN Features for Action Recognition. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)

  64. Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern, (2020)

  65. Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M., Manocha, D.: FAR: Fourier aerial video recognition. In: arXiv:2203.10694, (2022)

  66. Wang, X., Xian, R., Guan, T., Melo, C.M.D. Nogar, S.M. Bera, A., Manocha, D.: AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal. In: arXiv:2303.01589v1, (2023)

  67. Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multi-viewpoint outdoor dataset for human action recognition. IEEE Trans. Human Machine Syst. 99, 1–9 (2020)

    Google Scholar 

  68. Hazar, M., Fatma, B., Mohamed, H.: Human activity recognition from UAV-captured video sequences. Pattern Recogn. 100, 107140 (2020)

    Article  Google Scholar 

  69. Aldahoul, N., Karim, H.A., Sabri, A.Q.M., Tan, M.J.T., Momo, M.A., Fermin, J.L.: A comparison between various human detectors and cnn-based feature extractors for human activity recognition via aerial. IEEE Access (2022). https://doi.org/10.1109/ACCESS.2022.3182315

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Chhavi Dhiman: study conception and design, investigation, Review & Editing, proof reading, Revision Anunay Varshney & Ved Vyapak: software, data collection, analysis and draft manuscript writing, proof reading

Corresponding author

Correspondence to Chhavi Dhiman.

Ethics declarations

Conflict of interest

The authors declare they have no financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dhiman, C., Varshney, A. & Vyapak, V. AP-TransNet: a polarized transformer based aerial human action recognition framework. Machine Vision and Applications 35, 52 (2024). https://doi.org/10.1007/s00138-024-01535-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-024-01535-1

Keywords

Navigation