AP-TransNet: a polarized transformer based aerial human action recognition framework

Dhiman, Chhavi; Varshney, Anunay; Vyapak, Ved

doi:10.1007/s00138-024-01535-1

AP-TransNet: a polarized transformer based aerial human action recognition framework

RESEARCH
Published: 10 April 2024

Volume 35, article number 52, (2024)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Chhavi Dhiman¹,
Anunay Varshney² &
Ved Vyapak¹

129 Accesses
Explore all metrics

Abstract

Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (\({\text{i}})\) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (\({\text{ii}})\) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery

Drone Watch: A Novel Dataset for Violent Action Recognition from Aerial Videos

FAR: Fourier Aerial Video Recognition

References

Reshma, R., Ramesh, T., Sathishkumar, P.: Security situational aware intelligent road traffic monitoring using UAVs. In: International Conference on VLSI Systems, Architectures, Technology and Applications (VLSI-SATA), Bengaluru, India, (2016)
Kaff, A.A., Moreno, F.M., José, L.J.S., García, F., Martín, D., Escalera, A.D.l., Nieva, A., Garcéa, J.L.M.: VBII-UAV: Vision-Based Infrastructure Inspection-UAV. In: Recent Advances in Information Systems and Technologies. (WorldCIST 2017) Advances in Intelligent Systems and Computing, Porto Santo Island, Madeira, Portugal, (2017)
Erdelj, M., Natalizio, E., Chowdhu, K.R., Akyildiz, I.F.: Help from the Sky: leveraging UAVs for Disaster Management. IEEE Pervasive Comput. 16(1), 24–32 (2012)
Article Google Scholar
Peschel, J.M., Murphy, R.R.: On the human-machine interaction of unmanned aerial system mission specialists. IEEE Trans. Human-Machine Syst. 43(1), 53–62 (2013)
Article Google Scholar
San, K.T., Mun, S.J., Choe, Y.H., Chang, Y.S.: UAV Delivery Monitoring System. In: MATEC Web of Conferences, (2018)
Rango, A., Laliberte, A., Herrick, J.E., Winters, C., Havstad, K., Steele, C., Browning, D.: Unmanned aerial vehicle-based remote sensing for rangeland assessment, monitoring, and management. J. Appl. Remote. Sens. 3(1), 033542 (2009)
Article Google Scholar
Akbari, Y., Almaadeed, N., Maadeed, S.A., Elharrouss, O.: Applications, databases and open computer vision research from drone videos and images: a survey. Artif. Int. Rev. 54(5), 3887–3938 (2021)
Article Google Scholar
Finn, R.L., Wright, D.: Unmanned aircraft systems: Surveillance, ethics and privacy in civil applications. Comput. Law Secur. Rev. 28(2), 184–194 (2012)
Article Google Scholar
Kim, H.C., Lim, C.S., Lee, C.S., Choi, J.H.: Introduction of real-time video surveillance system using UAV. J. Commun. 11(2), 213–220 (2016)
Google Scholar
Bozcan, I., Kayacan, E.: UAV-AdNet: Unsupervised Anomaly Detection using Deep Neural Networks for Aerial Surveillance. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, (2020)
Dilshad, N., Hwang, J., Song, J., Sung, N.: Applications and Challenges in Video Surveillance via Drone: A Brief Survey. In: International Conference on Information and Communication Technology Convergence (ICTC), Jeju Islan, Korea, (2020)
He, K., Zhang, X., Ren S., Sun, J.: Deep Residual Learning for Image Recognition. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, (2016)
Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J.C., Patel, V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding faces: machines may be just as good, or better, than humans. IEEE Signal Process. Mag. 35(1), 66–83 (2018)
Article Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Trans. Multimed. 20(4), 939–949 (2018)
Article Google Scholar
Jin, P., Mou, L., Hua, Y., Xia, G.S., Zhu, X.X.: FuTH-Net: fusing temporal relations and holistic features for aerial video classification. IEEE Trans. Geosci. Remote Sensing (2022). https://doi.org/10.1109/TGRS.2022.3150917
Article Google Scholar
Hou, R., Chen, C., Shah, M.: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, (2017)
Yang, Z., An, G., Zhang, R., Zheng, Z., Ruan, Q.: SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Process. (2022). https://doi.org/10.1049/ipr2.12725
Article Google Scholar
Muhammad, K., Ullah, M.A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., Albuquerque, V.H.C.D.: Human action recognition using attention based LSTM network with dilated CNN features. Future Gener. Comput. Syst. 125, 820–830 (2021)
Article Google Scholar
Dhiman, C., Vishwakarma, D.K., Aggarwal, P.: Part-wise Spatio-temporal attention driven CNN based 3D human action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–24 (2020)
Article Google Scholar
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. (TIP) 29, 3835–3844 (2020)
Article Google Scholar
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Recog. Machine Intell. 40(6), 1510–1517 (2018)
Article Google Scholar
Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, (2017)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A Closer Look at Spatiotemporal Convolutions for Action Recognition. In: IEEE International Conference on Pattern Recognition (CVPR) , Salt Lake City, Utah, (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)
Article Google Scholar
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond Short Snippets: Deep Networks for Video Classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, Boston, MA, USA, (2015)
Geraldes, R., Gonçalves, A., Lai, T., Villerabel, M., Deng, W., Salta, A., Nakayama, K.: UAV-based situational awareness system using deep learning. IEEE Access 7, 122583–122594 (2019)
Article Google Scholar
Pareek, P., Thakkar, A.: A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 54, 2259–2232 (2021)
Article Google Scholar
Dhiman, C., Vishwakarma, D.K.: A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Appl. Artif. Intell. 77, 21–45 (2019)
Article Google Scholar
Feichtenhofer, C.: X3D: Expanding Architectures for Efficient Video Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, (2020)
Behl, H.S., Sapienza, M., Sin, G., Saha, S., Cuzzolin, F., Torr, P.H.S.: Incremental Tube Construction for Human Action Detection. In: British Machine Vision Conference (BMVC) , Northumbria University Newcastle, (2018)
Sultani, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Pattern Recogn. (2021). https://doi.org/10.1016/j.cviu.2021.103186
Article Google Scholar
Zhou, X., Liu, S., Pavlakos, G., Kumar, V., Daniilidis, K.: Human Motion Capture Using a Drone. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, (2018)
Devlin, J., Chang, M.-W., Lee, K., Kristina, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Computation and Language, (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative. In: Pre-print, (2018)
Liu, M.O.N.G.J.D.M.J.D.C.O.L.M.L.L.Z.V.S.Y.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. In: preprint arXiv:1907.11692 , (2019)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jég, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagor, S.: End-to-End Object Detection with Transformers. In: European conference on computer vision, (2020)
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Luci, M.: ViViT: A Video Vision Transformer. In: International conference on Computer Vision (ICCV), (2021)
Chen, J., Ho, C.M.: MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition. In: WACV, (2022)
Zhao, B., Wang, Y., Su, K., Ren, H., Sun, H.: reading pictures instead of looking": RGB-D image-based action recognition via capsule network and kalman filter. Sensors (Basel) 6, 2217 (2021)
Article Google Scholar
He, J., Gao, S.: TBSN: Sparse-Transformer Based Siamese Network for Few-Shot Action Recognition. In: 2nd Information Communication Technologies Conference (ICTC), Nanjing, China, (2021)
Akkaya, I.B. Kathiresan, S.S., Arani, E., Zonooz, B.: Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation. In: arXiv:2305.08551 [cs.CV], (2023)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, (2015)
Liu, H., Liu, F., Fan X., Huang, D.: Polarized Self-Attention: Towards High-quality Pixel-wise Regression. In: arXiv:2107.00782v2 [cs.CV], (2021)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV, Munich, (2018)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR, Long Beach, CA, (2019)
Hu, J., Shen, L., Albanie, S., Sun, G., Vedaldi, A.: Gather-excite: Exploiting feature context in convolutional neural networks. In: NIPS, Montreal, Canada, (2018)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: ICCV, Seoul, Korea, (2019)
Perera, A.G., Law, Y.W., Chah, J.: Drone-action: an outdoor recorded drone video dataset for action recognition. Drones (2019). https://doi.org/10.3390/drones3040082
Article Google Scholar
Nagendran, A., Harper, D., Shah, M.: Visual sensors and an inertial navigation system mounted on a helium balloon can collect high-definition video that is synchronized with metadata. In: SPIE : The international SOciety of optics and photonics, (2010)
Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multiviewpoint outdoor dataset for human action recognition. IEEE Trans. Human-Machine Syst. 99, 1–9 (2020)
Google Scholar
Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: arXiv, (2014)
Janocha, K., Czarnecki, W.M.: On loss functions for deep neural networks in classification. Theor. Foundat. Machine Learn. (TFML ) (2017). https://doi.org/10.4467/20838476SI.16.004.6185
Article Google Scholar
Zhang, Z., Sabuncu, M.R.: Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In: Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, (2018)
Rodriguez, E.G. Ganem, G.L., Pleiss, G., Cunningham, J.P.: Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning. In: Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, PMLR, (2020)
Han, P., Abolfazl, R.: Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery. In: Advances in Visual Computing; Lecture Notes in Computer Science; Springer, Cham, Switzerland, (2020)
Othman, N.A., Aydin, I.: Development of a novel lightweight CNN model for classification of human actions in UAV-captured videos. Drones 7(3), 148 (2023)
Article Google Scholar
Sultania, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Image Understan. 206, 103186 (2021)
Article Google Scholar
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards Understanding Action Recognition. In: IEEE International Conference on Computer Vision, Sydney, Australia, (2013)
Cheron, G., Laptev, I., Schmid, C.: PCNN: Pose-Based CNN Features for Action Recognition. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern, (2020)
Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M., Manocha, D.: FAR: Fourier aerial video recognition. In: arXiv:2203.10694, (2022)
Wang, X., Xian, R., Guan, T., Melo, C.M.D. Nogar, S.M. Bera, A., Manocha, D.: AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal. In: arXiv:2303.01589v1, (2023)
Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multi-viewpoint outdoor dataset for human action recognition. IEEE Trans. Human Machine Syst. 99, 1–9 (2020)
Google Scholar
Hazar, M., Fatma, B., Mohamed, H.: Human activity recognition from UAV-captured video sequences. Pattern Recogn. 100, 107140 (2020)
Article Google Scholar
Aldahoul, N., Karim, H.A., Sabri, A.Q.M., Tan, M.J.T., Momo, M.A., Fermin, J.L.: A comparison between various human detectors and cnn-based feature extractors for human activity recognition via aerial. IEEE Access (2022). https://doi.org/10.1109/ACCESS.2022.3182315
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Delhi Technological University, New Delhi, India
Chhavi Dhiman & Ved Vyapak
Department of Mechanical Engineering, Delhi Technological University, New Delhi, India
Anunay Varshney

Authors

Chhavi Dhiman
View author publications
You can also search for this author in PubMed Google Scholar
Anunay Varshney
View author publications
You can also search for this author in PubMed Google Scholar
Ved Vyapak
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Chhavi Dhiman: study conception and design, investigation, Review & Editing, proof reading, Revision Anunay Varshney & Ved Vyapak: software, data collection, analysis and draft manuscript writing, proof reading

Corresponding author

Correspondence to Chhavi Dhiman.

Ethics declarations

Conflict of interest

The authors declare they have no financial interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dhiman, C., Varshney, A. & Vyapak, V. AP-TransNet: a polarized transformer based aerial human action recognition framework. Machine Vision and Applications 35, 52 (2024). https://doi.org/10.1007/s00138-024-01535-1

Download citation

Received: 01 September 2023
Revised: 03 January 2024
Accepted: 17 March 2024
Published: 10 April 2024
DOI: https://doi.org/10.1007/s00138-024-01535-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AP-TransNet: a polarized transformer based aerial human action recognition framework

Abstract

Access this article

Similar content being viewed by others

Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery

Drone Watch: A Novel Dataset for Violent Action Recognition from Aerial Videos

FAR: Fourier Aerial Video Recognition

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

AP-TransNet: a polarized transformer based aerial human action recognition framework

Abstract

Access this article

Similar content being viewed by others

Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery

Drone Watch: A Novel Dataset for Violent Action Recognition from Aerial Videos

FAR: Fourier Aerial Video Recognition

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation