Abstract
Drones are widespread and actively employed in a variety of applications due to their low cost and quick mobility and enabling new forms of action surveillance. However, owing to various challenges- limited no. of aerial view samples, aerial footage suffers with camera motion, illumination changes, small actor size, occlusion, complex backgrounds, and varying view angles, human action recognition in aerial videos even more challenging. Maneuvering the same, we propose Aerial Polarized-Transformer Network (AP-TransNet) to recognize human actions in aerial view using both spatial and temporal details of the video feed. In this paper, we present the Polarized Encoding Block that performs (\({\text{i}})\) Selection with Rejection to select the significant features and reject least informative features similar to Light photometry phenomena and (\({\text{ii}})\) boosting operation increases the dynamic range of encodings using non-linear softmax normalization at the bottleneck tensors in both channel and spatial sequential branches. The performance of the proposed AP-TransNet is evaluated by conducting extensive experiments on three publicly available benchmark datasets: drone action dataset, UCF-ARG Dataset and Multi-View Outdoor Dataset (MOD20) supporting with ablation study. The proposed work outperformed the state-of-the-arts.
Similar content being viewed by others
References
Reshma, R., Ramesh, T., Sathishkumar, P.: Security situational aware intelligent road traffic monitoring using UAVs. In: International Conference on VLSI Systems, Architectures, Technology and Applications (VLSI-SATA), Bengaluru, India, (2016)
Kaff, A.A., Moreno, F.M., José, L.J.S., García, F., Martín, D., Escalera, A.D.l., Nieva, A., Garcéa, J.L.M.: VBII-UAV: Vision-Based Infrastructure Inspection-UAV. In: Recent Advances in Information Systems and Technologies. (WorldCIST 2017) Advances in Intelligent Systems and Computing, Porto Santo Island, Madeira, Portugal, (2017)
Erdelj, M., Natalizio, E., Chowdhu, K.R., Akyildiz, I.F.: Help from the Sky: leveraging UAVs for Disaster Management. IEEE Pervasive Comput. 16(1), 24–32 (2012)
Peschel, J.M., Murphy, R.R.: On the human-machine interaction of unmanned aerial system mission specialists. IEEE Trans. Human-Machine Syst. 43(1), 53–62 (2013)
San, K.T., Mun, S.J., Choe, Y.H., Chang, Y.S.: UAV Delivery Monitoring System. In: MATEC Web of Conferences, (2018)
Rango, A., Laliberte, A., Herrick, J.E., Winters, C., Havstad, K., Steele, C., Browning, D.: Unmanned aerial vehicle-based remote sensing for rangeland assessment, monitoring, and management. J. Appl. Remote. Sens. 3(1), 033542 (2009)
Akbari, Y., Almaadeed, N., Maadeed, S.A., Elharrouss, O.: Applications, databases and open computer vision research from drone videos and images: a survey. Artif. Int. Rev. 54(5), 3887–3938 (2021)
Finn, R.L., Wright, D.: Unmanned aircraft systems: Surveillance, ethics and privacy in civil applications. Comput. Law Secur. Rev. 28(2), 184–194 (2012)
Kim, H.C., Lim, C.S., Lee, C.S., Choi, J.H.: Introduction of real-time video surveillance system using UAV. J. Commun. 11(2), 213–220 (2016)
Bozcan, I., Kayacan, E.: UAV-AdNet: Unsupervised Anomaly Detection using Deep Neural Networks for Aerial Surveillance. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, USA, (2020)
Dilshad, N., Hwang, J., Song, J., Sung, N.: Applications and Challenges in Video Surveillance via Drone: A Brief Survey. In: International Conference on Information and Communication Technology Convergence (ICTC), Jeju Islan, Korea, (2020)
He, K., Zhang, X., Ren S., Sun, J.: Deep Residual Learning for Image Recognition. In: IEEE conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, (2016)
Ranjan, R., Sankaranarayanan, S., Bansal, A., Bodla, N., Chen, J.C., Patel, V.M., Castillo, C.D., Chellappa, R.: Deep learning for understanding faces: machines may be just as good, or better, than humans. IEEE Signal Process. Mag. 35(1), 66–83 (2018)
Qiu, Z., Yao, T., Mei, T.: Learning deep spatio-temporal dependence for semantic video segmentation. IEEE Trans. Multimed. 20(4), 939–949 (2018)
Jin, P., Mou, L., Hua, Y., Xia, G.S., Zhu, X.X.: FuTH-Net: fusing temporal relations and holistic features for aerial video classification. IEEE Trans. Geosci. Remote Sensing (2022). https://doi.org/10.1109/TGRS.2022.3150917
Hou, R., Chen, C., Shah, M.: Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos. In: IEEE International Conference on Computer Vision (ICCV), Venice, Italy, (2017)
Yang, Z., An, G., Zhang, R., Zheng, Z., Ruan, Q.: SRI3D: Two-stream inflated 3D ConvNet based on sparse regularization for action recognition. IET Image Process. (2022). https://doi.org/10.1049/ipr2.12725
Muhammad, K., Ullah, M.A., Imran, A.S., Sajjad, M., Kiran, M.S., Sannino, G., Albuquerque, V.H.C.D.: Human action recognition using attention based LSTM network with dilated CNN features. Future Gener. Comput. Syst. 125, 820–830 (2021)
Dhiman, C., Vishwakarma, D.K., Aggarwal, P.: Part-wise Spatio-temporal attention driven CNN based 3D human action recognition. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–24 (2020)
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. (TIP) 29, 3835–3844 (2020)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Recog. Machine Intell. 40(6), 1510–1517 (2018)
Carreira, J., Zisserman, A.: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, (2017)
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A Closer Look at Spatiotemporal Convolutions for Action Recognition. In: IEEE International Conference on Pattern Recognition (CVPR) , Salt Lake City, Utah, (2018)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning Spatiotemporal Features with 3D Convolutional Networks. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)
Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 677–691 (2017)
Ng, J.Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond Short Snippets: Deep Networks for Video Classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),, Boston, MA, USA, (2015)
Geraldes, R., Gonçalves, A., Lai, T., Villerabel, M., Deng, W., Salta, A., Nakayama, K.: UAV-based situational awareness system using deep learning. IEEE Access 7, 122583–122594 (2019)
Pareek, P., Thakkar, A.: A survey on video-based human action recognition: recent updates, datasets, challenges, and applications. Artif. Intell. Rev. 54, 2259–2232 (2021)
Dhiman, C., Vishwakarma, D.K.: A review of state-of-the-art techniques for abnormal human activity recognition. Eng. Appl. Artif. Intell. 77, 21–45 (2019)
Feichtenhofer, C.: X3D: Expanding Architectures for Efficient Video Recognition. In: Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, (2020)
Behl, H.S., Sapienza, M., Sin, G., Saha, S., Cuzzolin, F., Torr, P.H.S.: Incremental Tube Construction for Human Action Detection. In: British Machine Vision Conference (BMVC) , Northumbria University Newcastle, (2018)
Sultani, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Pattern Recogn. (2021). https://doi.org/10.1016/j.cviu.2021.103186
Zhou, X., Liu, S., Pavlakos, G., Kumar, V., Daniilidis, K.: Human Motion Capture Using a Drone. In: IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, (2018)
Devlin, J., Chang, M.-W., Lee, K., Kristina, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Computation and Language, (2019)
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative. In: Pre-print, (2018)
Liu, M.O.N.G.J.D.M.J.D.C.O.L.M.L.L.Z.V.S.Y.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. In: preprint arXiv:1907.11692 , (2019)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, (2020)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jég, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, PMLR, (2021)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagor, S.: End-to-End Object Detection with Transformers. In: European conference on computer vision, (2020)
Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., Xia, H.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021)
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Luci, M.: ViViT: A Video Vision Transformer. In: International conference on Computer Vision (ICCV), (2021)
Chen, J., Ho, C.M.: MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition. In: WACV, (2022)
Zhao, B., Wang, Y., Su, K., Ren, H., Sun, H.: reading pictures instead of looking": RGB-D image-based action recognition via capsule network and kalman filter. Sensors (Basel) 6, 2217 (2021)
He, J., Gao, S.: TBSN: Sparse-Transformer Based Siamese Network for Few-Shot Action Recognition. In: 2nd Information Communication Technologies Conference (ICTC), Nanjing, China, (2021)
Akkaya, I.B. Kathiresan, S.S., Arani, E., Zonooz, B.: Enhancing Performance of Vision Transformers on Small Datasets through Local Inductive Bias Incorporation. In: arXiv:2305.08551 [cs.CV], (2023)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, (2015)
Liu, H., Liu, F., Fan X., Huang, D.: Polarized Self-Attention: Towards High-quality Pixel-wise Regression. In: arXiv:2107.00782v2 [cs.CV], (2021)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV, Munich, (2018)
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR, Long Beach, CA, (2019)
Hu, J., Shen, L., Albanie, S., Sun, G., Vedaldi, A.: Gather-excite: Exploiting feature context in convolutional neural networks. In: NIPS, Montreal, Canada, (2018)
Cao, Y., Xu, J., Lin, S., Wei, F., Hu, H.: Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: ICCV, Seoul, Korea, (2019)
Perera, A.G., Law, Y.W., Chah, J.: Drone-action: an outdoor recorded drone video dataset for action recognition. Drones (2019). https://doi.org/10.3390/drones3040082
Nagendran, A., Harper, D., Shah, M.: Visual sensors and an inertial navigation system mounted on a helium balloon can collect high-definition video that is synchronized with metadata. In: SPIE : The international SOciety of optics and photonics, (2010)
Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multiviewpoint outdoor dataset for human action recognition. IEEE Trans. Human-Machine Syst. 99, 1–9 (2020)
Kingma, D., Ba, J.: Adam: A Method for Stochastic Optimization. In: arXiv, (2014)
Janocha, K., Czarnecki, W.M.: On loss functions for deep neural networks in classification. Theor. Foundat. Machine Learn. (TFML ) (2017). https://doi.org/10.4467/20838476SI.16.004.6185
Zhang, Z., Sabuncu, M.R.: Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. In: Conference on Neural Information Processing Systems (NeurIPS), Montréal, Canada, (2018)
Rodriguez, E.G. Ganem, G.L., Pleiss, G., Cunningham, J.P.: Uses and Abuses of the Cross-Entropy Loss: Case Studies in Modern Deep Learning. In: Proceedings on "I Can't Believe It's Not Better!" at NeurIPS Workshops, PMLR, (2020)
Han, P., Abolfazl, R.: Fully Autonomous UAV-Based Action Recognition System Using Aerial Imagery. In: Advances in Visual Computing; Lecture Notes in Computer Science; Springer, Cham, Switzerland, (2020)
Othman, N.A., Aydin, I.: Development of a novel lightweight CNN model for classification of human actions in UAV-captured videos. Drones 7(3), 148 (2023)
Sultania, W., Shah, M.: Human action recognition in drone videos using a few aerial training examples. Comp. Vision Image Understan. 206, 103186 (2021)
Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards Understanding Action Recognition. In: IEEE International Conference on Computer Vision, Sydney, Australia, (2013)
Cheron, G., Laptev, I., Schmid, C.: PCNN: Pose-Based CNN Features for Action Recognition. In: IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, (2015)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern, (2020)
Kothandaraman, D., Guan, T., Wang, X., Hu, S., Lin, M., Manocha, D.: FAR: Fourier aerial video recognition. In: arXiv:2203.10694, (2022)
Wang, X., Xian, R., Guan, T., Melo, C.M.D. Nogar, S.M. Bera, A., Manocha, D.: AZTR: Aerial Video Action Recognition with Auto Zoom and Temporal. In: arXiv:2303.01589v1, (2023)
Perera, A.G., Law, Y.W., Ogunwa, T., Chahl, J.: A multi-viewpoint outdoor dataset for human action recognition. IEEE Trans. Human Machine Syst. 99, 1–9 (2020)
Hazar, M., Fatma, B., Mohamed, H.: Human activity recognition from UAV-captured video sequences. Pattern Recogn. 100, 107140 (2020)
Aldahoul, N., Karim, H.A., Sabri, A.Q.M., Tan, M.J.T., Momo, M.A., Fermin, J.L.: A comparison between various human detectors and cnn-based feature extractors for human activity recognition via aerial. IEEE Access (2022). https://doi.org/10.1109/ACCESS.2022.3182315
Author information
Authors and Affiliations
Contributions
Chhavi Dhiman: study conception and design, investigation, Review & Editing, proof reading, Revision Anunay Varshney & Ved Vyapak: software, data collection, analysis and draft manuscript writing, proof reading
Corresponding author
Ethics declarations
Conflict of interest
The authors declare they have no financial interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dhiman, C., Varshney, A. & Vyapak, V. AP-TransNet: a polarized transformer based aerial human action recognition framework. Machine Vision and Applications 35, 52 (2024). https://doi.org/10.1007/s00138-024-01535-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-024-01535-1