Abstract
Detecting objects in Wide Area Motion Imagery (WAMI), an essential task for many practical applications, is particularly challenging in crowded scenes, such as areas with heavy traffic, since pixel resolutions of objects and ground sampling distance are highly compromised, and different factors disrupt visual signals. To address this challenge, we design a framework that combines preprocessing operations and deep detectors. To train deep networks for detection in WAMI for improved performance in especially crowded areas, we propose a novel crowd-aware thresholded loss (CATLoss) function. Moreover, we introduce a hard sampling mining method to strengthen the discriminative ability of the proposed solution. Additionally, we extend prior networks used in the literature using novel spatio-temporal cascaded architectures to incorporate more contextual information without introducing additional parameters. Overall, our approach is causal, more generalizable, and more robust even in reduced spatial sizes. On the WPAFB-2009 dataset, we show that our solution performs better than or on par with state-of-the-art without introducing any computational complexity during inference. The code and trained models will be released at (https://github.com/poyrazhatipoglu/CATLoss).
Similar content being viewed by others
Availability of Data and Material
The data sources utilized in this study are already publicly available through their respective providers.
Code Availability
The code and trained models will be released at https://github.com/poyrazhatipoglu/CATLoss.
References
Aeschliman C, Park J, Kak AC (2014) Tracking vehicles through shadows and occlusions in wide-area aerial video. IEEE Trans Aerosp Electron Syst 50(1):429–444
AFRL U (2009) Wright-patterson air force base (wpafb) dataset
Agarwal S, Furukawa Y, Snavely N, Simon I, Curless B, Seitz SM, Szeliski R (2011) Building Rome in a day. Commun ACM 54(10):105–112
Alcantarilla PF, Solutions T (2011) Fast explicit diffusion for accelerated features in nonlinear scale spaces. IEEE Trans Patt Anal Mach Intell 34(7):1281–1298
Andrew AM (2001) Multiple view geometry in computer vision. Kybernetes
Ao W, Fu Y, Hou X, Xu F (2019) Needles in a haystack: tracking city-scale moving vehicles from continuously moving satellite. IEEE Trans Image Process 29:1944–1957
Basharat A, Turek M, Xu Y, Atkins C, Stoup D, Fieldhouse K, Tunison P, Hoogs A (2014) Real-time multi-target tracking at 210 megapixels/second in wide area motion imagery. In: IEEE winter conference on applications of computer vision, IEEE, pp 839–846
Biewald L (2020) Experiment tracking with weights and biases. Software available from wandb com 2
Brown LG (1992) A survey of image registration techniques. ACM Comput Surv (CSUR) 24(4):325–376
Brutzer S, Höferlin B, Heidemann G (2011) Evaluation of background subtraction techniques for video surveillance. In: CVPR 2011. IEEE, pp 1937–1944
Bürkle A, Essendorfer B (2010) Maritime surveillance with integrated systems. In: 2010 International WaterSide Security Conference. IEEE, pp 1–8
Canepa A, Ragusa E, Zunino R, Gastaldo P (2021) T-rexnet-a hardware-aware neural network for real-time detection of small moving objects. Sensors 21(4):1252
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision. Springer, pp 213–229
Chang HC, Lai SH, Lu KR (2006) A robust real-time video stabilization algorithm. J Vis Commun Image Represent 17(3):659–673
Chen H, Zhang L, Ma J, Zhang J (2019) Target heat-map network: an end-to-end deep network for target detection in remote sensing images. Neurocomputing 331:375–387
Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1831–1840
Dawn S, Saxena V, Sharma B (2010) Remote sensing image registration techniques: a survey. In: International conference on image and signal processing. Springer, pp 103–112
Dewancker I, McCourt M, Clark S (2016) Bayesian optimization for machine learning: a practical guidebook. arXiv preprint arXiv:1612.04858
Doherty P, Rudol P (2007) A uav search and rescue scenario with human body detection and geolocalization. In: Australasian joint conference on Artificial Intelligence. Springer, pp 1–13
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Du D, Qi Y, Yu H, Yang Y, Duan K, Li G, Zhang W, Huang Q, Tian Q (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 370–386
Feather R, Davis JW (2011) Activity analysis in wide-area aerial surveillance video. Tech. rep., Ohio State University Columbus United States
Fehlmann S, Pontecorvo C, Booth DM, Janney P, Christie R, Redding NJ, Royce M, Fiebig M (2014) Fusion of multiple sensor data to recognise moving objects in wide area motion imagery. In: 2014 international conference on digital image computing: techniques and applications (DICTA), pp 1–8. https://doi.org/10.1109/DICTA.2014.7008110
Force UA (2007) Columbus large image format dataset 2007
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Hartung C, Spraul R, Krüger W (2018) Improvement of persistent tracking in wide area motion imagery by CNN-based motion detections. In: Image and signal processing for remote sensing XXIV. SPIE, vol 10789, pp 249–258
Hatipoğlu P, Albayrak R, Alatan AA (2020) Object detection under moving cloud shadows in WAMI. ISPRS Ann Photogramm Remote Sens Spatial Inf Sci 2:837–844
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. PMLR, pp 448–456
Keck M, Galup L, Stauffer C (2013) Real-time tracking of low-resolution vehicles for wide-area persistent surveillance. In: 2013 IEEE workshop on applications of computer vision (WACV). IEEE, pp 441–448
Kent P, Maskell S, Payne O, Richardson S, Scarff L (2012) Robust background subtraction for automated detection and tracking of targets in wide area motion imagery. In: Optics and photonics for counterterrorism, crime fighting, and defence VIII, SPIE vol 8546, pp 208–219
Krausman JA, Miller DA (2015) The 12m\(^{{\rm TM}}\) tethered aerostat system: rapid tactical deployment for surveillance missions. In: 22nd AIAA lighter-than-air systems technology conference, p 3351
Krawczyk B (2016) Learning from imbalanced data: open challenges and future directions. Progr Artif Intell 5(4):221–232
LaLonde R, Zhang D, Shah M (2018) Clusternet: Detecting small objects in large scenes by exploiting spatio-temporal information. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4003–4012
Lei X, Pan H, Huang X (2019) A dilated CNN model for image classification. IEEE Access 7:124087–124095
Li L, Jamieson K, DeSalvo G, Rostamizadeh A, Talwalkar A (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res 18(1):6765–6816
Liang P, Ling H, Blasch E, Seetharaman G, Shen D, Chen G (2013) Vehicle detection in wide area aerial surveillance using temporal context. In: Proceedings of the 16th international conference on information fusion. IEEE, pp 181–188
Lin Y, Medioni G (2007) Map-enhanced uav image sequence registration and synchronization of multiple image sequences. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE, pp 1–7
Lin TY, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Loshchilov I, Hutter F (2018) Fixing weight decay regularization in adam.https://openreview.net/forum?id=rk6qdGgCZ
Luo W, Li Y, Urtasun R, Zemel R (2016) Understanding the effective receptive field in deep convolutional neural networks. Adv Neural Inf Process Syst 29
Motorcu H, Ates HF, Ugurdag HF, Gunturk BK (2021) Hm-net: a regression network for object center detection and tracking on wide area motion imagery. IEEE Access 10:1346–1359
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Icml
Negin F, Tabejamaat M, Fraisse R, Bremond F (2022) Transforming temporal embeddings to keypoint heatmaps for detection of tiny vehicles in wide area motion imagery (wami) sequences. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1432–1441
Ozyurt EO, Gunsel B (2018) Wami object tracking using l 1 tracker integrated with a deep detector. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 2690–2694
Palaniappan K, Rao RM, Seetharaman G (2011) Wide-area persistent airborne video: architecture and challenges. In: Distributed video sensor networks. Springer, pp 349–371
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inf Process Syst 32
Perera AA, Srinivas C, Hoogs A, Brooksby G, Hu W (2006) Multi-object tracking through simultaneous long occlusions and split-merge conditions. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE, vol 1, pp 666–673
Pfister T, Charles J, Zisserman A (2015) Flowing convnets for human pose estimation in videos. In: Proceedings of the IEEE international conference on computer vision, pp 1913–1921
Pflugfelder R, Weissenfeld A, Wagner J (2020) On learning vehicle detection in satellite video. arXiv preprint arXiv:2001.10900
Pi Y, Nath ND, Behzadan AH (2020) Convolutional neural networks for object detection in aerial imagery for disaster response and recovery. Adv Eng Inform 43:101009
Pollard T, Antone M (2012) Detecting and tracking all moving objects in wide-area aerial video. In: 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, pp 15–22
Reilly V, Idrees H, Shah M (2010) Detection and tracking of large number of targets in wide area surveillance. In: European conference on computer vision. Springer, pp 186–199
Saleemi I, Shah M (2013) Multiframe many-many point correspondence for vehicle tracking in high density wide area aerial videos. Int J Comput Vis 104(2):198–219
Shi X, Ling H, Blasch E, Hu W (2012) Context-driven moving vehicle detection in wide area motion imagery. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE, pp 2512–2515
Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. Artificial intelligence and machine learning for multi-domain operations applications. SPIE, vol 11006, pp 369–386
Sodemann AA, Ross MP, Borghetti BJ (2012) A review of anomaly detection in automated surveillance. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42(6):1257–1272
Sommer LW, Teutsch M, Schuchert T, Beyerer J (2016) A survey on moving object detection for wide area motion imagery. In: 2016 IEEE winter conference on applications of computer vision (WACV). IEEE, pp 1–9
Spruyt V, Ledda A, Philips W (2013) Sparse optical flow regularization for real-time visual tracking. In: 2013 IEEE international conference on multimedia and expo (ICME). IEEE, pp 1–6
Teutsch M, Grinberg M (2016) Robust detection of moving vehicles in wide area motion imagery. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 27–35
Van Brummelen G (2012) Heavenly mathematics: the forgotten art of spherical trigonometry. Princeton University Press, Princeton
Vella E, Azim A, Gaetjens HX, Repasky B, Payne T (2019) Improved detection for wami using background contextual information. In: 2019 digital image computing: techniques and applications (DICTA). IEEE, pp 1–9
Xiao J, Cheng H, Sawhney H, Han F (2010) Vehicle detection and tracking in wide field-of-view aerial video. In: 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, pp 679–684
Yong H, Huang J, Meng D, Hua X, Zhang L (2020) Momentum batch normalization for deep learning with small batch size. In: European conference on computer vision. Springer, pp 224–240
Zheng E, Wu C (2015) Structure from motion using structure-less resection. In: Proceedings of the IEEE international conference on computer vision, pp 2075–2083
Zhou Y, Maskell S (2019) Detecting and tracking small moving objects in wide area motion imagery (wami) using convolutional neural networks (cnns). In: 2019 22th international conference on information fusion (FUSION). IEEE, pp 1–8
Zivkovic Z (2004) Improved adaptive gaussian mixture model for background subtraction. In: Proceedings of the 17th international conference on pattern recognition, 2004. ICPR 2004. IEEE, vol 2, pp 28–31
Funding
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Author information
Authors and Affiliations
Contributions
Conceptualization: [PUH]; Methodology: [PUH]; Software: [PUH]; Validation: [PUH]; Investigation: [PUH]; Writing—Original Draft: [PUH]; Visualization: [PUH]; Supervision: [CI, SK]; Validation [CI, SK]; Writing—Review and Editing [CI, SK]
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Appendix
Appendix
1.1 A.1 Visual Outputs
See Fig. 11.
1.2 A.2 Haversine Distance
The distance traveled between two consecutive frames and the speeds of targets can be calculated using the latitude (\(\varphi\)), and longitude (\(\lambda\)) coordinates of targets, and the time elapsed between two consecutive frames. To calculate the approximate distances between two points ((\(\varphi _1\), \(\lambda _1\)), (\(\varphi _2\), \(\lambda _2\))) on Earth’s surface, Haversine distance (Van Brummelen 2012) formulated in Eq. (9) is used.
where r is the approximate Earth’s radius.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Hatipoglu, P.U., Iyigun, C. & Kalkan, S. Crowd-aware Thresholded Loss for Object Detection in Wide Area Motion Imagery. PFG 91, 339–364 (2023). https://doi.org/10.1007/s41064-023-00253-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41064-023-00253-z