Efficient CNN-based low-resolution facial detection from UAVs

Face detection in UAV imagery requires high accuracy and low execution time for real-time mission-critical operations in public safety, emergency management, disaster relief and other applications. This study presents UWS-YOLO, a new convolutional neural network (CNN)-based machine learning algorithm designed to address these demanding requirements. UWS-YOLO’s key strengths lie in its exceptional speed, remarkable accuracy and ability to handle complex UAV operations. This algorithm presents a balanced and portable solution for real-time face detection in UAV applications. Evaluation and comparison with the state-of-the-art algorithms using standard and UAV-speciﬁc datasets demonstrate UWS-YOLO’s superiority. It achieves 59.29% of accuracy compared with 27.43% in a state-of-the-art solution RetinaFace and 46.59% with YOLOv7. Additionally, UWS-YOLO operates at 11 milliseconds, which is 345% faster than RetinaFace and 373% than YOLOv7.


Introduction
Detecting and tracking down people on the ground from a drone or unmanned aerial vehicle (UAV) is critical for a myriad of applications such as remote monitoring and surveillance, search and rescue to find missing people [1,2], people counting in dense crowds, emergency and vigilance (e.g., the Drone Guard Angel in the EU H2020 project ARCADIAN-IoT) and public safety (e.g., surveillance application in the EU H2020 project 5G-INDUCE) [3].Face detection is an active research area in the field of computer vision to identify specific individuals.It is the first step to develop a robust facial recognition system by detecting and locating the human face from the obtained image.Automatic face detection system plays a vital role in face identification, facial expression recognition, headpose estimation and human-computer interaction among others [4].Although there has been a lot of research conducted on facial detection, there are still several issues to be addressed owing to various challenging operational conditions being involved in facial detection from drones including a high degree of variability, distance from the camera, face orientation, illumination, face occlusion, complex backgrounds, low resolution to name a few.These challenges have a detrimental effect on the detection rate and accuracy of the detection [4].
In addition, most of the research work is just focused on accomplishing the task of detecting faces in the scene with high accuracy.Nevertheless, the complete use case when follow-up tasks are executed is not generally considered in the research community.Figure 1 presents a typical example of a flowchart in order to perform some of the mentioned follow-up tasks, such as face identification, expression recognition and head pose estimation.These tasks are computationally expensive, and thus, their inference time is high.Therefore, the face detection task should be fast and light, in order to allow other tasks to provide results close to real time.This research is focused on improving the inference time in the face detection stage.
Although having a low execution time is mandatory for the success of different use cases, developing and executing a low computationally expensive algorithm is key to reduce the required amount of memory needed from the GPU (Graphics Processing Unit).Current research work should focus on lowering the rate of operations per second to allow other algorithms to perform in parallel in the same GPU.
The aim of this study is to create a convolutional neural network (CNN) specifically designed for deployment with unmanned aerial systems (UAS).The CNN targets for a trade-off among accuracy, speed and portability.The solution should achieve high accuracy able to detect small faces within large images even if the faces consist of only a few pixels.In addition, in order to overcome the highlighted challenges, the CNN is optimized in terms of speed to achieve an execution time beyond real time (faster than 30 frames per second).Finally, this algorithm is powerefficient with low computational demands.As a result, our contributions are summarized as follows: • Up-to-date literature review on face detection techniques and their limitations.The organization of the paper is as follows: Sect. 2 reviews the related work.Section 3 describes the design of the proposed solution to perform facial detection, followed by the implementation setup in Sect 4. Section 5 presents the results of the proposed solution.Finally, Sect.6 concludes the paper.

Related work
The aim of this section is to explain the techniques used in this paper.In addition, it reviews state-of-the-art work related to facial detection.

Machine learning techniques
The methods commonly used to detect faces from images or videos are divided into two main categories: featurebased and image-based approaches.The feature-based approaches such as Viola-Jones [5] and Gabor featuresbased methods [6] focus on extracting main facial features which result in sub-optical facial detection [4].These methods are fast but fail when the detections are from different angles and lighting conditions [7].Image-based approaches including deep convolutional neural networks have inspired face detection in recent years.There are two categories for CNN-based face detectors, region-based (two-stage) and single-stage methods.Fast RCNN [8], Faster RCNN [9] and R-FCN [10] are common regionbased methods.They have achieved high accuracy at the cost of speed.The single-stage methods including YOLO series [11][12][13] and SSD (single-shot detector) [14] have achieved high inference time but lower accuracy compared to two-stage approaches.RetinaFace [15] using Resnet-152 as backbone achieves great accuracy but has a high inference time while processing HD (1920 9 1080) or 4k (4096 9 2160) images.The aforementioned single-stage CNNs may be configured in order to achieve higher accuracy for low-pixel size object detection.Spatial Pyramid Pooling (SPP) and Path Aggregation Network (PAN) are two advanced techniques that keep the small features extracted in the preliminary CNN's layers to increase the accuracy at almost no cost in execution speed.

Spatial pyramid pooling (SPP)
The SPP [16] uses large max poolings to improve the receptive field and consequently increases the accuracy of the architecture.It is robust to object deformation and carries out information cumulation at a deeper stage of the neural network.To increase the receptive field and get better accuracy, the strides in the SPP module were changed to 7 x 7, 10 x 10 and 13 x 13 in the backbone of the proposed architecture.

Path aggregation network (PAN)
PAN is a method for the improvement of the detection of small objects.To achieve this, the features of the top layers with more information are combined with the deeper layers with more shallow features [17].In this study, the PAN architecture was modified using an additional upsampling to keep more meaningful information which is essential for the detection of small objects.

Facial detection
This subsection compares state-of-the-art facial detection results with the proposed approach in this article.The results have been obtained from the papers of each algorithm.Table 1 shows the comparative analysis.
In one study, [18] proposed a face detection technique implemented on a Raspberry Pi.Haar cascade classifier algorithm was implemented using OpenCV [19].The Droneface dataset [20] was used as the facial image dataset.Although high accuracy was obtained (98%, 93%, 86% and 80%) for distances of 1.5, 3, 4 and 5 m, respectively, the detections were not tested from an altitude of more than 5 m.In addition, the speed of the model has not been reported.In another study, [21] proposed an enhanced YOLOv3 algorithm for face detection being comparable with YOLOv3 and YOLOv4.The mAP of 51.9% was achieved using WIDER FACE dataset [22] which is less than our model with mAP of 72.26%.The speed of the model has not also been reported.
In [7], a fast customized CNN was implemented suitable for UAV use cases.In [23] and [24], although high performance has been achieved for face detection and identification, they are not fast enough for our use case and they have been tested up to a distance of 10 m.In [25], a Mobilenet-SSD was implemented in TensorFlow for facial detection.Although very high accuracy was obtained (91.92%), the algorithm was not tested on the drone.In another study [26], YOLOv3 was implemented on Raspberry Pi for facial detection.This study is slow for our use case (6.7 frames per second) and has been tested at a distance of 3.8 m.In [27], Local Binary Pattern Histogram (LBPH) was used as a face recognizer for anti-theft and surveillance applications.Although it has obtained a high accuracy of 89.1%, the speed and the distance have not been reported.
In [28] a high accuracy is achieved in the WIDER FACE dataset but without specifying what input size was used to obtain this accuracy.Moreover, it achieves almost realtime inference speed being close to 30 frames per second (FPS) which is slower than our model (91 FPS).In [29] and [30] a high accuracy is achieved in the WIDER FACE dataset, but the inference time has not been reported.In [31] a modification of RetinaNet to improve accuracy and speed is presented.However, our model achieves better results in terms of accuracy and inference time in the WIDER FACE dataset.
In summary, current literature does not consider deeply the challenges involved in facial detection from UAVs including altitude, illumination and face orientation among others and the models used in these studies are computationally expensive.However, our proposed solution reduces the computational power to make it more suitable for portable, resource-constrained devices.

Design of the proposed algorithm
The proposed algorithm is designed to strike a balance between accuracy, speed and portability.The UWS-YOLO integrates three key components to create an effective solution for face detection.First, a robust backbone architecture to achieve high accuracy, followed up by an SPP module with modified strides of 7, 10 and 13 in the max poolings.Finally, an enhanced PAN increases the accuracy for small object detection.Figure 2 presents the architecture of the proposed solution combining all the techniques implemented.

The backbone
The baseline of the proposed solution is the Tiny-YOLOv4 [32] backbone whose main strength is the low execution speed (4 ms).This backbone serves as the foundation of the UWS-YOLO algorithm; nevertheless, its main flaw is the low accuracy that it presents.Our solution includes Cross-Stage Partial Networks (CSPBlock) modules which replace the ResBlock module in the residual network (marked as a).CSPBlocks [33] were chosen in order to improve the correlation difference of gradient information and enhance the learning ability of the convolution network.The feature maps are divided into two branches to apply different transformations and then concatenated back together.UWS-YOLO includes three CSPBlock modules with 64, 128 and 256 filters.The parameters of the CSPBlocks are adopted according to [34].Experiments with various filter numbers have been carried out in order to choose the best.
To implement this architecture, firstly similar to YOLOv4, the CSPBlock modules were used in the backbone of the proposed architecture.Cross-stage residual edge is used in the CSPBlock module to combine the two divided feature maps (groupid=1/2 in the figure).This enhances the learning ability of convolution networks compared with the ResBlock module.
The filter number of 64 in the first CSPBlock module is divided into two convolutions of 32 filters and then combined with 3x3 convolution at the beginning of the backbone with 32 filters.The results are then passed to a 3x3 convolution with a filter number of 64 and combined with the shortcut from the first convolution with the filter number of 64.
The filter number of the second CSPBlock module is 128 and divides into two convolutions of 64 filters and then combined with a 3x3 convolution of the backbone with 64 filters.The results are then passed to a 3x3 convolution with a filter number of 128 and combined with the shortcut from the first convolution with the filter number of 128.
The last module starts with a convolution of 256 filters and divides into two convolutions with filters of 128 and then combined with a shortcut from the convolution with 128 filters.The results are then passed to a 3x3 convolution with filter number of 64.The result is combined with the shortcut from the first convolution with filter 256.
Secondly to improve the receptive field of the backbone, an SSP module was modified concatenating the maxpooling outputs with kernel sizes of 7, 10 and 13, respectively.The concatenation of these max-pooling outputs improved the accuracy of the architecture.
Thirdly, to be able to detect small faces from long distances, a modified PAN module was included to the backbone of the architecture with an additional bottom-up path augmentation added to the FPN architecture to aggregate features from low-level layers with more detailed information and the higher-level layers with more semantic information.An extra upsampling (3 downsampling with factors of 16, 8 and 4) was added to the PAN architecture in comparison with YOLOV4 to keep shallower features.The concatenation of the features from the bottom-up path goes through a 1 9 1 convolution.

Enhanced spatial pyramid pooling (SPP)
The second component, an enhanced spatial pyramid pooling (SPP) [16], is included at the end of the backup (marked as b) at the end of the backbone.This technique provides the capability to handle different sizes of the image input sizes and this capture multi-scale information from different spatial resolutions being this a key factor when detecting object at low resolutions from large images.By deploying this technique, the CNN become more robust when the UAV is flying at different altitudes because the algorithm is less affected by variations in the size of the faces.

Enhanced path aggregation network (PAN)
A modified path aggregation network (PAN) [17] module was also added (Marked as c) in the algorithm with three upsamplings (compared to YOLOv4).This technique enhances the capability of keeping the fine-grained features from the first layers along the computation of the following ones.This becomes an important matter when the details extracted from low-pixel size faces may be lost along the CNN.
The number of filters selected was 64, 128, 256, 128 and 64, respectively.The extra bottom-up path augmentation was down-sampled (3 downsamplings compared to two downsamplings in YOLOv4) with factors of 16, 8 and 4, respectively.The number of filters selected was 64.The features from the bottom-up path were then concatenated, Fig. 2 The architecture of UWS-YOLO face detector and 1 Â 1 convolution was run on the result.The coordinates, probability and confidence were obtained using YOLOv3 headers.The YOLOv3 headers were used to output the coordinates, probability and level of confidence.19 Â 19, 38 Â 38 and 76 Â 76 are the feature maps of the YOLOv3 header.

Implementation and deployment
This section presents the implementation and deployment of the UWS-YOLO in order to detect faces from UAV images and videos.It describes how the proposed solution is trained and tested in addition to the datasets used to evaluate results.Besides, further explanation is given regarding standards algorithms commonly employed to perform face detection in order to be compared against our algorithm.

Face detection datasets
Several algorithms are compared in terms of accuracy and speed.Every algorithm presented should be trained and evaluated over the same dataset in order to perform a fair comparison.In this manuscript, two datasets are presented.First, a publicly available dataset named WIDER FACE is mainly composed of images taken at ground level.Secondly, the authors of the manuscript created a face detection dataset recorded from a UAV: UAV-UWS dataset.

WIDER FACE
One of the largest datasets for face recognition is WIDER FACE dataset.It contains 32,203 images and 393,703 faces which is an average of 12 faces per image.The different illumination, scales, poses and occlusions included in this dataset make it a challenging task to achieve very high accuracy.WIDER FACE [22] is divided into three levels of difficulties, namely easy, medium and difficult.The accuracies in this research have been obtained by joining all three subsets together.
In order to allow other researchers to compare their algorithms against UWS-YOLO, this manuscript uses this dataset as the benchmark to compare our proposed solution with the state-of-the-art architectures; however, UWS-YOLO is optimized for low-pixel face recognition and not for close range images.

UAV-UWS dataset
In the scope of this research work, the authors have also created a dataset to test the algorithms in videos captured from a drone.This dataset was recorded using a DJI Mini 2 UAV.The frames obtained have a 4K resolution (3840 Â 2160 pixels) and a frame rate of 30 fps.
In total, 20 people from different ethnicities were recorded for the dataset.For each person, a video of 30 s was recorded at 8 different distances from the drone to the face (2, 5, 7, 10, 15, 20, 25 and 30 m).Therefore, the dataset is composed of 144,000 different frames.The UAV was positioned at 30 above the face at each distance.Furthermore, the people recorded were asked to do different head movements to get a complete view of the whole face.Figure 3 shows an example of the head movements asked to every volunteer.Initially, participants were asked to execute a lateral head movement, looking first to the left and then to the right.Following this, a full circular movement with the head was made to cover the rest of the positions, including upward and downward positions.Finally, the volunteers were asked to stare directly at the drone, looking forward.
Due to the similarity of the faces in consecutive frames, the test dataset has been created extracting only one frame per second.Therefore, the dataset used for testing is composed of 4,800 images with a ratio of one face per image.
It is worth noting that GDPR (General Data Protection Regulation) compliance is a critical consideration in our data-gathering process, as we are committed to respecting individuals' privacy and adhering to the regulations.While the dataset may be relatively small in comparison with some other studies, it has been meticulously curated to cater specifically to the objectives of our research.We emphasize that the dataset's size is appropriate for the testing and validation of our proposed model and offers a representative sample for the intended use case.
Table 2 shows the average face size, face at each distance of our dataset with two image sizes: the original 4K resolution and after being resized to 608 9 608.As can be seen, at higher than 15 m, the size of the faces in the resized image is less than 10 pixels.The smallest face is at 30 m where the average size is only 2 9 5 pixels in the resized image.

Face detection algorithms
UWS-YOLO is compared against RetinaFace [15] and YOLOv7 [35] in terms of accuracy, inference time, loading time of the model and model size.To compare all algorithms, these were trained in the same conditions over the same dataset.First, every algorithm was trained with the general dataset WIDER FACE.Then, each algorithm is evaluated over the testing dataset of WIDER FACE.Finally, to compare the accuracy of the algorithms when detecting faces from UAV images at high altitudes, the UAV-UWS dataset is used as a testing dataset; therefore, this dataset is only used for verification of the algorithms.This process is useful for comparing how each algorithm with the same training behaves at different distances.

Hyperparameters
Table 3 shows the execution hyperparameters used to train UWS-YOLO, RetinaFace and YOLOv7.

RetinaFace-ResNet50
A momentum coefficient of 0.9 was used for training.The input size of the training images was 640 Â 640 px, and the batch size used was 8.Moreover, the machine learning execution platform for RetinaFace was TensorFlow 2.5.3 [36].

UWS-YOLO
We utilize a momentum coefficient of 0.9 as the learning policy.An image input size of 608 Â 608 px was used in our training with a batch size of 64.The machine learning execution platform used was Darknet [37].

YOLOv7
YOLOv7 was trained with a momentum coefficient of 0.9.608 Â 608 px is the size as the images are fed in YOLOv7.The batch size chosen is 8.The machine learning execution platform is Darknet [37].

Intersection over Union (IoU)
The Intersection over Union (IoU) is a fundamental metric used in object detection.The IoU measures the overlap between the predicted bounding box and the ground truth bounding box, providing a quantitative assessment of the accuracy of the detection.As shown in Fig. 4, the IoU is defined as the area of overlap (intersection) between the bounding boxes divided by the area of union.
Face detection is only one stage in the pipeline of our use case.Therefore, subsequent stages rely heavily on obtaining accurate face images to achieve high accuracy.For instance, in face verification tasks, having precise face images is crucial for obtaining successful results.Hence, an accurate bounding box around the face is of utmost Momentum coefficient 0.9 0.9 0.9 importance, and this is where the IoU threshold becomes crucial.
Figure 5 illustrates three different face detections varying the IoU threshold.The green bounding box represents the ground truth, while the red one corresponds to the face detected by the model.With an IoU of 90% (Fig. 5a) the detected face closely matches the ground truth.When using an IoU of 75% (Fig. 5b) a small portion of the face is lost, but the results are still satisfactory for subsequent stages.However, with an IoU of 50% (Fig. 5c) a significant part of the face is lost, possibly even an eye from the bounding box.This makes it challenging for the next stages to achieve excellent results; however, the impact is not significant.In cases with an IoU below 50%, it becomes impossible to achieve good results in the next stages as more than half of the face may be lost.
Therefore, in this study, the accuracy of the models will be compared using these three IoU values: 90%, 75% and 50%.

Pipeline
The pipeline is divided into three stages as depicted in Fig. 6.First, the image is preprocessed, and then it is fed into the neural network.Finally, the results are post-processed to obtain the bounding boxes with confidence scores for each detection.The input to the pipeline is a single frame.

Preprocessing
The preprocessing stage consists of three steps aimed at preparing the captured frame for the neural network.

Convolutional neural network (CNN)
The preprocessed image is then passed through a CNN.For this research, three different CNN models were employed: RetinaFace, YOLO-UWS and YOLOv7.Each model produces detection candidates along with their corresponding confidence scores.

Detections
A detection threshold is defined to determine the acceptance of a detection.If the confidence score of a detection is higher than the threshold, it is considered a valid detection; otherwise, it is discarded.Lowering the threshold will increase the number of true detections but also lead to an increase in false detections; therefore, it may reduce the accuracy.The choice of an appropriate threshold is crucial to achieve high accuracy with each model.Once the detections are confirmed, the bounding boxes are resized to match the dimensions of the original size.
The output of the pipeline consists of the bounding boxes for each detection, defined by the coordinates of the top-left corner of the bounding box, as well as its width and height.Furthermore, the pipeline provides the confidence score for each detection.

Empirical results and discussion
This section presents both quantitative and qualitative results from various experiments conducted to facilitate a comprehensive comparison of the algorithms.The quantitative results subsection shows the comparison of Retina-Face (with three different input sizes), YOLOv7 and UWS-YOLO on different metrics such as the inference time, the build time, the weights size and the accuracy of the algorithms.The last one has been analyzed on two different datasets-WIDER FACE and UAV-UWS.The qualitative results show what can our algorithm (UWS-YOLO) achieve on images at different distances.

Experimentation environment
The experiments have been carried out on a computer with an Intel(R) Xeon(R) E5-2630 v4 at 2.20 GHz with 20 cores and 32 GB of RAM.In addition, an NVIDIA GeForce GTX TITAN X with 12 GB of onboard memory with CUDA compatibility [38] was used.The Operative System (OS) used is Focal Ubuntu 20.04.3 with a Kernel version of 5.11.00.

Accuracy
The accuracy of the models has been evaluated using three different approaches.Firstly, the WIDER FACE dataset was utilized, employing three IoUs (Intersection Over Union): 0.5, 0.75 and 0.9.Subsequently, the models were tested on the UAV-UWS dataset employing the same IoUs.Finally, they were tested on the UAV-UWS dataset but at every distance with a fixed IoU of 0.5.All the results have been obtained from experiments conducted specifically for this study.
Table 4 shows the results for the WIDER FACE dataset.Among the models, YOLOv7 exhibits the lowest accuracy.UWS-YOLO demonstrates modest performance and matches the accuracy of RetinaFace only when using an input size of 416 px and the most permissive IoU (50%).Reti-naFace achieves the highest accuracy across all IoU, particularly with an input size of 1600 px.
Table 5 presents the accuracy of the models on the UAV-UWS dataset.Our model (UWS-YOLO) demonstrates exceptional accuracy, second only to RetinaFace with an input size of 1600 px.Compared with models with the same input size, UWS-YOLO achieves þ12:7% better accuracy than YOLOv7 and þ31:86% improvement over RetinaFace with an IoU of 50%.Even with a stricter IoU of 90%, UWS-YOLO outperforms YOLOv7 by þ3:96% and RetinaFace by þ2:33%.
Table 6 shows the accuracy of the models on the UAV-UWS dataset for each of the eight distances recorded: 2, 5, 7, 10, 15, 20, 25 and 30 m.At a distance of 2 m, all models  RetinaFace with an input size of 1600 px, YOLOv7 and UWS-YOLO is the only model that achieves good results at longer distances (more than 10 m) as indicated in the table.At 25 and 30 m, YOLOv7 struggles to detect any face, resulting in significantly low accuracy (0.15% and 0.01%, respectively).In contrast, UWS-YOLO achieves an accuracy of more than 8% at 25 m and 2.22% at 30 m, surpassing even RetinaFace with an input size of 1600 px by þ1:83%.
These results demonstrate that our model UWS-YOLO performs well at both short and long distances, surpassing, at some distances, models with nearly three times larger input sizes.Furthermore, when compared to models with the same input size, UWS-YOLO consistently accomplishes superior results across all distances.

Built time, weights size and inference time
Table 7 provides a comparison of the built time, the weights size and the inference time for the three models under consideration.The values shown are with the same input size of 608 px for all three models.YoloV7 exhibits the longest building time, taking approximately 4 s, and it has also the biggest weights size of 140 MB.In contrast, RetinaFace has a build time of around 2.5 s with a weights size close to 130 MB.Notably, our proposed model, UWS-YOLO, demonstrates the fastest build time of only 1.5 s, showcasing its speed and stands out as the most lightweight model with only a weights size of 49 MB.
Regarding inference time, a correlation can be observed with the weights size.Our proposed model achieves the fastest inference time, merely 11 ms, equivalent to 91 fps.In contrast, RetinaFace has an inference time of 38 ms (26 fps) while YoloV7 is the slowest with 41 ms (24 fps).Therefore, our model stands as the only one capable of achieving real-time processing, defined at a frame rate of 30 fps.
Our proposed model is þ345% faster than RetinaFace achieving better accuracy when evaluated on the UAV-UWS dataset.Furthermore, in comparison with YoloV7, our model outperforms it significantly, being þ373% faster and showcasing better accuracy across all tested datasets, including WIDERFACE.YOLOv7 and RetinaFace models only achieve real-time processing with their smallest input size (416), while our model achieves it even with an input size of 608.Furthermore, not only our model is faster than the others at these input sizes (416 and 608) but also it achieves a higher accuracy.Using an input size of 416, our model is þ371% faster than RetinaFace and has an accuracy improvement of þ18:31%.Moreover, it is þ314% faster than YOLOv7 and better accuracy by þ3:54%.
Our model was trained with an input size of 608 px; therefore, the best results will be when this input size is used.Figure 7 reflects this.Our model is the upper leftmost compared to the other models.As mentioned in previous sections, our model with this input size is þ345% faster than RetinaFace and þ373% than YOLOv7.It also has þ31:86% and þ12:7% higher accuracy than RetinaFace and YOLOv7, respectively.
On the other side, with an input size of 1600 px, our model is faster than the other models (þ330% than Reti-naFace and þ365% than YOLOv7), but it has lower accuracy (À3:68% than RetinaFace and À8:07% than YOLOv7).
Although RetinaFace-1600px achieves the highest accuracy (69.16%), its inference time is also really high (208 ms).Moreover, UWS-YOLO does not show a great accuracy difference when increasing the input size to 1600 px (1.8% more), but the inference time increased significantly (46 ms more), being slower than real time.This can be seen in Fig. 7. Therefore, based on the trade-off between high speed and high accuracy and that real-time processing is needed, our algorithm UWS-YOLO-608px will achieve the best results.

Qualitative results
Figure 8 shows frames from our UAV-UWS dataset at four different distances and head positions.Each frame contains only one person and therefore one face.Figure 8a shows a volunteer at 2 m with the head facing right down.The face was detected by our model with a confidence score of 58%.The low confidence score can be attributed to the person's face facing down, resulting in an obscured facial appearance.This demonstrates the capability of our algorithm to detect faces under conditions where not all facial features are readily visible.Figure 8b shows the same volunteer in the same position but at a distance of 7 m.Our model is able to detect the face with a confidence score of 61%.Again, this confidence score can be attributed to the downward-facing orientation of the face.
In Fig. 8c the person is at 20 m and is looking directly at the drone.The face is detected with a confidence score of 81%.The model achieves a high confidence score even at far distances as the person is facing the drone and all the face features can be appreciated.Finally, Fig. 8d shows the same person staring at the drone but at 30 m, the maximum distance of our dataset.Our model is capable of detecting the face, but only with a confidence score of 13%.It is worth reminding that at 30 m, the size of the face in the

Conclusions
Face detection is a widely studied task in research, primarily focusing on achieving high accuracy.However, most existing solutions neglect the crucial aspect of low execution times, which are essential for real-time performance and subsequent face identification processes.Moreover, face detection from UAV imagery poses additional challenges due to factors such as face posing, angle variations, camera inclination, and drone vibrations.
This manuscript introduces a novel CNN-based algorithm, UWS-YOLO, specifically designed to address these limitations.Our algorithm demonstrates high-accuracy face detections with minimal inference times from UAV videos.To evaluate the performance of UWS-YOLO, we conducted comprehensive training and evaluation experiments, comparing it against SOTA algorithms, including Retina-Face.The evaluations were performed on both a standard dataset (WIDER FACE) and a dataset collected by the authors using a UAV.Empirical evaluation has been conducted, and UWS-YOLO outperforms RetinaFace by a significant margin, achieving an accuracy rate of 59.29% compared to RetinaFace's 27.43%.Notably, UWS-YOLO surpasses RetinaFace's capabilities by successfully detecting faces even beyond a distance of 10 m.Furthermore, in terms of speed, UWS-YOLO exhibits exceptional efficiency, performing 345% faster than RetinaFace.
The key advantages of UWS-YOLO can be summarized as follows: superior speed, remarkable accuracy and the ability to handle the complexities of face detection in UAV operations.These findings position UWS-YOLO as a balanced and highly suitable algorithm for face detection in UAV applications.In summary, the presented research has created a cutting-edge solution that surpasses previous algorithms in terms of speed and accuracy, while also addressing the unique challenges associated with face detection from UAV imagery.

Fig. 1
Fig. 1 Flowchart with examples of the use of face detection with follow-up tasks with their average inference time

Fig. 3
Fig. 3 Examples of the different face positions recorded in the UAV-UWS dataset

1 .
Resize: The image is resized to match the input size required by the neural network.In this process, the OpenCV resize function is utilized.2. Scaling: The pixel values in each channel are scaled to fit within the range expected by the neural network.3. Channel conversion: The image is converted from the BGR (Blue-Green-Red) color space to RGB (Red-Green-Blue).Moreover, the channel dimensions are transposed from (N, H, W, C) to (N, C, H, W), where N represents the number of images in a batch, C is the number of channels in the image, H is the height of the image, and W is the width.

Figure 7
Figure7shows the relation between the accuracy and the inference time in our UAV-UWS dataset.It has been compared using the three models-RetinaFace, YOLOv7 and UWS-YOLO-with three different input sizes-416, 608 and 1600 px.In the figure, the leftmost area indicates the fastest models, while the area at the top indicates models with higher accuracy.The figure also shows where is the real-time processing defined as 33 ms (30 fps).Any model on the left of this vertical line will execute real-time detections.YOLOv7 and RetinaFace models only achieve real-time processing with their smallest input size (416), while our model achieves it even with an input size of 608.Furthermore, not only our model is faster than the others at these input sizes (416 and 608) but also it achieves a higher accuracy.Using an input size of 416, our model is þ371% faster than RetinaFace and has an accuracy improvement of þ18:31%.Moreover, it is þ314% faster than YOLOv7 and better accuracy by þ3:54%.Our model was trained with an input size of 608 px; therefore, the best results will be when this input size is used.Figure7reflects this.Our model is the upper leftmost compared to the other models.As mentioned in previous sections, our model with this input size is þ345% faster than RetinaFace and þ373% than YOLOv7.It also has þ31:86% and þ12:7% higher accuracy than RetinaFace and YOLOv7, respectively.On the other side, with an input size of 1600 px, our model is faster than the other models (þ330% than Reti-naFace and þ365% than YOLOv7), but it has lower accuracy (À3:68% than RetinaFace and À8:07% than YOLOv7).

Fig. 7
Fig. 7 mAP accuracy with IoU 0.5 in the UAV-UWS dataset versus its inference time in milliseconds using three different input sizes: 416, 608 and 1600

Table 2
Distance from the camera to the face versus the size of the detected face in pixels (px) for the original resolution and the resized image

Table 4
Accuracy of UWS-YOLO and RetinaFace for WIDER FACE dataset varying IoU percentage

Table 5
Accuracy of UWS-YOLO and RetinaFace for UAV-UWS dataset varying IoU percentage

Table 7
Build time, weights size and inference time for the three compared models with input size of 608 px