1 Introduction

Drone is an unmanned air vehicle that has been widely developed and utilized in many beneficial applications. Unfortunately, drones could be adopted illegally and threaten global security. Several strategies have been considered to address the concerns brought on by the illegal use of drones. Radars, electro-optical cameras, acoustic, and Radio Frequency (RF) analysis are introduced as drone detection modalities [1]. Due to several advantages visual band cameras are recommended for drone detection. Compared to other detective systems, they are more affordable. They are attractive for widespread deployment in a variety of settings because of their cost. Moreover, they offer high-resolution images, which makes in-depth analysis possible. It is able to distinguish minute features like motion patterns, marks, and drone silhouettes. Deep learning techniques are particularly suitable for visual data sources. This study tackles the approach of the visual band camera with Computer Vision (CV) methods.

Drone detection using visual camera is a complicated limited case of object detection. Occasionally, drones are seen flying overpopulated areas. In addition to the large-scale illumination, the mist, waving trees, dust, and haze create diverse backgrounds from which it is challenging to recognize the drone see Fig. 1a. As shown in Fig. 1b, drones and birds commonly coexist in the same area and have similar physical characteristics. So, it cannot be easy to make the distinction between them. Furthermore, as Fig. 1c illustrates, the tiny size of drones makes it challenging to deduce their semantic features from the scene.

Fig. 1
figure 1

Drone detection challenges. a Difficulties arising from an intense environment in drone detection. b difficulties associated with confusing birds for drone identification. c Challenges associated with tiny drones

Increasing the network depth or the size of the picture input is the most popular fix for these issues. For our case, neither of the two techniques is ideal because they increase computing complexity and execution time. The YOLOv6 backbone network’s feature maps’ ERF is expanded in this study in order to prioritize the shape bias over the texture bias see Fig. 2. Distinguishing items from their environment is aided by shape bias. The extraction of the tiny objects’ semantic features is aided by the LAM. The main structure adopted in this article for training and detecting drones in various settings with varying sizes is YOLOv6. The article targets low computation power priority with an accepted level of accuracy. Therefore, the paper focuses on the tiny version of YOLOv6 and other comparing models adopted in drone detection. The paper’s contributions could be summarized as follows:

  1. 1.

    YOLOv6 is custom trained on a dataset to detect drones in different scene complexities.

  2. 2.

    The backbone replacement of YOLOv6 with rescaled RepLKNet [2], which adopts Depth-Wise (DW) convolution with Large Kernels (LKs) to enlarge the ERF of output feature maps.

  3. 3.

    Finally, the paper introduces a Large Effective Receptive Field Network (LERFNet) as a backbone network. LERFNet adopts channel separation and two large Receptive-Field (RF) blocks followed by a Linear spatial-channel Attention Module (LAM). LERFNet provides high detection speed with high feature point extraction capability. The model presents a comparative number of parameters and GFLOPs compared to other detection methods.

The results of tiny-YOLOv6 guided by the proposed backbone model are compared to YOLOv5s, tiny-YOLOv7, tiny-YOLOv6, and tiny-YOLOv6 with the RepLKNet backbone. The results show the effect of enlarging the ERF on the detection process of variable-sized drones in complex scenes.

Fig. 2
figure 2

The ERF produced from the first ERBlock of EfficientRep as the main backbone of YOLOv6 and the first stage of RepLKNet and proposed LERFNet consequently. A better ERF is indicated by a Dark region that is more widely scattered

The article is divided into different sections for the remainder. Section 2 implements the related work to our study. The methodology including tiny-YOLOv6 and tiny-YOLOv6 with RepLKNet structures is described in Sect. 3. The proposed LERFNet backbone is introduced in Sect. 4. The experiment procedures, the results, and the discussion of the results are tackled in Sects. 5 and 6 respectively. The conclusion of the article is presented in Sect. 7

2 related work

Drone detection follows the same methods adopted in the object detection challenge. Utilizing classical CV or Deep Learning (DL) are the most common methods for drone detection in the visual band [3]. DL approaches show results with high accuracy and adaptivity. So, this article adopts the DL approaches for the drone detection problem.

DL progress leads to improved networks that are capable of localizing and classifying objects in images. DL-based detection has been divided into two-stage and one-stage methods. The term “two-stage” refers to“the first stage defining the ROI and the second stage for classification”. Girshick et al [4] introduce Region Convolution Neural Network RCNN as the base for two-stage object detection. RCNN relies on a set of candidate areas with proposal vector features, which are then classified using support vector machines (SVM). He et al. [5] introduce the Spatial Pyramid Pooling Network (SPPnet) to speed up the results of RCNN. SPPnet extracts features for the whole image instead of just part of it, and the objects are classified based on these features. Also, Girshick and Ross [6] introduce Fast-RCNN as a potential solution by overcoming the issues with R-CNN and SPPnet but with region proposal computation as a constraint. Ren et al. [7] propose a region proposal network (RPN) and combines it with fast-RCNN as one network named “faster-RCNN”. Lin et al. [8] introduce feature pyramid network FPN with faster-RCNN to provide multi-scale feature maps. He et al. The two-stage algorithms’ primary flaw is their inaccuracy in localizing tiny objects and slow inference time. The computation cost won’t improve even if we greatly cut the base model. When opposed to two-stage, one-stage models provide more positive performance. The academics subsequently switched to one-stage detectors because of their flexibility in addressing problems and high performance with minimal memory requirements. One-stage models are end-to-end networks that handle the whole image at once, not in patches. The most widely used one-stage detector models are RetinaNet [9], Single Shot multi-box Detector (SSD) [10], and You Only Look Once (YOLO) series [11,12,13,14,15,16,17]. RetinaNet was introduced by Lin et al. [9], which additionally developed the Focal loss function to down-weight the loss associated to well-classified occurrences. RetinaNet detects objects with different sizes and aspect ratios through an anchor-free architecture along with FPN to gather multi-scale characteristics. RetinaNet’s inference computational needs can make deployment difficult in settings with limited resources. SDD developed by Liu et al. [10] estimates bounding boxes and class probabilities in a straightforward manner. They are predicted by utilizing a single deep neural network. Due to their potential shortage in the lower feature maps, tiny objects become a challenge for it. SSD depends on fixed aspect ratio anchor boxes that are preset, which could not always line up perfectly with the aspect ratios of the objects in the dataset.

Redmon et al. [11] introduce the first version of YOLO, which tackles detection as a regression issue. It predicts class probabilities and bounding box coordinates from an entire picture using a single neural network. Redmon, Joseph, and Ali Farhadi [12] introduce YOLOv2 and YOLO9000 with the same architecture but different training techniques. Pascal VOC [18] and MS COCO [19] are used as training datasets for YOLOv2. MS COCO and ImageNet [20] datasets are used to collaboratively train the YOLO9000, which is created to detect over 9000 distinct item classifications. Redmon et al. [13] release YOLOv3 with Darknet-53 network design. YOLOv3 has been trained using several image resolutions. Bochkovskiy et al. [14] introduce YOLOv4 with a “Bag of Freebies” for the training enhancements and a “Bag of Specials” for the detection performance. YOLOv4 outperforms YOLOv3 and EfficientDet [21] in terms of speed and accuracy. Glenn Jocher releases YOLOv5’s open-source version on GitHub [15] two months after YOLOv4 with no research paper. Following YOLOv4, YOLOv5 implements cross-stage partial connections in the backbone CSP Darknet-53 and Path Aggregation Network (PAN) neck. YOLOv5 enhances the networks of auto-learning detection anchors, the backbone, and the neck. Furthermore, YOLOv5 introduces novel mosaic dataset augmentation during the training stage. Li et al. [16] release YOLOv6 with Efficient-Rep backbone, Rep-PAN neck, and decoupled head. YOLOv6 is released with 5 network scales to support a variety of application cases. Wang et al. [17] publish YOLOv7, which adopts new “Bag of Freebies” in training to make enhancements in the inference time. YOLOv7 provides a novel E-ELAN backbone network for feature maps generation. The previously mentioned models can customize the training process to detect drones in a sequence of images. With a few adjustments, researchers use these STOA DL techniques to identify drones in various contexts.

Fig. 3
figure 3

Rep-block structure in the training phase and after reparametrization in the inference phase

Zeng et al. [22] adopt YOLOv5 with three steps to increase the accuracy of small object detection. Initially to improve the feature collection of tiny objects, a hybrid attention module is created. An enhanced Simple and Efficient Bottleneck (SEB) module is built to further separate foreground and background characteristics. Lastly, a multilayer feature fusion is constructed to enhance the semantic information of shallow features. For instantaneous identification of small drones, Liu et al. [23] suggest using a trimmed YOLOv4. By comparing the algorithm to the YOLOv4 model, the speed is improved at the expense of accuracy. Pan-Tilt-Zoom (PTZ) camera was employed by Liu et al. [24] to identify potential targets. A DL classifier is then used to categorize these targets as drones or not. Li et al. [25] attempted to optimize the feature preservation of YOLOv5’s backbone. The BAM attention module is added to the network’s head to reduce interruption from difficult background data. Additionally, they swap out the YOLOv5 neck for a Bi-directional Feature Pyramid Network (BiFPN). An enhanced YOLOv6 object identification algorithm is suggested by Li et al. [26] to identify tiny, high-density objects of interest in UAV aerial photos. A hybrid data augmentation technique, feature pyramid networks (FPN) strengthened with feature alignment module (FAM) and feature selection module (FSM), and transformer prediction head (TPH) formed with Transformer Encoder Block are among the enhancements. The accuracy performance matrices are significantly impacted by the TPH block and the FSM module, but the network complexity is also significantly increased.

Following a study of the literature, we found some shortcomings, such as the use of training datasets that overlooked difficult weather conditions or intricate surroundings. They also disregarded model inference time parameters, which are important in determining the amount of computing power needed. Our motivation to offer the approach, which may enhance accuracy, effectively target multiscale UAVs, and yield balanced outcomes in difficult weather situations, stems from these constraints and difficulties.

3 Methodology

YOLOv6 is adopted as the main structure of the proposed drone detection method. This section briefly describes tiny-YOLOv6 for drone detection. Then, we utilize RepLKNet as the backbone for tiny-YOLOv6.

3.1 Tiny-YOLOv6 network structure

Tiny-YOLOv6 is divided into three main parts: the backbone, neck, and head. The input images with three channels of RGB are tackled as the input for the backbone. The backbone structure has a vital impact on inference efficiency as it occupies a large percentage of computation costs. The backbone is responsible for creating the feature maps for the input drone images. The neck is responsible for the fusion of semantic features from the dense layers and texture features from the shallow layers. These pyramid feature maps are presented to the head. The head of Yolov6 depends on an efficient decoupled head with loss functions.

3.1.1 Backbone

The YOLOv6 backbone depends on the layout reparameterization principle. The architecture is presented in order to detach learning time multi-branch topology from inference time soft design. So, the trade-off between time and accuracy in inference is accomplished. The Rep-block is constructed using the Rep-VGG blocks [27] and the ReLU activation function [28] as shown in Fig. 3. The tiny version adopts Rep-blocks to construct a backbone identified as Efficient-Rep.

3.1.2 Neck

YOLOv6 continues to use Spatial Pyramid Pooling (SPP) [5] and modified PAN [29] as the network neck, as it did in YOLOv4 and YOLOv5. SPP frees the network input image from the limitation of a fixed size. The feature maps produced by the previous layer are integrated with the SPP layer to generate outputs with fixed lengths to the modified PAN. PAN adopts bottom-up route incorporation, which reduces the feature flow between the bottom layer and the high layer. Adaptive feature pooling connects the feature array to all feature levels. So, the feature level holds insightful information delivered to the head network.

Fig. 4
figure 4

The complete YOLOv6 structure with Efficient-Rep backbone, Rep-PAN neck, and efficient decoupled head

3.1.3 Head

By employing the efficient decoupled head, YOLOv6 follows FCOS [30] and YOLOX [31]. The width multiplier used to scale the backbone and the neck will also control the head scale. YOLOv6 depends on anchor-free detectors to manage the candidate’s results. The head assigns ground truth labels to the predefined anchors during the training time. YOLOv6 deploys Task Alignment Learning (TAL) [32] for label assignment instead of Simplified Optimal Transport Assignment [33] (SimOTA) because of its repetitive unstable training. Two loss functions are included in the head for classification and localization. YOLOv6 neglects object loss since it does not make a noticeable impact on the results. Furthermore, Vari Focal Loss (VFL) [34] is deployed as the classification loss to distinguish between positive and negative training samples. The box regression loss is responsible for localizing the object’s region effectively. For simplicity, the Tiny and nano versions deploy SIOU [35] without probability loss. The large and medium versions of YOLOv6 adopts GIOU [36] with Distributed Focal Loss (DFL) [37]. For self-monitoring over two losses, the self-distillation technique is represented. It adjusts the knowledge transfer from the teacher model and labels it for the student model in the training phase. The overall loss can be explained as follows:

$$\begin{aligned} \hbox {Total}_{\text {loss}}=\hbox {class}_{\text {loss}}+\alpha ~\hbox {Reg}_{\text {loss}}+\beta ~\hbox {KD}_{\text {loss}} \end{aligned}$$
(1)

where \(\hbox {class}_{\text {loss}}\) represents the classification loss and \(\hbox {Reg}_{\text {loss}}\) is the regression loss and the knowledge distillation loss is represented by \(\hbox {KD}_{\text {loss}}\). The two hyperparameters \(\alpha \) and \(\beta \) are used for the losses balance. YOLOv6 adopts some strategies to increase the model’s performance. Solving the problem of the gray border of the images in the case of resizing the input images involves two main tricks. First, resize the images with the changed gray border to the target size. Then, shutting the mosaic data augmentation down in the last epochs. A rep-optimizer is adopted to solve the problems of the quantization of reparametrized models. Figure 4 depicts the tiny YOLOv6 structure.

3.2 YOLOv6 with RepLKNet

The complex scene in the case of low-altitude drones and a non-clear sky are explicit problems for drone detection. Our research tackles this problem and tries to solve it by enlarging the ERF of the backbone network. The RF is the field of view in a real human visual system. On CNN, it is the portion of the image used to produce the features. A large RF is commonly required in the case of image segmentation problems. It is required to enlarge the RF in each pixel in the output label map during segmentation tasks. In the case of the detection tasks, enlarging the RF enlarges the shape bias over the texture bias. The shape bias helps to discriminate objects from the surrounding regions. The RF size may be increased by chaining more layers to improve the network’s depth.

Fig. 5
figure 5

RepLK block structure in the training phase and in the inference phase after reparameterization

As network depth increases, so do network computations and complexity. Likewise, it has been demonstrated through experimentation that while the theoretical RF is expanding, the ERF is contracting in deeper layers [2]. The ERF is the patch pixels in the input RF image that influence the output feature. It is obvious that not every pixel within the output feature’s RF has an equal impact. Since they have more paths to contribute to the output, pixels at the center of an RF have a much greater influence on the output. The Dilated Convolution (DC) [38] is another way to enlarge the RF. It enlarges the RF exponentially based on the gaps between kernel weights. DC faces a problem in regions with high frequencies. Another way is to use CNN with LKs. On the recent CNN network, it is not common to use CONV layers with LKs. Although it has a large ERF and high accuracy results, it has problems with large parameter sizes and a high execution time. Ding, Xiaohan, et al. [2] adopt LKs with a DW convolution network named RepLKNet. RepLKNet is presented to compete with vision transformers ViTs [39] in classification and other downstream tasks like image segmentation and object detection. When comparing ViT to the most advanced convolutional networks, outstanding results are obtained. While CNNs are well-suited for handling large-scale datasets and have success in a variety of CV applications, ViTs are more advantageous in situations where contextual comprehension and global dependencies are critical. It is believed that the power of ViTs comes from multi-head self-attention modules (MHSA). However, the design complexity of MHSA in ViTs makes it hard to work in low computing power environments. In general, RepLKNet adopts the major design of ViT with some adjustments, substituting LK with DW convolution for the MHSA modules. CNNs perform and run more slowly when LKs are applied haphazardly. The separable DW convolution [40] with the LKs help to overcome the quadratic increase in parameters and FLOP numbers. The separable DW convolution has a greater effect on LKs than it does on \(3\times 3\) kernels. Structure reparameterization is also applied in RepLKNet. It depends on small kernels parallel to LKs followed by the BN layer in the training phase. The model is restructured after training as the small kernel and BN layer parameters are fused into the LK. For the inference time, only the LKs with the fused parameters exist see Fig.  5. RepLKNet follows the design of Swin Transformers [39] with alterations where LKs replace MHSA.

Fig. 6
figure 6

RepLKNet network structure used as backbone for YOLOv6 to produce feature maps for the neck network

RepLKNet starts with a stem block, which uses some Conv layers to get high-resolution features and down-sample input channels. Then there are four main blocks named stages that produce the feature maps. Each stage contains some blocks, which provide network scalability. RepLK-block is the main block in each stage. It deploys LKs with DW Conv. These LKs are reparametrized with small \(3\times 3\) kernels in the inference time. Shortcuts are added to overcome accuracy loss with increasing depth. The identity Conv layers are added to provide depth and nonlinearity. Every stage is ended with Conv Feed Forward Network (CFFN) followed by a transition block. This work merges the transition block with the CFNN block for simplicity. The transition block is responsible for downsampling and producing the feature maps with the desired number of channels. The network scale can be controlled by three design parameters: (the number of RepLK blocks, the channel’s output for each feature map, and the size of the LKs). The proposed design is to adopt (31, 29, 27, 13) LK sizes, (128, 256, 512, 1024) channels output for each index, and (1, 1, 2, 1) as the number of RepLK blocks for every stage. These parameters are chosen to compute the complexity of the tiny-YOLOv6. The structure of RepLKNet as the backbone of YOLOv6 is illustrated in Fig. 6. Despite the large ERF of feature maps and the small growth in the FLOPs that are produced from RepLKNet, the inference time has increased significantly. Therefore, we propose a backbone network i.e., (LERFNet) with a large ERF to detect drones in complex scenes with an acceptable inference time.

4 Proposed LERFNET structure

In the previous section, we propose the use of RepLKNet as the backbone of YOLOv6 to enlarge the ERF of the feature maps produced from the network backbone. This method shows a great effect in detection accuracy and proves the concept of; enlarging the ERF of the feature maps enhances drone detection in different scenarios. Based on this concept, we propose a backbone network LERFNet to produce feature maps for the neck of YOLOv6. LERFNet is designed to solve the problem of the inference time of RepLKNet. LERFNet follows the structure of RepLKNet and ViTs but differs in the four stages’ content. Each stage contains a LRF-block followed by LAM and a transition block at the stage end. It starts by accepting the input three channels image and driving it through the stem layer for downsampling and depth production. The output from the stem layer is fed forward to the four stages for feature maps creation. Each stage starts with an LRF-block, which divides the input channels into two paths. Each path has a certain method to enlarge the ERF of its input channels. The LRF-block starts with PointWise (PW) separable convolution, which controls the input dimensionality. The first channels path goes through a \(3\times 3\) Conv layer with stride 2 and RELU followed by BN layer. Every CONV layer followed by BN will be reparametrized and merged in the inference time. RepLK-blocks are used in this path to produce output with large ERF but from half channels only. The RepLK-blocks are designed with LK sizes of (31, 29, 27, 13). This path ended with \(3\times 3\) Conv with BN layer and RELU. The second path adopts DW dilated convolution DC with BN layer and RELU. The DC is another way to produce output feature maps with large RF. DC widens the kernel size by inserting gaps between its successive parts. In plain words, it is like convolution but somehow it incorporates pixel bypassing to extend a broader region of the input. An extra argument Dilation Factor (DF) added to normal convolution determines how much the input is extended. In other words, dependent on the value of this argument, \((DF-1)\) pixels are bypassed in the kernels. With each convolution operation, the main objective is to extract more information from the output. With the same computational cost, DC offers a wider RF with the same number of parameters. DC provides LRF with no more computation cost. However, DC faces a loss in the locational coherence between nearby pixels. Increasing the DF leads to a bigger non-overlap zone. Therefore, the RepLK-block is deployed to provide consistency between neighbor pixels and the DC provides low computation cost. The two paths’ output channels are concatenated and fed forward to the LAM.

Fig. 7
figure 7

The output feature maps produced from LERFNet in case of (a) before the LAM and (b) after LAM

Fig. 8
figure 8

a Represents LAM structure using spatial and channel attention modules where, b describes the channel attention structure and c describes spatial attention structure

LAM is adopted to provide channel and spatial adaptation. Now a day, the self-attention approach has dominated several fields of computer vision. Even so, there are some difficulties with using self-attention in computer vision. The quadratic complexity, non-effective spatial adaptation, and 2D structure misrepresentation make self-attention not an applicable solution, especially in cases of limited computation power. Therefore, LAM is deployed instead of multi-head self-attention (MHSA) modules. LAM selects the semantic information and disregards disruptive answers based on the input characteristics. The effect of using LAM on the LRF block output spatial and channel feature maps is shown in Fig. 7. LAM is made up of channel and spatial attention modules that are stacked in series see Fig. 8a. The LRF block features results are used as input \( F_{\text {in}}\in R^{\hbox {Ch}\times (W\times H)} \) for the channel-spatial attention module. First, the \(1-\)D channel attention \( \hbox {ATT}_{\text {ch}}\in R^{\hbox {Ch}\times 1 \times 1} \) is applied to \(F_{\text {in}}\) features input. Channel attention is taking advantage of the connections between features across the channels.

The channel attention coordinates the process on“what”is significant in an image since, each channel in a feature map is considered as feature finder. The input spatial dimension is reduced to provide attention to the channels. The spatial data are combined through max pooling in a branch and \(3\times 3\) with stride of 2 followed by \(1\times 1\) conv layers in the other branch. The two branches output can be presented as follows:

$$\begin{aligned} F_{\text {med}}=P_{\max }{(F_{\text {in}})\ \oplus \ \textrm{Conv}^{1\times 1}(\textrm{Conv}_2^{3\times 3}(F_{\text {in}}))} \end{aligned}$$
(2)

where the \(P_{\max }\) represents the max pooling and \(\textrm{Conv}_2^{3\times 3}\) performs as DW Conv with stride of 2 and \(F_{\text {med}}\) represents the spatial descriptors from the combined two branches. Then, the channels correlation are recovered using point-wise \(1\times 1\) conv layer. The attention channel maps are normalized by the softmax function \(\sigma \). The linear channel attention structure is illustrated in Fig. 8b and can be computed as follows:

$$\begin{aligned} \textrm{ATT}_{\text {ch}}=\sigma \ (\textrm{Conv}^{1\times 1}(\hbox {Fmed})) \end{aligned}$$
(3)

Distinct from the channel attention, the spatial attention is used to find“where”are the significant regions of interest in an image. This work follows the structure of the spatial attention used in CBAM [41]. First, the input features for the spatial attention block \( F_{\text {in}}\in R^{Ch\times (W\times H)} \) are downsampled using max and average pooling layers separately. The 2-D maps \(P_{\text {avg}}\) and \(P_{\max }\) \(\in R^{1\times H \times W}\) are combined to form feature descriptor along the channel axis. The \(P_{\text {avg}}\) and \(P_{\max }\) describe the max and average pooling, which are used to hold the channels data. Then, the 2-D spatial attention map \(ATT_{\text {spt}}\) is produced by applying \(5\times 5\) Conv layer followed by the softmax function. The linear spatial attention structure is illustrated in Fig. 8c and can be computed as follows:

$$\begin{aligned} \textrm{ATT}_{\text {spt}}=\sigma \ (\textrm{Conv}^{5\times 5}(P_{\max }\ {concat}\ P_{\text {avg}})) \end{aligned}$$
(4)

The results from each attention module are element wise multiplied \( \otimes \) by a copied version from its input feature map. The channel and spatial attention maps are used sequentially to produce LAM, which can be calculated as follows:

$$\begin{aligned} F_{\text {out}}=\textrm{ATT}_{\text {spt}}\otimes {{(F}_{\text {in}}\ \otimes \ ATT}_{\text {ch}}) \end{aligned}$$
(5)
Fig. 9
figure 9

LERFNet one stage structure containing LRF blocks followed by linear attention module and transition block

LRF-block produces feature maps with a large ERF and high shape bias. These feature maps are fed to the attention module to pay attention to the vital parts and objects in input features. Then, the output from LAM is fed forward to the transition layer to produce the desired channel number. Therefore, each LERFNet stage contains these three modules with different scale and output channel numbers. Figure 9 depicts the entire structure of the LERFNet stage. The feature maps output from LERFNet are fed forward to the neck and head of tiny-YOLOv6 to complete the detection cycle of the drones.

5 Experimental results

Although the YOLO series is well known for its object identification skills, the primary objective of this research is on drone detection, which has different needs and faces different obstacles than standard object detection jobs. However, drone detection is defined as a special case of object detection tasks. Furthermore, in order to demonstrate the performance and innovation of the proposed method in the field of drone detection, the results are compared with a widely used benchmark YOLO series, notably from YOLOv5 to YOLOv7, which are renowned for their efficiency and accuracy. The experiments are carried out using five different techniques: YOLOv5s, tiny-YOLOv6, tiny-YOLOv7, tiny-YOLOv6 with the RepLKNet backbone, and tiny-YOLOv6 with the proposed LERFNet backbone. The assessments are exploited for all models by using the same hardware and dataset.

Fig. 10
figure 10

DUT-anti-UAV dataset. a Labeled drone image samples. b Drone statistics in the dataset images. The drones position (xy) in the images. Also, the drone’s height and width ratios are related to the images

5.1 Dataset

Zhao et al. [42] provide a DUT anti-UAV dataset. The dataset is established for drone detection and tracking in different circumstances. The dataset contains 10, 000 labeled images for detection with high accuracy. The dataset images are divided into 5200 training images, 2600 images for validation, and 2200 images for the test. Drones are distributed through images with different sizes and in different regions in the image. The dataset also shows different scene scenarios to measure the strength of the different models. Most of the drones are concentrated within the center of the image and occupy small regions of the whole image. The dataset statistics and samples are illustrated in Fig. 10.

Fig. 11
figure 11

Training batch with different augmentation methods

Fig. 12
figure 12

Different performance metrics, (a) precision with confidence, (b) recall with confidence, (c) mAP@0.5 with GFLOPs, (d) F1 score with the average inference time, (e) precision-recall curve, (f) F1 score with confidence

Fig. 13
figure 13

Samples of the detection results from each of the five models’ various scene settings. The detection bounding box and confidence for the same image applied to the compared models are displayed in each row. Zoomed-in viewing is preferable

Table 1 Ablation experiments

5.2 Training strategy

For a fair comparison between comparative methods, the same training strategy is deployed in the assessments. The models are trained using a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01. The momentum, cosine decay, and warmup are adopted to accelerate the convergence to the global minimum. The proposed model dilation factors are set to match the LK sizes. Different augmentation methods are adopted such as HSV color augmentation, (translation, scaling, and rotation) position augmentation, the two strong mix-up [43], and mosaic data augmentation [14]. The models are trained for 200 epochs with 30 epochs for early stopping and validation every 10 epochs. The models are trained without pre-trained weights for a fair comparison. The training images are resized to \(640\times 640\). The networks are trained to determine the object’s class as drone or not and the region of the drone in the image. The same hardware environment is adopted for the training process. An example of a training batch is shown in Fig. 11.

5.3 Assessment metrics

The models’assessments are performed using the metrics used in STOA methods. Precision, Recall, the F1 score, and mean Average Precision mAP are adopted as the accuracy performance matrices. The AP is calculated at \(IOU =0.5\) to get AP@0.5 and over ranges of IOU and averaging the results of the different ranges to get AP@ [0.5 : 0.95]. Another important metric in the comparison between models is the drone detection speed. The average inference time over the test images is calculated to measure the speed performance of each model. Floating Point Operations FLOPs are a frequently utilized indicator of a model’s computational expense. Additionally, the trainable parameters are employed as a proximate indication of both memory utilization and computational complexity [16].

5.4 Comparisons

The evaluation results of LERFNet with tiny-YOLOv6 structure and other comparative models are performed based on the assessment matrices discussed in the previous section. The models are trained and tested on a single GPU (NVIDIA Quadro M2000M) with 4GB internal memory. The models are trained and tested by the same dataset for a fair comparison. These results show the superior performance of the proposed LERFNet than other compared methods. The comparison results are summarized in Table 1. The results discussion will be introduced in the next section.

6 Discussion

This section introduces the analysis of the comparison between the comparative models. The results show that YOLOv7 outperforms the other models in the metrics concerned with complexity. YOLOv7 shows the lowest number of GFLOPs and the lowest number of parameters. It also shows also the lowest average execution time. However, LERFNet with YOLOv6 is very close to YOLOv7 in the complexity metrics. YOLOv7 shows low results in the metrics concerned with accuracy. YOLOv5 shows the highest Recall results but with low precision. Therefore, YOLOv5 has the highest capability of detecting drones in images, but it also produces repetitive false alarms. YOLOv5 low mAP@0.5 : 0.95 indicates the small overlap regions between predictions and ground truth. The effect of adopting RepLKNet as the backbone of YOLOv6 is declared in the results. Increasing the ERF through LKs makes a great improvement in the Recall. The mAP@0.5 and mAP@0.5 : 0.95 are also increased compared to standard YOLOv6. Although the great accuracy results of YOLOv6 with RepLKNet, the execution time of DW with LKs defects the average inference time. LERFNet shows balanced results among the other models. It shows a great improvement in accuracy metrics compared to YOLOv5, YOLOv6, and YOLOv7 with a little increase in the inference time when compared to YOLOv7 and YOLOv6. It also shows the effect of using LRF blocks instead of LKs on the inference time. The precision, Recall, F1 score, and Precision-Recall curves for the models are shown in Fig. 12. The results also show that YOLOv7 presents detection with a low confidence score. Detection results samples shown in Fig. 13 declare the results numbers. The first row shows the capability of all models to detect drones in clear sky. While the second row shows that YOLOv5 and YOLOv6 are affected by the small drone size and cloud environment. The complex environment shown in the third row leads to false alarms except in YOLOv6 with RepLKNet and LERFNet backbone. The shadowing phenomena in the fourth row leads to a false alarm in YOLOv7. Also, nearby flying drones prevent accurate detection of them. The fifth-row samples show the capability of the backbone with large ERF to detect small drone sizes in complex scenes. LERFNet shows a considerable improvement over the compared models in terms of the ability to recognize drone targets. The proposed model detects drones in the images of different cases. The model shows a great balance between accuracy, complexity, and execution time.

7 Conclusion

In this paper, we propose enlarging the ERF feature maps produced from the YOLOv6 backbone to detect drones in a complex environment. First, we deploy RepLKNet as the YOLOv6 backbone for feature extraction, which achieves high accuracy, but has a slow inference time. Therefore, RepLKNet is replaced with the proposed LERFNet, which produces large ERF feature maps with a high shape bias. Comparisons are presented between the proposed LERFNet with tiny-YOLOv6, tiny-YOLOv7, tiny-YOLOv6, tiny-YOLOv6 with RepLKNet, and YOLOv5s. For a fair assessment, the models are trained from scratch and tested with the same dataset of drone images. The proposed LERFNet achieves the top F1-score with approximately no increase in the inference time. The proposed model demonstrates a superb balance of accuracy and speed when compared to aforementioned models. Only the tiny-YOLOv6 is used for LERFNet testing; no large-scale testing is done. The model is only used for the process of detecting drones; it has not been evaluated for other detection or classification tasks.