1 Introduction

The application of drones in various domains is increasing day by day especially military and surveillance to perform deliberate operations in the arena. It is very much important to detect drones in providing security in real-time. Nowadays, real-time detection of drones is a great challenge in various environments like rain, sunlight, and night. Deep learning plays a major role in detecting objects in various conditions [1]. Recently, computer vision and deep learning approaches such as R-CNN, Faster R-CNN, and mask-RCNN are providing a solution to detect an object [2].

The object detection and tracking system has been variedly applied in various areas of military, health sectors, and security monitoring with autonomy robots [3]. The traditional object identification has been implied with the primary focus on assuming the edges, cuttings, and templates, where the chances of getting the accuracy is lower and the loss is assumed to be of higher [4]. Besides various features extraction methods are been used with CBIR images and various filtering techniques has been applied to the detected objects to ensure the object identification [5]. The gradient-based histogram in image classifiers has been used and the local binary patterns also been applied with the image scanning in the object with sliding window assumptions [6]. The machine learning techniques has been used with the image accuracy enhancement using the PASCAL VOC object detection in handcrafted future. But all of these mechanisms have been challenging phases for the object tracking in the surveillance system using the embedded system [7]. To overcome these challenges, various deep learning models have been proposed to enhance the accuracy [8].

The deep learning models such as R-CNN, Faster R-CNN, and mask-RCNN are not suitable for detecting small objects with speed. In this paper, a novel deep YOLOV3 algorithm is proposed to detect small objects [9]. The proposed work employs confidence score, backbone classifier, and multi-scale predictions to detect small drone objects. The proposed deep YOLOV3 offer the following mechanism for detecting the small drone objects:

The contributions of the paper are as follows:

  • This paper proposed a deep YOLOV3 to solve the small object detection issue with speed.

  • The confidence score is calculated based on the conditional probability to detect the bounding boxes for a target object.

  • Deep YOLOV3 also used a backbone classifier and multi-scale prediction to classify the objects with high accuracy and more accurate surveillance has been achieved.

  • The proposed deep YOLOV3 model is achieved 99.99% accuracy to detect the small drone objects with less loss

The rest of the work contains five sections. In Section 2, the paper summarizes the various literature reviews to sort out the problems. In Section 3, the paper is proposed with the deep YOLOV3 model for detecting a small drone object efficiently. In Section 4, the simulation and performance of the proposed deep YOLOV3 model to analyze the loss and accuracy are presented. Section 5 presents the conclusions and future direction.

1.1 Highlights of the proposed YOLOv3 model

  • The proposed YOLOv3 uses logistic regression to predict a confidence score for a bounding box and 7 × 7 grid cells are detected simultaneously. The result shows that this is a very fast model.

  • The proposed YOLOv3 uses three anchor boxes to predict three boxes for a grid cell and allocated to a grid cell in detection.

  • The proposed YOLOv3 architecture such as 106 convolution layers followed by 2 fully connected layers and 812 × 812 × 3 input size detects a small object with low false alarm and uses minimum filters as possible.

  • The proposed YOLOv3 model uses independent logistic classifiers and binary cross-entropy loss during training for small object prediction. It also supports multi-label

2 Related works

Unlu et al. proposed that the commercial unmanned aerial vehicle has been announced as a drone is increasing the modern communication of video and audio with security standards is an essential key factor for the wireless devices [10]. They investigated a novel approach on autonomy drone detection and tracking with a multi-camera angle view [11]. The camera support and the frame have been analyzed with the compact memory and time in an efficient manner [12]. The small-sized aerial intruders are used with the image plane and the compressed images are analyzed with the resource detection algorithm using the deep learning classification algorithms [13, 14].

Deep learning-based object detection has been proposed with various accuracy levels Song Han et al. investigated that the lower cost aerial photography has been used for taking the highlighted pictures and videos with advanced drones and assumed to be of error-prone. A deep drone has been proposed with the embedded system framework where the power drones vision is highly investigated with the drone automatic tracking [15]. The tracking and detection are assumed to be of the multiple hardware with GPU(NVIDIA GTX980) and GPU (NVIDIA Tegra K1 and NVIDIA Tegra X1), the embedded setup has been used to obscure the frame rate, accuracy estimation, and the power assumption has been analyzed with the 1.6 fps for tracking [16]. Redman et al. proposed a new approach on YOLO detectors to use object detection and the work on object classification has been done with the regression problem in bounding boxes and the probability for the association [17]. A neural network has been proposed with the class analysis of probability for the end to end directly in the performance detection and faster R CNN has been used intensively for object detection [18].

Krizhevsky et al. investigated that the deep neural network has been widely used with the computer vision analysis for the binary and multiple image classification and analysis. Besides a tool AlexNet a classic tool of network proposed with image contributes 8 layers and 60 million connections, later VGGNet [8]. Szegedy et al. proposed that GoogleNet has been proposed with the different scaling with the support of CNN. GoogleNet with CNN proposes a convolutional layer with a specified model having 1 × 1, 3 × 3, and 5 × 5 with kernel level. The gradient problem has been solved using this multiple cross layers usage [19]. He et al. proposed that ResNet increases the image recognition accuracy and it bypasses the layers that may enhance the absolute values and SqueezeNet has been applied with CNN to enhance the image recognition accuracy with 50× connections and produces with higher accuracy [20].

Henriques et al. highlighted that the kernelized correlation filters have been used for detection of the image classification based on the DFT and a blueprint has been used with the fast algorithms with the resultant in the diagonalized translations in the storage and computation with the trackers to run 70 frames per second on the NVIDIA TK1 kit [21].

Sabir Hossain et al. highlighted that the target detection and tracking from aerial images have happened with smart sensors and drones. A deep learning-based framework has been proposed with the embedded modules such as Jetson TX or AGX Xavier with Intel Neural computer stick [22]. The flying drones are used with a certain coverage limit and hence it has been followed with the accuracy for estimating the multi-object detection algorithm with the GPU-based embedded for the specified computation power [23]. Deep SORT uses a hypothesis tracking with the support of Kalman filtering with the association metric specified in the multi-rotor drone [24].

Roberto Opromolla highlighted that the UAVs have been used in various civil and military applications specified with the visual cameras that enable the tracking and detection of the track cooperative targets with the frame sequence using deep learning [25]. The you only look once (YOLO), an object detection system, is done with the processing architecture where the machine vision algorithms with the cooperative hints need to be brought to the flight test campaign in the two multirotor UAV. The methods integrate the accuracy and robustness in the challenging environments in the target range vulnerability [26].

Christos Kyrkou et al., proposed a trade-off mechanism with the development of a single-shot object detector using the deep CNN. The UAVs detect the vehicle in a dedicated UAV environment. CNN proposes a holistic approach in the optimization with the deployment UAVs. The aerial images operate in 6–19 frames per second with an accuracy of 95% with the UAV applications with the low power embedded processors [27]. Tsung-Yi Lin et al. highlighted that the RetinaNet performs the object detection using the backbone network and the network supports the classification and regression. The backbone network uses the convolutional feature with the input images combined the Faster R-CNN also uses the Feature Pyramid Network (FPN) [28]. The probability of object presence has been used with the input features mapping in the C channels in the pyramid level with the A anchors and N object classes that take the ReLUactivations [29].

Yi Liu et al., highlighted that the UAV has been applied in the tasks of power transmission devices and the usage of deep learning algorithms has been used with the attention of UAV transmission control [30]. The Mask R-CNN has been used with the components of the transmission devices using the edge, hole filling, and hough transform in wireless communication [31]. The accuracy has been achieved with the 100% accuracy in the proposed model using the UAV transmission parameters [32].

LI Y et al. proposed that the multiblock single shot multibox detector (SSD) with the small object detection is used for the surveillance of the railway tracks with UAV. The input images are segmented in terms of patches and the truncated object has been assigned with the sub-layer detection in the two stages with the suppression of sublayer and filtering has been applied in the training samples. The boxes that are not detected in the main layer have been substantially increased with the available SSD and the deep learning model has been proposed with labeling the landslides and the important communication during rainy days has been reported [33].

Jun-Ichiro Watanabe et al. proposed that the YOLO has been applied in the conservation of the marine environments with the micro and macro plastics on land that occupies the ocean and with that intrusion many species are suspected to face the consequences. The satellite remote sensing techniques for global environmental monitoring have been identified with the object tracker on the ocean surface [34]. Autonomy robots have been used with the observation for the control of the objects in marine environments. The underwater ecosystems have been studied with the learning object algorithm using a DEEP net and the YOLO v3 has been applied and accuracy has been estimated with 77.2% [35]. Kaliappan et al. [36,37,38] proposed machine learning techniques such as clustering and genetic algorithms to achieve load balancing. Vimal et al. [39] proposed machine learning-based Markov model for energy optimization in cognitive radio networks. Aybora et al. [40] simulated types of annotation errors for object detection using YOLOv3 and examined the erroneous annotations in training and testing phase. Sabir et al. [24] designed a GPU-based embedded flying robot that used a deep learning algorithm to detect and track the real time multiple-object from aerial imagery.

3 Methods

The aim of the proposed model is to analyze the object detection in a real-time environment, with movement decisions using a novel YOLO V3 model using the box within the bounded coordinates. YOLOv3 performs better feature extraction than YOLOv2 because it uses a hybrid approach such as Darknet and residual network. The image is captured and segmented within the bounded box level coordinates. The coordinates are mapped within the box with an interval of frames per second in the novel YOLOV3 model. In this, a deep convolutional neural network (DCNN) to predict with high accuracy has been applied. Various filter sizes such as 32, 64,128,256, and 1024 are applied with striding and padding to process pixel by pixel basis in the frame [2]. The proposed scheme of various sizes kernelized correlation filter (KCF) is used in different convolution layers. In general, KCF runs very fast on video processing. The CNN layers split the image into various regions to predict the accurate bounding box based on the confidence score for all divided regions [41]. The proposed YOLOv3 trained in the Dell EMC workstation which consists of two Intel Xeon Gold 511812 core processor, Six channel memory 256 GB 2666 MHz DDR4 ECC memory, 2× NVIDIA Quadro GV100 GPU and 4× 1TB NVMe class 40 SSD, 1TB SATA HDD.

3.1 Proposed work

The model has been proposed as a novel YOLO V3 deep learning embedded model to detect a small object with a real-time system. YOLO looks at the entire image in an instance to predict the bounding box coordinates [42]. The class probabilities are calculated for all bounding boxes. YOLO can process 45 frames per second. In this, we used a deep convolutional neural network (DCNN) to predict with high accuracy [43]. Figure 1 shows our proposed deep YOLO V3 prediction model to predict the drones. Labeled input images are trained with 45,000 epochs. Each region is a 7 × 7 grid capable of predicting five bounding boxes. The proposed model detects the drone object as well.

Fig. 1
figure 1

Proposed deep YOLO V3 prediction model

The proposed YOLO algorithm with embedded model used a regression mechanism to predict the classes and bounding boxes for the entire image in every single run in a particular object location. Equation (1) described the bounding box with four descriptors such as center of a bounding box (by, bh), width (bw), height (bh), and a class of an object (c).

$$ y=\left({p}_c,{b}_x,{b}_y,{b}_h,{b}_w,c\right) $$
(1)

The CNN predicts four coordinates for each bounding box such as tx, ty, tw and th. The prediction of a bounding box is followed by the following four equations. cx, cy is the offset of the top left corner of an image and pw and ph are the prior bounding box width and height respectively. Here, t is the gradient ground truth value which is computed during the training process.

$$ {b}_x=\sigma \left({t}_x\right)+{c}_x $$
(2)
$$ {b}_y=\sigma \left({t}_y\right)+{c}_y $$
(3)
$$ {b}_w={p}_w{e}^{t_w} $$
(4)
$$ {b}_h={p}_h{e}^{t_h} $$
(5)

The proposed algorithm applied anon-max suppression technique called independent logistic regression and predicts the pc while much of the grid and boxes do not contain a targeted object. pcis used to predict a confidence score for a bounding box and 7 × 7 grid cells are detected simultaneously. The result shows that this is a very fast model. This strategy rejects bounding boxes with low probability and predicts a bounding box with the highest probability. The predicted bounding box contains a good confidential score. Equations (6) express the confidence score.

$$ {\mathrm{p}}_{\mathrm{r}}\left(\mathrm{o}\right)\mathrm{x}\ \left(\mathrm{IOU}\right) $$
(6)

IOU is an intersection over the union in the region. It is expressed as an area of intersection over the area of the union of two bounding boxes. IOU falls within 0 to 1 and the ground truth box is said to be ~ 1.IOU is used to find the confidence score for each bounding box that ensures a box contains a predicted target object. IOU also prevents background detection. The confidence score is 0 if there is no object present in the grid cell. Otherwise, the confidence score is equal to IOU between the predicted bounding box and the ground truth box. IOU is achieved greater than 0.5 that ensures a better prediction with high accuracy for object detection [44]. In order to achieve good prediction, YOLO multiplies the individual box confidence predictions with conditional class probabilities (pr(ci)) that is expressed in Eq. (7).

$$ {p}_r\left({c}_i\ |\ o\right){p}_r(o){IOU}_{prediction}^{truth}={p}_r\left({c}_i\right){IOU}_{prediction}^{truth} $$
(7)

The average IOUs are selected to represent the final predicted bounding boxes which are the much closest prior values for good prediction. It is expressed in the following equation.

$$ {p}_r(o)\ x\ IOU\left(b,o\right)=\sigma \left({t}_o\right) $$
(8)

The confidence score of a bounding box plays a vital role in making a prediction at the testing stage. It is the output of a neural network.

The paper has been designed with a DCNN with 106 convolution layers that contain convolution layers, pooling layers and fully connected layers with classification function. We calculated feature maps with convolution by sliding filter along with the input image in the result is a two-dimension matrix [45]. Figure 2 shows the sample feature map calculated in the convolution layer.

Fig. 2
figure 2

Feature map

4 Results and discussion

The paper has been proposed with the downloaded 3000 drone images from the Kaggle dataset and Google with around 2 GB. In the proposed YOLOv3, 45,000 epochs have been taken for drone dataset that provide the high accuracy and sensitivity. Figure 3 shows the sample drone images used for training and testing [46]. The proposed work uses a pre-trained YOLOv3 model for training. We implemented the YOLOv3 in GPU-based workstation.

Fig. 3
figure 3

Drone images

The input images are trained by a pre-trained YOLOv3 model with 106 convolution layers. The training stage takes more than 8 h to build the trained model. This trained YOLO model can accept either image or video input. The proposed model achieved 99.99 percentage results on the detection of drones images or videos [47]. Figure 4 shows the detection of drones’ video in the testing stage. Table 1 provides the accuracy comparison of three models such as YOLO, YOLOv2, and YOLOv3. YOLO and YOLOv2 are suitable for large object detection in very fast manner. The proposed YOLOv3 architecture is suitable for small object detection because it uses a hybrid network.

Fig. 4
figure 4

Detection of drones video

Table 1 Comparison of three YOLO models

4.1 Loss analysis

This section shows the evaluation of various losses such as total loss, classification loss, localization loss, clone loss, objectness, and localization loss in the region proposal network (RPN) for object detection. Figure 5 shows the total loss up to 45,000 epochs. It reached 0 from the 200 epoch that ensures very good accuracy. Figure 6 shows the classification loss that achieved 0 from the beginning of the epoch that ensures the perfect classification of the drones based on conditional probabilities. It is the final layer that reveals the object detection.

Fig. 5
figure 5

Epoch vs total loss

Fig. 6
figure 6

Epoch vs classification loss

Figure 7 shows the localization loss during the various epochs. It returns the region proposals from the feature map based on the offset of the bounding box. The proposed YOLOv3 scheme uses a sum-squared error (SSE) as loss function for optimization. It penalizes the classification error for an object in each grid cell. This scheme chose the box with the highest IOU.

Fig. 7
figure 7

Epoch vs localization loss

Figure 8 ensures that the proposed approach is extracting the region proposals for a targeted object. Region proposal network (RPN) is integrated with the convolution neural network CNN network for classification. The accuracy of the proposed scheme based on performance of the RPN module.

Fig. 8
figure 8

Steps vs objectness loss in RPN

Figure 9 shows the localization loss in the RPN in which the bounding box regression loss is predicted to make good accuracy because it applied the regressor strategy for classification. Figure 10 shows the clone loss in various epochs. IT ensures the training and validation losses for accurate prediction. Also, it was recalculated by the neurons in the CNN layer weights for good predictions.

Fig. 9
figure 9

Epoch vs localization loss in RPN

Fig. 10
figure 10

Epoch vs clone loss

4.2 Accuracy

The proposed deep YOLOV3 model provides 99.99% accuracy while training and testing stage because the model is designed with 106 convolution layers and different size feature maps. The YOLOv2 model is also used for training and testing in which 98.27% is achieved because it uses only residual network to detect an object. This model also applied a confidence score based on conditional probability to predict a target object effectively. Figure 11 shows that the proposed model achieved very good accuracy at the end of the epoch. The proposed approach also used backbone classifier to classify the objects accurately.

Fig. 11
figure 11

Epoch vs accuracy

5 Conclusion

In this work, a novel deep YOLO V3 model is proposed to detect the small objects. The project involves training the model using pre-trained YOLOV3 with drone images. The simulation result is shown that the proposed deep YOLO V3 model is suitable for the computer vision process. In this, 106 convolution layers were designed with various feature maps to learn the small drone objects. YOLOv3 extracts better features using both Darknet and residual networks. The training stage performs 45,000 epochs to provide high accuracy. The proposed scheme uses IOU to predict a confidence score for a bounding box and grid cells simultaneously. The proposed model used logistic classifiers and binary cross-entropy loss for optimization to detect a small object. The proposed deep YOLO V3 revealed the 99.99% of accuracy because it used multi-scale predictions and backbone classifiers to better classify them. The different kinds of losses are analyzed that ensure the very good prediction of drone images because this model achieved a good confidence score based on conditional probability. It is used to predict the accurate bounding boxes of an object. The proposed YOLOv3 model is not suitable to detect the larger size objects compared to previous versions such as YOLO and YOLOv2. In the future direction, the algorithm can be extended to train a huge volume of a small drone with the complex visible condition and far-flung remote areas.