Keywords

1 Introduction

It is well known by the machine intelligence research community that general multi-class object detection systems have undergone a rapid development in the last few years, primarily due to the increase of use of deep machine learning based on CNNs, as has been demonstrated by the experiments conducted on the popular public databases, COCO [1] and VOC [2]. While these datasets include high-resolution images of commonly observed objects at a ground level of viewing (i.e. front, side or elevated views), the drone footage that we are utilising in the research presented in this paper have been captured looking vertically downwards from a drone flying at a fixed altitude. Hence one direct challenge this dataset poses is that none of the state-of-the art CNN networks which have been trained on the existing benchmark datasets can directly be used for detecting objects in our dataset due to the significant change of view angle, predominantly.

In literature, there have been some successful attempts in effectively using trained CNN in conducting single-class object detections in drone imagery [4, 5]. However, there has not been any attempt to develop single CNN that are capable of capturing objects of many types, i.e. multi-class object detectors. The development of multi-class CNNs that can perform multi-class object detection on drone images will require the use of more deeper CNNs and significantly more data for training such networks, as more features will have to be utilised by such networks in discriminating multiple object from each-other and from the object background. A further challenge to face in preparing data for training such networks is the need for data balancing, i.e. the need of having roughly the same number of objects for each class, in training the CNNs. In capturing data for training CNNs, this is in most cases difficult due to the relative differences of number of objects of different types physically present in captured drone footage. In the research presented in this paper, a data balancing strategy has been adopted to ensure that sufficient samples of each class are used in each iteration of training and the associated updating of the prediction loss of each iteration.

For clarity of presentation, this paper is divided into a number of sub-sections. Section 2 provides a comprehensive literature review in relation to the subject area. Section 3 shows the dataset configuration. Section 4 shows the methodology proposed for the network configuration, training, testing, evaluation, and Sect. 5 provides the experimental results and discussion. Finally, Sect. 6 concludes the research findings.

2 Related Works

Deep learning (DL) approaches outperform traditional machine learning techniques in several fields, including computer vision [6]. The history of object detection and classification using deep CNN has been motivated by its known success in general image classification tasks including but not limited to, LeNet [7], AlexNet [8], ZF-Net [9], VGG-16 [10], ResNet [11], Inception [12] and MobileNet [13].

The first attempt to apply CNN for object recognition was presented in [7]. The authors investigated the possibility of using Stochastic Gradient Descent (SGD) via backpropagation in training a CNN named the ‘LeNet’ for optical character recognition (OCR) in documents. This simple network comprises of two convolutional layers, a max pooling layer and a fully connected layer. Even though this architecture may not be suitable for complicated tasks, it can be considered as the backbone from which more recent, state-of-the-art CNN architectures have been derived. In the distant past, the lack of available fast computing resources resulted in the practical limitation of one’s ability to train CNNs until about 2012. The period since 2012 can be considered the golden age of the application of CNNs in computer vision, as modern GPUs enhanced the capabilities to speed up the processing, especially in relation to the millions of parameters available for selection in neural networks.

The authors in [8] presented AlexNet, a CNN designed for improving classification in the ImageNet dataset [14]. This CNN comprises of five convolutional layers followed by max pooling. A dropout technique is used to reduce the overfitting. The authors investigated how increasing the number of convolutional layers improves the feature extraction in comparison with the LeNet architecture. A modified version of AlexNet was presented by [9] and named the ZF-Net. Compared to AlexNet, in ZF-Net, the filter size was reduced from 11 to 7 and the stride was reduced from 4 to 2. This has a significant impact on extracting more reliable features in the early layers. In addition, a few convolutional layers were added to improve the feature extraction.

With the idea of further deepening the CNN architecture, VGG-16 net, was proposed by [10], having a total of 16 layers. The network comprises of 12 convolutional layers and a 3 × 3 filter is used in all layers. The VGG-16 net resulted in an excellent performance as compared to the performance of previous networks. The detailed research presented in [10] has had a significant impact on the CNN community. It confirmed that increasing the depth of the model has a crucial impact on improving performance. The authors also compared the performance between VGG-16 and VGG-19, where two convolutional layers were added to the latter network. A slight improvement in the top 5% errors was obtained with VGG-19 at 8.0, as compared to 8.7 obtained with VGG-16. Nevertheless, VGG-16 is still more popular with the CNN research community as practitioners have noted its comparable accuracy with the latter, VGG-19.

The Inception-V1 architecture also called as the GoogleLeNet was presented in [15]. It aims to solve the computation costs of the very deep CNN by applying a 1 × 1 convolution and concatenating the channels. There are 27 layers, of which 22 are convolutional. Rather than using the fully connected layer before the softmax layer, this architecture uses global average pooling towards the end of the network. Using these two techniques successfully reduces the number of parameters from 12M in VGG-16 to 2.4M. This significantly speeds up the process and an illustration of the difference between the fully connected layers and the global average pooling can be found in [5].

However, increasing the depth of CNNs used, by only stacking the layers, leads to the gradient to vanish or explode, consequently increasing the training errors, more details can be found in [11]. To solve the problem of the need for a deep net without a vanishing gradient, in [11] the Microsoft community presented ResNet. This architecture has 152 convolutional layers, which is substantially higher than before. The idea here is that by applying the so-called shortcut-connection and building up residual blocks means that the ensemble of the smaller networks can benefit from the very deep net, without overfitting the model. Since its introduction, this architecture has become very popular within the CNN research community. It significantly improves the object detection and segmentation accuracy achievable using CNN.

Further versions of the original Inception network (i.e. V1) have been released subsequently, namely V2, V3, and V4. The former two were introduced in [16], while the latter was introduced in [12]. In Inception V2, a smart factorisation method is used, aiming to improve the computational complexity. In Inception V3, an RMSProp optimiser, batch normalisation, and adding label smoothing to the loss function are added without drastically changing the architecture. Furthermore, Inception V4 [12] combines Inception V1 with ResNet, which significantly benefits from a very deep network architecture but yet minimises the number of parameters.

A trading off between the speed and accuracy has been proposed in the design of a CNN architecture specifically proposed for applications within the mobile and embedded systems, named MobileNet in [13]. The main idea here is the adaptation of depth-wise separable convolutions for all convolutional layers, except the first layer, which is fully convolutional. This architecture demonstrated a competitive accuracy and benefits when working specifically with limited resource.

Moreover, deep CNNs has been proven to outperform conventional machine learning methods in object detection tasks [6]. Several applications have attracted the attention of both practitioners and academics, who have effectively deployed CNNs for video surveillance, autonomous driving, rescue and relief operations, robots in industry, face and pedestrian detection, understanding UAV images, recognising brands and text digitalisation, among others.

Object detection using CNNs is basically an extension of the meta-architectures used in general feature extraction and classification tasks. For example, LeNet [7], VGG-16 [10], ResNet [11], and Inception [12] are popular feature extraction architectures. The object detection is performed by expanding these architectures with more layers responsible for object detection. There are two approaches in the used in object detection domain; two-stage object detection as in [17,18,19] and Single-shot object detector as in [20] and [21].

In 2014, R-CNN object detection architecture was shown to improve the detection of objects in the VOC2012 dataset [2] in comparison to conventional machine learning techniques through combing Selective Search [22] with CNNs used for classification [17]. The authors proposed the use of Selective Search to generate 2000 region proposals and use 2000 CNNs for classification. However, this design led to the increase of algorithmic complexity and time consumption, even though the object detection accuracy and has been improved. In Fast R-CNN [18], a single CNN is used which significantly reduces the time consumption. The last version of this series called Faster R-CNN which replaces the Selective Search with a Region Proposal Network (RPN) for proposing regions. on the other hand, sharing the computation over every single convolutional layer was conducted for R-FCN when [23] proposed recognising the object parts and location variance using a position-sensitive score maps strategy. This method delivers higher speeds when compared with Faster R-CNN, with comparable accuracy.

The second approach in object detection uses a single CNN for detecting objects using the raw pixels only (i.e. not pixel regions) as in the case of the Single-Shot Detector (SSD) proposed in [20] and You Only Looks Once (YOLO) proposed in [21]. The latter approach outperforms the former in detecting objects in many benchmark datasets, including COCO [1] and VOC2012 [2] datasets, beside its capability to reduce complexity and time consumption. While SSD uses VGG-16 [10] for feature extraction, YOLO uses a custom architecture called Darknet-19 [24]. Neither of these approaches used the filtering steps that ensure each location has a minimum probability of having an object.

YOLO has been proven to be one of the most efficient object detection architectures specifically suitable for real time applications [21]. It uses a custom meta architecture based on Darknet [24] for feature extraction. The first version of YOLO, i.e., V1, is simple in nature when compared to the subsequent two versions. It consists of 24 convolutional layers followed by fully connected layers [21]. In YOLO-V2, batch normalization and anchor boxes for bounding box prediction is utilised with the aim to improve the localisation accuracy. Significant improvement in accuracy is achieved in YOLO-V3. This architecture is changed by increasing the convolutional layers to 106, building residual blocks and skipping the connection to improve the detection at different scales. Also, they change the square errors in the loss function to cross-entropy terms and replace the softmax layer with a logistic regression that predicts the label by giving a threshold value. However, [20] showed that SSD performs better than YOLO-V1 and V2 because the predicted boxes for each location are higher than in the first two versions of YOLO. However, YOLO-V3 [25] outperforms SSD in several datasets, including the benchmark COCO dataset.

On the other hand, there are a considerable number of attempts in applying CNN object detection architectures on drone-based imagery system. Generally, the approaches proposed in literature can be categorised into three different application areas based on their purpose of use, namely, obstacle detection for ensuring safe-flying of the drones, creating DNN models for embedding within a drone’s hardware and object detection/recognition/localization for aerial monitoring and surveillance of large areas [26]. The latter category with wide application areas and considerable open research problems is the focus area of the research conducted in this research. Few attempts has been published to detect different objects in drone footage using CNNs based learning as in [27,28,29,30] and [31], etc. The binary classifiers for the detection of palm trees, as one of the objects in the designed multiclass detector, can be found in [32,33,34,35,36,37] and [37, 38] and [39]. For more details we refer the readers to [5]. Furthermore, there are a few researches discussed the animal detection as in [31, 40,41,42,43] and [4].

Most previous work applied CNNs to detect objects of multiple classes as in [44] and [45] in applications related to the detection of obstacles in the flight path of a drone, thereby addressing the drone’s safe flying. In contrast analysing drone footage to detect and classify multiple objects is rarely studied. To the best of our knowledge, the few published attempts recovered via the literature review conducted are detailed below.

For infrastructure assessment and monitoring purposes of the electric power distribution industry, the authors in [29] proposed the detecting three different classes of objects using a single CNN, namely power lines/cables, pylons and insulators, from drone images, with the aim of automatic maintenance and insurance purposes. The authors investigated the use of the pre-trained CNN model GoogleLeNet and fine-tuned it with their dataset before they applied Spectral Clustering [46] for further improvement of the results obtained.

Moreover, automatic railway corridor monitoring and assessment by using DNNs on images captured by drones was proposed by [27]. The pre-trained GoogleLeNet architecture and an architecture proposed by the authors were re-trained and trained respectively to detect and classify five different classes of objects, namely, lines, ballast, anchors, sleepers and fasteners. It was shown that when using the novel architecture proposed the F-score reduced from 89% to 81% as compared to using GoogleLeNet at a ten-fold reduction of network parameters due to the simpler architecture of the proposed network. With a similar focus in mind, the authors of [47] minimised the number of convolutional layers of their proposed architecture and obtained significantly good results in multi-class object classification in the area of using robots in detecting threats in crises and emergency situations. Further, the authors in [48] discussed how integration of CNN technology can be addressed in small drones by applying transfer learning and saving only the last few layers of the CNN to enable embedding the system into a drone’s cameras for autonomous flight.

To conclude, there is no previous attempts of investigating the applicability of CNNs in detecting objects in drone-based imagery system with the specification given in this research dataset as shown in Sect. 3. This research reveals the significant of the number of convolutional layers, pooling type in the pooling layer, learning rate and the optimization method in improving the multi-class detector in drone-based images which is the main contribution of this research.

3 Dataset Configuration

The research dataset comprises of 221, large aerial view images of size 5472 × 3648 pixels, captured by a drone. An example of such an image is illustrated in Fig. 1. However, for the purpose of conducting the research proposed in this paper, three objects have been labelled: ‘palm trees’, sheds and ‘group-of-animals’. The labelled data that consists of the above three classes, will then be used for training the multi-class object detector. The combined dataset consists of 900 images of size 416 × 416. This dataset has 1753, 3300 and 3420 bounding boxes for the palm tree, group of animals and sheds respectively.

Fig. 1.
figure 1

A sample of the drone-based desert image dataset [3]; image dimensions, 5472 × 3648 pixels.

3.1 Data Balancing Strategy

An important preparation step in training a multi-class model using min-batch gradient descent approach to minimise its loss function is to ensure a balanced number of labelled objects for each class, per iteration. This is because, to ensure slopping the loss function toward the minimum (weight updating), the presence of an inadequate number of samples for a particular class complicates the training process; increases noise and practically results in a high bias. This reflects the need for increasing the number of training samples.

The number of bounding boxes in the multi-class dataset has group-of-animals (3300) and sheds (3420) objects and a significantly low number of palm-tree objects (1753). The majority of raw images captured by the drones used for data collection in our experiments were of animal or crop farming areas. The presence and spread of palm trees in such farms were sparse and collecting sufficient samples of palm free was therefore difficult. Further due to the significant within-class variations of group-of-animals (or sheds), if one is to attempt developing a CNN network for detection group-of-animals, a large number of such objects will be required in training, testing and validation. Given the above, 150 further images of size 500 × 500 (different magnitude) were cropped from the raw, large sized images captured by the drones. The idea is to use these additional images to boost the number of samples needed in a particular class, in the process of balancing data. Following this the total number of palm trees available for training has been increased to 2271 from the original 1753. The combined dataset used in training YOLO-V3 for multi-class object detection therefore have has 1050 cropped images in total, divided as 85% for training and 15% for testing. Practically, all the images are saved in a single folder and named with n number of names, which is the number of classes. This is to ensure the division of the training and testing set using the determined percentage of each class. With this strategy, the palm tree training samples are balanced, and this ensures a sufficient number of palm tree bounding boxes are trained per iteration. The final research dataset used in this research is shown in Table 1.

Table 1. The multi-class object detection dataset after data balancing strategy (# refers to ‘Number of’).

4 Research Methodology

The following steps are related to the proposed CNN architecture and coarse-to-fine framework as well as the specific used training strategy as in Sects. 4.1 and 4.2 respectively. The last sub-section, Sect. 4.3, shows the evaluation methodology.

4.1 Proposed CNN Architecture

The single-shot-based learning approach is utilised, whereby extracting features of an object, and detecting objects are performed using a single CNN. Effective object detection using CNNs will heavily depend on the meta-architecture a CNN uses for feature extraction. Therefore, the structure/architecture of different state-of-the-art CNN networks were investigated within the wider research context of this for potential use in multi-class object detection being proposed in this paper The study conducted includes how different state-of-the-art architectures differ in terms of the number of convolutional layers, activation function, and type of pooling.

Following the practical evaluation of different state-of-the-art architectures, YOLO-V3 was adopted for the given task. This architecture uses Darknet-53 for the feature extraction, which has 53 convolutional layers, and a further 53 convolutional layers for object detection from the feature map. In total, YOLO-V3 has 106 convolutional layers, with residual blocks. The residual block is the idea inherited from ResNet, which differs significantly from other architectures in that there are no pooling layers in-between the convolutional layers, although a skipping connection is used to reduce the number of parameters. However, while the last layer in YOLO-V3 uses ‘average-pooling’, in our investigations, ‘max-pooling’ has proven to reduce the outliers, and subsequently, it has been tuned for evaluation.

4.2 Training Strategies

Training strategy refers to defining a set of parameters to control the training process of a given architecture. The complexity of training deep neural networks results significantly from the sheer number of parameters than can be tuned and the difficultly in predicting the performance in a given application prior to practically configuring and testing such a configuration. This includes the selection of the gradient descent algorithm, batch size, learning rate, optimization method, and number of iterations. However, the crucial parameters that have a significant impact are the learning rate, batch size and the optimization method, and hence this research evaluates their impact on this research dataset. A large batch size, such as 32 or 64, mostly improves the performance compared to a batch size of 2, 4 or 6 even though this is not the case in certain datasets. As it is restricted by the hardware specification, the batch size used in our investigations was fixed at 12 and the data-balancing strategy used ensures the sufficient number of different classes’ samples per iteration. However, the learning rate is the most important hyperparameter that can significantly improve the accuracy and speed. A comprehensive explanation of the learning rate and how it affects the sloping toward the minimum loss can be found in [49]. The learning rate has been tuned in YOLO-V3 and changed from 0.001 to 0.0001, omitting the learning rate decay. This means that the weight updates occur more slowly but also consistently in all iterations.

4.3 Evaluation Methodology

The typical evaluation method of learning algorithms is usually based on calculating the precision, recall, and F1-score, as shown in Eqs. 1, 2, and 3. In this paper, these metrics are calculated for each class before the average is taken, which is eventually used to reflect the overall performance. The interpretation/definitions of the terms True Positive (TP) True Negative (TN), False Positive (FP), False Negative (FN), precision and recall are shown in Table 2.

$$\mathrm{precision}= \frac{\mathrm{TP }}{\mathrm{TP}+\mathrm{FP}}$$
(1)
$$\mathrm{Recall}= \frac{\mathrm{TP }}{\mathrm{TP}+\mathrm{FN}}$$
(2)
$$\mathrm{F}1\,\,\mathrm{ score} = 2 .\frac{\mathrm{precision }\,\,.\,\,\mathrm{ Recall }}{\mathrm{Precision}+\mathrm{ Recall}}$$
(3)
Table 2. The interpretation of performance evaluation metrics.

5 Results and Discussion

The experiments are initiated by configuring the dataset as described in Sect. 3. The total number of cropped images (from the large-scale drone image dataset) used in this research dataset is 1050 images. These images have been divided into 85% for training and 15% for testing. As the test set is randomly selected from amongst the 1050 images, the number of bounding boxes that belongs to each of the three classes, differs from image to image. The test set, which contained 157 images, comprised of 173 palm trees, 442 group-of-animals and 374 sheds/animal-shelters. The state-of-the-art CNN architectures, SSD-500 with VGG-16 and ResNet meta-architecture and YOLO-V3, were configured, trained, tested and evaluated, without any changes to their default parameters, except the batch size used, which was set as 12 for YOLO-V3 and set at 4 for SSD. The details of these architectures are shown in Table 3. Based on the initial performance results, YOLO-V3 registered the highest F1-score and it is selected for further optimization.

Table 3. The default parameters of the CNNs architecture in YOLO-V3, SSD 500 with VGG-Net and ResNet architectures (# refers to ‘Number of’).

The SSD-500 with VGG-16 and ResNet meta-architectures and YOLO-V3 have been configured, trained and tested in their ability to detect multi-class objects in drone images. While the former uses 16 convolutional layers for feature extraction, the SSD with ResNet-uses 101 convolutional layers. YOLO-V3 uses 53 layers based on Darknet-53 for feature extraction and a further 53 convolutional layers for detecting objects from the generated feature map. Therefore, these networks have different number of convolution layers in the feature extraction and the object detection phases. The result of multiclass detection in the research shows an F1-score of 0.91 in using YOLO-V3 compared to 0.77 in SSD-500/VGG-Net and 0.83 in SSD-500/ResNet. The influence of the number of convolution layers is clearly shown whereby SSD-500 with VGG-16 registered the lowest F1-score, significantly better as compared with the F1-score of SSD-500 with ResNet. However, YOLO-V3 outperformed both SSD-500 with VGG-16 and with ResNet by a considerable margin as in Table 4. This is because as compared to the five convolutional layers of the detection phase of the SSD architecture, YOLO-V3 has 53 in the detection phase. Further YOLO-V3’s feature extraction process is more comprehensive as compared to that if the two SSD based approaches.

Table 4. The result of training three CNNs architectures for drone-based multi-class object detection without any hyper-parameter optimization. (#BX: Number of bounding boxes, TP: True Positive, FN: False Negative, FP: False Positive).

The precision of the learned model is 1 in detecting all types of objects with the multi-class model generated as there were no ‘False Positive (FP)’ detections, i.e. objects which are classified as being of a particular type but are not that type. The challenge here are the missed detections of each object type, represented by the ‘False Negative (FN)’, which is a total of 151 bounding boxes of objects belonging to one of the types of objects not being detected out of a total of 989 possible objects of all types. YOLO-V3 clearly outperforms both SSD-500 based architectures as it clearly shows better results with regards to the performance parameters, recall, F1-score and Average Confidence. Therefore, the use of YOLO-V3 is recommended for multi-class object detection.

With the aim of improving the obtained result of the modest YOLO-V3, the impact of the activation function, pooling method, learning rate and the optimization method are practically evaluated. This helps in investigating the best integration that could influence the performance as in Sects. 5.15.4. The optimal selection in each case will be used in the final coarse-to-fine model as in Sect. 5.5.

5.1 The Impact of Different Activation Functions in the Hidden Units

Most CNNs uses either ReLU or Leaky ReLU in the hidden units aiming to activate certain units to pass over the net. While ReLU omits all neurons less than zero, Leaky ReLU allows a small value to present, which has an impact on reducing the number of non-activated neurons. While YOLO-V3 uses Leaky ReLU in its configuration, the impact of changing it to ReLU is practically tested here. This is because the number of classes is still limited, which is three compared to general object detection tasks, which have 80 and more different classes. Reducing the number of activated neurons can resulted in simplifying the model and reducing the overfitting.

In Table 5, the result of changing the activation function is presented. This shows a slight improvement in the FN from 838 out of 989 in the baseline model to 849 based on using the ReLU activated function. Even the F1-score has a slight improvement from 0.91 to 0.92, the combination with tuning different hyper-parameters such as learning rate or pooling layer can give a noticeable improvement.

Table 5. Overall multiclass object detector performance using YOLO-V3 with different activation functions (#BX: Number of bounding boxes).

5.2 The Impact of the Pooling Method

The pooling layer can be optionally used in-between the convolutional layers aiming to reducing the number of parameters by taking either the average, max or any other pooling method of a determined receptive field. While ResNet uses the residual blocks, it has no pooling layers in-between the convolutional layers, but it uses the average pooling toward the end of the network. Max pooling is commonly used in modern CNN architectures including VGG-net and AlexNet, and the practitioners claim it has a good reduction of outliers compared to average pooling. A comparison between average pooling and max pooling toward the end of the feature’s extraction layers has been evaluated by training the model twice and comparing the performance. The result is shown in Table 6, which reflects a slight improvement when using the max pooling. This is because the model precision is initially 1, which reflects the lower number of outliers in the model. It is suggested to tune the pooling type if the precision of the model is low or if the number of classes where the model tends to lean is higher than in this research case.

Table 6. Overall multiclass object detection performance based on using YOLO-V3 with different pooling type (#BX: Number of bounding boxes).

5.3 The Impact of Tuning Learning Rate and the Choice of the Learning Rate Decay Method

The learning rate is the most crucial hyperparameter that formulates the training process and the converging time of a given DNN. As it determines the periodic update of the network loss whilst the network is being trained, the higher the value it is set at, less time would be needed to the network loss to converge but will be noisier compared to the use of smaller values. Practically, the learning rate can be tuned between 0.1 and 1. Usually, researchers use lower learning rate values in training on complex datasets. However, a slightly higher value of learning-rate is used if the dataset is easier to train, particularly when using in conjunction with a large volume of data, aiming to reduce the learning time, but with the consequence of the training process to become unstable. As the learning rate substantially affects the training speed of a network, the learning rate decay, a parameter that determines the reduction of the learning rate over each epoch (i.e. each iteration of the learning process), can be used to balance between the speeding up the process and converging the network, when the network tends to reach the local minima.

Therefore, given the significant impact of the learning rate on stabilizing the training process, we conducted experiments with the learning rate set to 0.001 and 0.0001. As the research dataset is complex in nature (high intra-class variations in sheds and group-of-animals) and the data availability per class is limited as compared to typical popular object detection datasets, we ignore the time required for the network to train but would consider the stability and convergence of training as crucial. Therefore, the evaluation of the model performance with learning rates of 0.001 and 0.0001, and with the learning rate decay omitted, was conducted. The results are tabulated in Table 7. The results show the significance of performance improvement when the learning rate is 0.0001 and compared to setting it ten times larger, at 0.0001. However, it is noted that the results with a learning rate of 0.001 was obtained at 50,000 iterations, whilst the result with the leaning rate 0.0001 was achieved in 180,000 iterations. The selection of the learning rate for a training task is hence a decision that should be made keeping in mind the complexity of the task to be carried out, time available for training the network and the relative importance of performance metrics such as the precision, recall, F1-score and confidence, which will depend on the application needs.

Table 7. Overall multiclass object detection performance using YOLO-V3 based on different learning rate. (BX: Number of Bounding Boxes).

5.4 An Evaluation of the Use of Optimization Methods in Minimizing Loss

As Gradient Descent [7] is the method used for minimizing the loss function, different optimization methods have been used beside it to make the model learn fast and accurately, whereby the Momentum [50] is the most popular approach used by the computer vision research community. However, there is a claim by deep learning practitioners that RMSProp [51] and Adam [52] (combines Momentum with RMSProp) optimizers work better in practice. To evaluate the effectiveness of theses optimizers on the proposed multi-class object detector, we conducted an investigation that effectively used Momentum, RMSProp and Adam optimizers. The results obtained are shown in Table 8.

Table 8. Overall multiclass object detection performance based on YOLO-V3 when different optimizers are used (#BX: Number of bounding boxes).

The results in Table 8 demonstrate the slight improvement in object detection that is enabled by the use of ADAM, which is a hybrid between the Momentum and RMSProp optimization approaches.

5.5 Overall Performance of the Optimised Multi-class Object Detector

The experiments conducted in Sects. 5.15.4 have been conducted using default settings for all parameters, other than the parameter under investigation. These investigations revealed the efficiencies achievable when the right parameter values are selected and used with the dataset under investigation. Based on the research results presented above that highlighted the optimal setting of each parameter, the YOLO-V3 based multi-class object detector was reconfigured, trained, tested and evaluated. This final model has a ReLU activation function for the hidden units, max pooling toward the end of the network, a learning rate of 0.0001 and an Adam optimizer. The results obtained from this customization are presented in Table 9.

Table 9. The ultimate YOLO-V3 based multi-class object detector performance.

The results tabulated in Table 9 show the performance improvement achieved by the optimised network for multi-class object detection. The True Positive (TP) rate has improved from being 838 total objects accurately detected to 932 accurately detected out of a total of 989 objects annotated in images within bounding boxes. However, as the learning rate has been reduced as compared to the learning rate of the network with default parameters, the total number of iterations to achieve such a result is higher, at 180, 000 iterations. It is noted that the precision of the palm trees has slightly decreased from 1 to 0.99, due to a single case of FP, as shown in Fig. 2. A tree which is not a palm tree that has some perceptual similarity to a palm tree, when seen at a low resolution, has been detected as a palm tree.

Fig. 2.
figure 2

The single case of false detection (False Positive, FP) of palm trees in the test set.

Examples of multi-class object detection with the optimized YOLO-V3 CNN are illustrated in Fig. 3a, and 3b. It reflects the ability of the learned model to detect different types of sheds oriented in different angles, different group-of-animals (animal) with different spatial-densities and occlusions and different sizes of palm trees in the drone-based footage. It is noted that the missed palm trees are those that are either very small in size or of a very low resolution. The missed sheds are those that are oriented differently to the orientation of the majority of buildings used in training and the missed group-of-animals are those groups that are sparsely spread within the farm.

Fig. 3a.
figure 3

The final results of the object detection performance of the proposed methodology for drone-based multi-class object detection

Fig. 3b.
figure 4

The final results of the object detection performance of the proposed methodology for drone-based multi-class object detection

Figure 4 illustrates further examples where some types of objects are missed. Despite the above missed and false detections, the multi-class object detector developed in this chapter has an improved rate of missed detections (6%) as compared to a 16% missed detections that resulted from the model with non-optimal parameters.

Fig. 4.
figure 5

Examples of missed detections that result from the YOLO-V3 CNN, trained for multi-class object detection

6 Summary and Conclusion

In this paper, multi-class object detection in drone images was investigated, making the best use of the state-of-the-art CNN architectures, SSD-500 supported by the meta-architectures VGG-16 and ResNet and the YOLO-V3 CNN architecture. The key focus of this paper was to develop a single CNN model that is capable of detecting palm-trees, group-of-animals, and sheds/animal-shelters. Initially the performance of all three CNN models were compared in the multi-class object detection task analysing in detail their performance in detecting all three types of objects accurately under default hyper-parameter value selections. This experiment concluded that YOLO-V3 has superior performance to the two SSD-500 based CNN models in recall, F1-score, and average confidence while all three models provided a precision of 1.

Further detailed investigations were subsequently conducted to decide on the optimal hyper-parameter settings when using YOLO-V3 in the given multi-class object detection tasks. Specifically, the impact of using different activation functions, pooling methods, learning rates and optimisation methods to minimize loss were investigated and the relevant optimal parameters were obtained. The original YOLO-V3 network was then reconfigured with these optimal parameters and the model was re-trained, tested and evaluated. The experiment concluded the ability of the optimised YOLO-V3 CNN model to perform significantly better in multi-class object detection in drone images. All performance metrics were substantially improved. Missed detections were carefully studies to make conclusions that due to the high intra-class variations present in all three types of objects, specifically in animal shelters/sheds, significant amount of balanced examples of such objects need to be used in training, to further improve the performance accuracy of the proposed model.