Abstract
Multi-class object detection has a rapid evolution in the last few years with the rise of deep Convolutional Neural Networks (CNNs) learning based, in particular. However, the success approaches are based on high resolution ground level images and extremely large volume of data as in COCO and VOC datasets. On the other hand, the availability of the drones has been increased in the last few years and hence several new applications have been established. One of such is understanding drone footage by analysing, detecting, recognizing different objects in the covered area. In this study conducted, a collection of large images captured by a drone flying at a fixed altitude in a desert area located within the United Arab Emirates (UAE) is given and it is utilised for training and evaluating the CNN networks to be investigated. Three state-of-the-art CNN architectures, namely SSD-500 with VGGNet-16 meta-architecture, SSD-500 with ResNet meta-architecture and YOLO-V3 with Darknet-53 are optimally configured, re-trained, tested and evaluated for the detection of three different classes of objects in the captured footage, namely, palm trees, group-of-animals/cattle and animal sheds in farms. Our preliminary experiments revealed that YOLO-V3 outperformed SSD-500 with VGGNet-16 by a large margin and has a considerable improvement as compared to using SSD-500 with ResNet. Therefore, it has been selected for further investigation, aiming to propose an efficient coarse-to-fine object detection model for multi-class object detection in drone images. To this end, the impact of changing the activation function of the hidden units and the pooling type in the pooling layer has been investigated in detail. In addition, the impact of tuning the learning rate and the selection of the most effective optimization method for general hyper-parameters tuning is also investigated. The result demonstrated that the multi-class object detector developed has precision of 0.99, a recall of 0.94 and an F-score of 0.96, proving the efficiency of the multi-class object detection network developed.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
It is well known by the machine intelligence research community that general multi-class object detection systems have undergone a rapid development in the last few years, primarily due to the increase of use of deep machine learning based on CNNs, as has been demonstrated by the experiments conducted on the popular public databases, COCO [1] and VOC [2]. While these datasets include high-resolution images of commonly observed objects at a ground level of viewing (i.e. front, side or elevated views), the drone footage that we are utilising in the research presented in this paper have been captured looking vertically downwards from a drone flying at a fixed altitude. Hence one direct challenge this dataset poses is that none of the state-of-the art CNN networks which have been trained on the existing benchmark datasets can directly be used for detecting objects in our dataset due to the significant change of view angle, predominantly.
In literature, there have been some successful attempts in effectively using trained CNN in conducting single-class object detections in drone imagery [4, 5]. However, there has not been any attempt to develop single CNN that are capable of capturing objects of many types, i.e. multi-class object detectors. The development of multi-class CNNs that can perform multi-class object detection on drone images will require the use of more deeper CNNs and significantly more data for training such networks, as more features will have to be utilised by such networks in discriminating multiple object from each-other and from the object background. A further challenge to face in preparing data for training such networks is the need for data balancing, i.e. the need of having roughly the same number of objects for each class, in training the CNNs. In capturing data for training CNNs, this is in most cases difficult due to the relative differences of number of objects of different types physically present in captured drone footage. In the research presented in this paper, a data balancing strategy has been adopted to ensure that sufficient samples of each class are used in each iteration of training and the associated updating of the prediction loss of each iteration.
For clarity of presentation, this paper is divided into a number of sub-sections. Section 2 provides a comprehensive literature review in relation to the subject area. Section 3 shows the dataset configuration. Section 4 shows the methodology proposed for the network configuration, training, testing, evaluation, and Sect. 5 provides the experimental results and discussion. Finally, Sect. 6 concludes the research findings.
2 Related Works
Deep learning (DL) approaches outperform traditional machine learning techniques in several fields, including computer vision [6]. The history of object detection and classification using deep CNN has been motivated by its known success in general image classification tasks including but not limited to, LeNet [7], AlexNet [8], ZF-Net [9], VGG-16 [10], ResNet [11], Inception [12] and MobileNet [13].
The first attempt to apply CNN for object recognition was presented in [7]. The authors investigated the possibility of using Stochastic Gradient Descent (SGD) via backpropagation in training a CNN named the ‘LeNet’ for optical character recognition (OCR) in documents. This simple network comprises of two convolutional layers, a max pooling layer and a fully connected layer. Even though this architecture may not be suitable for complicated tasks, it can be considered as the backbone from which more recent, state-of-the-art CNN architectures have been derived. In the distant past, the lack of available fast computing resources resulted in the practical limitation of one’s ability to train CNNs until about 2012. The period since 2012 can be considered the golden age of the application of CNNs in computer vision, as modern GPUs enhanced the capabilities to speed up the processing, especially in relation to the millions of parameters available for selection in neural networks.
The authors in [8] presented AlexNet, a CNN designed for improving classification in the ImageNet dataset [14]. This CNN comprises of five convolutional layers followed by max pooling. A dropout technique is used to reduce the overfitting. The authors investigated how increasing the number of convolutional layers improves the feature extraction in comparison with the LeNet architecture. A modified version of AlexNet was presented by [9] and named the ZF-Net. Compared to AlexNet, in ZF-Net, the filter size was reduced from 11 to 7 and the stride was reduced from 4 to 2. This has a significant impact on extracting more reliable features in the early layers. In addition, a few convolutional layers were added to improve the feature extraction.
With the idea of further deepening the CNN architecture, VGG-16 net, was proposed by [10], having a total of 16 layers. The network comprises of 12 convolutional layers and a 3 × 3 filter is used in all layers. The VGG-16 net resulted in an excellent performance as compared to the performance of previous networks. The detailed research presented in [10] has had a significant impact on the CNN community. It confirmed that increasing the depth of the model has a crucial impact on improving performance. The authors also compared the performance between VGG-16 and VGG-19, where two convolutional layers were added to the latter network. A slight improvement in the top 5% errors was obtained with VGG-19 at 8.0, as compared to 8.7 obtained with VGG-16. Nevertheless, VGG-16 is still more popular with the CNN research community as practitioners have noted its comparable accuracy with the latter, VGG-19.
The Inception-V1 architecture also called as the GoogleLeNet was presented in [15]. It aims to solve the computation costs of the very deep CNN by applying a 1 × 1 convolution and concatenating the channels. There are 27 layers, of which 22 are convolutional. Rather than using the fully connected layer before the softmax layer, this architecture uses global average pooling towards the end of the network. Using these two techniques successfully reduces the number of parameters from 12M in VGG-16 to 2.4M. This significantly speeds up the process and an illustration of the difference between the fully connected layers and the global average pooling can be found in [5].
However, increasing the depth of CNNs used, by only stacking the layers, leads to the gradient to vanish or explode, consequently increasing the training errors, more details can be found in [11]. To solve the problem of the need for a deep net without a vanishing gradient, in [11] the Microsoft community presented ResNet. This architecture has 152 convolutional layers, which is substantially higher than before. The idea here is that by applying the so-called shortcut-connection and building up residual blocks means that the ensemble of the smaller networks can benefit from the very deep net, without overfitting the model. Since its introduction, this architecture has become very popular within the CNN research community. It significantly improves the object detection and segmentation accuracy achievable using CNN.
Further versions of the original Inception network (i.e. V1) have been released subsequently, namely V2, V3, and V4. The former two were introduced in [16], while the latter was introduced in [12]. In Inception V2, a smart factorisation method is used, aiming to improve the computational complexity. In Inception V3, an RMSProp optimiser, batch normalisation, and adding label smoothing to the loss function are added without drastically changing the architecture. Furthermore, Inception V4 [12] combines Inception V1 with ResNet, which significantly benefits from a very deep network architecture but yet minimises the number of parameters.
A trading off between the speed and accuracy has been proposed in the design of a CNN architecture specifically proposed for applications within the mobile and embedded systems, named MobileNet in [13]. The main idea here is the adaptation of depth-wise separable convolutions for all convolutional layers, except the first layer, which is fully convolutional. This architecture demonstrated a competitive accuracy and benefits when working specifically with limited resource.
Moreover, deep CNNs has been proven to outperform conventional machine learning methods in object detection tasks [6]. Several applications have attracted the attention of both practitioners and academics, who have effectively deployed CNNs for video surveillance, autonomous driving, rescue and relief operations, robots in industry, face and pedestrian detection, understanding UAV images, recognising brands and text digitalisation, among others.
Object detection using CNNs is basically an extension of the meta-architectures used in general feature extraction and classification tasks. For example, LeNet [7], VGG-16 [10], ResNet [11], and Inception [12] are popular feature extraction architectures. The object detection is performed by expanding these architectures with more layers responsible for object detection. There are two approaches in the used in object detection domain; two-stage object detection as in [17,18,19] and Single-shot object detector as in [20] and [21].
In 2014, R-CNN object detection architecture was shown to improve the detection of objects in the VOC2012 dataset [2] in comparison to conventional machine learning techniques through combing Selective Search [22] with CNNs used for classification [17]. The authors proposed the use of Selective Search to generate 2000 region proposals and use 2000 CNNs for classification. However, this design led to the increase of algorithmic complexity and time consumption, even though the object detection accuracy and has been improved. In Fast R-CNN [18], a single CNN is used which significantly reduces the time consumption. The last version of this series called Faster R-CNN which replaces the Selective Search with a Region Proposal Network (RPN) for proposing regions. on the other hand, sharing the computation over every single convolutional layer was conducted for R-FCN when [23] proposed recognising the object parts and location variance using a position-sensitive score maps strategy. This method delivers higher speeds when compared with Faster R-CNN, with comparable accuracy.
The second approach in object detection uses a single CNN for detecting objects using the raw pixels only (i.e. not pixel regions) as in the case of the Single-Shot Detector (SSD) proposed in [20] and You Only Looks Once (YOLO) proposed in [21]. The latter approach outperforms the former in detecting objects in many benchmark datasets, including COCO [1] and VOC2012 [2] datasets, beside its capability to reduce complexity and time consumption. While SSD uses VGG-16 [10] for feature extraction, YOLO uses a custom architecture called Darknet-19 [24]. Neither of these approaches used the filtering steps that ensure each location has a minimum probability of having an object.
YOLO has been proven to be one of the most efficient object detection architectures specifically suitable for real time applications [21]. It uses a custom meta architecture based on Darknet [24] for feature extraction. The first version of YOLO, i.e., V1, is simple in nature when compared to the subsequent two versions. It consists of 24 convolutional layers followed by fully connected layers [21]. In YOLO-V2, batch normalization and anchor boxes for bounding box prediction is utilised with the aim to improve the localisation accuracy. Significant improvement in accuracy is achieved in YOLO-V3. This architecture is changed by increasing the convolutional layers to 106, building residual blocks and skipping the connection to improve the detection at different scales. Also, they change the square errors in the loss function to cross-entropy terms and replace the softmax layer with a logistic regression that predicts the label by giving a threshold value. However, [20] showed that SSD performs better than YOLO-V1 and V2 because the predicted boxes for each location are higher than in the first two versions of YOLO. However, YOLO-V3 [25] outperforms SSD in several datasets, including the benchmark COCO dataset.
On the other hand, there are a considerable number of attempts in applying CNN object detection architectures on drone-based imagery system. Generally, the approaches proposed in literature can be categorised into three different application areas based on their purpose of use, namely, obstacle detection for ensuring safe-flying of the drones, creating DNN models for embedding within a drone’s hardware and object detection/recognition/localization for aerial monitoring and surveillance of large areas [26]. The latter category with wide application areas and considerable open research problems is the focus area of the research conducted in this research. Few attempts has been published to detect different objects in drone footage using CNNs based learning as in [27,28,29,30] and [31], etc. The binary classifiers for the detection of palm trees, as one of the objects in the designed multiclass detector, can be found in [32,33,34,35,36,37] and [37, 38] and [39]. For more details we refer the readers to [5]. Furthermore, there are a few researches discussed the animal detection as in [31, 40,41,42,43] and [4].
Most previous work applied CNNs to detect objects of multiple classes as in [44] and [45] in applications related to the detection of obstacles in the flight path of a drone, thereby addressing the drone’s safe flying. In contrast analysing drone footage to detect and classify multiple objects is rarely studied. To the best of our knowledge, the few published attempts recovered via the literature review conducted are detailed below.
For infrastructure assessment and monitoring purposes of the electric power distribution industry, the authors in [29] proposed the detecting three different classes of objects using a single CNN, namely power lines/cables, pylons and insulators, from drone images, with the aim of automatic maintenance and insurance purposes. The authors investigated the use of the pre-trained CNN model GoogleLeNet and fine-tuned it with their dataset before they applied Spectral Clustering [46] for further improvement of the results obtained.
Moreover, automatic railway corridor monitoring and assessment by using DNNs on images captured by drones was proposed by [27]. The pre-trained GoogleLeNet architecture and an architecture proposed by the authors were re-trained and trained respectively to detect and classify five different classes of objects, namely, lines, ballast, anchors, sleepers and fasteners. It was shown that when using the novel architecture proposed the F-score reduced from 89% to 81% as compared to using GoogleLeNet at a ten-fold reduction of network parameters due to the simpler architecture of the proposed network. With a similar focus in mind, the authors of [47] minimised the number of convolutional layers of their proposed architecture and obtained significantly good results in multi-class object classification in the area of using robots in detecting threats in crises and emergency situations. Further, the authors in [48] discussed how integration of CNN technology can be addressed in small drones by applying transfer learning and saving only the last few layers of the CNN to enable embedding the system into a drone’s cameras for autonomous flight.
To conclude, there is no previous attempts of investigating the applicability of CNNs in detecting objects in drone-based imagery system with the specification given in this research dataset as shown in Sect. 3. This research reveals the significant of the number of convolutional layers, pooling type in the pooling layer, learning rate and the optimization method in improving the multi-class detector in drone-based images which is the main contribution of this research.
3 Dataset Configuration
The research dataset comprises of 221, large aerial view images of size 5472 × 3648 pixels, captured by a drone. An example of such an image is illustrated in Fig. 1. However, for the purpose of conducting the research proposed in this paper, three objects have been labelled: ‘palm trees’, sheds and ‘group-of-animals’. The labelled data that consists of the above three classes, will then be used for training the multi-class object detector. The combined dataset consists of 900 images of size 416 × 416. This dataset has 1753, 3300 and 3420 bounding boxes for the palm tree, group of animals and sheds respectively.
A sample of the drone-based desert image dataset [3]; image dimensions, 5472 × 3648 pixels.
3.1 Data Balancing Strategy
An important preparation step in training a multi-class model using min-batch gradient descent approach to minimise its loss function is to ensure a balanced number of labelled objects for each class, per iteration. This is because, to ensure slopping the loss function toward the minimum (weight updating), the presence of an inadequate number of samples for a particular class complicates the training process; increases noise and practically results in a high bias. This reflects the need for increasing the number of training samples.
The number of bounding boxes in the multi-class dataset has group-of-animals (3300) and sheds (3420) objects and a significantly low number of palm-tree objects (1753). The majority of raw images captured by the drones used for data collection in our experiments were of animal or crop farming areas. The presence and spread of palm trees in such farms were sparse and collecting sufficient samples of palm free was therefore difficult. Further due to the significant within-class variations of group-of-animals (or sheds), if one is to attempt developing a CNN network for detection group-of-animals, a large number of such objects will be required in training, testing and validation. Given the above, 150 further images of size 500 × 500 (different magnitude) were cropped from the raw, large sized images captured by the drones. The idea is to use these additional images to boost the number of samples needed in a particular class, in the process of balancing data. Following this the total number of palm trees available for training has been increased to 2271 from the original 1753. The combined dataset used in training YOLO-V3 for multi-class object detection therefore have has 1050 cropped images in total, divided as 85% for training and 15% for testing. Practically, all the images are saved in a single folder and named with n number of names, which is the number of classes. This is to ensure the division of the training and testing set using the determined percentage of each class. With this strategy, the palm tree training samples are balanced, and this ensures a sufficient number of palm tree bounding boxes are trained per iteration. The final research dataset used in this research is shown in Table 1.
4 Research Methodology
The following steps are related to the proposed CNN architecture and coarse-to-fine framework as well as the specific used training strategy as in Sects. 4.1 and 4.2 respectively. The last sub-section, Sect. 4.3, shows the evaluation methodology.
4.1 Proposed CNN Architecture
The single-shot-based learning approach is utilised, whereby extracting features of an object, and detecting objects are performed using a single CNN. Effective object detection using CNNs will heavily depend on the meta-architecture a CNN uses for feature extraction. Therefore, the structure/architecture of different state-of-the-art CNN networks were investigated within the wider research context of this for potential use in multi-class object detection being proposed in this paper The study conducted includes how different state-of-the-art architectures differ in terms of the number of convolutional layers, activation function, and type of pooling.
Following the practical evaluation of different state-of-the-art architectures, YOLO-V3 was adopted for the given task. This architecture uses Darknet-53 for the feature extraction, which has 53 convolutional layers, and a further 53 convolutional layers for object detection from the feature map. In total, YOLO-V3 has 106 convolutional layers, with residual blocks. The residual block is the idea inherited from ResNet, which differs significantly from other architectures in that there are no pooling layers in-between the convolutional layers, although a skipping connection is used to reduce the number of parameters. However, while the last layer in YOLO-V3 uses ‘average-pooling’, in our investigations, ‘max-pooling’ has proven to reduce the outliers, and subsequently, it has been tuned for evaluation.
4.2 Training Strategies
Training strategy refers to defining a set of parameters to control the training process of a given architecture. The complexity of training deep neural networks results significantly from the sheer number of parameters than can be tuned and the difficultly in predicting the performance in a given application prior to practically configuring and testing such a configuration. This includes the selection of the gradient descent algorithm, batch size, learning rate, optimization method, and number of iterations. However, the crucial parameters that have a significant impact are the learning rate, batch size and the optimization method, and hence this research evaluates their impact on this research dataset. A large batch size, such as 32 or 64, mostly improves the performance compared to a batch size of 2, 4 or 6 even though this is not the case in certain datasets. As it is restricted by the hardware specification, the batch size used in our investigations was fixed at 12 and the data-balancing strategy used ensures the sufficient number of different classes’ samples per iteration. However, the learning rate is the most important hyperparameter that can significantly improve the accuracy and speed. A comprehensive explanation of the learning rate and how it affects the sloping toward the minimum loss can be found in [49]. The learning rate has been tuned in YOLO-V3 and changed from 0.001 to 0.0001, omitting the learning rate decay. This means that the weight updates occur more slowly but also consistently in all iterations.
4.3 Evaluation Methodology
The typical evaluation method of learning algorithms is usually based on calculating the precision, recall, and F1-score, as shown in Eqs. 1, 2, and 3. In this paper, these metrics are calculated for each class before the average is taken, which is eventually used to reflect the overall performance. The interpretation/definitions of the terms True Positive (TP) True Negative (TN), False Positive (FP), False Negative (FN), precision and recall are shown in Table 2.
5 Results and Discussion
The experiments are initiated by configuring the dataset as described in Sect. 3. The total number of cropped images (from the large-scale drone image dataset) used in this research dataset is 1050 images. These images have been divided into 85% for training and 15% for testing. As the test set is randomly selected from amongst the 1050 images, the number of bounding boxes that belongs to each of the three classes, differs from image to image. The test set, which contained 157 images, comprised of 173 palm trees, 442 group-of-animals and 374 sheds/animal-shelters. The state-of-the-art CNN architectures, SSD-500 with VGG-16 and ResNet meta-architecture and YOLO-V3, were configured, trained, tested and evaluated, without any changes to their default parameters, except the batch size used, which was set as 12 for YOLO-V3 and set at 4 for SSD. The details of these architectures are shown in Table 3. Based on the initial performance results, YOLO-V3 registered the highest F1-score and it is selected for further optimization.
The SSD-500 with VGG-16 and ResNet meta-architectures and YOLO-V3 have been configured, trained and tested in their ability to detect multi-class objects in drone images. While the former uses 16 convolutional layers for feature extraction, the SSD with ResNet-uses 101 convolutional layers. YOLO-V3 uses 53 layers based on Darknet-53 for feature extraction and a further 53 convolutional layers for detecting objects from the generated feature map. Therefore, these networks have different number of convolution layers in the feature extraction and the object detection phases. The result of multiclass detection in the research shows an F1-score of 0.91 in using YOLO-V3 compared to 0.77 in SSD-500/VGG-Net and 0.83 in SSD-500/ResNet. The influence of the number of convolution layers is clearly shown whereby SSD-500 with VGG-16 registered the lowest F1-score, significantly better as compared with the F1-score of SSD-500 with ResNet. However, YOLO-V3 outperformed both SSD-500 with VGG-16 and with ResNet by a considerable margin as in Table 4. This is because as compared to the five convolutional layers of the detection phase of the SSD architecture, YOLO-V3 has 53 in the detection phase. Further YOLO-V3’s feature extraction process is more comprehensive as compared to that if the two SSD based approaches.
The precision of the learned model is 1 in detecting all types of objects with the multi-class model generated as there were no ‘False Positive (FP)’ detections, i.e. objects which are classified as being of a particular type but are not that type. The challenge here are the missed detections of each object type, represented by the ‘False Negative (FN)’, which is a total of 151 bounding boxes of objects belonging to one of the types of objects not being detected out of a total of 989 possible objects of all types. YOLO-V3 clearly outperforms both SSD-500 based architectures as it clearly shows better results with regards to the performance parameters, recall, F1-score and Average Confidence. Therefore, the use of YOLO-V3 is recommended for multi-class object detection.
With the aim of improving the obtained result of the modest YOLO-V3, the impact of the activation function, pooling method, learning rate and the optimization method are practically evaluated. This helps in investigating the best integration that could influence the performance as in Sects. 5.1–5.4. The optimal selection in each case will be used in the final coarse-to-fine model as in Sect. 5.5.
5.1 The Impact of Different Activation Functions in the Hidden Units
Most CNNs uses either ReLU or Leaky ReLU in the hidden units aiming to activate certain units to pass over the net. While ReLU omits all neurons less than zero, Leaky ReLU allows a small value to present, which has an impact on reducing the number of non-activated neurons. While YOLO-V3 uses Leaky ReLU in its configuration, the impact of changing it to ReLU is practically tested here. This is because the number of classes is still limited, which is three compared to general object detection tasks, which have 80 and more different classes. Reducing the number of activated neurons can resulted in simplifying the model and reducing the overfitting.
In Table 5, the result of changing the activation function is presented. This shows a slight improvement in the FN from 838 out of 989 in the baseline model to 849 based on using the ReLU activated function. Even the F1-score has a slight improvement from 0.91 to 0.92, the combination with tuning different hyper-parameters such as learning rate or pooling layer can give a noticeable improvement.
5.2 The Impact of the Pooling Method
The pooling layer can be optionally used in-between the convolutional layers aiming to reducing the number of parameters by taking either the average, max or any other pooling method of a determined receptive field. While ResNet uses the residual blocks, it has no pooling layers in-between the convolutional layers, but it uses the average pooling toward the end of the network. Max pooling is commonly used in modern CNN architectures including VGG-net and AlexNet, and the practitioners claim it has a good reduction of outliers compared to average pooling. A comparison between average pooling and max pooling toward the end of the feature’s extraction layers has been evaluated by training the model twice and comparing the performance. The result is shown in Table 6, which reflects a slight improvement when using the max pooling. This is because the model precision is initially 1, which reflects the lower number of outliers in the model. It is suggested to tune the pooling type if the precision of the model is low or if the number of classes where the model tends to lean is higher than in this research case.
5.3 The Impact of Tuning Learning Rate and the Choice of the Learning Rate Decay Method
The learning rate is the most crucial hyperparameter that formulates the training process and the converging time of a given DNN. As it determines the periodic update of the network loss whilst the network is being trained, the higher the value it is set at, less time would be needed to the network loss to converge but will be noisier compared to the use of smaller values. Practically, the learning rate can be tuned between 0.1 and 1. Usually, researchers use lower learning rate values in training on complex datasets. However, a slightly higher value of learning-rate is used if the dataset is easier to train, particularly when using in conjunction with a large volume of data, aiming to reduce the learning time, but with the consequence of the training process to become unstable. As the learning rate substantially affects the training speed of a network, the learning rate decay, a parameter that determines the reduction of the learning rate over each epoch (i.e. each iteration of the learning process), can be used to balance between the speeding up the process and converging the network, when the network tends to reach the local minima.
Therefore, given the significant impact of the learning rate on stabilizing the training process, we conducted experiments with the learning rate set to 0.001 and 0.0001. As the research dataset is complex in nature (high intra-class variations in sheds and group-of-animals) and the data availability per class is limited as compared to typical popular object detection datasets, we ignore the time required for the network to train but would consider the stability and convergence of training as crucial. Therefore, the evaluation of the model performance with learning rates of 0.001 and 0.0001, and with the learning rate decay omitted, was conducted. The results are tabulated in Table 7. The results show the significance of performance improvement when the learning rate is 0.0001 and compared to setting it ten times larger, at 0.0001. However, it is noted that the results with a learning rate of 0.001 was obtained at 50,000 iterations, whilst the result with the leaning rate 0.0001 was achieved in 180,000 iterations. The selection of the learning rate for a training task is hence a decision that should be made keeping in mind the complexity of the task to be carried out, time available for training the network and the relative importance of performance metrics such as the precision, recall, F1-score and confidence, which will depend on the application needs.
5.4 An Evaluation of the Use of Optimization Methods in Minimizing Loss
As Gradient Descent [7] is the method used for minimizing the loss function, different optimization methods have been used beside it to make the model learn fast and accurately, whereby the Momentum [50] is the most popular approach used by the computer vision research community. However, there is a claim by deep learning practitioners that RMSProp [51] and Adam [52] (combines Momentum with RMSProp) optimizers work better in practice. To evaluate the effectiveness of theses optimizers on the proposed multi-class object detector, we conducted an investigation that effectively used Momentum, RMSProp and Adam optimizers. The results obtained are shown in Table 8.
The results in Table 8 demonstrate the slight improvement in object detection that is enabled by the use of ADAM, which is a hybrid between the Momentum and RMSProp optimization approaches.
5.5 Overall Performance of the Optimised Multi-class Object Detector
The experiments conducted in Sects. 5.1–5.4 have been conducted using default settings for all parameters, other than the parameter under investigation. These investigations revealed the efficiencies achievable when the right parameter values are selected and used with the dataset under investigation. Based on the research results presented above that highlighted the optimal setting of each parameter, the YOLO-V3 based multi-class object detector was reconfigured, trained, tested and evaluated. This final model has a ReLU activation function for the hidden units, max pooling toward the end of the network, a learning rate of 0.0001 and an Adam optimizer. The results obtained from this customization are presented in Table 9.
The results tabulated in Table 9 show the performance improvement achieved by the optimised network for multi-class object detection. The True Positive (TP) rate has improved from being 838 total objects accurately detected to 932 accurately detected out of a total of 989 objects annotated in images within bounding boxes. However, as the learning rate has been reduced as compared to the learning rate of the network with default parameters, the total number of iterations to achieve such a result is higher, at 180, 000 iterations. It is noted that the precision of the palm trees has slightly decreased from 1 to 0.99, due to a single case of FP, as shown in Fig. 2. A tree which is not a palm tree that has some perceptual similarity to a palm tree, when seen at a low resolution, has been detected as a palm tree.
Examples of multi-class object detection with the optimized YOLO-V3 CNN are illustrated in Fig. 3a, and 3b. It reflects the ability of the learned model to detect different types of sheds oriented in different angles, different group-of-animals (animal) with different spatial-densities and occlusions and different sizes of palm trees in the drone-based footage. It is noted that the missed palm trees are those that are either very small in size or of a very low resolution. The missed sheds are those that are oriented differently to the orientation of the majority of buildings used in training and the missed group-of-animals are those groups that are sparsely spread within the farm.
Figure 4 illustrates further examples where some types of objects are missed. Despite the above missed and false detections, the multi-class object detector developed in this chapter has an improved rate of missed detections (6%) as compared to a 16% missed detections that resulted from the model with non-optimal parameters.
6 Summary and Conclusion
In this paper, multi-class object detection in drone images was investigated, making the best use of the state-of-the-art CNN architectures, SSD-500 supported by the meta-architectures VGG-16 and ResNet and the YOLO-V3 CNN architecture. The key focus of this paper was to develop a single CNN model that is capable of detecting palm-trees, group-of-animals, and sheds/animal-shelters. Initially the performance of all three CNN models were compared in the multi-class object detection task analysing in detail their performance in detecting all three types of objects accurately under default hyper-parameter value selections. This experiment concluded that YOLO-V3 has superior performance to the two SSD-500 based CNN models in recall, F1-score, and average confidence while all three models provided a precision of 1.
Further detailed investigations were subsequently conducted to decide on the optimal hyper-parameter settings when using YOLO-V3 in the given multi-class object detection tasks. Specifically, the impact of using different activation functions, pooling methods, learning rates and optimisation methods to minimize loss were investigated and the relevant optimal parameters were obtained. The original YOLO-V3 network was then reconfigured with these optimal parameters and the model was re-trained, tested and evaluated. The experiment concluded the ability of the optimised YOLO-V3 CNN model to perform significantly better in multi-class object detection in drone images. All performance metrics were substantially improved. Missed detections were carefully studies to make conclusions that due to the high intra-class variations present in all three types of objects, specifically in animal shelters/sheds, significant amount of balanced examples of such objects need to be used in training, to further improve the performance accuracy of the proposed model.
References
Lin, T.-Y., et al.: Microsoft coco: common objects in context. In: European Conference on Computer Vision, pp. 740–755 (2014)
Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88(2), 303–338 (2010)
Drone-based dataset for desert area. Falcon Eye Drones Ltd, Dubai, UAE (2017)
Aburasain, R.Y., Edirisinghe, E.A., Albatay, A.: Drone-based cattle detection using deep neural networks. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1250, pp. 598–611. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55180-3_44
Aburasain, R.Y., Edirisinghe, E.A., Albatay, A.: Palm tree detection in drone images using deep convolutional neural networks: investigating the effective use of YOLO V3. In: Conference on Multimedia, Interaction, Design and Innovation, pp. 21–36 (2020)
Voulodimos, A., Doulamis, N., Doulamis, Protopapadakis, E.: Deep learning for computer vision: a brief review’, Comput. Intell. Neurosci. 2018 (2018)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf. Accessed 22 Feb 2017
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: European Conference on Computer Vision, pp. 818–833 (2014)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ArXiv Prepr. ArXiv14091556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 770–778 (2016)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016). https://arxiv.org/abs/1602.07261v2. Accessed 12 May 2019
Howard, A.G., et al.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. ArXiv Prepr. ArXiv170404861 (2017)
Fei-Fei, L., Deng, J., Li, K.: ImageNet: Constructing a large-scale image database. J. Vis. 9(8), 1037 (2009)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 2818–2826 (2016)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, pp. 580–587 (2014). http://openaccess.thecvf.com/content_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html. Accessed 17 Jan 2020
Girshick, R.: Fast R-CNN, pp. 1440–1448 (2015). http://openaccess.thecvf.com/content_iccv_2015/html/Girshick_Fast_R-CNN_ICCV_2015_paper.html. Accessed 26 Apr 2019
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
Liu, W., et al.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37 (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)
Dai, J., Li, Y., He, K., Sun, J.: R-FCN: Object Detection via Region-based Fully Convolutional Networks (2016). https://arxiv.org/abs/1605.06409v2. Accessed 5 May 2019
Redmon, J.: Darknet: Open source neural networks in c (2013)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement ArXiv Prepr. ArXiv180402767 (2018)
Radovic, M., Adarkwa, O., Wang, Q.: Object recognition in aerial images using convolutional neural networks. J. Imaging 3(2), 21 (2017). https://doi.org/10.3390/jimaging3020021
Ikshwaku, S., Srinivasan, A., Varghese, A., Gubbi, J.: Railway corridor monitoring using deep drone vision. In: Computational Intelligence: Theories, Applications and Future Directions - Volume II, pp. 361–372 (2019)
Al-Sa’d, M.F., Al-Ali, A., Mohamed, A., Khattab, T., Erbad, A.: RF-based drone detection and identification using deep learning approaches: an initiative towards a large open source drone database. Future Gener. Comput. Syst. 100, 86–97 (2019). https://doi.org/10.1016/j.future.2019.05.007
Varghese, A., Gubbi, J., Sharma, H., Balamuralidhar, P.: Power infrastructure monitoring and damage detection using drone captured images. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 1681–1687 (2017). https://doi.org/10.1109/IJCNN.2017.7966053
Shao, W., Kawakami, R., Yoshihashi, R., You, S., Kawase, H., Naemura, T.: Cattle detection and counting in UAV images based on convolutional neural networks. Int. J. Remote Sens. 41(1), 31–52 (2020). https://doi.org/10.1080/01431161.2019.1624858
Kellenberger, B., Volpi, M., Tuia, D.: Fast animal detection in UAV images using convolutional neural networks. In: 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 866–869 (2017)
Malek, S., Bazi, Y., Alajlan, N., AlHichri, H., Melgani, F.: Efficient framework for palm tree detection in UAV images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 7(12), 4692–4703 (2014). https://doi.org/10.1109/JSTARS.2014.2331425
Moreira, A.: Estimating babassu palm density using automatic palm tree detection with very high spatial resolution satellite images - ScienceDirect (2017). https://www.sciencedirect.com/science/article/pii/S0301479717301081. Accessed 27 Jun 2019
Wang, Y., Zhu, X., Wu, B.: Automatic detection of individual oil palm trees from UAV images using HOG features and an SVM classifier. Int. J. Remote Sens. 40(19), 7356–7370 (2019). https://doi.org/10.1080/01431161.2018.1513669
Al Mansoori, S., Kunhu, A., Al Ahmad, H.: Automatic palm trees detection from multispectral UAV data using normalized difference vegetation index and circular Hough transform. In: High-Performance Computing in Geoscience and Remote Sensing VIII, vol. 10792, p. 1079203 (2018)
AlMaazmi, A.: Palm trees detecting and counting from high-resolution WorldView-3 satellite images in United Arab Emirates. In: Remote Sensing for Agriculture, Ecosystems, and Hydrology XX, vol. 10783, p. 107831M (2018). https://doi.org/10.1117/12.2325733
Freudenberg, M., Nölke, N., Agostini, A., Urban, K., Wörgötter, F., Kleinn, C.: Large scale palm tree detection in high resolution satellite images using U-Net. Remote Sens. 11(3), 312 (2019). https://doi.org/10.3390/rs11030312
Mubin, N.A., Nadarajoo, E., Shafri, H.Z.M., Hamedianfar, A.: Young and mature oil palm tree detection and counting using convolutional neural network deep learning method. Int. J. Remote Sens. 40(19), 7500–7515 (2019). https://doi.org/10.1080/01431161.2019.1569282
Zortea, M., Nery, M., Ruga, B., Carvalho, L.B., Bastos, A.C.: Oil-palm tree detection in aerial images combining deep learning classifiers. In: IGARSS 2018 - 2018 IEEE International Geoscience and Remote Sensing Symposium, pp. 657–660 (2018). https://doi.org/10.1109/IGARSS.2018.8519239
Yousif, H., Yuan, J., Kays, R., He, Z.: Fast human-animal detection from highly cluttered camera-trap images using joint background modeling and deep learning classification. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4 (2017). https://doi.org/10.1109/ISCAS.2017.8050762
Gomez Villa, A., Salazar, A., Vargas, F.: Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol. Inform. 41, 24–32 (2017). https://doi.org/10.1016/j.ecoinf.2017.07.004
Norouzzadeh, M.S., et al.: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning. Proc. Natl. Acad. Sci. 115(25), E5716–E5725 (2018). https://doi.org/10.1073/pnas.1719367115
Rivas, A., Chamoso, P., González-Briones, A., Corchado, J.M.: Detection of cattle using drones and convolutional neural networks. Sensors 18(7), 2048 (2018). https://doi.org/10.3390/s18072048
Saqib, M., Khan, S.D., Sharma, N., Blumenstein, M.: A study on detecting drones using deep convolutional neural networks. In: 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–5 (2017). https://doi.org/10.1109/AVSS.2017.8078541
Kim, B.K., Kang, H.-S., Park, S.-O.: Drone classification using convolutional neural networks with merged doppler images. IEEE Geosci. Remote Sens. Lett. 14(1), 38–42 (2016)
Von Luxburg, U.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007)
Buettner, R., Baumgartl, H.: A Highly Effective Deep Learning Based Escape Route Recognition Module for Autonomous Robots in Crisis and Emergency Situations (2019). http://scholarspace.manoa.hawaii.edu/handle/10125/59506. Accessed 05 Jun 2019
Yoon, I., Anwar, A., Rakshit, T., Raychowdhury, A.: Transfer and online reinforcement learning in STT-MRAM based embedded systems for autonomous drones. In: 2019 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1489–1494 (2019). https://doi.org/10.23919/DATE.2019.8715066
Aburasain, R.Y.: Application of convolutional neural networks in object detection, re-identification and recognition. Loughborough University (2020)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Kurbiel, T., Khaleghian, S.: Training of Deep Neural Networks based on Distance Measures using RMSProp. ArXiv170801911 Cs Stat (2017). http://arxiv.org/abs/1708.01911. Accessed 15 Apr 2020
Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. ArXiv14126980 Cs (2017). http://arxiv.org/abs/1412.6980. Accessed 15 Apr 2020
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this paper
Cite this paper
Aburasain, R.Y., Edirisinghe, E.A., Zamim, M.Y. (2022). A Coarse-to-Fine Multi-class Object Detection in Drone Images Using Convolutional Neural Networks. In: Biele, C., Kacprzyk, J., Kopeć, W., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds) Digital Interaction and Machine Intelligence. MIDI 2021. Lecture Notes in Networks and Systems, vol 440. Springer, Cham. https://doi.org/10.1007/978-3-031-11432-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-11432-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11431-1
Online ISBN: 978-3-031-11432-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)