Keywords

1 Introduction

Convolutional Neural Networks (CNNs) [5] have led to tremendous accuracy increases in vision tasks like classification [2] and detection [8, 9], in part due to the availability of large-scale datasets like ImageNet [11]. Many vision benchmarks feature a controlled situation, with all classes occurring in more or less similar frequencies. However, in practice this isn’t always the case. For example, in animal censuses on images from Unmanned Aerial Vehicles (UAVs) [6], the vast majority of images is empty. As a consequence, training a deep model on such datasets like in a classical balanced setting might lead to unusable results.

In this paper, we present a collection of recommendations that allow training deep CNNs on heavily imbalanced datasets (Sect. 2), demonstrated with the application of big mammal detection in UAV imagery. We assess the contribution of each recommendation in a hold-one-out fashion and further compare a CNN trained with all of them to the current state-of-the-art (Sect. 4), where we manage to increase the precision from 9% to 40% for high target recalls. The paper is based on [3].

2 Proposed Training Practices

The following sections briefly address all the five recommendations that make training on an imbalanced dataset possible:

Curriculum Learning. For the first five training epochs, we sample the training images so that they always contain at least one animal. This is inspired by Curriculum Learning [1] and makes the CNN learn initial representations of both animals and background. This provides it with a better starting point for the imbalance problem later on.

Rotational Augmentation. Due to the overhead perspective, we employ \(90^{\circ }\)-stop image rotations as augmentation. However, we empirically found it to be most effective at a late training stage (from epoch 300 on), where the CNN is starting to converge to a stable solution.

Hard Negative Mining. After epoch 80 we expect the model to have roughly learned the animal and background appearances, and thus focus on reducing the number of false positives. To do so, we amplify the weights of the four most confidently predicted false alarms in every training image for the rest of the training schedule.

Border Class. Due to the CNN’s receptive field capturing spatial context, we frequently observed activations in the vicinity of the animals, leading to false alarms. To remedy this effect, we label the 8-neighborhood around true animal locations with a third class (denoted as “border”). This way, the CNN learns to treat the surroundings of the animals separately, providing only high confidence for an animal in its true center. At test time, we simply discard the border class by merging it with the background.

Class Weighting. We balance the gradients during training with constant weights corresponding to the inverse class frequencies observed in the training set.

3 Experiments

3.1 The Kuzikus Dataset

We demonstrate our training recommendations on a dataset of UAV images over the Kuzikus game reserve, NamibiaFootnote 1. Kuzikus contains an estimated 3000 large mammals such as the Black Rhino, Zebras, Kudus and more, distributed over \(\mathrm {103\,km^2}\) [10]. The dataset was acquired in May 2014 by the SAVMAP ConsortiumFootnote 2, using a SenseFly eBeeFootnote 3 with a Canon PowerShot S110 RGB camera as payload. The campaign yielded a total of 654 \(\mathrm {4000\,\times \,3000}\) images, covering \(\mathrm {13.38\,km^2}\) with around 4 cm resolution. 1183 animals could be identified in a crowdsourcing campaign [7]. The data were then divided image-wise into 70% training, 10% validation and 20% test sets.

3.2 Model Setup

We employ a CNN that accepts an input image of \(512\,\times \,512\) pixels and yields a \(32\,\times \,32\) grid of class probability scores. We base it on a pre-trained ResNet-18 [2] and replace the last layer with two new ones that map the 512 activations to 1024, then to the 3 classes, respectively. We add a ReLU and dropout [12] with probability 0.5 in between for further regularization. The model is trained using the Adam optimizer [4] with weight decay and a gradually decreasing learning rate for a total of 400 epochs.

We assess all recommendations in a hold-one-out fashion, and further compare them to a full model and the current state-of-the-art on the dataset, which employs a classifier on proposals and hand-crafted features (see [10] for details).

4 Results and Discussion

Figure 1 shows the precision-recall curves for all the models.

Fig. 1.
figure 1

Precision-recall curves based on the animal confidence scores for the hold-one-out CNNs (first six models), the full model and the baseline

All recommendations boost precision, but with varying strengths. For example, disabling curriculum learning (“CNN 3”) yields the worst precision at high recalls—too many background samples from the start seem to severely drown any signal from the few animals. Unsurprisingly, a model trained on only images that contain at least one animal (“CNN 2”) is similarly bad: this way, the model only sees a portion of the background samples and yields too many false alarms. The full model provides the highest precision scores of up to 40% at high recalls of 80% and more. At this stage, the baseline reaches less than 10% precision, predicting false alarms virtually everywhere. In numbers, this means that for 80% recall our model predicts 447 false positives, while the baseline produces 2546 false alarms.

5 Conclusion

Many real-world computer vision problems are characterized by significant class imbalances, which in the worst case makes out-of-the-box applications of deep CNNs unfeasible. An example is the detection of large mammals in UAV images, out of which the majority is empty. In this paper, we presented a series of practices that enable training CNNs by limiting the risk of the background class drowning the few positives. We analyzed the contribution of each individual practice (curriculum learning, hard negative mining, etc.) and showed how a CNN, trained with all of them, yields a substantially higher precision if tuned for high recalls.