Keywords

1 Introduction

To conserve and manage diverse mammalian communities in a way that their population status is secured and conflict with humans is minimized, requires in the first place comprehensive data-derived knowledge of their status. There is a growing awareness that standard wildlife monitoring methods are not effective and difficult to scale-up (e.g. snow-tracking or hunting-bag data). Therefore, numerous new initiatives to monitor mammals using camera traps are currently being developed across Europe, collectively generating enormous amounts of pictures and videos. However, a large amount of available data is not effectively exploited, mainly because of the human time required to mine the data from collected raw multimedia files.

Camera trapping has already proved to be one of the most important technologies in wildlife conservation and ecological research [1,2,3,4,5]. Rapid developments in the application of AI speed up the transformation in that area and contribute to the fact that most of the recorded material will be automatically classified in the future [6-8, 11-13]. However, despite the increasing availability of deep learning models for object recognition [14, 15, 17, 18], the effective usage of this technology to support wildlife monitoring is limited, mainly because of the complexity of DL technology, the lack of end-to-end pipelines, and high computing requirements.

In this study, we present the preliminary results of applying the new AI standalone model for both object detection and species-level classification of camera trapping data from a European temperate lowland forest, the Białowieża Forest, Poland. Our model was built using an extremely fast, light-weight and flexible deep learning architecture based on recently published YOLOv5 (pre-trained YOLOv5l) [9] and trained on 2659 labeled images accessed via the API built in TRAPPER [10]. To the best of our knowledge, this is the first YOLOv5 implementation for automated mammal species recognition using camera trap images.

2 Materials and Methods

2.1 Dataset Preparation and Preprocessing

As the main data source in our research we used species-classified images originating from camera trapping projects from Białowieża Primeval Forest stored in TRAPPER [10]. This consisted of 2659 images. Bounding boxes were manually added to all images with the animal’s presence to determine the exact position of each individual. When multiple individuals were present in the image, several bounding boxes were created for each individual. Example annotations and images are shown in Fig. 3. We did not use empty images in our dataset.

We have 12 classes in our dataset, 11 species of animals and 1 class “Other”, which represents birds and small rodents. Images in our dataset were of various sizes. Most of them are 12 Mpixel high-resolution images. The distribution of the sizes can be found in Fig. 1. In our dataset we identified 11 common species occurring in Białowieża Primeval Forest.

Fig. 1.
figure 1

The distribution of the image sizes (pixels) in the dataset.

Observations are not balanced across these species and we have species with larger and smaller support of samples (Fig. 2). Example annotations and images are shown in Fig. 3.

Fig. 2.
figure 2

The distribution of species in the dataset.

Fig. 3.
figure 3

Example of images and annotations in TRAPPER platform.

In our experiments we decided to perform cross-validation. This is the process of splitting the dataset into k equal size parts and process k training runs. For each training run, one split of the data was used for validation, and the rest was used for training the network. Images to each split were selected randomly using stratification on animal species - each separate data split had the same proportions of the observed species.

2.2 Deep Learning Architecture

The following data pipeline was adopted:

  1. 1.

    The dataset was downloaded from TRAPPER using the dedicated API.

  2. 2.

    Species with less than 40 camera trap images were excluded.

  3. 3.

    The format of the image annotations was adapted to the required input format of the YOLOv5 architecture.

  4. 4.

    Model testing in 5-fold cross validation.

YOLOv5 is the Deep Learning-based architecture, which we selected for this research. It achieves state-of-the-art results in the object detection field. In comparison to other Deep Learning architectures, YOLOv5 is simple and reliable. It needs much less computational power than other architectures, while keeping comparable results [14, 15] and performing much faster than other networks (Fig. 4). YOLOv5 strongly utilizes the architecture of YOLOv4 [18]. The encoder used in YOLOv5 is CSPDarknet [18]. Along with Path Aggregation Network [17] (PANet) they make up the whole network architecture. In comparison to the YOLOv4, activation functions were modified (Leaky ReLU and Hardswish activations were replaced with SiLU [19] activation function).

The selection of YOLOv5 architecture for this research was motivated by several reasons:

  1. 1.

    The network is currently state-of-the-art in the fast objects detection field.

  2. 2.

    The architecture is light-weight; this allows us to train the model using small computational resources and keep it cost-effective.

  3. 3.

    Small size of the model has the potential to be used in mobile devices (i.e. camera traps).

Fig. 4.
figure 4

Comparisons between YOLOv5 models and EfficientDet. Published within author permission.

2.3 Model Training Process

The first stage of model training was the hyper-parameter tuning. For that purpose, we have used evolutionary hyper-parameter tuning methods from YOLOv5 on the training and validation data. This has given us more optimal parameters for our dataset. In the next step, we have trained the model using the optimal hyper-parameters, starting from an already trained YOLOv5l model checkpoint.

Using a pre-trained model is a common technique in computer vision called Transfer Learning [20]. It speeds up the training process and keeps the generalization at the high level. During our experiments we have observed that the optimal number of epochs is 60, after that there are negligible changes in the model. Figure 5 shows how the YOLOv5 loss functions changed during the training. The results are shown on one of the training cross-validation splits.

YOLOv5 loss function is a sum of three smaller loss functions:

  • Bounding Box Regression Loss - penalty for wrong anchor box detection, Mean Squared Error calculated based on predicted box location (x, y, h, w);

  • Classification Loss - Cross Entropy calculated for object classification;

  • Objectness Loss - Mean Squared Error calculated for Objectness-Confidence Score (estimation if the anchor box contains an object).

Below plots (Figs. 5 and 6) shows that after 51 epochs there is a minimal change in the loss functions as well as in F1-score.

Fig. 5.
figure 5

Bounding box regression, classification and objectness loss changes during model training on the 1-st split from cross-validation.

Fig. 6.
figure 6

F1-score changes during model training on the 1-st split from cross-validation.

The main software used in the training of the models was Python 3.8, with PyTorch 1.7, CUDA 11.2 and Jupyter Notebook. All AI model iterations, as well as the final one, were trained using cloud computing on Microsoft Azure. We executed the experiment on a virtual machine with preinstalled Linux software and a single NVIDIA Tesla K80 GPU. The training time for a single training run (60 epochs) on the train data took 2h (8.5h for the whole cross-validation).

3 Results and Discussion

The preliminary results using a limited amount of training data showed on average 85% F1-score in identification of 12 most commonly occuring medium-size and large mammal species in BF (including “other” class). The results presented in this study are the averaged results from all of the 5 validation splits from the cross-validation process. Table 1 shows the results of the combined evaluation. From the data we can see that the detection precision for species is cut down with decreasing sample size. Furthermore, for example, “roe deer” class has the lowest F1-score (0.58) because of the low abundance of this species in our data-set. As a result, this species is often not detected by our model, as shown in Figs. 7 and 8. Future experiments should address this issue.

Another insight from Fig. 7 is that “roe deer” class is often misclassified as “red deer” (15%). These species are similar in size and coloration and the reason for those mistakes might be caused by the low number of the targets - “red deer” occurs 872 times, where there are only 97 instances of “roe deer” in the dataset. To address this issue, we suggest to re-run these analyses on larger samples of classified images with bounding boxes.

Table 1. The combined results of the 5-fold cross-validation process.
Fig. 7.
figure 7

Confusion matrix of the predictions on the test data from the 1-st cross-validation split.

Fig. 8.
figure 8

Precision-Recall curve of the first split of the 1-st cross-validation. Values in the plot legend show the AUC score for each species.

3.1 Incorrect Classification Examples

We are aware of the disadvantages of our classification model. Three most common registered issues are: misclassifying animals with trees (Fig. 9), classifying animals with incorrect classes and not detecting animals at all (Fig. 10).

Fig. 9.
figure 9

Example of trees misclassified as animals.

Fig. 10.
figure 10

Incorrect species classification. European bison is classified as “red deer” and wild boar as “wolf”. The last-right image is an example of an unrecognized red fox sample.

3.2 Correct Classification Examples

The tested solution works with expected accuracy level with partially visible animals on images. Figure 11. shows correct detection and classification of a red fox that has only half of its body visible in the frame. Next examples (Fig. 12) prove that the model is able to detect animals that are blended with the background.

Fig. 11.
figure 11

First-left image presents correct detection of partially visible animals, the center and last, correct detection of barely visible animals.

Fig. 12.
figure 12

Correct detection of animals blended with the background.

4 Conclusions

Our achievement of the preliminary results presented in this study evidently proves a large potential in using YOLOv5 deep learning architecture to train AI models for automatic species recognition. Moreover, it can be directly incorporated into the camera trapping data processing workflows. Extended camera trap images dataset would allow for higher metrics and better detection of species from the Białowieża Forest.

As a future step, we see that the inclusion of additional training data sets from multiple European forest areas would greatly improve generalization of the Deep Learning model. However, this work suggests that there is room for improvement in case of the selected YOLOv5 network architecture and the inference pipeline. The encoder network may be changed for deeper structure to be able to extract more specific features from the images. Another common practice to improve the results is the usage of the Test Time Augmentation (TTA) on the inference process.

Pre-trained YOLOv5l model is fast enough, accurate and light-weight to be deployed on a well functional Edge AI computing platform. This advantage opens a new chapter for species classification in real time using camera-traps in the field.

Data Availability

Jupyter Notebook, which implements this experiment pipeline, includes TRAPPER integration and YOLOv5 usage for species recognition. It has been released as open-source code and is available on GitLab: https://gitlab.com/oscf/trapper-species-classifier.

Authors’ Contributions

MCH, MR and PT from Bialystok University of Technology designed, developed and tested ML infrastructure. JB, MCH and DK from Polish Academy of Sciences delivered training and testing data, shared domain support and verified research results.

Conflict of Interest

The authors declare that they have no conflicts of interest.

Ethical Approval

All applicable, international, national, and/or institutional guidelines for the care and use of animals were followed.