1 Introduction

Motivation As the biodiversity crisis intensifies, the survival of many endangered species grows increasingly precarious, evidenced by species diversity continuing to fall at an unprecedented rate (Ceballos et al., 2020; Vié et al., 2009). The great ape family, whose survival is threatened by habitat degradation and fragmentation, climate change, hunting and disease, is a prime example (Carvalho et al., 2021). The International Union for Conservation of Nature (IUCN) considers all three member species, that is orangutans, gorillas, chimpanzees (including bonobos), to be either endangered or critically endangered.

The threat to great apes has far-reaching ecological implications. Great apes contribute to the balance of healthy ecosystems by seed dispersal, consumption of leaves and bark, and shaping habitats by creating canopy gaps and trails (Chappell & Thorpe, 2022; Haurez et al., 2015; Tarszisz et al., 2018). They also form part of complex forest food webs, their removal from which would have cascading consequences for local food chains. In addition, great apes are our closest evolutionary relatives and a key target for anthropological research. We share 97% of our DNA with the phylogenetically most distant orangutans and 98.8% with the closer chimpanzees and bonobos. The study of great apes, including their physiology, genetics, and behaviour, is essential to addressing questions of human nature and evolution (Pollen et al., 2023). Urgent conservation action for the protection and preservation of these emblematic species is therefore essential.

The timely and efficient assessment of great ape presence, abundance, distribution, and behaviour is becoming increasingly important in evaluating the effectiveness of conservation policies and intervention measures. The potential of exploiting camera trap imagery for conservation or biological modelling is well recognised (Kühl & Burghardt, 2013; Tuia et al., 2022). However, even small camera networks generate large volumes of data (Fegraus et al., 2011) and the number and complexity of downstream processing tasks required to perform ecological analysis is extensive. Typically, ecologists first need to identify those videos that contain footage of the target study species followed by further downstream analyses, such as estimating the distance of the animals from the camera (i.e., camera trap distance sampling) to calculate species density or identification of ecologically or anthropologically important behaviours, such as tool use or camera reactivity (Houa et al., 2022). Performing these tasks manually is time consuming and limited by the availability of human resources and expertise, becoming infeasible at large scale. This underlines the need for rapid, scalable, and efficient deep learning methods for automating the detection and assessment of great ape populations and analysis of their behaviours.

To facilitate the development of methods for automating the interpretation of camera trap data, large-scale, open-access video datasets must be available to the relevant scientific communities, whilst removing geographic details that could potentially threaten the safety of animals (Tuia et al., 2022). Unlike the field of human action recognition and behaviour understanding, where several large, widely acknowledged datasets exist for benchmarking (Kay et al., 2017; Kuehne et al., 2011; Soomro et al., 2012), the number of great ape datasets is limited and those that are currently available lack scale, diversity and rich annotations.

Contribution In this study, we present the PanAf20K dataset, the largest and most diverse open-access video dataset of great apes in the wild—ready for AI training. The dataset comprises footage collected from 18 study sites across 15 African countries, featuring apes in over 20 distinct habitats (i.e., forests, savannahs, and marshes). It displays great apes in over 100 individual locations (e.g., trails, termite mounds, and water sources) displaying an extensive range of 18 behaviour categories. A visual overview of the dataset is presented in Fig. 1. The footage is accompanied by a rich set of annotations suitable for a range of ecologically important tasks such as detection, action localisation, fine-grained and multi-label behaviour recognition.

Fig. 1
figure 1

PanAf20K visual overview. We present the largest and most diverse open-access video dataset of great apes in the wild. It comprises \(\sim \) 20,000 videos and more than 7 million frames extracted from camera traps at 18 study sites spanning 15 African countries. Shown are 25 representative still frames from the dataset highlighting its diversity with respect to many important aspects such as behavioural activities, species, number of apes, habitat, day/night recordings, scene lighting, and more

Paper Organisation. Following this introduction, Sect. 2 reviews existing animal behaviour datasets and methodologies for great ape detection and behaviour recognition. Section 3 describes both parts of the dataset, the PanAf20K and the PanAf500, and details how the data was collected and annotated. Benchmark results for several computer vision tasks are presented in Sect. 4. Section 5 discusses the main findings as well as any limitations alongside future research directions while Sect. 6 summarises the dataset and highlights its potential applications.

2 Related Work

Great Ape Video Datasets for AI Development While there have been encouraging trends in the creation of new animal datasets (Beery et al., 2021; Cui et al., 2018; Swanson et al., 2015; Van Horn et al., 2018), there is still only a limited number specifically designed for great apes and even fewer suitable for behavioural analysis. In this section, the most relevant datasets are described.

Bain et al. (2021), curated a large camera trap video dataset (\(>40\) h) with fine-grained annotations for two behaviours; buttress drumming and nut cracking. However, the data and corresponding annotations are not yet publicly available and the range of annotations is limited to two audio-visually distinct behaviours. The Animal Kingdom dataset (Ng et al., 2022), created for advancing behavioural understanding, comprises footage sourced from YouTube (50 h, 30 K videos) along with annotations that cover a wide range of actions, from eating to fighting. The MammalNet dataset (Chen et al., 2023), which is larger and more diverse, is also composed from YouTube footage (18 K videos, 539 h) and focuses on behavioural understanding across species. It comprises taxonomy-guided annotations for 12 common behaviours, identified through previous animal behaviour studies, for 173 mammal categories. While both datasets are valuable resources for the study of animal behaviour, they contain relatively few great ape videos since these species make up only a small proportion of the overall dataset. Animal Kingdom spans \(\sim \) 100 videos while MammalNet includes \(\sim \) 1000 videos across the whole great ape family, representing \(\sim \) 0.5% and \(\sim \) 5% of all videos, respectively. Other work to curate great ape datasets has focused annotation efforts on age, sex, facial location, and individual identification (Brookes & Burghardt, 2020; Freytag et al., 2016; Schofield et al., 2019), rather than behaviour.

For the study of great ape behaviour, the currently available datasets have many limitations. First, they are too small to capture the full breadth of behavioural diversity. This is particularly relevant for great apes, which are a deeply complex species, displaying a range of individual, paired and group behaviours, that are still not well understood (Samuni et al., 2021; Tennie et al., 2016). Secondly, they are not composed of footage captured by sensors commonly used in ecological studies, such as camera traps and drones. This means that apes are not observed in their natural environment and the distribution of behaviours will not be representative of the wild (i.e., biased towards ‘interesting’ or ‘entertaining’ behaviours). Additionally, the footage may be biased towards captive or human-habituated animals which display altered or unnatural behaviours and are unsuitable for studying their wild counterparts (Chappell & Thorpe, 2022; Clark, 2011). All these factors may limit the ability of trained models to generalise effectively to wild footage of great apes where conservation efforts are most urgently needed. This, in turn, limits their practical and immediate utility. We aim to overcome these limitations by introducing a large scale, open-access video dataset that enables researchers to develop models for analysing the behaviour of great apes in the wild and evaluate them against established methods.

Great Ape Detection and Individual Recognition Yang et al. (2019) developed a multi-frame system capable of accurately detecting the full body location of apes in challenging camera-trap footage. In more recent work, Yang et al. developed a curriculum learning approach that enables the utilisation of large volumes of unlabelled data to improve detection performance (Yang et al., 2023). Several other works focus on facial detection and individual identification. In early research, Freytag et al. (2016) applied YOLOv2 (Redmon & Farhadi, 2017), to localise the faces of chimpanzees. They utilised a second deep CNN for feature extraction (AlexNet (Krizhevsky et al., 2012) and VGGFaces (Parkhi et al., 2015)), and a linear support vector machine for identification. Later, Brust et al. (2017) extended their work utilising a much larger and diverse dataset. Schofield et al. (2019) presented a pipeline for identification of 23 chimpanzees across a video archive spanning 14 years. Similar to Brust et al. (2017), they trained the single-shot object detector, SSD (Schofield et al., 2019), to perform initial localisation, and a secondary CNN model to perform individual classification. Brookes and Burghardt (2020) employed YOLOv3 (Redmon & Farhadi, 2018) to perform one-step simultaneous facial detection and individual identification on captive gorillas.

Great Ape Action and Behaviour Recognition To date, three systems have attempted automated great ape behavioural action recognition. The first (Sakib & Burghardt, 2020) was based on the two-stream convolutional architecture by Simonyan and Zisserman (2014) and uses 3D ResNet-18 s for feature extraction and LSTM-based fusion of RGB and optical flow features. They reported a strong top-1 accuracy of 73% across the nine behavioural actions alongside a relatively low average per class accuracy of 42%. The second, proposed by Bain et al. (2021), utilises both audio and video inputs to detect two specific behaviours; buttress drumming and nut cracking. Their system utilises a 3D ResNet-18 and a 2D ResNet-18 for extraction of visual and audio features, respectively, in different streams. They achieved an average precision of 87% for buttress drumming and 85% for nut cracking on their unpublished dataset. Lastly, Brookes et al. (2023) introduced a triple-stream model that utilises RGB, optical flow and DensePose within a metric learning framework, and achieved top-1 and average per-class accuracy of 85% and 65%, respectively.

3 Dataset Overview

Task-Focused Data Preparation The PanAf20K dataset consists of two distinct parts. The first includes a large video dataset containing 19,973 videos annotated with multi-label behavioural labels. The second part comprises 500 videos with fine-grained annotations across \(\sim \) 180,000 frames. Videos are recorded at 24 FPS and resolutions of \(720\times 404\) for 15 s (\(\sim \) 360 frames). In this section, we provide an overview of the dataset, including how the video data was originally collected (see Sect. 3.1) and annotated for both parts (see Sect. 3.2).

3.1 Data Acquisition

Camera Trapping in the Wild The PanAf Programme: The Cultured Chimpanzee has 39 research sites and data collection has been ongoing since January 2010. The data included in this paper samples 18 of these sites and the available data were obtained from studies of varying duration (7–22 months). Grids comprising 20 to 96 \(1\times 1\) km cells were established for the distribution of sampling units (to cover a minimum of 20–50 \(\text {km}^2\) in rainforest and 50–100 \(\text {km}^2\) in woodland savannah). An average of 29 (range 5–41) movement-triggered Bushnell cameras were installed per site. One camera was installed per grid cell where possible. However, in larger grids cameras were placed in alternate cells. If certain grid cells did not contain suitable habitat, such as grassland in forest-savanna mosaic sites, two cameras were placed instead as far away from each other as possible, in cells containing suitable habitat to maximize coverage. In areas where activities of interest (e.g., termite fishing sites) were likely to take place, a second camera was installed to capture the same scene from a different angle. Cameras were placed approx. 1 m high above ground, in locations that were frequently used by apes (e.g., trail, fruit trees). This method ensured a strategic installation of cameras, with maximal chance of capturing footage of terrestrial activity of apes. Both GPS location and habitat type for each location was noted. Footage was recorded for 60 s with a 1 s interval between triggers and cameras were visited every 1–3 months for maintenance and to download the recorded footage throughout the study periods.

3.2 Data Annotation

Fine-grained Annotation of PanAf500 The PanAf500 was ground-truth labelled by users on the community science platform Chimp &See (Arandjelovic et al., 2016) and researchers at the University of Bristol (Sakib & Burghardt, 2020; Yang et al., 2019) (examples are shown in Fig. 2). We re-formatted the metadata from these sources specifically for use in computer vision under reproducible and comparable benchmarks ready for AI-use. The dataset includes frame-by-frame annotations for full-body location, intra-video individual identification, and nine behavioural actions (Sakib & Burghardt, 2020) across 500 videos and \(\sim \) 180,000 frames.

Fig. 2
figure 2

Manually annotated full-body location, species and behavioural action labels. Sample frames extracted from PanAf20K videos with species (row 1) and behavioural action annotations (row 2) displayed. Green bounding boxes indicate the full-body location of an ape. Species and behavioural action annotations are shown in the corresponding text

As shown in Fig. 3, the number of individual apes varies significantly, from one to nine, with up to eight individuals appearing together simultaneously. Individuals and pairs occur the most frequently while groups occur less frequently, particularly those exceeding four apes. Bounding boxes are categorised according to the COCO dataset (Lin et al., 2014) (i.e., \(>96^2\), \(96^2\) and \(32^2\) for large, medium and small, respectively) with small bounding boxes occurring relatively infrequently compared to large and medium boxes.

Fig. 3
figure 3

Number of apes & bounding box size distribution in the PanAf500 data. The top row shows the distribution of apes across frames and videos in (a) and (b), respectively, while the distribution of bounding box sizes is shown in (c). The middle row shows still frame examples of videos containing one, two, four and eight apes (viewing from left to right). The bottom row demonstrates still frames with bounding boxes of various sizes; the colour of bounding box and associated number represent the intra-video individual IDs

Fig. 4
figure 4

Behavioural actions in the PanAf500 data. Examples of each one of the nine behavioural action classes (right) and their distribution (left) across 500 videos. The total number of per-frame annotations for each behavioural action class is shown on top of the corresponding bar (Color figure online)

The behavioural action annotations cover 9 basic behavioural actions; sitting, standing, walking, running, climbing up, climbing down, hanging, sitting on back, and camera interaction. We refer to these classes as behavioural actions in recognition of historical traditions in biological and computer vision disciplines, which would consider them behaviours and actions, respectively. Figure 4 displays the behavioural actions classes in focus together with their per-frame distribution. The class distribution is severely imbalanced, with the majority of samples (\(>85{\%}\)) belonging to three head classes (i.e., sitting, walking and standing). The remaining behavioural actions are referred to as tail classes. The same imbalance is observed at the clip level, as shown in Table 1, although the distribution of classes across clips does not match the per-frame distribution exactly. While behavioural actions with longer durations (i.e., sitting) have more labelled frames, this does not necessarily translate to more clips. For example, there are more clips of walking and standing than sitting, and more clips of climbing up than hanging, although the latter have fewer labelled frames.

Table 1 Behavioural action class statistics. The total number of clips for each behavioural action alongside the average duration in seconds and frames
Fig. 5
figure 5

PanAf20K behaviour examples. Triplets of example frames for six categories (i.e., feeding, travel, camera reaction, social interaction, chimp carrying and tool use) in the PanAf20K dataset are shown. Note that camera reaction, social interaction and chimp carrying have been abbreviated to reaction, social and carrying, respectively

Fig. 6
figure 6

Behavioural annotations of the PanAf20K dataset. The distribution of behaviour categories for the PanAf20K dataset is shown. Figures above each bar represent the dataset proportion (%) of each class

Fig. 7
figure 7

Co-occurrence of behaviours in the PanAf20k dataset. A co-occurrence matrix for the PanAf20K behaviours, where each cell reflects the number of times two behaviours occurred together. Diagonal cells are reset to aid visibility

Fig. 8
figure 8

Examples of fine-grained and multi-label annotations. For videos with fine-grained annotations, full-body locations and behavioural actions are associated with each ape on a frame-by-frame basis (left). In contrast, multi-label behaviour annotations are provided at the video level (right); behaviours are not localised or assigned specifically to each ape

Multi-label Behavioural Annotation of PanAf20K Community scientists on the Chimp &See platform provided multi-label behavioural annotations for \(\sim \) 20,000 videos. They were shown 15-second clips at a time and asked to annotate whether animals were present or whether the clip was blank. To obtain a balance between specificity and keeping the task accessible and interesting to a broad group of people, annotators were presented with a choice of classification categories. These categories allowed focus to be given to ecologically important behaviours such as tool use, camera reaction and bipedalism. Hashtags for behaviours not listed in the classification categories were also permitted, allowing new and interesting behaviours to be added when they were discovered in the videos. The new behaviours were subcategories of the existing behaviours, many of them relating to tool use (e.g., algae scooping and termite fishing in aboreal nests).

To ensure annotation quality and consistency a video was only deemed to be analyzed when either three volunteers marked the video as blank, unanimous agreement between seven volunteers was observed, or 15 volunteers annotated the video. These annotations were then extracted and expertly grouped into 18 co-occurring classes, which form the multi-label behavioural annotations presented here. The annotations follow a multi-hot binary format that indicates the presence of one or many behaviours. It should also be noted that behaviours are not assigned to individual apes or temporally localised within each video. Figure 5 presents examples for several of the most commonly occurring behaviours. Figure 6 shows the full distribution of behaviours across videos, which is highly imbalanced. Four of the most commonly occurring classes are observed in \(>\,60{\%}\) videos, while the least commonly occurring classes are observed in \(<1{\%}\). The relationship between behaviours is shown in Fig. 7 which presents co-occurring classes. The behaviours differ from the behavioural actions included in the PanAf500 dataset, corresponding to higher order behaviours that are commonly monitored in ecological studies. For example, instances of travel refer to videos that contain an individual or group of apes travelling, whereas associated behavioural actions such as walking or running may occur in many contexts (i.e., walking towards another ape during a social interaction or while searching for a tool).

Both parts of the dataset are suitable for different computer vision tasks. The PanAf500 supports great ape detection, tracking, action grounding, and multi-class action recognition, while the PanAf20k supports multi-label behaviour recognition. The difference between the two annotation types can be observed in Fig. 8.

Machine Labels for Animal Location and IDs We generated full-body bounding boxes for apes present in the remaining, unlabelled videos using state-of-the-art (SOTA) object detection models evaluated on the PanAf500 dataset (see Sect. 4). Additionally, we assigned intra-video IDs to detected apes using the multi-object tracker, OC-SORT (Cao et al., 2023). Note that these pseudo-labels do not yet associate behaviours with individual bounding boxes.

Table 2 Ape detection benchmarks. Detection performance on the PanAf500 dataset. Results are reported for the MegaDetector (Beery et al, 2019), ResNet-101 (+SCM+TCM) (Yang et al, 2019), VarifocalNet (Zhang et al, 2021), Swin Transformer (Liu et al, 2021) and ConvNeXt (Liu et al, 2022). The highest scores for each metric are shown in bold

4 Experiments and Results

This section describes experiments relating to the PanAf500 and PanAf20K datasets. For the former, we present benchmark results for great ape detection and fine-grained action recognition. For the latter, we present benchmark results for multi-label behavioural classification. For both sets of experiments, several SOTA architectures are used.

Fig. 9
figure 9

Megadetector (Beery et al., 2019) achieves higher precision for the majority of cases although ConvNeXt (Liu et al., 2022) and Swin Transformer (Liu et al., 2021) achieve better precision scores at high recall rates (\(R_{{\textit{det}}} > 0.84\))

Fig. 10
figure 10

R101 (+SCM+TCM) (Yang et al., 2019) and VFNet (Zhang et al., 2021) achieve the highest true positive rates at low false positive rates (\({\textit{FPR}} < 0.15\)). At higher false positive rates R101(+SCM+TCM) (Yang et al., 2019) performs better

4.1 PanAf500 Dataset

Baseline Models We report benchmark results for ape detection and fine-grained behavioural action recognition for the PanAf500 dataset, trained and evaluated on SOTA architectures. For ape detection, this entails the MegaDetector (Beery et al., 2019), ResNet-101 (+SCM+TCM) (Yang et al., 2019), VarifocalNet (VFNet) (Zhang et al., 2021), SwinTransformer (Liu et al., 2021) and ConvNext (Liu et al., 2022) architectures. For fine-grained action recognition, we considered X3D (Feichtenhofer, 2020), I3D (Carreira & Zisserman, 2017), 3D ResNet-50 (Tran et al., 2018), Timesformer (Bertasius et al., 2021) and MViTv2 (Li et al., 2022) architectures. Action recognition models were chosen based on SOTA performance on human action recognition datasets and to be consistent with the best performing models on the AnimalKingdom (Ng et al., 2022) and MammalNet datasets (Chen et al., 2023). In all cases, train-val-test (80:05:15) splits were generated at the video-level to ensure generalisation across video/habitat and splits remained consistent across tasks.

Fig. 11
figure 11

Megadetector detection examples. A sequence of frames (along each row) extracted from 3 different videos. The ground truth bounding boxes (green) are shown alongside detections (red). The first sequence (row 1) shows successful detections. The second set of sequences (row 2–4) provide examples of false positive detections. The third set of sequences (row 5–6) provide examples of false negative detections

Great Ape Detection We initialised all models with pretrained feature extractors. For all models, except the Megadetector, we utilised MS COCO (Lin et al., 2014) pretrained weights. We use the out-of-the-box Megadetector implementation since it is pretrained on millions of camera trap images and provides a strong initialisation for camera trap specific detection tasks. We then fine-tuned each model for 50 epochs using SGD with a batch size of 16. Training was carried out using an input image resolution of \(416^2\) and an Intersection over Union (IoU) threshold of 0.5 for non maximum suppression, at an initial learning rate of \(1 \times 10^{-2}\) which was reduced by 10% at 80 and 90% of the total training epochs. All ape detection models were evaluated using the commonly used object detection metrics: mean average precision (mAP), precision, recall and F1-scores. All metrics follow the open images standard (Krasin et al., 2017) and are considered in combination during evaluation. Performance is provided separately for small (\(32^2\)), medium (\(96^2\)) and large bounding boxes (\(>96^2\)), as per the COCO object detection standard, in addition to overall performance.

Performance Table 2 shows that the fine-tuned Megadetector achieves the best mAP score overall and for large bounding boxes, although it is outperformed by the Swin Transformer and ResNet-101 (+Cascade R-CNN+SCM+TCM) on medium and small bounding boxes, respectively. This shows that in-domain pre-training of the feature extractor is valuable for fine-tuning since the Megadetector is the only model pretrained on a camera trap dataset, rather than the COCO dataset (Lin et al., 2014). Performance across the remaining metrics, precision, recall and F1-score, is dominated by the Swin Transformer, which shows the importance of modelling spatial dependencies for good detection performance.

The precision-recall (PR) curve displayed in Fig. 9 shows that most models maintain precision of more than 90% (\(P_{{\textit{det}}} > 0.9\)) at lower recall rates (\(R_{{\textit{det}}} < 0.80\)), except ResNet-101 (+SCM+TCM) which falls below this at recall of 78% (\(R_{{\textit{det}}}=0.78\)). The fine-tuned Megadetector achieves consistently higher precision than other models for more than 84% of cases (\(R_{{\textit{det}}}=0.84\)), outperforming other models by 5% (\(P_{{\textit{det}}}=0.05\)) on average. However, at higher recall rates (\(R_{{\textit{det}}}>0.84\)) ConvNeXt and SwinTransformer achieve higher precision, with the latter achieving marginally better performance. The ROC curve presented in Fig. 10 shows that VFNet and ResNet-101 (+SCM+TCM) achieve higher true positive rate than all other models at false positive rates less than 5% (\({\textit{FPR}} < 0.05\)) and 40% (\({\textit{FPR}} < 0.40\)), respectively. At higher false positive rates ConvNext and SwinTransformer are competitive with ResNet-101 (+SCM+TCM), with marginally better performance being established by ConvNeXt at very high false positive rates. Figure 11 presents qualitative examples of success and failure cases for the best performing model.

Fig. 12
figure 12

Per-class distribution vs. behavioural thresholds. Distribution of each behavioural action class at various behavioural thresholds. Note that tail classes are effected more significantly by longer thresholds than head classes

Table 3 Behavioural action recognition benchmarks. Behavioural action recognition performance on the PanAf500 dataset. Results are reported for X3D (Feichtenhofer, 2020), I3D (Carreira & Zisserman, 2017), 3D ResNet-50 (Hara et al., 2017), MViTV2 (Li et al., 2022), and TimeSformer (Bertasius et al., 2021) models. The highest scores for top-1 and average per-class accuracy are shown in bold
Fig. 13
figure 13

Class-wise performance vs. proportion of data. The per-class accuracy for each behavioural action recognition model is plotted against the proportion of data for each class. All models consistently achieve strong performance on the head classes, whereas performance is variable across tail classes

Behavioural Action Recognition We trained all models using the protocol established by Sakib and Burghardt (2020). During training we imposed a temporal behaviour threshold that ensures that only frame sequences in which a behaviour is exhibited for t consecutive frames are utilised during training in order to retain well-defined behaviour instances. We then sub-sampled 16-frame sequences from clips that satisfy the behaviour threshold. The test threshold is always kept consistent (\(t=16\)). Figure 12 shows the effect of different behaviour thresholds on the number of clips available for each class. Higher behaviour thresholds have a more significant effect on minority/tail classes since they occur more sporadically. For example, there are no training clips available for the climbing down class where \(t=128\). All models were initialised with feature extractors pre-trained on Kinetics-400 (Kay et al., 2017) and fine-tuned for 200 epochs using the Adam optimiser and a standard cross-entropy loss. We utilised a batch size of 32, momentum of 0.9 and performed linear warm-up followed by cosine annealing using an initial learning rate of \(1\times 10^{-5}\) that increases to \(1\times 10^{-4}\) over 20 epochs. All behavioural action recognition models were evaluated using average top-1 and average per-class accuracy (C-Avg).

Performance Table 3 shows the X3D model attains the best top-1 accuracy at behaviour thresholds \(t=16\) and \(t=64\), although similar performance is achieved by MViTV2 and TimeSformer for the latter threshold. It also achieves the best average per-class performance at \(t=64\), while TimeSformer achieves the best performance at \(t=32\) and \(t=128\).

The MVITV2 models realise the best top-1 accuracy at \(t=32\) and \(t=128\), although they do not achieve the best average per-class performance at any threshold. The 3D ResNet-50 achieves the best average per-class performance at \(t=16\). When considering top-1 accuracy, model performance is competitive. At lower behavioural thresholds, i.e., \(t=16\) and \(t=32\), the difference in top-1 performance is 2.55 and 4.68%, respectively, between the best and worst performing models, although this increases to 5.38 and 11.74% at \(t=64\) and \(t=128\), respectively. There is greater variation in average per-class performance and it is rare that a model achieves the best performance across both metrics.

Although we observe strong performance with respect to top-1 accuracy, our models exhibit relatively poor average per-class performance. Figure 13 plots per-class performance against class frequency and shows that the average per-class performance is caused by poor performance on tail classes. The average per-class accuracy across all models for the head classes is 83.22% while only 28.33% is achieved for tail classes. There is significant variation in the performance of models; I3D performs well on hanging and climbing up but fails to classify any of the other classes correctly. Similarly, X3D performs extremely well on sitting on back but achieves poor results on the other classes. None of the models except for TimeSformer correctly classify any instances of running during testing. Figure 14 presents the confusion matrix calculated on validation data alongside examples of misclassified instances.

4.2 PanAf20K Dataset

Data Setup We generate train-val-test splits (70:10:20) using iterative stratification (Sechidis et al., 2011; Szymanski & Kajdanowicz, 2019). During training, we uniformly sub-sample \(t=16\) frames from each video, equating to \(\sim 1\) frame per second (i.e., a sample interval of 22.5 frames).

Baseline Models To establish benchmark performance for multi-label behaviour recognition, we trained the X3D, I3D, 3D ResNet-50s, and MViTv2 models. All models were initialised with feature extractors pre-trained on Kinetics-400 (Kay et al., 2017) and fine-tuned for 200 epochs using the Adam optimiser. We utilised a batch size of 32, momentum of 0.9 and performed linear warm-up followed by cosine annealing using an initial learning rate of \(1\times 10^{-5}\) that increases to \(1\times 10^{-4}\) over 20 epochs. Models were evaluated using mAP, subset accuracy (i.e., exact match), precision and recall. Behaviour classes were grouped, based on class frequency, into head (\(>10{\%}\)), middle (\(>1{\%}\)) and tail (\(<1{\%}\)) segments, and mAP performance is reported for each segment. To address the long-tailed distribution, we substitute the standard loss for those calculated using long-tail recognition techniques. Specifically, we implement (i) focal loss (Cui et al., 2019\(L_{CB}\); (ii) logit adjustment (Menon et al., 2020\(L_{LA}\); and (iii) focal loss with weight balancing via a MaxNorm constraint (Alshammari et al., 2022).

Fig. 14
figure 14

Confusion matrix & example errors. The confusion matrix (left) is shown alongside examples of mis-classified frames (right). For mis-classified examples, ground truth labels are shown on the y-axis (i.e., hanging, running, sitting) and examples of the classes most likely to be incorrectly predicted for the ground truth class are shown on the x-axis. Note that a high proportion of errors are due to predictions made in favour of majority classes

Multi-label Behaviour Recognition As shown in Table 4, performance is primarily dominated by the 3D ResNet-50 and TimeSformer models when coupled with the various long-tailed recognition techniques. The TimeSformer (+LogitAdjustment) attains the highest mAP scores for both overall and tail classes, while the MViTV2 (+FocalLoss) and 3D ResNet-50 (+FocalLoss) demonstrate superior performance in terms of head and middle class mAP, respectively. The 3D ResNet-50 (+FocalLoss) and 3D ResNet-50 (+WeightBalancing) models achieve the best subset accuracy and recall, respectively, while the highest precision is realised by the TimeSformer (+LogitAdjustment) model. Although the 3D ResNet-50 and TimeSformer models perform strongest, it should be noted that the difference in overall mAP across all models is small (i.e., 4.03% between best and worst performing models).

As demonstrated by the head, middle and tail mAP scores, higher performance is achieved for more frequently occurring classes with performance deteriorating significantly for middle and tail classes. Across models, the average difference between head and middle, and middle and tail classes is 35.68 (\(\pm 1.88\))% and 40.55 (\(\pm 3.02\))%, respectively. The inclusion of long-tailed recognition techniques results in models that consistently attain higher tail class mAP performance than their standard counterparts (i.e., models that do not use long-tail recognition techniques). The logit adjustment technique consistently results in the best tail class mAP across models, whereas the focal loss results in the best performance on the middle classes for all models except the X3D model. None of the standard models achieve the best performance on any metric.

Figure 15 plots per-class mAP performance of the 3D ResNet-50 and 3D ResNet-50 (+LogitAdjustment) models against the per-class proportion of data. The best performance is observed for the three most commonly occurring classes (i.e., feeding, travel, and no behaviour) whereas the worst performance is obtained by the most infrequently occurring classes (i.e., display, aggression, sex, bipedal, and cross species interaction) with the exception of piloerection. It can also be observed that the ResNet-50 (+LogitAdjustment) model outperforms its standard counterpart on the majority of middle and tail classes, although it is outperformed on tail classes. Examples of success and failure cases by the 3D ResNet-50 model are presented in Fig. 16.

5 Discussion and Future Work

Results The performance of current SOTA methods is not currently sufficient for facilitating the large-scale, automated behavioural monitoring required to support conservation efforts. The conclusions drawn in ecological studies rely on the highly accurate classification of all observed behaviours by expert primatologists. While the current methods achieve strong performance on head classes, relatively poor performance is observed for rare classes. Our results are consistent with recent work on similar datasets (i.e., AnimalKingdom (Ng et al., 2022) and MammalNet (Chen et al., 2023)) which demonstrate the significance of the long-tailed distribution that naturally recorded data exhibits (Liu et al., 2019). Similar to (Ng et al., 2022), our experiments show that current long-tailed recognition techniques can help to improve performance on tail classes, although a large discrepancy between head and middle, and head and tail classes still exists. The extent of this performance gap (see Table 4) emphasises the difficulty of tackling long-tailed distributions and highlights an important direction for future work (Perrett et al., 2023). Additionally, the near perfect performance at training time (i.e., \(>95{\%}\) mAP) highlights the need for methods that can learn effectively from a minimal number of examples.

Fig. 15
figure 15

Class-wise accuracy vs. proportion of data. The per-class average precision for the 3D ResNet-50 (Hara et al., 2017) and 3D ResNet-50 (+LogitAdjustment) (Hara et al., 2017; Menon et al., 2020) models is plotted against the proportion of data for each class. In general, better model performance is achieved on classes with high data proportions and the ResNet-50 (+LogitAdjustment) model shows improved performance on middle and tail classes

Fig. 16
figure 16

Multi-label errors. Frames extracted from three videos exhibit success and failure cases of the 3D ResNet-50 model. Behaviour predictions are shown in light boxes of the first frame of each sequence; true positives are green, false positive are blue, and false negatives are red. In the first video (row 1), the model fails to classify feeding by the chimp visible in frames 1 and 2 whereas in the second video (row 2), it fails to classify tool use by the infant chimp in the final frame. Climbing is predicted incorrectly in the final video (row 3) (Color figure online)

Table 4 Multi-label behaviour recognition benchmarks. Results are reported for I3D (Carreira and Zisserman, 2017), 3D ResNet-50 (Hara et al, 2017), X3D (Feichtenhofer, 2020), MViTV2 (Li et al, 2022), and TimeSformer (Bertasius et al, 2021) models with focal loss (Cui et al, 2019), logit adjustment (Menon et al, 2020), and focal loss with weight balancing (Alshammari et al, 2022). The highest scores across all metrics are shown in bold

Community Science and Annotation Although behavioural annotations are provided by non-expert community scientists, several studies have shown the effectiveness of citizen scientists to perform complex data annotation tasks (Danielsen et al., 2014; McCarthy et al., 2021) typically carried out by researchers (i.e., species classification, individual identification etc.). However, it should be noted that, as highlighted by Cox et al. (2012), community scientists are more prone to errors relating to rare species. In the case of our dataset, this may translate to simple behaviours being identified correctly (e.g., feeding and tool use) whereas more nuanced or subtle behaviours (e.g., display and aggression) are missed or incorrectly interpreted, amongst other problems. This may occur despite the behaviour categories were predetermined by experts as suitable for non-expert annotation.

The dataset’s rich annotations suit various computer vision tasks, despite key differences from other works. Unlike similar datasets (Chen et al., 2023; Ng et al., 2022), behaviours in the PanAf20K dataset are not temporally located within the video. However, the videos in our dataset are relatively short (i.e., 15 s) in contrast to the long form videos included in other datasets. Therefore, the time stamping of behaviour may be less significant considering it is possible to utilise entire videos, with a suitably fine-grained sample interval (i.e., 0.5–1 s), as input to standard action recognition models. With that being said, behaviours occur sporadically and chimpanzees are often only in frame for very short periods of time. Therefore, future work will consider augmenting the existing annotations with temporal localisation of actions. Moreover, while our dataset comprises a wide range of behaviour categories, many of them exhibit significant intra-class variation. In the context of ecological/primatological studies, this variation often necessitates the creation of separate ethograms for individual behaviours (Nishida et al., 1999; Zamma & Matsusaka, 2015). For instance, within the tool use behaviour category, we find subcategories like nut cracking (utilizing rock, stone, or wood), termite fishing, and algae fishing. Similarly, within the camera reaction category, distinct subcategories include attraction, avoidance, and fixation. In future, we plan to extend the existing annotations to include more granular subcategories.

Ethics Statement All data collection, including camera trapping, was done non-invasively, with no animal contact and no direct observation of the animals under study. Full research approval, data collection approval and research and sample permits of national ministries and protected area authorities were obtained in all countries of study. Sample and data export was also done with all necessary certificates, export and import permits. All work conformed to the relevant regulatory standards of the Max Planck Society, Germany. All community science work was undertaken according to the Zooniverse User Agreement and Privacy Policy. No experiments or data collection were undertaken with live animals.

6 Conclusion

We present by-far the largest open-access video dataset of wild great apes with rich annotations and SOTA benchmarks. The dataset is directly suitable for visual AI training and model comparison. The size of the dataset and extent of labelling across \(>7\) M frames and \(\sim 20\) K videos (lasting \(>80\) h) now offers the first comprehensive view of great ape populations and their behaviours to AI researchers. Task-specific annotations make the data suitable for a range of associated, challenging computer vision tasks (i.e, animal detection, tracking, and behaviour recognition) which can facilitate ecological analysis urgently required to support conservation efforts. We believe that given its immediate AI compatibility, scale, diversity, and accessibility, the PanAf20K dataset provides an unmatched opportunity for the many communities working in the ecological, biological, and computer vision domains to benchmark and expand great ape monitoring capabilities. We hope that this dataset can, ultimately, be a step towards better understanding and more effectively conserving these charismatic species.