Inter-observer variance and agreement of wildlife information extracted from camera trap images

Camera traps are a popular tool in terrestrial wildlife research due to their low costs, easy operability, and usefulness for studying a wide array of species and research questions. The vast numbers of images they generate often require multiple human data extractors, yet accuracy and inter-observer variance are rarely considered. We compared results from 10 observers who processed the same set of multi-species camera trap images (n = 11,560) from seven sites. We quantified inter-observer agreement and variance for (1) the number of mammals identified, (2) the number of images saved, (3) species identification accuracy and the types of mistakes made, and (4) counts of herbivore groups and individuals. We analysed the influence of observer experience, species distinctiveness and camera location. Observers varied significantly regarding image processing rates, the number of mammals found and images saved, and species misidentifications. Only one observer detected all 22 mammals (range: 18–22, n = 10). Experienced observers processed images up to 4.5 times faster and made less mistakes regarding species detection and identification. Missed species were mostly small mammals (56.5%) while misidentifications were most common among species with low phenotypic distinctiveness. Herbivore counts had high to very high variances with mainly moderate agreement across observers. Observers differed in how they processed images and what they recorded. Our results raise important questions about the reliability of data extracted by multiple observers. Inter-observer bias, observer-related variables, species distinctiveness and camera location are important considerations if camera trapping results are to be used for population estimates or biodiversity assessments.


Introduction
When remote wildlife cameras became commercially available in the early 1990s (Kucera and Barrett 1993) their favourable characteristics as a stand-alone research tool led to a revolution in the monitoring of terrestrial wildlife. Animals could now easily be recorded non-invasively anywhere in the world (Kucera and Barrett 2011;Meek et al. 2015). Cameras have since been widely deployed to answer management and scientific questions alike, applicable to an immense spectrum of inquiry that ranges from simple species detection to animal-environment interactions and ultimately global biodiversity changes (Swann and Perkins 2014;Trolliet et al. 2014;Steenweg et al. 2017).
Camera traps combine several useful characteristics and, hence, have important advantages over most other terrestrial monitoring tools. They are affordable, enabling more extensive and intensive sampling strategies. Their operation does not require specialised skills or training and they cause little disturbance to wildlife. Cameras can be left in remote areas for prolonged time periods operating almost independently. Camera trapping, thus, is a costeffective way of improving both detection probability of cryptic species and the number of observations obtained, creating permanent records that yield irrefutable evidence of a wide array of species and their behaviours (De Bondi et al. 2010;Welbourne et al. 2015;Caravaggi et al. 2017; Wearn and Glover-Kapfer 2019; Thomas et al. 2020).
It is no surprise then that camera trap studies have increased substantially in many parts of the world (Meek et al. 2015;Agha et al. 2018), often focussing on more than a single species (Burton et al. 2015). Camera traps are now probably used more than any other research tool in terrestrial wildlife studies. Their popularity and functional versatility, however, create a luxury problem. Large quantities of images are generated, and the relevant data contained in these images still needs to be extracted in a reliable fashion. Since cameras can record quick image successions from a single trigger event, devices deployed in areas of high wildlife activity may easily generate up to 40 000 images per day. It is no longer unusual that research programmes operate camera grids with, sometimes, hundreds of devices across entire landscapes (Swanson et al. 2015;McShea et al. 2016). With vast amounts of images being recorded, the time required to review the available footage also increases, thus demanding more resources to extract data. It follows that most research programmes need to rely on multiple observers to process camera trap information efficiently.
Ongava Research Centre (ORC) in northern Namibia faces a similar challenge. Here, dozens of cameras are deployed regularly on a private reserve to study wildlife ecology and distributions (Edwards et al. 2018;Stratford et al. 2019), to assess camera utility and performance (Stratford and Naholo 2017), and to assist with wildlife management questions on the reserve. For example, over the last decade approximately 50 cameras have been set at permanent waterholes every year as part of the reserve's wildlife population monitoring programme, routinely generating upwards of 800 000 images within only a few weeks. In order to facilitate the process of tagging, storing, and analysing these magnitudes of image data, ORC developed a software system that provides an optimised user interface. Termed GDMS (Geo-Data Management System), the primary interface for processing images was designed to maximise user efficiency by providing methods for rapid image handling, in particular moving between groups of images. It has been particularly helpful at processing so-called "empty triggers", image sequences triggered by moving vegetation that contain no relevant wildlife, but can contribute considerably to the number of images needing to be managed. Users record keywords for images, typically species, with simple pre-configured on-screen buttons (Fig. 1). In case of uncertainty, the GDMS enables users to 'flag' images for independent verification by other observers. All images and their meta-data (camera types, locations, configurations, etc.) are then stored in a database and can be accessed using a range of filters and sort criteria. This way, ORC has processed > 21.7 million images since 2009, resulting in a database now housing more than 7.6 million keyworded wildlife images that support a variety of analyses. As is customary elsewhere, ORC has depended on multiple data extractors to process and classify image contents.
In principle, the mass-extraction of wildlife information from camera trap images is no different than recording wildlife with any other method. The measurement taken needs to be accurate and sources of recording bias must be minimised. The obvious advantage of camera trapping is that a permanent record exists, one that can be assessed without time constraints in the comfort of an office and be reviewed independently by multiple observers. Despite an objective, static wildlife record, the question remains whether different observers indeed extract the same information from images. Previous research suggests not, showing that inter-observer variance can greatly affect the reliability of results obtained when multiple human observers are expected to perform the same task. Studies found considerable inconsistencies in the information obtained, be it during the identification of mammalian species or unique individuals from camera trap footage (Gooliaff and Hodges 2018;Johansson et al. 2020), or the classification of population characteristics (Newbolt and Ditchkoff 2019). It is also clear that different factors such as camera type (Randler and Kalb 2018) and observer experience can influence the results (Burns et al. 2018; Katrak-Adefowora Geo-Data Management System screen interface for efficient processing of camera trap images. The interface displays the image frame on the left side, the species selection menu on the right, and options for saving or deleting frames at the bottom. Species occurring in the image frame are keyworded for saving, while images without relevant content are deleted. In case of uncertainty, observers can also query species identification or use generic keywords such as "mammal". The software automatically stores the results for each processed image, alongside important meta-data such as date, location, and time of the record. Crucial to the rapid processing of images is the implementation of intuitive 'shortcut' keys for saving and deleting frames. Once these key positions and functions are learned by the user, processing becomes ergonomically more efficient et al. 2020) and that accuracy may differ with species size and phenotypic distinctiveness (Potter et al. 2019). A variety of mistakes can occur, for instance observers missing species entirely or misidentifying them. Even among trained observers, variance can be high (e.g. Meek et al. 2013), especially if similar-looking wildlife occur in the same dataset (Gooliaff and Hodges 2018).
Different statistical tools are available to measure agreement between observers and account for variance in observer accuracy (Watson and Petrie 2010;Koo and Li 2016;Ranganathan et al. 2017). A random sample of 70 mammal camera trap studies published over the last decade revealed that 43% of articles reported multiple observers extracting information from images for data analysis, yet without measuring reliability between them, while most studies did not explicitly acknowledge whether multiple observers were involved in data extraction, or how this might have influenced the results. Only four studies explained how the risk of misidentifications was mitigated ( Supplementary Information 1). The potential value of camera trapping information for biodiversity conservation is beyond doubt (Steenweg et al. 2017;Wearn and Glover-Kapfer 2019). Quantifying inter-observer differences is paramount, however, if the resulting data are utilised for population estimates (Després-Einspenner et al. 2017), species mapping or biodiversity assessments that may lead to conservation policy decisions and prioritisation of limited conservation funds, for example.
In order to understand inter-observer variance in greater detail, we conducted an experiment, asking observers with varying degrees of camera trapping and wildlife identification experience to process the same set of multi-species camera trap images. We aimed to determine inter-observer variance and agreement in terms of: (1) the number of relevant images stored, (2) the number of mammal species identified, and (3) the types and frequencies of mistakes made. We further compared differences in how observers counted herbivore groups and individuals while exploring the influence of observer experience, species distinctiveness and camera location on the results. Finally, we assessed causes of inter-observer variance and how removal of outliers may influence agreement and the accuracy of the results obtained.

Experimental set-up and image processing
To assess inter-observer agreement and variance of mammal species detections extracted from camera trap images, we conducted a multi-rater experiment, asking 10 in-house observers (based at ORC) to process the same set of images following a set of guidelines and tasks (Supplementary Information 2). Observers included three senior researchers and two research technicians with camera trapping experience, as well as five student interns, of which only one had previously worked with camera traps on a regular basis. Although not permanently employed, we included student interns in this experiment based on the premise that data extraction tasks are often relegated to assistant staff when large amounts of information need to be processed, and these tasks are increasingly outsourced to layman audiences (e.g. McShea et al. 2016;Swanson et al. 2016;Willi et al. 2019;Katrak-Adefowora et al. 2020). Based on their experience (no. of years) with both camera trap studies and identification of Namibian wild mammals, we classified observers into three experience categories. Because years of experience with camera trapping and species identification were strongly correlated (Supplementary Fig. 2; r s = 0.969, n = 10, p < 0.001), we calculated the average between the two for each participant for final experience classification, being: novices (0 ≤ 1 years), semi-experienced (1-5 years), and experienced observers (> 5 years).
We utilised a sub-set of camera trap data recorded as part of Ongava Game Reserve's 2019 annual wildlife monitoring programme. From dry season waterhole monitoring, which comprised a total of 863 591 camera trap images recorded with 51 cameras at 12 permanent waterholes over 2.5 weeks in September 2019, we randomly selected a sample of seven camera locations containing a total of 11 560 images for our experiment. Our sample was recorded with seven cameras (models: Bushnell HD x 2, Bushnell Core x 2, Reconyx HC500 × 2, and one Reconyx HF2X Hyperfire2) positioned at six of the waterholes. Regardless of the specific model, Bushnell cameras were set to record three images per burst with a minimum delay interval of 15 s between bursts, whereas Reconyx cameras recorded 10 images per burst with a minimum delay interval of 30 s between bursts. All cameras were set to 'high' detection sensitivity and highest possible image quality, which differed between models. There were no differences between day-and night-time recording settings. All cameras had a different field of view and, hence, we treated the two units positioned at the same waterhole independently, hereafter reported as stations 6a and 6b. Sample data spanned a total of 271.5 deployment hours, including both day and night images. The image data contained 22 different mammal species of varying body sizes and phenotypic characteristics (Supplementary Information 3). To ensure correct comparisons with observer responses, we carefully scanned the entire image set four times to determine all species present and their respective identifications.
Following the set of instructions, each observer processed camera trap images independently, extracting mammal information using the GDMS software. The screen interface ( Fig. 1) automatically loads image frames from a chosen image folder and observers were asked to assess each frame independently, not inferring species identification from images they had seen before or afterwards. Prior to data processing, all participants were trained to use the GDMS screen interface and its functionalities. Observers saved image frames that contained mammal species, assigning the relevant species keywords to each frame, including multiple species occurring in the same frame. Observers deleted those frames that did not contain mammal species. We instructed observers to assign species names only if identification was certain. Focal species included all wild mammals occurring on Ongava Game Reserve, regardless of their body size and phenotypic characteristics, or their size and location in the image frame. If detected species were not listed for keywording in the GDMS's screen interface, observers recorded their species names separately and keyworded and saved frames as "mammal". If animals were only partially visible in a given frame, observers assigned the keyword "mammal" or the respective species name if unambiguous identification was possible from the visible body parts. Observers "queried" image frames in which they could detect mammalian wildlife but were uncertain about accurate species identification. The 'query' function enables images to be saved and re-evaluated by other observers at a later stage. Observers also recorded the time spent on processing each of the seven image folders. There was no time restriction on data extraction and observers independently decided when and for how long they processed imagery. Observers were not allowed to consult other researchers but could utilise any other resources to aid their iden-tification of mammals found in the images. Following the processing of the entire dataset, observers provided feedback on the experiment.
In addition to mammal species detection and identification, observers also counted the number of individuals they detected for five common herbivores and the number of distinct social groups observed. Focal herbivores included Plains zebra (Equus quagga), Hartmann's mountain zebra (Equus zebra hartmannae), Black-faced impala (Aepyceros melampus petersi), Greater kudu (Tragelaphus strepsiceros), and Oryx (Oryx gazella). For this task, observers were instructed to consider the entire image sequence instead of assessing individual image frames separately, i.e. counts were recorded once for any herd/individual detection from the time that animals entered the camera's field of view until leaving it again. If the same animals or social groups reoccurred at a camera station at a later stage, observers were asked to count and record them again.

Data analysis
The choice of method to assess inter-observer reliability depends on the number of observers (raters), the type of response variable assessed and its statistical distribution (Mitchell 1979;Watson and Petrie 2010;Koo and Li 2016;Ranganathan et al. 2017). We used Anderson-Darling tests (Stephens 1974) to assess whether response data were normally distributed and selected parametric and non-parametric statistics accordingly.
We compared the number of mammal images saved by observers using a Friedman test (Friedman 1937(Friedman , 1939, the non-parametric alternative to the one-way ANOVA with repeated measures, with observers as treatments and camera stations as blocks. The results are reported adjusted for ties. Because some mammals occurred at several of the seven locations, we determined the ratios of correct, incorrect, and missed species for each camera station by calculating the proportion of identified species per observer and camera station to the number of possible correct identifications for that station. We then compared these ratios using one-way ANOVAs. We compared the total number of species correctly detected between observers with a Chi-Square Goodness-of-Fit test. Despite limited sample sizes, we considered the influence of observer experience and camera station on the results. Data between camera stations were compared with one-way ANOVAs, each for the ratio of correct vs. incorrect identifications and also missed species, i.e. the mean of all observer ratios for each station, as well as for the mean image processing rates (images/minute) between stations.
We determined observer agreement for herbivore herd and individual counts, as well as mean herd size estimates, using Intra-Class Correlation Coefficients (hereafter ICC, Fisher 1958; Bartko 1966). ICCs were computed between observers as single scoring, multi-rater absolute agreement measures using a two-way random effects model, with camera stations as repeats and observers as factors (Koo and Li 2016). Camera stations where a particular focal herbivore was not present were removed from the analysis. For each herbivore estimate, we screened ICC results for outliers using Bland-Altman plots (Bland and Altman 1986). Our interpretations of ICC scores followed the scale proposed by (Koo and Li 2016), with: values < 0.5 = poor agreement, 0.5-0.75 = moderate agreement, 0.75-0.9 = good agreement, and values > 0.9 = excellent agreement. Due to limited sample sizes, we deemed a conservative scale more appropriate than the more lenient ones proposed by (Cicchetti and Sparrow 1981) or (Shrout and Fleiss 1979).
We used "queried" images, those for which observers could not ascertain species identification with absolute certainty, as a proxy for observer confidence. The total number of images queried was compared between all observers with a Chi-Square Goodness-of-Fit test with a single variable (observer).
We quantified each observer's mistakes made during species detection and identification, categorising mistakes as either 'species missed': a species occurred but was not recorded, or 'misidentification': a mammal was recorded but identified wrongly, also including confusions between similar-looking species. Since their accurate identification is nearly impossible from camera trap images, we made no distinction between the Cape rock hyrax (Procavia capensis) and the Kaokoveld rock hyrax (Procavia capensis welwitschii), nor between the South African ground squirrel (Xerus inauris) and the Damara ground squirrel (Xerus princeps).
To assess the influence of phenotypic similarity on misidentifications, we computed a species distinctiveness index for all terrestrial mammals known to occur on Ongava Game Reserve. We based species distinctiveness on the combination of three criteria, being: (1) a distinctiveness score that compared 10 morphological features between species; (2) a uniqueness score that compared unique morphological traits; and (3) a taxonomic score that accounted for the number of species in a given taxonomic group (adjusted from Potter et al. 2019). Their detailed calculation is given in Supplementary Information 4. For each species, we aggregated the results obtained for each criterion into a cumulative index ranging from 0 to 29 and calculated distinctiveness as a percentage ratio to indicate high (1.0) or low (0.0) distinctiveness from other species (Supplementary Information 5).

Data Processing
Ten observers with varying degrees of experience independently processed the total of 11 560 images recorded at seven camera stations. Participants spent a mean of 10.2 h ± 1.5 h S.E. processing imagery, with a high degree of variance between individual observers (range: 4.23 -18.82 h, S 2 = 19.02). The most experienced observer processed imagery nearly 4.5 times faster than the least experienced one. Faster processing resulted in fewer images saved (Fig. 2a) and processing rates significantly increased with observer experience (Fig. 2b). Image processing rates differed significantly among observers (f 9,60 = 2.75, p = 0.009) and mean image processing rates varied significantly between different camera stations (f 6,63 = 3.11, n = 7, p = 0.010; Table 1).
The total number of images that observers saved as containing mammalian wildlife differed significantly ( Table 2). Additionally, observers appeared to assign species keywords with different degrees of confidence. The number of images queried (i.e. species identification could not be assigned with absolute certainty) ranged from 0 to 1 863 (median = 355, n = 10), revealing a high degree of variance (S 2 = 445 663.57). Querying differed significantly between observers (χ 2 = 7 243.94; d.f. = 9; p < 0.001) and significantly decreased with increasing observer experience (r s = -0.791, n = 10, p = 0.006). While the four most experienced observers did not query species identification in any of the images, novices and semi-experienced observers queried species identification regularly (median = 797 images, range: 61-1 863, n = 6). 24.6 (± 1.65) 21.1 (4.9-63.1) * Frequency represents the cumulative number of correct detections by all 10 observers and percentage value shows the proportion of these in relation to all 750 possible correct detections (no. of mammals occurring x 10 observers)

Mammal species identification
Only one observer (a novice) detected all 22 mammalian wildlife species present in the entire dataset. The number of species found by observers varied between 18 and 22 (Table 2), with a mean overall detection rate of 92.7% ± 1.5% S.E. (range: 81.8 − 100%, S 2 = 25.1). Of the 57 possible correct species identifications (incl. repeat occurrences of the same species at different camera stations), observers detected a median of 55 species with a range of 51-56 (S 2 = 2.16), or 96.0% of all possible correct detections (n = 570; Table 1). Camera station had a significant effect on the accuracy of correct species detections as well as missed species (both f 6,63 = 12.48, p < 0.001). Camera station did not significantly influence species misidentifications (f 6,63 = 1.13, p = 0.353).
None of the observers correctly identified all mammalian wildlife across the seven subdatasets. Observers missed between 0 and 4 species (median = 2.0, S 2 = 1.04, n = 10) on 23 instances across the seven stations (Table 1). Missed species mostly were small mammals (56.5%) with body masses < 5 kg. Five observers also misidentified mammals on a total of 14 instances (median = 0.5, range: 0-4, S 2 = 2.21, n = 10), an error occurring mainly among least and semi-experienced observers (n = 13, 92.9%). In some instances (see Table 1 camera station 3), observers correctly detected and identified mammals in an image sequence while also misidentifying the same species during a separate detection.
Using species distinctiveness scores as a guideline, confusions between least phenotypically distinct species, notably between the two rhino species, the two zebra species as well as two species of dwarf antelopes with similar phenotypic appearance, represented the majority of misidentification mistakes (n = 12, 52.2%) ( Supplementary Information 6). The total number of mistakes occurring during species detection and identification declined with increasing observer experience ( Fig. 3; r s = -0.874, n = 10, p = 0.001). Experienced observers that processed data faster mainly overlooked small species, whereas less experienced observers frequently missed and misidentified mammalian wildlife (Fig. 3). The proportion of misidentifications in relation to missed species mistakes increased with declining experience (Fig. 3).

Counting herbivore individuals and herds
The individual, herd and resulting mean herd size estimates of 10 observers were characterised by high to very high sample variances (Table 3). Depending on the species, individual counts differed by a magnitude of 140 -210%, herd counts by 180 − 320%, and mean herd size estimates by 140 − 170% (Table 3). Novices recorded the most variable estimates for 14 of the 15 assessments (93.3%), followed by semi-experienced observers, whereas experienced observers generally produced the most accurate estimates (Fig. 4). Considering the full dataset for analysis, inter-observer agreement for herbivore estimates was mostly moderate (n = 6), followed by good (n = 5) and excellent estimate agreement (n = 4) (Fig. 5). In terms of estimate precision, the highest exact agreement for any estimate was Oryx herd count, recorded as the same value by four observers. Herbivore estimates and their corresponding ICC agreement scores significantly differed for all individual counts, herd counts and mean herd size comparisons (all F-Tests p < 0.001; see Supplementary Information 7).
Screening of count data revealed that high estimate variances, and hence observer disagreement, were mainly attributable to statistical outliers representing the estimates recorded by a novice observer, contributing 13 outliers ( Fig. 4; Supplementary Fig. 3). We, therefore, also compared inter-observer agreement using both the entire dataset (10 observers) and a reduced dataset (9 observers). Removal of this observer from the analysis increased ICC agreement scores for 14 of the 15 assessments and resulted in higher ICC accuracy. ICC agreement quality changed for eight estimates (53.3%) following outlier removal. Interobserver agreement improved by one category in six instances (moderate-good: 3; good- ICC results for all 10 observers and dark grey bars represent ICC results after removal of data from one novice observer that contributed 13 outliers to the 15 assessments. Error bars show 95% Confidence Intervals of ICC scores, with a thin line for the complete dataset and thick line indicating data excluding one novice observer. ICC agreement interpretations follow the quality criteria of Koo and Li (2016) excellent: 3) and by two categories for another estimate (moderate-excellent), while the quality rating of one estimate decreased from good to moderate (Fig. 5).

Discussion
Our small-scale experiment raises important questions about the reliability of data extracted by multiple human observers from the same set of camera trap imagery. Observers considerably differed in how they processed imagery, how many data they saved, how confidently they assigned species identifications, how many species they detected overall, and how they interpreted processing rules. Observers also differed in their ability to identify mammalian wildlife accurately, in the number and types of mistakes that occurred and, ultimately, in their estimates of common herbivore individuals and herds seen. Most inter-observer comparisons yielded significant differences and only moderate to good agreement. With few exceptions, the extracted data were characterised by high degrees of variance, with the magnitude of estimate differences ranging in the order of several 100%. These differences are particularly worrying when considering that observers extracted data from near optimal recording conditions; 20 of the 22 focal species had medium to large body sizes (Supplementary Information 4), and the animals were positioned in open areas close to cameras, remaining stationary at waterholes for prolonged periods. Also, we imposed no time limits on data processing and observers had access to a range of resources aiding species identification. Our findings, thus, stress the importance of comparing agreement in multi-observer camera trap studies and, where necessary, the need to truncate unreliable responses.
Observer experience appeared to be one key variable influencing the results. Experienced observers assigned species labels with a very high degree of confidence and accuracy and their herbivore estimates had the highest accuracy overall. They processed images faster and mainly overlooked small-bodied mammals, such as the Rock hyrax, if these were located away from the images' centre and focal point of animal activity. Less experienced observers, on the other hand, processed data more slowly, yet they assigned species identifications with less confidence, regularly missing and confusing similar-looking mammals, resulting in lower accuracy in species detection and correct identification. Their estimates of herbivore numbers and herds also revealed greater variance. Only using experienced observers for image classification may, therefore, improve accuracy and confidence in the results. However, many research projects rely on (untrained) volunteers for these time-intensive tasks in particular and the number of images obtained during large-scale multi-camera surveys makes it difficult to refer these tasks solely to experienced data processors. It is noteworthy, however, that only one observer, a novice data extractor, detected all 22 mammals present in the entire dataset. During post-experiment feedback, novice and semi-experienced observers clearly pointed out that fatigue was a challenge, especially after multiple hours of continuous data extraction (cf. Oliveira-Santos et al. 2010). This may contribute to an explanation of higher results variance and the number of mistakes that occurred, especially misidentifications. While detecting and counting animals from static images may seem easy, experiments show that human concentration decreases significantly within 30 min of performing intensive cognitive tasks, leading to mental fatigue and increased error rates (Slimani et al. 2018). In our study, observers dealt differently with fatigue. While some stopped image processing and resumed tasks at a later stage (sometimes days later), others continued to finish a particular image sequence. Given the deleterious effects of mental fatigue on human attention (Boksem et al. 2005;Faber et al. 2012), maximum time thresholds for data extraction may be needed to ensure greater reliability. Additional studies are needed to examine the impact of fatigue on camera trapping results.
Most identification mistakes occurred between species with least phenotypic distinctiveness, for instance when observers confused Plains and Hartmann's mountain zebras. This finding is consistent with other camera trap studies showing that accurate mammal classifications are difficult to obtain where a range of similar-looking species occur sympatrically (Gooliaff and Hodges 2018), or individuals of a particular species exhibit little distinctiveness (Güthlin et al. 2014;Newbolt and Ditchkoff 2019). Following our study, observers commented that animal orientation in the image frame (broad side vs. frontal vs. rear views) as well as 'crowdedness', i.e. the occurrence of large numbers of animals in the same image frame, also made accurate identification very difficult. Estimates of Black-faced impala, the species that occurred in the largest social groups at various recording stations, consistently had the highest variance. Group size estimates appeared challenging for any herbivore occurring in gregarious herds, confirming that camera trap-based abundance estimates of non-distinct social species need to be treated with great caution as the cameras' limited field of view and the complexity of animal movements obstruct an accurate assessment (Stratford and Naholo 2017).
Although based on a small sample of only seven recording stations, our results give evidence that camera location influenced data extraction reliability. This is probably associated with the specific contents and image characteristics recorded at each locality, as background habitat, species compositions and levels of animal activity differed between waterholes. Habitat structure, especially in relation to "openness" has been shown to influence detection and identification accuracy and might have contributed to differing results between camera stations in our study (Wearn and Glover-Kapfer 2019;Egna et al. 2020). Observer image processing rates, mammal detection and identification accuracies, as well as the types and frequency of mistakes indeed varied significantly across the different camera stations. In addition to a diverse range of observer-related influences on classification accuracy, further complications may arise from technical factors influencing image quality, such as image complexity, motion blur and light conditions due to flash settings or varying natural light and contrast between day-and night-time (Rovero et al. 2013;Zheng et al. 2016) (Table 4). Finally, observers also interpreted data extraction rules differently, resulting in greater variance. The novice observer whose estimates frequently contributed extreme outliers adjusted definitions of social group and independent detection events throughout the experiment, indicating that any ambiguity in the data extraction tasks could contribute to higher variance. Inter-observer agreement improved considerably when this participant was omitted from the analyses.
While surveyors can choose from a variety of image management software options to optimise data handling procedures (Young et al. 2018;Greenberg et al. 2019), the final classification of image content still mostly depends on human assessment. Clearly, individual observers can have profound influences on the results. Low agreement between data extractors, however, can lead to skewed population demographics (Newbolt and Ditchkoff 2019) and biased survivorship and population estimates (Morrison et al. 2011;Van Horn et al. 2014;Johansson et al. 2020), thus potentially misguiding conservation measures and policy.
A rapidly evolving suite of automated image processing software aims to mitigate the concerns associated with manual data extraction. These algorithms can overcome certain human biases and issues associated with fatigue, thus improving the precision of results by offering predictable, standardized error rates. However, most, if not all, models still require manual preparation of a training dataset, which is undertaken by both experts (Tabak et al. 2019;Whytock et al. 2021) and/or untrained volunteers (Norouzzadeh et al. 2018Willi et al. 2019). Human error is, therefore, not wholly eliminated from the process and at least some of the risks stemming from low classification agreement and false classifications remain. Lower accuracies for out-of-sample data and species with less images in the training dataset (Tabak et al. 2019;Willi et al. 2019;Schneider et al. 2020) suggest that unexpected and rare species may either not be detected or classified incorrectly (Whytock et al. 2021).
Despite fast processing speeds being achieved by some algorithms (e.g. 2 000 images / minute, Tabak et al. 2019), efficiency considerations also need to take into account time and resources required for software development and human classification of training datasets, especially if individual models are not transferable between species and sites. Algorithm classification accuracies and image processing rates in multi-species studies still vary greatly (Supplementary Information 8) and many were similar to those obtained by expert observers using the GDMS. Using models to detect empty image frames and classify common species, however, could significantly reduce the workload of human observers and standardize error rates to predictable levels. Optimizing the processing workflow of largescale studies with hundreds of thousands or millions of images could involve a hybrid solution, in which trained software models detect animals in image frames but only classify the most common ones, while less common and uncertain records are queried for independent assessment by experienced human observers.
Until machine learning methods can provide reliable automated detection and identification tools for a large variety of common and uncommon species in different habitats (see Willi et al. 2019;Miao et al. 2019;Schneider et al. 2020), much of the time-consuming data extraction work will continue to be the responsibility of human observers. In these studies, the issue of inter-observer reliability can be addressed in different ways. For instance, online on-demand training provides resources to inexperienced data extractors, measurably improving the accuracy of image classifications (e.g. Katrak-Adefowora et al. 2020). Other approaches to reduce the impact of misclassifications may include the omission of records   (Rode et al. 2021), multi-observer validation of image content (Swanson et al. 2016), or the selective flagging of doubtful records for independent review by more experienced assessors. Here, we demonstrate that simple outlier screening and truncation can improve reliability. Based on our results, we emphasise the importance of assessing classification agreement in multi-observer camera trap studies, especially if a range of similar wildlife species is monitored simultaneously.
The issue of inter-observer reliability is not unique to camera trap studies. It pertains to any field of research inquiry where multiple human observers are expected to assess the attributes of the same 'object' and classify their observations. The problems that arise using multiple observers for data extraction can be highly context specific and, in case of camera traps, a variety of factors influence what humans detect and manage to identify. Our results revealed great inconsistencies in some of the most common measurements obtained with camera trapping: the detection and accurate identification of animals if they occurred (occurrence and biodiversity studies), resulting in both false negatives and false positives, as well as high variances in population size parameters of common social species. Simply put, we do not all see or count the same in a given set of camera trap images, even under favourable conditions. While samples sizes prevented us from testing how the different factors that influence content classification accuracy might interact, our experiment reflected realistic research circumstances in terms of observer selection, the images being processed, the tasks assigned, and the software utilised. Observer performance and associated biases deserve greater attention in multi-species studies, and particularly so if these are founded on mass data extraction by different observers (Cruickshank et al. 2019).

Consent for publication
Observers have consented to the submission of this manuscript for publication.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.