Crowd surveillance: estimating citizen science reporting probabilities for insects of biosecurity concern

Data streams arising from citizen reporting activities continue to grow, yet the information content within these streams remains unclear, and methods for addressing the inherent reporting biases little developed. Here, we quantify the major influence of physical insect features (colour, size, morphology, pattern) on the propensity of citizens to upload photographic sightings to online portals, and hence to contribute to biosecurity surveillance. After correcting for species availability, we show that physical features and pestiness are major predictors of reporting probability. The more distinctive the visual features, the higher the reporting probabilities—potentially providing useful surveillance should the species be an unwanted exotic. Conversely, the reporting probability for many small, nondescript high priority pest species is unlikely to be sufficient to contribute meaningfully to biosecurity surveillance, unless they are causing major harm. The lack of citizen reporting of recent incursions of small, nondescript exotic pests supports the model. By examining the types of insects of concern, industries or environmental managers can assess to what extent they can rely on citizen reporting for their surveillance needs. The citrus industry, for example, probably cannot rely on passive unstructured citizen data streams for surveillance of the Asian citrus psyllid (Diaphorina citri). In contrast, the forestry industry may consider that citizen detection and reporting of species of the large and colourful insects such as pine sawyers (Monochamus spp.) may be sufficient for their needs. Incorporating citizen surveillance into the general surveillance framework is an area for further research.


Introduction
Biosecurity surveillance aims to protect the natural environment, plants and animals, as well as agri-and horticulture from harm caused by pests and diseases (Froud et al. 2008). The biosecurity threat arising from the invasion of exotic insect pests is highly diffuse in that the number of target species is very large and the potential points of entry are numerous. This presents particular logistical challenges for implementing effective surveillance-it is impossible to deploy targeted traditional surveillance (e.g. species specific traps, trained inspectors, etc.) for all threats in all locations. Proposed alternatives to such traditional surveillance include increased use of sensors, robots and citizen science. Here, we focus on the latter option. Citizens can potentially contribute to biosecurity surveillance in many ways, ranging from inadvertent references to invasive organisms on social media platforms (e.g. Twitter), to deliberate through unstructured (spatially and temporally opportunistic) reporting of species via dedicated online portals (e.g. iNaturalist, https ://www.inatu ralis t.org/), to deliberate structured (designed) surveys (Welvaert and Caley 2016). The potential surveillance power of the general public is evident from a New Zealand study, where nearly half of all new exotic species detections over a 3-year period were from members of the general public (Froud et al. 2008). In a similar vein, Thomas et al. (2017) recorded that 95% of non-indigenous invertebrate species new to Barrow Island were detected by members of the local community. Such surveillance contributes to what is termed "general surveillance" (Hammond et al. 2016a).
Detecting environmental biosecurity events from human social media communications in a timely manner faces some particular challenges, some technical (Daume 2016) and others largely arising from uncertainty and bias relating to the observation process (Welvaert and Caley 2016). In comparison with self-reported syndromic human health surveillance, the spatial scale and number of events to be detected is small initially (at the time when detection is most critical), and the direct impact on individuals typically minimal. For example, the combined effects of citrus greening disease (Huanglongbing-currently causing massive economic loss to the citrus industry in the Americas), vectored by the Asian citrus psyllid (Diaphorina citri) (Grafton-Cardwell et al. 2013) are neither immediate nor direct on human individuals per se, until the pathogen has spread significantly and affected trees are showing visible symptoms. Furthermore, the impacts and/ or symptoms of exotic pests and diseases may be unknown or hard to detect or difficult to distinguish from endemic pests, resulting in varying levels of detectability (Jarrad et al. 2011). The detection of small-scale biosecurity events through social media also requires that the taxonomy of organism is widely known but also unique; otherwise, the signal-to-noise ratio is too low for reliable signal retrieval (Welvaert et al. 2017). Hence, reliably detecting the arrival of exotic insects within the social media data stream is likely to be highly problematic.
Insect collecting is a worldwide contemporary and historical hobby/passion of many members of the public, with the major recent change being the move to photography in place of physical specimen collecting, and the ability to share these images online. In comparison with social media, the uploading of photographs onto citizen science data portals is much more deliberate and has a taxonomic underpinning. Dedicated online platforms now exist to store such observations, and to crowd-source their species identification. The number of citizen-sourced record uploads goes in the tens of millions. Note, however, these data sources generally do not contain biosecurity related species information. By definition they would not contain records for invasive alien species that have not yet entered a country. These growing datasets, however, can be used to inform us about the type of species that are typically reported by citizen scientists, and whether they are likely to include exotic pests and/or pathogen species should they arrive. Indeed, in Australia for example, a wide variety of sightings of insect species are uploaded to the Atlas of Living Australia (ALA, https ://www.ala.org. au/) which acts as a repository for most citizen science platforms in Australia along with professionally collected museum specimens, etc. The number of citizen-sourced record uploads of insect species to the ALA already number in the 100,000s, involving 1000s of species. Although these numbers may seem impressive, at face value they provide little information on whether an emergency plant pest (e.g. D. citri) would be detected and reported in a timely manner.
There is clearly overlap between the types of insects that are recorded, and exotic insect species of biosecurity concern, raising the possibility of using an analogue approach to estimate surveillance sensitivity. For example, the black spittlebug (Amarusa australis), a harmless native species in Australia, is from the same Cicadellidae family as the glassy-winged sharp shooter (Homalodisca vitripennis)the key vector for the causative pathogen Xylella fastidiosa of Pierce's Disease in grapevines. As of 30-06-2016, there had been two citizen sightings of A. australis uploaded to ALA, and notably, both were identified on the same day as they were uploaded. However, the two species differ substantially in size and colour (H. vitripennis is larger and more colourful) (Fig. 1a, b), calling into question the accuracy of the analogue approach. Clearly some form of model is required to infer what this sighting rate may mean for the detection and reporting of an incursion of H. vitripennis, as it is larger and more colourful. Answering this question requires knowing the factors that motivate people to report the insects they discover, and applying these factors to emergency plant pests of concern to estimate the likely reporting probabilities.
This study introduces a quantitative, statistical approach to estimating the citizen reporting probabilities of insects based on their physical features. In doing so it quantifies the contribution of citizen science activities to biosecurity surveillance, and enables identification of invasive insects for which citizen science would not provide effective surveillance.

Experimental design
We used a case-control experimental design to assess factors that influence the probability of an insect species being uploaded to the Atlas of Living Australia through citizen science channels (ALA 2016a, b). The Atlas of Living Australia is Australia's national biodiversity database. It is an online biodiversity data management system which links Australia's biological knowledge with its scientific and agricultural reference collections and other custodians of biological information. The initial focus of the ALA was on assembling a comprehensive database of collections and records generated by professional taxonomists and scientists. Subsequently, it has developed (and actively encouraged) the direct recording of sightings by non-professionals ("citizen scientists") including the ability to upload datasets, and to receive sighting data streams from stand-alone citizen science reporting platforms. The predominant citizen science sources for the insect orders of interest (see below) were Bowerbird (http://www.bower bird.org.au/), iNaturalist (https ://www.inatu ralis t.org/), QuestaGame (https ://quest agame .com/) and direct citizen uploads. The predominant source of uploads from professionals was from museums within Australia's seven states and territories participating in the Online Zoological Collections of Australian Museums (https ://www.ozcam .org.au/) and scientific collection expeditions. Uploads from professionals for the insect orders of interest out-numbered those by citizens by a factor of c. 50, but note that this figure is highly dynamic.
Cases ( n = 278 ) were species for which at least one record by a citizen source was uploaded through the Atlas of Living Australia (ALA) portal in the two years up until 30 June 2016. Controls ( n = 196 ) were a weighted (by number of observations) sample of all species within the ALA for which there were zero records by citizens over the same period. Only the orders Coleoptera and Hemiptera were considered, as these orders encompass the vast majority of emergency plant pests (EPPs). The Hemiptera in particular appear particularly difficult to prevent from invading and are typically not detected on incursion pathways (Caley et al. 2015).
For each species, we assessed the following: • Order (Coleopteran or Hemipteran) • Body length (mm) • Colour-rated on a scale from 0 (no colour) to 4 (Vividly coloured or Very highly coloured) • Pattern-rated on a scale from 0 (no pattern) to 4 (Very highly patterned or ornate) • Morphology-rated on a scale from 0 (no morphology of interest) to 4 (Unique or spectacular morphology) • Range size ( km 2 )-minimum convex polygon of all ALA records • Observation intensity ( km −2 )-Density (intensity) of all citizen science reports for all insect species within the range over the 2-year period (

3
including the terms "Genus species" AND "Pest", a second search within Google Scholar using the same search terms and finally a search within the Pests and Diseases Information Library (PaDIL, http://www.padil .gov.au/) using the taxon name only. Hits were checked for relevance, with searching stopped either as soon as an article was found that clearly identified the taxon as being a pest (in any environment), or hits stopped containing both required search terms. We did not attempt to assess impact, for as the thrust of the work relates to citizen's motivation to report a taxon, this albeit subjective definition of pest status suffices (i.e. the taxon has been recorded behaving in a way that is considered a pest).

Statistical analysis
We use two methods of analysis for classifying whether a species will be detected and reported. The first, logistic regression, produces easily interpreted coefficients (e.g. the effect of factor X is to increase the odds of reporting by Y).
The second, random forests (Breiman 2001), is essentially a form of data mining whose performance (discriminatory ability) we would a priori expect to be close to the maximum obtainable. The downside is that interpreting the influence of the covariates from the many individual classification and regression trees within the "forest" so generated is not straightforward, although the relative contribution and importance of the covariates can be assessed. Logistic regression models the reporting probability onto the ALA via citizen science platforms as a linear function of the covariates (the "linear predictor") as: w h e r e p = Pr(Reported � Covariates ⋂ Sampled) a n d * � = ( * 0 , 1 , … , k ) are the coefficients for the k covariates .
Note that the asterisk(*) for in Equation 1 signifies that this is a biased estimate of the intercept as a result of the case-control sampling process (see below). The logit transformation of a probability (p) is defined as the log of the odds. That is: Treating the scoring variables as continuous could be criticized; however, the purpose of the model is primarily for classification, and the approach facilitates better (1) logit(p) = * 0 + 1 Order + 2 Size + 3 Colour + 4 Pattern + 5 Morphology + 6 log 10 (Range) + 7 log 10 (ObsIntens) + 8 (Pest) communication of the covariate effects on reporting probabilities for less quantitative readers. The random forest model was fitted to the same set of covariates using the default parameter settings and a forest size of 1000 trees.
Reporting probabilities over the 2-year period were converted to yearly reporting probabilities assuming citizen reporting effort could be approximated as constant across years.

Model evaluation
We evaluated the classification performance of the logistic regression model using 10-fold cross-fold validation, whereby the data were randomly divided into 10-folds, which were held out in turn and classification errors assessed. The ten values were then averaged to provide an overall estimate of classification errors expected during prediction. For the random forest model, the in-built outof-the-bag (OOB) error rate was used as an estimate of the classification error when predicting.
The 10-fold cross-validation performance for the logistic regression model (Sensitivity = 89%, Specificity = 83%, Overall error rate = 13.5%) slightly outperformed the out-of-the-bag error rates of the random forest (Sensitivity = 89%, Specificity = 77%, Overall error rate = 16%). Armed with this knowledge that the logistic regression model was at least as good as the data-mining alternative, we used it for prediction and interpretation.

Model prediction
Predicting the probability of reporting given only the covariates requires explicit formulation that accounts for proportion of cases sampled ( P 1 ) and controls sampled ( P 0 ). The appropriate equation (Keating and Cherry 2004) is : where * � is either the linear predictor described by Equation 1, or the logit-transformed probability arising from the random forest model.

Model prediction with application to exotic insects
We used Eq. 3 to estimate the reporting probability for high priority pests of concern to Plant Health Australia of crosssectoral concern. To do this, these species were scored for size, colour, pattern and morphology using the same criteria as those applied to the ALA records. The incursion size was arbitrarily set at 100 km 2 (10 km × 10 km), the observation intensity set to the median (0.26 km −2 ), and the species was considered be present as a pest. The model estimates can be rerun for different desired combinations of observation intensity and outbreak size, depending on what size outbreak authorities consider they are capable of eradicating. The 100 km 2 was chosen as it is a figure bandied around by management agencies when considering the largest sized insect invasion that they have sufficient resources for there to be a reasonable chance of eradication.
Analyses were undertaken using the R software environment R Development Core Team (2017), including use of the "randomForest" package (Liaw and Wiener 2002).

Effect of features on reporting
The features we recorded had a very large impact on the estimated reporting probabilities via citizen science platforms into the ALA (Table 1). The probability of a beetle being reported was considerably higher than a bug (odds ratio = 2.2, Table 1), possibly reflecting the popularity of beetle collecting. Species considered pests had a much higher reporting probability (odds ratio = 15.4, Table 1), possibly arising from the increased visibility that their plant damage brings, but also probably arising from their higher abundance and range. The estimated range of the species and the estimated activity of citizen reports also had a significant positive effect on the probability of reporting (Table 1). In terms of the features of the beetles and bugs considered, those species not reported through the ALA citizen science channels are typically smaller, less colourful, less patterned and morphologically uninteresting such as the commonly found black larder beetle (Fig. 2b), despite being a household pest, compared with those that are reported (Table 1). Indeed, despite the large number of sightings uploaded, some widespread common pest species have not been uploaded as of 30 June 2016. A further example of a widespread though unrecorded is the green peach aphid (Myzus persicae) (Fig. 3a), despite it causing considerable economic loss during the period of the study by vectoring beet western yellows virus during the spring of 2014. Table 1 Features, associated coefficients, odds ratios (coefficients exponentiated to base e) and associated confidence intervals (C.I.) influencing the citizen reporting probability of insect species to the Atlas of Living Australia Values in parentheses indicate either the level of the factor associated with the coefficient, or for continuous measures the scaling of the parameter for a unit change associated with the coefficient. See Eq.
(1) for model detail 1 Only species from the Hemipteran (Bugs) and Coleopteran (Beetles) orders were considered

Predicted reporting probabilities
When the model was applied to a subset of the Plant Health Australia cross-sectoral high priority pest species (HPPs) for a given set range size (100 km 2 ) and median citizen science observation intensity, the estimated yearly reporting probability ranged from a low of 2% (e.g. sugarcane sidewinder) to near 97% (Lychee longicorn beetle) (Tables S1 & S2 in Supplementary Material). Generally speaking, HPPs with very low estimated probabilities of reporting are dominated by Hemipterans (Table S2 in Supplementary Material), whilst those with high probabilities of reporting are dominated by the Coleoptera (Table S1 in Supplementary Material). Insects that have high estimated reporting rates include the Colorado potato beetle, for which its size, colour and distinctive pattern (Fig. 4a) result in an estimated yearly reporting probability of 0.78. In contrast, the Russian wheat aphid (Fig. 4b) has a low predicted citizen reporting probability of 3%.

Discussion
Our model has quantitatively inferred the extent to which size, colour, pattern and morphology all influence the citizen reporting probability of insects. Although this finding is unsurprising, this is the first time that the reporting probability and the factors that influence it have been quantified. This enables a more objective evaluation of the contribution of unstructured online reporting platforms to plant biosecurity surveillance. Importantly, when applied to exotic pests of biosecurity concern, we have inferred that the passive citizen reporting probabilities for many (particularly small, (a) ( b) Fig. 3 Contrasts in colour and pattern. Despite being a widespread pest, the green peach aphid has a low citizen science reporting probability on account of its small size (1-2 mm) and "plain" looks. In contrast, Calomela parilis from the leaf beetle genus is a citizen observers treasure. nondescript bugs) would be considered insufficient for biosecurity surveillance needs. Recent incursions of exotic pests into Australia support this estimated low reporting probability. For example, the incursion of the Russian wheat aphid went unreported by citizen scientists for possibly two years whilst spreading over a considerable area in southern Australia. It was first detected by strategic surveillance. Likewise, the tomato potato psyllid (Bactericera cockerelli), another small, unremarkable species was widespread, occurring on hundreds of premises in Western Australia before being detected and reported.
The citizen surveillance we have described here contributes to what is termed "general surveillance" within plant health (Hammond et al. 2016a), which is a catch-all phrase for describing surveillance that is not targeted. General surveillance activities are an important part of early detection and demonstrating area freedom (Hammond et al. 2016b). The estimated citizen reporting probabilities we have estimated here can be used to infer the likely sensitivity of general surveillance for exotic species from the "passive" citizen component. The predictions we have made here are simplistic in how they have chosen the citizen science observation intensity (simply using the median). In reality, the citizen observation intensity will vary greatly depending on the incursion location in relation to citizen science activity. The implications of this could be explored in more detail (see Pocock et al. 2017, for an example) and used as a means of directing where targeted surveillance could augment citizen surveillance. This is an area of further work.
The implication of these results for citizen reporting as a form of surveillance will vary depending on the features of the pest species of concern. Industries and environmental managers whose assets are potentially impacted by species with low reporting probabilities will clearly need to implement more structured/active surveillance if they require higher surveillance sensitivity. The citrus industry, for example, probably cannot rely on passive unstructured citizen science data streams for surveillance of D. citri-some form of structured surveillance will be required. In contrast, the forestry industry may consider that citizen detection and reporting of species of pine sawyers may be sufficient for their needs. Incorporating such inferred citizen surveillance reporting probabilities into the general surveillance framework is an area for further research. Targeted use of social media shows promise.
It is well known that citizen reporting rates are heavily biassed in space and time (Isaac and Pocock 2015), along with the visibility of the organism in question. Here, we have demonstrated quantitatively further inter-species reporting biases relating to perceptions (pattern, colour, morphology). This finding is generalizable to most unstructured citizen science reporting platforms relating to animals and plants.
Although we have demonstrated the importance of physical features and availability for citizen reporting probability, motivations for reporting insects using online reporting portals are likely diverse and may change with time. This will be an ongoing challenge for the use of citizen surveillance.