Clustered and Unclustered Group Testing for Biosecurity

Clark, Robert Graham; Barnes, Belinda; Parsa, Mahdi

doi:10.1007/s13253-023-00566-x

Clustered and Unclustered Group Testing for Biosecurity

Open access
Published: 26 August 2023

Volume 29, pages 193–211, (2024)
Cite this article

Download PDF

You have full access to this open access article

Journal of Agricultural, Biological and Environmental Statistics Aims and scope Submit manuscript

Clustered and Unclustered Group Testing for Biosecurity

Download PDF

654 Accesses
1 Citation
Explore all metrics

Abstract

Group testing is an important element of biosecurity operations, designed to efficiently reduce the risk of introducing exotic pests and pathogens with imported agricultural products. Groups of units, such as seeds, are selected from a consignment and tested for contamination, with a positive or negative test returned for each group. These schemes are usually designed such that the probability of detecting contamination is high assuming random mixing and a somewhat arbitrary design prevalence. We propose supplementing this approach with an assessment of the distribution of the number of contaminated units conditional on testing results. We develop beta-binomial models that allow for between-consignment variability in contamination levels, as well as including beta random effects to allow for possible clustering within the groups for testing. The latent beta distributions can be considered as priors and chosen based on expert judgement, or estimated from historical test results. We show that the parameter representing within-group clustering is, unsurprisingly, effectively non-identifiable. Sensitivity analysis can be conducted by investigating the consequences of assuming different values of this parameter. We also demonstrate theoretically and empirically that the estimated probability of a consignment containing contamination and evading detection is almost perfectly robust to mis-specification of the clustering parameter. We apply the new models to large cucurbit seed lots imported into Australia where they provide important new insights on the level of undetected contamination. Supplementary materials accompanying this paper appear on-line.

A Reference Population-Based Conformance Proportion

Article 08 November 2016

Nested Group Testing Procedure

Article 01 October 2022

Inferring fruit infestation prevalence from a combination of pre-harvest monitoring and consignment sampling data

Article Open access 06 June 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A fundamental challenge for the protection of agricultural productivity and trade in many countries is to confirm the absence of exotic pests and pathogens in arriving or departing cargo consignments. The volume and movement of agricultural goods is increasing rapidly (Chapman et al. 2017) but, although biosecurity regulations require inspections, costs and practicalities limit the extent of sampling and testing.

Consignments are assumed to contain a set of units such as seeds (Constable et al. 2018), prawns, grain, flowers (Hepworth 1996), or other perishable goods. Units are selected and tested collectively in groups, where group size may be determined by the efficacy of the testing process (e.g. prawns), or by the cost of testing and other practical considerations (e.g. seeds). The test may be non-destructive (e.g. cut flowers) or destructive (e.g. seeds), and the test result is either positive or negative for each group. Typically, the number of contaminated items in a group that tests positive remains unknown, with the detected contamination either actioned or the consignment rejected for importation—both at considerable cost. We focus on biosecurity, but forms of group testing are also used in other contexts including COVID-19 management (Lokuge et al. 2021) and environmental DNA surveys (Furlan et al. 2016).

Effective group-sampling schemes must take account of the potential heterogeneity of contamination within consignments, whether this be due to the clustering of contaminants within consignments of goods, or clustering related to country of origin or other spatial variation. However, because the extent of clustering within groups may be unidentifiable, it is common to assume that groups are random samples of units. The number of groups tested is often small, and zero contamination is detected in the great majority of cases, adding to the challenge of analysing test results and assessing risk. Testing errors have sometimes been allowed for (Cowling et al. 1999; Liu et al. 2012), although only with randomly formed groups. We assume throughout that testing is perfect, to simplify the exposition given the other innovations made, and because this is realistic in the case study in Sect. 5.

A key concept is the leakage (or slippage), which is the number of contaminated units that are imported. If any contamination is detected in the sampled groups, the leakage is zero, because importation is then blocked. If there is contamination in the consignment but none is detected in the inspected sample, then leakage is positive.

Group testing regimes are often designed by specifying a maximum tolerable or plausible prevalence of contaminated units ($\tilde{p}$). The number of groups to be sampled is calculated such that the probability that contamination occurs in the sample is at some high level if the prevalence equals $\tilde{p}$. Thus, leakage is unlikely to occur in this case. These calculations can be made by noting that if groups are random samples and the number of contaminated units in the consignment is treated as fixed, then the number of contaminated units in the sample is hypergeometrically distributed (e.g. IPPC 2008). The binomial distribution is sometimes used as a simpler approximation, provided the sampling fraction is small. Overdispersed extensions of binomial distributions have also been used. Trouvé and Robinson (2021) considered a two-level model with normal random effects to allow for between-consignment heterogeneity. They conducted sensitivity analysis on the consignment-level variance component, rather than estimating this parameter; analytical formulae are not available as they involve integration over the latent variable. Beta-binomial distributions have also been suggested in non-group-testing contexts (Venette et al. 2002; IPPC 2008; Trouvé et al. 2022).

This paper was motivated by a specific biosecurity problem faced by an operational division of the Australian government. Australia imports numerous consignments of seeds for a variety of agricultural commodities (e.g. Constable et al. 2018; Dall et al. 2019). Biosecurity regulations require that consignments are tested and are rejected for importation if contamination is detected. Because import volumes are increasing and testing is expensive and destructive, there is an imperative to reduce costs for importers, but also to understand the likely consequences for risk with any changes to the regulations.

Motivated by these challenges, we develop models and results allowing quantities relating to leakage to be inferred from group test results, allowing for between-consignment heterogeneity and possible clustering of contamination within groups. This contrasts with existing literature focussed on the probability of detecting contamination assuming $\tilde{p}$ rather than inferring leakage. Section 2 defines our notation and summarises selected models in the literature. Section 3 describes our proposed model when groups are assumed to be randomly formed. We use beta distributions to reflect heterogeneity or uncertainty about the prevalence in a particular consignment. Our model is similar to that of Trouvé et al. (2022) but allows for group testing rather than full observation of the number of contaminated units. The distribution of the leakage and the worst case for expected leakage are derived.

Section 4 generalises this model to allow for clustering of contamination within groups. A heuristic approximation reveals that the parameter that controls clustering within groups is not identifiable, and so must be assumed a priori. We show, however, that if the model is estimated from group testing outcome data, then estimates of the probability of leakage from a given consignment are approximately correct even if the clustering parameter is mis-specified. This is a novel and important finding, because it means that the possibility of clustering can be ignored when the primary concern is with the number of consignments with undetected contamination. The estimated number of contaminated units entering the country, however, is sensitive to clustering. Section 5 is a case study, showing how our new framework would lead to more informed biosecurity decisions in a particular example.

2 Problem Definition and Review

2.1 Problem Definition

We consider consignments (e.g. shipments, consignments, or subdivisions of consignments), containing units (e.g. seeds) which are grouped into equally sized groups. Typically, there are many consignments of a particular kind imported into a country annually, and we refer to ongoing importation of a particular type of product in a particular form as a pathway. Each consignment may or may not contain some contaminated units, and the focus is on inferring the probability of importation of contaminated units, using models which may be assumed a priori, or estimated from historical data on multiple consignments’ inspection results.

Each consignment is assumed to contain N units, which are grouped into B equally sized groups, each containing $\bar{N}=N/B$ units. We use the bar notation as $\bar{N}$ may be considered as the mean group size, although we assume throughout this paper that every group is of size $\bar{N}$. We would expect results to be very similar if groups vary, provided the relative variation of the group sizes is small. When group sizes vary more substantially, our results will be less applicable, although we expect that the same general trends will be seen. We also assume that N is divisible by $\bar{N}$, which would be a negligible approximation for large consignments.

A simple random sample without replacement of b groups is selected. Let $X_{ij}$ be the number of contaminated units in group j of consignment i, $j=1, \ldots , B$, so that $X_{ij}$ may take on values $0, 1, \ldots , \bar{N}$. We denote the indexes of the sampled groups as $1, \ldots , b$ and the non-sampled groups are $b+1, \ldots , B$. The total number of units in selected groups is $n=b\bar{N}$.

The testing procedure is such that we do not directly observe $X_{ij}$ even for sampled groups. Instead, all that is available for sampled groups j is $Y_{ij}$ which equals 1 if there is any contamination in the group and 0; otherwise, i.e. $Y_{ij} = I \left( X_{ij}>0 \right) $ where I(.) is the indicator function. We write $T_{Xi}=\sum _{j=1}^B X_{ij}$ and $t_{xi}=\sum _{j=1}^{b} X_{ij}$ for the population and sample number of contaminated units, respectively, in consignment i. We let $T_{Yi}=\sum _{j=1}^{B} Y_{ij}$ and $t_{yi}=\sum _{j=1}^{b} Y_{ij}$ be the population and sample number of contaminated groups, respectively.

We define the leakage (sometimes referred to as slippage) to equal $L_i=T_{Xi} I\left( t_{yi}=0 \right) $, which is the number of contaminated units imported in consignment i, reflecting the fact that consignments are blocked if any contamination is detected.

2.2 Review of Existing Models

2.2.1 Review of Models assuming Random Group Formation

Suppose that groups are simple random samples without replacement of units, constrained to be non-overlapping. The sampling distribution of $t_{yi}$, treating $T_{Xi}$ as a fixed parameter, can then be obtained without any further modelling assumptions. In this case, $t_{xi}$ follows a hypergeometric distribution with parameters N, $T_{Xi}$ and n (e.g. McSorley and Littel 1993). The distribution of $t_{yi}$ is less obvious but has been derived by Theobald and Davie (2014). Interest has often focussed on the probability that $t_{yi}=0$. The sample size n can be obtained such that $\mathbb {P}\left( t_{yi}=0 \right) $ is sufficiently low given a postulated value $T_{Xi}=\tilde{p} N$ (e.g. Lane et al. 2018; Constable et al. 2018). When $n \ll N$, the distribution of $t_{xi}$ given $T_{xi}$ is approximately $\text{ Bin }\left( n , T_{Xi}/N \right) $, leading to a simpler formula for n. More generally, the symmetries of the hypergeometric distribution allow for four distinct binomial approximations, with performance dependent on conditions (Brunk et al. 1968).

Hepworth (1996, 2019) also considered binomial models for group sampling with random group formation, focusing on exact inference and bias. Cowling et al. (1999) considered inference about the prevalence from unclustered group sampling. Neither considered leakage.

2.2.2 Review of Models assuming Non-Random Group Formation

It is possible that contamination tends to co-occur in the same group or groups. This will result in $X_{ij}$ and $t_{xi}$ being overdispersed relative to the hypergeometric or binomial distributions discussed under the previous heading. McArdle et al. (1990) discussed the possibility of clustering in sampling of plots for the presence of rare species. IPPC (2008) commented that clustering of contamination within consignments is possible and suggested the beta-binomial distribution to model it. Venette et al. (2002) also suggested the beta-binomial distribution to model clustering of $t_{xi}$, noting that this distribution is approximated by the negative binomial when the contamination rate is low (citing Madden et al. 1996). None of these papers discussing beta-binomial models considered group testing where only $t_{yi}$ is available.

2.2.3 Review of Models allowing for Consignment-Level Heterogeneity

Trouvé and Robinson (2021) considered the situation where the only data available are whether any detections are made from a whole consignment. They suggested an overdispersed model for $t_{xi}$ relative to the binomial distribution, with overdispersion controlled by a parameter $\alpha $. This results in higher probability that $t_{yi}=0$ when there is overdispersion. They used data from multiple consignments with different sample sizes per consignment, assuming a common value of $\alpha $ and $T_{Xi}/N$, enabling estimation of both $T_{Xi}/N$ and $\alpha $. Simulation showed that the bias from ignoring overdispersion can be substantial.

Trouvé et al. (2022) modelled consignment-level heterogeneity by assuming that a latent propensity for each consignment follows a beta distribution, equivalent to the model we will use in the next section. However, they assumed that the counts $t_{xi}$ of contaminated units are observed and did not consider group testing.

3 Models for Randomly Formed Groups

3.1 Model Definition

We propose the following model:

$$\begin{aligned} X_{ij} \vert p_i&\overset{ \tiny \text{ i.i.d. }}{\sim } \text{ Bin }\left( \bar{N} , p_i \right) ; \text{ and } \end{aligned}$$

(1)

$$\begin{aligned} p_i&\overset{\tiny \text{ i.i.d. }}{\sim } \text{ Beta } \left( \alpha ,\beta \right) \end{aligned}$$

(2)

for consignments i and groups $j=1, \ldots , B$, where $\alpha $ and $\beta $ are shape parameters. Note that different consignments are assumed to be independent. When we are considering a single consignment of interest, as is often the focus in biosecurity processes, and there are no suitable data to estimate $\alpha $ and $\beta $, then (2) is a prior distribution representing the state of knowledge or uncertainty about the prevalence of contamination. If historical data from a set of consignments are available, then $\alpha $ and $\beta $ can be estimated by maximum likelihood.

Note that (1) and (2) imply that $T_{Xi} \sim \text{ BB }\left( N,\alpha ,\beta \right) $, where BB denotes the beta-binomial distribution. We write $\mu =\alpha /\left( \alpha +\beta \right) $ so that $\mathbb {E}\left( T_{Xi}/N\right) =\mathbb {E}\left( p_i\right) =\mu $.

Under group testing, we do not know $t_{xi}$, but instead only observe $t_{yi}$ for one or more consignments i. Under the above model,

$$\begin{aligned} Y_{ij} \vert p_i&\overset{\tiny \text{ i.i.d. }}{\sim } \text{ Ber }\left( \phi _i \right) ; \text{ and } \end{aligned}$$

(3)

$$\begin{aligned} t_{yi} \vert p_i&\overset{\tiny \text{ indep }}{\sim } \text{ Bin } \left( b , \phi _i \right) ; \text{ where } \end{aligned}$$

(4)

$$\begin{aligned} \phi _i&= 1-\left( 1-p_i\right) ^{\bar{N}}. \end{aligned}$$

(5)

As $\phi _i$ is a function of $p_i$, it is also a random variable. To emphasise this functional relationship, we will write $\phi _i = \phi \left( p_i \right) $ when convenient. The marginal probability of $t_{yi}$ is:

$$\begin{aligned} \mathbb {P}\left( t_{yi} \right)= & {} \int {b \atopwithdelims ()t_{yi}} \phi (p)^{t_{yi}} \nonumber \\{} & {} \quad \left( 1 - \phi (p) \right) ^{b-t_{yi}} p^{\alpha -1} \left( 1 - p \right) ^{\beta -1} dp / B\left( \alpha , \beta \right) \nonumber \\= & {} \int {b \atopwithdelims ()t_{yi}} \left\{ 1 - \left( 1-p\right) ^{\bar{N}} \right\} ^{t_{yi}} p^{\alpha -1}\nonumber \\{} & {} \quad \left( 1 - p \right) ^{\bar{N}\left( b-t_{yi}\right) +\beta -1} dp / B\left( \alpha , \beta \right) . \end{aligned}$$

(6)

If observations of $t_{yi}$ for a sample of m consignments are available, then the likelihood is given by the product of (6) over $i=1, \ldots , m$. Numerical integration is required.

3.2 Distribution of Contamination Levels Conditional on Group Test Results

Given $t_{yi}$ for a particular consignment i, the conditional distribution of $p_i$ is:

$$\begin{aligned} \mathbb {P}\left( p_i \vert t_{yi} \right)\propto & {} \mathbb {P}\left( p_i\right) \mathbb {P}\left( t_{yi} \vert p_i \right) = p_i^{\alpha -1} \left( 1 - p_i \right) ^{\beta -1} \phi (p_i)^{t_{yi}} \left( 1 - \phi (p_i) \right) ^{b-t_{yi}} / C \nonumber \\= & {} \left\{ 1 - \left( 1-p_i\right) ^{\bar{N}} \right\} ^{t_{yi}} p_i^{\alpha -1} \left( 1 - p_i \right) ^{\bar{N}\left( b-t_{yi}\right) +\beta -1} / C \end{aligned}$$

(7)

where $C = \int _0^1 \left\{ 1 - \left( 1-p_i\right) ^{\bar{N}} \right\} ^{t_{yi}} p_i^{\alpha -1} \left( 1 - p_i \right) ^{\bar{N}\left( b-t_{yi}\right) +\beta -1} dp_i$. For the special case $t_{yi}=0$, the right-hand side of (7) is proportional to a beta density, with

$$\begin{aligned} \left( p _i\vert t_{yi}=0 \right) \hspace{0.5cm} \sim \hspace{0.5cm}\text{ Beta } \left( \alpha , \beta + \bar{N} b \right) . \end{aligned}$$

(8)

This is a standard result in binomial models with beta priors.

The distribution of $T_{Xi}$ conditional on an observed value of $t_{yi}$ would often be of interest. We present in Supplementary Materials a new exact result for this conditional distribution distribution (Section S1), an algorithm for sampling from it (Section S2) and a simpler approximate algebraic solution when $p_i$ is close to zero with high probability (Section S3).

3.3 Properties of the Leakage

We now consider the distribution of the leakage $L_i$. We write $f_{BB}\left( k;n,\alpha ,\beta \right) $ to denote the probability function of a beta-binomial random variable with parameters n, $\alpha $ and $\beta $, evaluated at k. Assuming (1) and (2), the probability function of $L_i$ is defined by:

$$\begin{aligned} \mathbb {P}\left[ L_i=k\right] = \left\{ \begin{array}{lll} f_{BB} \left( k ; N-n , \alpha , \beta + n \right) {{B \left( \alpha ,\beta +n\right) } \over {B \left( \alpha ,\beta \right) }} &{}\quad \text{ if } &{} k=1, \ldots , N-n \\ {{B\left( \alpha ,\beta +N\right) }\over {B\left( \alpha ,\beta \right) }} + 1 - {{B\left( \alpha ,\beta +n\right) } \over {B\left( \alpha ,\beta \right) }} &{}\quad \text{ if } &{} k=0 \end{array} \right. \end{aligned}$$

(9)

where B(., .) is the beta function (see Supplementary Section S4 for details). So $L_i$ is a zero-inflated beta-binomial $\left( N-n, \alpha , \beta \right) $ distribution. We also consider $\mathbb {E}\left( L_i\right) $ and derive an upper bound for it. Note that

$$\begin{aligned} \mathbb {E}\left( L_i|p_i\right) = \mathbb {E}\left[ T_{Xi} I \left( t_{yi}=0 \right) | p_i \right] = (N-n) p_i \mathbb {P}\left( t_{yi}=0|p_i\right) = (N-n) p_i \left( 1-p_i\right) ^n. \nonumber \\ \end{aligned}$$

(10)

Using well-known properties of the beta distribution, we immediately obtain

$$\begin{aligned} \mathbb {E}\left( L_i\right)= & {} \mathbb {E}\left[ \mathbb {E}\left( L_i \vert p_i \right) \right] \nonumber \\= & {} (N-n) \mathbb {E}\left[ p_i \left( 1-p_i\right) ^n \right] \nonumber \\= & {} \left( N-n\right) B\left( \alpha +1 , \beta +n \right) / B \left( \alpha , \beta \right) . \end{aligned}$$

(11)

To maximise $\mathbb {E}\left( L_i\right) $ with respect to $\alpha $ and $\beta $, note first that $\mathbb {E}\left( L_i|p_i \right) $ in (10) is maximised by $p_i=p_{\max }$ where $p_{\max }=\left( n+1\right) ^{-1}$ by elementary calculus. It follows that

$$\begin{aligned} \mathbb {E}\left( L_i\right)= & {} \mathbb {E}\left[ \mathbb {E}\left( L_i|p_i\right) \right] \le \mathbb {E}\left[ L_i|p_i=p_{\max }\right] \nonumber \\= & {} (N-n) p_{\max }\left( 1-p_{\max }\right) ^n = (N-n) n^{-n} \left( n+1 \right) ^{-(n+1)} \end{aligned}$$

with equality obtaining if and only if $p_i=p_{\max }$ with probability of 1. This is the limiting case of the beta prior in (2) when both $\alpha $ and $\beta $ both tend to infinity with $\alpha =n\beta $. Therefore, the worst case prior for $p_i$ (i.e. the case which maximises the expected leakage) is the distribution where $p_i=\left( n+1\right) ^{-1}$ with probability 1.

Lane et al. (2018, p. 41) also found that $p_{\max }=1/(n+1)$ in a similar setup, but using a hypergeometric model. However, their leakage rate stated in (E.1) appears to be higher than the one derived here by a factor of $\left( (N-n)/N \right) ^2$, possibly because an upper bound was derived rather than $\mathbb {E}(L_i)$ itself, with a further approximation to a combinatorial term between their equations (D.5) and (E.1).

It is notable the worst case is not simply when $p_i$ is high (or has high prior expected value). This is because when $p_i$ is high, it is likely that at least one sampled unit will be contaminated, in which case the consignment would be rejected and no leakage would occur. The worst case is when $p_i$ is small enough that there may well be no contamination in the sample, but large enough that there are substantial contaminated units in the consignment. Figure 1 illustrates how as $p_i$ increases, the expected number of contaminations ($\mathbb {E}\left( T_{xi}\right) $) increases, the probability of non-detection ($\mathbb {P}\left( t_{yi}=0 \right) $) decreases and the expected leakage ($\mathbb {E}\left\{ T_{xi} I\left( t_{yi}=0\right) \right\} $) increases to a peak at $p_i=1/(n+1)$ and then declines.

4 Models for Clustered Groups

4.1 Model Definition

If contamination tends to strike multiple units in the same group, then the binomial model (1) is not appropriate, and we would expect to see overdispersion relative to this model. This section develops nested beta models to model variation within and between consignments and groups. We assume the following prior and model:

$$\begin{aligned} X_{ij} \vert \left( p_i,p_{ij} \right)&\overset{\tiny \text{ indep }}{\sim }\text{ Bin }\left( \bar{N} , p_{ij} \right) . \end{aligned}$$

(12)

$$\begin{aligned} p_{ij} \vert p_i&\overset{\tiny \text{ indep }}{\sim } \text{ Beta } \left( \theta p_i , \theta \left( 1 - p_i \right) \right) \end{aligned}$$

(13)

$$\begin{aligned} p_i&\sim \text{ Beta } \left( \alpha , \beta \right) \end{aligned}$$

(14)

The hyper-parameter $\theta >0$ controls the level of clustering; when $\theta =\infty $, $p_{ij}=p_i$ and (12) is the same as (1). In the other extreme, when $\theta \downarrow 0$, the beta distribution in (13) becomes Bernoulli with parameter $p_i$ (see Supplementary Materials Section S5). The parameter $\theta $ is not straightforward to interpret. Applying elementary properties of the beta distribution, the intracluster correlation between the contamination indicator (1 or 0) for two different units in the same group, conditional the consignment propensity $p_i$, is equal to $1/\left( \theta +1\right) $. However, the intracluster correlation of a rare binary indicator is also difficult to visualise. We will introduce a more interpretable parameter in place of $\theta $ in Sect. 4.4, but some further discussion in terms of $\theta $ is needed first.

We treat $\theta $ as a known hyper-parameter and propose conducting sensitivity analysis over a range of plausible values of $\theta $. We take this approach because the model appears to be practically unidentifiable from observations of $t_{yi}$ if $\theta $ is unknown. This is not surprising given that all we observe is whether group totals $X_{ij}$ are zero or not, so identifying clustering within groups is unlikely to be feasible. The heuristic in Sect. 4.4 will confirm this, but will also show that the estimated probability of leakage occurring in a consignment is insensitive to the assumed value of $\theta $.

Conditional on $p_i$, $y_{ij}$ are independent with

$$\begin{aligned} \mathbb {P}\left[ y_{ij}=1 \vert p_i \right] = \phi _i \end{aligned}$$

(15)

where

$$\begin{aligned} \phi _i = 1 - \mathbb {E}\left[ \left( 1-p_{ij}\right) ^{\bar{N}} \vert p_i \right] = 1 - \frac{ B\left( \theta p_i , \theta \left( 1 - p_i \right) + \bar{N} \right) }{ B\left( \theta p_i , \theta \left( 1 - p_i \right) \right) } . \end{aligned}$$

(16)

Note that $\phi _i$ is a function of $p_i$, which we will emphasise by writing $\phi _i=\phi \left( p_i\right) $ when convenient. By elementary manipulations of beta functions, (16) can be re-expressed as:

$$\begin{aligned} \phi _i = 1 - \left( 1 - p_i \right) \prod _{k=1}^{\bar{N}-1} \left[ 1 - p_i \theta \left( \theta + k \right) ^{-1} \right] . \end{aligned}$$

(17)

Since $\theta \left( \theta + k \right) ^{-1} \le 1$, it is clear that the right-hand side of (17) has maximum value of $\phi _i=1-\left( 1-p_i\right) ^{\bar{N}}$, which is achieved as $\theta \rightarrow \infty $. This corresponds to the model in Sect. 3 and is the best case for $\phi _i$ (i.e. greatest chance of detecting contamination). It is also clear that the worst possibility for $\phi _i$ is $\theta \downarrow 0$, in which case $\phi _i=p_i$.

4.2 Likelihood and Posterior

Note that

$$\begin{aligned} t_{yi} \vert p_i \sim \text{ Bin } \left( b , \phi _i \right) \end{aligned}$$

(18)

and therefore

$$\begin{aligned} \mathbb {P}\left( t_{yi} \right) = \int {b \atopwithdelims ()t_{yi}} \phi (p_i)^{t_{yi}} \left( 1 - \phi (p_i) \right) ^{b-t_{yi}} p_i^{\alpha -1} \left( 1 - p_i \right) ^{\beta -1} dp_i / B\left( \alpha ,\beta \right) \end{aligned}$$

(19)

where $\phi (p_i)$ is defined by (16). If observations of $t_{yi}$ for a sample of consignments are available, then the likelihood is given by the product of (19) over consignments i, with numerical integration required in practice. The posterior of $p_i$ for a given consignment conditional on its value of $t_{yi}$ is proportional to:

$$\begin{aligned} \mathbb {P}\left( p_i \vert t_{yi} \right) \propto \mathbb {P}\left( p_i \right) \mathbb {P}\left( t_{yi} \vert p_i \right) \propto p_i^{\alpha -1} \left( 1-p_i \right) ^{\beta -1} \phi (p_i)^{t_y} \left( 1 - \phi (p_i) \right) ^{b-t_{yi}} . \end{aligned}$$

(20)

The posterior of $T_{Xi}$ could also be obtained, using the fact that $X_{ij} \vert p_i \; \sim \text{ BB } \left( \bar{N} , \theta p_i, \theta \left( 1 - p_i \right) \right) $ are independent.

4.3 Leakage for the Nested Beta Model

The distribution of the leakage is much more complex than in Sect. 3.4 when groups were unclustered. We focus on the probability that the leakage is zero and the expected leakage:

$$\begin{aligned} \mathbb {P}\left( L_i=0 \right)= & {} \mathbb {P}\left( t_{yi}>0 \text{ or } T_{yi}=0 \right) = \mathbb {P}\left( t_{yi}>0 \right) + \mathbb {P}\left( T_{yi}=0 \right) \nonumber \\= & {} \mathbb {E}\left[ \mathbb {P}\left( t_{yi}>0 \vert p_i \right) \right] + \mathbb {E}\left[ \mathbb {P}\left( T_{yi}=0 \vert p_i \right) \right] \nonumber \\= & {} \mathbb {E}\left[ 1 - \left( 1 - \phi _i \right) ^b + \left( 1 - \phi _i \right) ^B \right] \end{aligned}$$

(21)

$$\begin{aligned} \mathbb {E}\left( L_i | p_i \right)= & {} \mathbb {E}\left\{ \mathbb {E}\left[ I\left( t_{yi}=0\right) T_{Xi} \vert p_i, p_{i1}, \ldots , p_{iB} \right] \vert p_i \right\} \nonumber \\= & {} \mathbb {E}\left\{ \left[ \prod _{i=1}^b \left( 1 - p_i \right) ^{\bar{N}} \sum _{j=b+1}^B \bar{N} p_{ij} \right] \vert p_i \right\} \nonumber \\= & {} \left( 1 - \phi _i \right) ^b \left( B-b\right) \bar{N} p_i \end{aligned}$$

(22)

$$\begin{aligned} \mathbb {E}\left( L_i \right)= & {} \mathbb {E}\left[ \mathbb {E}\left( L_i | p_i \right) \right] = \mathbb {E}\left[ \left( 1 - \phi _i \right) ^b \left( B-b\right) \bar{N} p_i \right] . \end{aligned}$$

(23)

These quantities can be obtained by numerical integration using the distribution of $p_i$ in (14) and the definition of $\phi _i$ as a function of $p_i$ in (16).

Given that $\theta $ is unlikely to be identifiable in practice, we consider the worst case value of $\theta $ for the expected leakage. Substituting (17) into (22), we obtain

$$\begin{aligned} \mathbb {E}\left( L_i \vert p_i \right)= & {} \left\{ \left( 1 - p_i \right) \prod _{k=1}^{\bar{N}-1} \left[ 1 - p_i \theta \left( \theta + k \right) ^{-1} \right] \right\} ^b \left( B-b\right) \bar{N} p_i \end{aligned}$$

(24)

$$\begin{aligned}= & {} \left\{ \left( 1 - p_i \right) \prod _{k=1}^{\bar{N}-1} \left[ 1 - p_i + p_i k \left( \theta + k \right) ^{-1} \right] \right\} ^b \left( B-b\right) \bar{N} p_i . \end{aligned}$$

(25)

By inspection, it is obvious that the right-hand side of (25) is maximised by $\theta =0$, in which case it becomes

$$\begin{aligned} \mathbb {E}\left[ L_i \vert p_i ; \theta =0 \right] =\left( N - n\right) p_i \left( 1 - p_i \right) ^{b} . \end{aligned}$$

(26)

Because $\theta =0$ maximises $\mathbb {E}\left( L_i \vert p_i \right) $ for every $p_i$, it must also maximise $\mathbb {E}\left( L_i \right) $. By elementary maximisation, the right-hand side of (26) is maximised with respect to $p_i$ by $1/(b+1)$. Using arguments similar to those made in Sect. 3.3, it follows that the worst case for $\mathbb {E}\left( L_i\right) $ with respect to $\theta $, $\alpha $ and $\beta $ is $\theta =0$ and $\alpha , \beta \rightarrow \infty $ with $\beta =b \alpha $.

4.4 Heuristic Approximation of the Model for Rare Events

We now consider approximations for the distribution of $t_{yi}$, to show that $\theta $ is indeed not identifiable in practice, and to determine the effect of assuming an incorrect value of $\theta $. We consider the case where $\beta \gg \alpha $, which occurs when $p_i$ is small with high probability, which is often the case in practice (including in the case study in Sect. 5). We start with the following approximation for $\phi _i$ as defined in (17):

$$\begin{aligned} \phi _i \approx 1 - \left( 1 - p_i \right) ^{\bar{N}\lambda ^{-1}} , \end{aligned}$$

(27)

where we define

$$\begin{aligned} \lambda = \bar{N} \; \big / \; \left\{ 1 + \sum _{k=1}^{\bar{N}-1} \theta \left( \theta + k \right) ^{-1} \right\} . \end{aligned}$$

(28)

For details of this and other approximations made in this subsection, see Supplementary Materials Section S6. The approximation relies on $p_i \ll 1$. It is also perfect when $\theta =0$.

We show in Supplementary Section S6.1 that functions of beta random variables of the form seen in (27) can be approximated by beta distributions provided the second shape parameter is large. Using this result, we obtain

$$\begin{aligned} \phi _i \; \overset{a}{\sim }\ \; \text{ Beta } \left( \alpha , \beta \bar{N}^{-1} \lambda \right) \end{aligned}$$

(29)

Together with (18), this implies that

$$\begin{aligned} t_{yi} \overset{a}{\sim }\ \text{ BB } \left( b , \alpha , \beta \bar{N}^{-1} \lambda \right) \end{aligned}$$

(30)

Approximations (29) and (30) are based on $\beta \gg \alpha $, but are also perfect when $\theta =0$.

The parameter $\lambda $ has a straightforward interpretation as an approximate mean (within-group) outbreak size, provided $\beta \gg \alpha $, because $\mathbb {E}\left( X_{ij} \vert X_{ij} \ge 1 \right) \approx \lambda $. (See Supplementary Section S6.4 for details.) Note that $\lambda $ is a function of $\theta $. When $\theta =\infty $, then $\lambda =1$, reflecting that when there is no clustering, it will be rare for there to be more than one contamination in a group. When $\theta =0$, clustering is perfect and $\lambda =\bar{N}$, reflecting that if any units in a group are contaminated, then they are all contaminated, in this extreme case. We will discuss clustering in terms of the mean outbreak size $\lambda $, rather than the harder to interpret measure $\theta $, for the remainder of the paper.

We conclude from (30) that it is possible to identify $\alpha $ from a dataset of observations of $t_{yi}$, but it is not possible to identify both $\lambda $ and $\beta $. The best that can be done is to estimate the product $\beta \lambda $. Therefore, it is not possible to identify the marginal prevalence $\mu =\mathbb {E}(p_i) \approx \alpha / \beta $ if $\lambda $ is unknown. We are forced to assume a value of $\lambda $, possibly supplemented by sensitivity analysis over a range of values of $\lambda $.

Simple expressions for the expected leakage and the probability of leakage can be derived from (27) and (29) using properties of beta random variables and the beta function (for details see Supplementary Section 6.5):

$$\begin{aligned} \mathbb {P}\left( L_i>0\right)\approx & {} \frac{B(\alpha ,\beta \bar{N}^{-1} \lambda +b)}{B(\alpha ,\beta \bar{N}^{-1} \lambda )} - \frac{B(\alpha ,\beta \bar{N}^{-1} \lambda +B)}{B(\alpha ,\beta \bar{N}^{-1} \lambda )} \end{aligned}$$

(31)

$$\begin{aligned} \mathbb {E}\left( L_i \right)\approx & {} (N-n) {\alpha } / \left( \alpha + \beta + b \bar{N} \lambda ^{-1} \right) . \end{aligned}$$

(32)

What happens if we assume a value of $\lambda $ but the real value is different? The approximations derived above allow us to answer this question. Suppose that a dataset $\left\{ t_{yi}: i=1, \ldots , m \right\} $ is generated by model (12)–(14), where $\beta \gg \alpha $. Hence, $t_{yi}$ approximately follow (30), i.e.:

$$\begin{aligned}{} & {} t_{yi} \overset{a}{\sim }\ \text{ BB } \left( \bar{N} , \alpha _0 , \beta _0 \bar{N}^{-1} \lambda _0 \right) \end{aligned}$$

(33)

where $\alpha _0$, $\beta _0$ and $\lambda _0$ are the true values of these parameters. Suppose that we correctly assume model (12)–(14), except that we assume a particular value $\lambda _1$ which may differ from the true value $\lambda _0$. We estimate $\alpha $ and $\beta $ by maximum likelihood for this mis-specified working model, which is approximated by

$$\begin{aligned} t_{yi} \overset{a}{\sim }\ \text{ BB } \left( \bar{N} , \alpha , \beta \bar{N}^{-1} \lambda _1 \right) \end{aligned}$$

(34)

For large m and subject to regularity conditions, the maximum likelihood estimators of $\alpha $ and $\beta $ assuming $\lambda _1$ will tend in probability to the values that minimise the Kullback–Leibler distance between the working model (34) and the true model (33) (Van der Vaart 2000, Example 5.2.5). A Kullback–Leibler distance of zero is achieved when $\alpha =\alpha _0$ and $\beta = \beta _0 \lambda _0 / \lambda _1$, since (34) and (33) coincide in this case. Therefore, $\hat{\alpha }$ will tend to the true value $\alpha _0$, while $\hat{\beta }$ will tend to $\beta _0 \lambda _0 / \lambda _1$, which may be far from the true value $\beta _0$ when the assumed $\lambda _1$ is incorrect.

Intuitively, this is because we have data on $t_{yi}$, the distribution of which is closely related to that of $\phi _i$. So even if $\lambda $ is mis-specified, the estimated model will still correctly capture the distribution of $\phi _i$ (at least approximately), and so estimates of $\alpha $ and the product $\beta \lambda $ will be about right, even though the estimator of $\beta $ may have substantial bias.

The consequent estimators of $\mu $, $\mathbb {E}\left( L_i\right) $ and $\mathbb {P}\left( L_i>0\right) $ will tend to:

$$\begin{aligned} \hat{\mu }= & {} \frac{ \hat{\alpha } }{\hat{\alpha } +\hat{\beta } } \approx \frac{ \hat{\alpha } }{\hat{\beta } } \rightarrow \frac{\alpha _0 }{\beta _0\lambda _0/\lambda _1} \approx \mu _0 \lambda _1 / \lambda _0 \\ \widehat{\mathbb {P}} \left( L_i>0\right)\rightarrow & {} \frac{ B \left( \alpha _0 , \beta _0 \bar{N}^{-1} \lambda _0 + b \right) - B \left( \alpha _0 , \bar{N}^{-1} \lambda _0 + B \right) }{ B \left( \alpha _0 , \bar{N}^{-1} \lambda _0 \right) } = \mathbb {P}\left( L_i > 0 \right) \\ \widehat{\mathbb {E}} \left( L_i\right)\rightarrow & {} (N-n) \frac{\alpha _0}{\beta _0 \lambda _0 / \lambda _1 + b \bar{N} \lambda _1^{-1}} = \mathbb {E}\left( L_i \right) \lambda _1 / \lambda _0 \end{aligned}$$

For details, see Supplementary Section S6.6. These results mean that the estimated leakage rate of consignments ($\mathbb {P}\left( L_i>0 \right) $) is still trustworthy even if an incorrect value of $\lambda $ is assumed. However, the estimates of the unit-level prevalence ($\mu $) and the leakage rate of units ($\mathbb {E}\left( L_i \right) $) are biased and approximately tend to the true values multiplied by $\lambda _1/\lambda _0$. In practice, we suggest that $\lambda =1$ would be the default assumption in most cases, with sensitivity analysis of plausible alternative values of the mean outbreak size $\lambda $.

5 Case Study

5.1 Background and Data

Cucurbit fruits such as melons, cucumbers and zucchinis are grown and traded in large quantities. Agricultural production of cucurbit fruits derives from cucurbit seeds that are themselves commonly produced, processed and transported in bulk across multiple national boundaries, through highly structured supply chains (Bonny 2017).

Cucurbit plants are susceptible to contamination by many viruses, some of which can be transmitted in seeds from contaminated plants to emerging seedlings (DAWR 2017). In order to manage biosecurity risks associated with imported seeds, Australia sets broad phytosanitary entry conditions and, in some cases, also requires testing for one or more seed-borne pests, such as the tobamovirus cucumber green mottle mosaic virus (CGMMV) (DAWR 2017). In such circumstances, tested seed lots must be certified to be free from the detectable presence of the specified pest(s) before they are permitted entry. CGMMV has been repeatedly detected in seed lots intended for entry at the Australian border (Constable et al. 2018).

We focus on large seed lots (defined as more than 2kg for cucumber and melon seeds, which is approximately 80,000 seeds; DAWR 2017). A sample of $b=94$ groups is taken from each large lot, with each group containing $\bar{N}=100$ seeds. We assume a total of $B=800$ groups in each lot. Each sampled group is tested using the highly specific serology-based enzyme-linked immunosorbent assay (ELISA) protocol, resulting in a presence or absence finding. It is typically assumed that groups are random samples of seeds (or equivalently, that contaminations are spread at random across groups), although this may not always be perfectly satisfied, for example, because units from a common location or supplier could tend to end up in the same groups. It is therefore important to understand the sensitivity to an assumption of random groups.

5.2 Estimating Parameters from Historical Data

Test results from 2016 seed imports are used for this case study, including cucumber, melon, pumpkin, squash and watermelon seeds. No contamination was recorded in 68 of the 72 large seed lots imported. From the other 4 large seed lots, the numbers of groups testing positive (i.e. $t_{yi}$) were 4, 4, 7 and 21. These results are a subset of the data discussed by Constable et al. (2018). In the terminology of Sects. 1–4, seed lots are consignments.

We fitted the nested beta model defined by (12)–(14) using maximum likelihood estimation. We assumed two specific values of $\lambda $ corresponding to the extreme possibilities of no clustering ($\lambda =1$ which implies $\theta =\infty $ and is equivalent to the unclustered model (1)–(2) discussed in Sect. 3.1) and perfect clustering ($\lambda =\bar{N}$ which implies $\theta =0$, where groups have either 100% or zero per cent of their seeds contaminated). The likelihood is maximised with respect to $\log (\alpha )$ and $\log (\beta )$ using the optim function in R with default settings including the use of the Nelder–Mead algorithm. Confidence intervals are constructed by profile likelihood as the simulation study summarised in Sect. 5.3 finds that Wald confidence intervals perform poorly here. In order to construct the profile intervals, the likelihood was calculated for many values of $\alpha $ and $\beta $, defined by a grid of approximately $2.1 \times 10^5$ pairs of values. The profile intervals are then the range of values of each parameter of interest in the grid where the deviance is less than the relevant quantile of the $\chi ^2_1$ distribution.

Maximum likelihood estimators and profile confidence intervals are calculated for $\alpha $, $\beta $, $\mu =\mathbb {E}(p_i)=\alpha /(\alpha +\beta )$, the expected leakage $\mathbb {E}(L_i)$ and the probability of leakage $\mathbb {P}(L_i>0)$. The latter two quantities are obtained from (23) and (21), respectively, with numerical integration required in general. Estimates are also calculated from the approximate model (30) when $\lambda =1$ using the betabin function in the aod package (Lesnoff and Lancelot 2012). (The approximate model is equivalent to the non-approximate version if $\lambda =\bar{N}$.)

Table 1 shows the estimates of $\alpha $, $\beta $ and $\mu $ for the two assumed values of $\lambda $. The estimates from the approximate model are also shown when $\lambda =1$. Table 2 shows the maximum likelihood estimators of $\mathbb {E}(L_i)$ and $\mathbb {P}(L_i>0)$. We firstly discuss the results when it is assumed that there is no clustering of contamination within groups ($\lambda =1$). The confidence intervals are wide, unsurprisingly given that contamination was only detected in four out of 72 lots. We see that $\hat{\mathbb {P}}(L_i>0)=0.032$, implying (on average) 2.2 seed lots with undetected contamination in a year’s worth of importing (about 70 lots). Even the lower end of the confidence interval (0.009) for $\mathbb {P}(L_i>0)$ suggests leakage from 0.6 seed lots annually. This gives a useful new perspective on the risk from this pathway. The expected number of contaminated seeds admitted into the country is estimated at $\hat{\mathbb {E}}(L_i)=0.11$ per lot or about 8 per year. The estimated first shape parameter is very small at $\hat{\alpha }=0.016$, representing a sharp peak for the distribution of $p_i$ at zero, with high probability of a value close to zero. This reflects the data where the great majority of lots have zero detections out of 9400 seeds.

Table 1 Maximum likelihood estimators of parameters of model (12)–(14) based on cucurbit surveillance results

Full size table

Table 2 Maximum likelihood estimators of the expected leakage $\mathbb {E}\left( L_i\right) $ and probability of leakage $\mathbb {P}\left( L_i>0\right) $ in any given consignment i, under (12)–(14) based on cucurbit surveillance results

Full size table

Tables 1 and 2 also show results for perfect clustering of contamination within groups ($\lambda =\bar{N}$). The estimates of $\alpha $ and $\mathbb {P}(L_i>0)$ are virtually unchanged (compared to the unclustered case where $\lambda =1$), while the estimates of $\mathbb {E}(L_i)$ and $\beta $ have increased by a factor of $\bar{N}=100$. This shows that, provided data are available to estimate the model parameters, estimates of $\alpha $ and $\mathbb {P}(L_i>0)$ are insensitive to the assumption of random mixing, consistent with the theoretical approximation discussed in Sect. 4.4.

The second rows of both Tables 1 and 2 show the results of fitting the approximate model (30) when $\lambda =1$. Profile confidence intervals were calculated based on the same approximation; this was substantially faster (several seconds on a Mac desktop with an i7 processor compared to several hours for non-approximate profiling), because the calculation could be vectorised in R and numerical integration was not required. The results are similar to the first row for $\hat{\alpha }$, but the approximate $\hat{\beta }$ is about 15% higher and the approximate $\hat{\mu }$ is about 15% lower. The confidence intervals for the approximate $\hat{\alpha }$ are virtually identical to their non-approximate counterpart; however, the confidence intervals for $\hat{\beta }$ and $\hat{\mu }$ are quite different (particularly $\hat{\mu }$). This is because the approximation is based on $p_i$ being close to zero with high probability and this is no longer justified at the upper limit of the interval for $\mu $. Table 2, however, shows that the approximation is excellent for quantities associated with leakage. This is because the greatest expected leakage given $p_i$ is when $p_i=1/9401$, and when this is the case, the approximations made in deriving (30) are precise.

Figure 2 further explores the role of the assumed value of $\lambda $. The panels of this figure show how the estimates of the various parameters, leakage rates and the deviance depend on the assumed $\lambda $. The figure confirms that $\hat{\alpha }$ and $\hat{\mathbb {P}} \left( L_i>0\right) $ are practically invariant to $\lambda $, whereas the other parameters depend substantially on $\lambda $. Panel (f) shows that the deviance is virtually flat with respect to $\lambda $, as the maximum deviance is only approximately 0.05, which represents a negligible increase compared to zero. This confirms that, as expected, $\lambda $ is not estimable from these data. All of these results are consistent with the approximations discussed in Sect. 4.4, including the fact that $\hat{\mu }$ and $\hat{\mathbb {E}}(L_i)$ are nearly perfectly proportional to the assumed value of $\lambda $.

Seed lots are blocked from importation if any contaminations are detected, so the distribution of $T_{Xi}$ conditional on $t_{yi}=0$ is of interest. We show this estimated conditional distribution in Fig. 3 for the unclustered case $\lambda =1$, because in many cases clustering is believed to be negligible (although this belief may not always be justified). The unconditional distribution of $T_{xi}$ is also shown. We only tabulate up to $T_{xi}=14$ and then group the 15+ values together for readability. (The probabilities continue to decline beyond $T_{xi}=15$.) The figure shows that the conditional probability that $T_{Xi}=0$ is much higher at 0.97 than the unconditional probability of 0.91. Moreover, the conditional probability of positive values of $T_{Xi}$ decreases rapidly with $T_{Xi}$, particularly for $T_{Xi} \ge 5$, whereas the corresponding decrease in the unconditional probabilities is much more gentle. This demonstrates that the testing process greatly reduces the risk of any leakage at all and virtually rules out the risk of more than about 5 contaminated seeds being imported from any given consignment.

5.3 Simulation Study for the Estimation of the Beta Distribution Parameters

The dataset in the case study is sparse, with positive detections in only 4 of 72 consignments. A simulation study was conducted to evaluate estimation and inference of model parameters in this situation. We simulated from a range of values of $\alpha $ spread around the estimate of $\alpha $ from the case study. Clustering was either non-existent ($\lambda =1$) (so that the generating model matched the fitted model), moderate ($\lambda =\bar{N}/2=50$) or perfect ($\lambda =\bar{N}=100$). The values of $\mu $ (and hence $\beta $) were chosen such that the proportion of groups testing positive roughly matched the proportion in the case study. 1000 datasets were simulated for each case.

The simulation found that estimates were highly variable reflecting the sparsity of detections. Nevertheless, it was confirmed that $\hat{\alpha }$ had consistently low bias relative to its dispersion, regardless of $\lambda $. As expected, $\hat{\mu }$ had large negative bias if $\lambda $ was 50 or 100. Profile confidence intervals had coverage much closer to the nominal 95% than Wald confidence intervals. Full details and results are contained in Supplementary Section S7.

6 Discussion

The design of sampling plans for group testing of import consignments often relies on achieving a high probability of detecting contamination based on a single assumed consignment prevalence. While this is a useful starting point, it fails to account for the heterogeneity that exists across consignments and across groups within consignments.

To address this limitation, we propose supplementing the existing approach with multi-level models that incorporate beta-distributed random effects for consignments and groups. These models can be based on expert judgement or estimated from available data, even when positive detections are sparse. By employing these models, we can estimate leakage rates of undetected contamination, including the probability of a consignment containing undetected contamination and the distribution of contaminated units entering a country. This new approach complements existing design strategies by providing a data-driven assessment of the effectiveness of a testing regime for an ongoing importation pathway. Biosecurity agencies can better manage risk by systematically collecting testing results, periodically modelling them, and revising testing or mitigation strategies accordingly.

One of the models we introduce allows for clustering of contaminated units within groups. The parameter controlling this clustering can be expressed as a mean outbreak size, denoted $\lambda $, which is the approximate expected number of contaminated units in a group with contamination. Theoretical derivations and empirical results demonstrate that this parameter cannot be identified from data on group testing results. In practice, when firm information is lacking, it would often make sense to assume $\lambda =1$, indicating no clustering. We show that the estimator of the probability of leakage in a consignment is almost perfectly robust to mis-specification of $\lambda $, provided contamination is rare. However, further research would be worthwhile to estimate $\lambda $ through experiments, as it impacts estimators of the leakage rate of units and the unit-level prevalence. When clustering occurs, these quantities would be underestimated by a factor of approximately $\lambda $.

The new models also allow us to assess the effectiveness of group testing. In a case study, we estimated the unconditional probability of a consignment containing contamination to be approximately 9%. This probability reduces to only 3% when conditioned on zero sample detections.

The framework developed here offers a novel perspective by enabling inference on ongoing leakage risk from a pathway. This provides substantial value in conjunction with the standard practice of designing sampling plans based on an arbitrary constant design prevalence. Group testing strategies, widely used in current biosecurity processes, environmental sampling, and public health, can benefit from the new models and results presented here, which prompt a reconsideration of design criteria and data analyses for effective risk management.

Data Availability

The data in the case study were fully stated in the text. R code to reproduce the case study and simulation can be accessed at https://doi.org/10.5281/zenodo.7916796. This site includes an R Shiny app to calculate posterior distributions for $T_{Xi}$ which can also be accessed at https://inferenceapps.shinyapps.io/groupSampling.

References

Bonny S (2017) Corporate concentration and technological change in the global seed industry. Sustainability 9:1632
Article Google Scholar
Brunk HD, Holstein JE, Williams F (1968) The Teacher’s Corner: a comparison of binomial approximations to the hypergeometric distribution. Am Stat 22:24–26
Google Scholar
Chapman D, Purse B, Roy H, Bullock J (2017) Global trade networks determine the distribution of invasive non-native species. Global Ecol Biogeogr 26:907–917
Article Google Scholar
Constable F, Daly A, Terras MA, Penrose L, Dall D (2018) Detection in Australia of cucumber green mottle mosaic virus in seed lots of cucurbit crops. Australas Plant Disease Notes 13:18
Article Google Scholar
Cowling D, Gardner I, Johnson W (1999) Comparison of methods for estimation of individual-level prevalence based on pooled samples. Prev Vet Med 39:211–225
Article Google Scholar
Dall D, Penrose L, Daly A, Constable F, Gibbs M (2019) Prevalences of pospiviroid contamination in large seed lots of tomato and capsicum, and related seed testing considerations. Viruses 11:1034–1041
Article Google Scholar
DAWR (2017), Final pest risk analysis for Cucumber green mottle mosaic virus (CGMMV), https://www.awe.gov.au/biosecurity-trade/policy/risk-analysis/plant/cucumber-green-mottle-mosaic-virus/final-report, Department of Agriculture and Water Resources (DAWR): Canberra, Australia
Furlan E, Gleeson D, Hardy C, Duncan R (2016) A framework for estimating the sensitivity of eDNA surveys. Mol Ecol Resour 16:641–654
Article Google Scholar
Hepworth G (1996) Exact confidence intervals for proportions estimated by group testing. Biometrics 52:1134–1146
Article MathSciNet Google Scholar
Hepworth G (2019) Bias correction of estimated proportions using inverse binomial group testing. Aust N Z J Stat 61:51–60
Article MathSciNet Google Scholar
IPPC (2008) International standards for phytosanitary measures: methodologies for sampling of consignments. http://www.fao.org/3/cb2570en/cb2570en.pdf. International Plant Protection Convention. CPM-3 Report, ISPM No. 31
Lane, S., Richards, R., McDonald, C., and Robinson, A. (2018), Sample size calculations for phytosanitary testing of small seed lots of seed: An investigation of various options. https://cebra.unimelb.edu.au/research/strengthening-surveillance/importation-of-small-seed-lots, CEBRA Report 1606A, University of Melbourne
Lesnoff M, Lancelot R (2012) aod: Analysis of Overdispersed Data. R package version 1.3.2
Liu A, Liu C, Zhang Z, Albert PS (2012) Optimality of group testing in the presence of misclassification. Biometrika 99:245–251
Article MathSciNet Google Scholar
Lokuge K, Banks E, Davis S, Roberts L, Street T, O’Donovan D, Caleo G, Glass K (2021) Exit strategies: optimising feasible surveillance for detection, elimination, and ongoing prevention of COVID-19 community transmission. BMC Med 19:1–14
Article Google Scholar
Madden L, Hughes G, Munkvold G (1996) Plant disease incidence: inverse sampling, sequential sampling, and confidence intervals when observed mean incidence is zero. Crop Prot 15:621–632
Article Google Scholar
McArdle BH, Gaston KJ, Lawton JH (1990) Variation in the size of animal populations: patterns, problems and artefacts. J Anim Ecol 59:439–454
Article Google Scholar
McSorley R, Littel RC (1993) Probability of detecting nematode infestations in quarantine samples. Nematropica 23:177–181
Google Scholar
Theobald C, Davie A (2014) Group testing, the pooled hypergeometric distribution, and estimating the number of defectives in small populations. Commun Stat Theory Methods 43:3019–3026
Article MathSciNet Google Scholar
Trouvé R, Robinson AP (2021) Estimating consignment-level infestation rates from the proportion of consignment that failed border inspection: possibilities and limitations in the presence of overdispersed data. Risk Anal 41:992–1003
Article Google Scholar
Trouvé R, Arthur AD, Robinson AP (2022) Assessing the quality of offshore Binomial sampling biosecurity inspections using onshore inspections. Ecol Appl 32:e2595
Article Google Scholar
Van der Vaart A (2000) Asymptotic statistics. Cambridge University Press, Cambridge
Google Scholar
Venette R, Moon R, Hutchison W (2002) Strategies and statistics of sampling for rare individuals. Ann Rev Entomol 47:143–174
Article Google Scholar

Download references

Acknowledgements

David Dall suggested the cucurbit case study and provided invaluable advice on the background and interpretation of this dataset. Fiona Constable also assisted with the use of these data. Francis Hui and Alan Welsh provided useful suggestions on presentation and approach. Two anonymous referees made extensive comments which considerably improved the paper. All views are the authors’ own.

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions.

Author information

Authors and Affiliations

Research School of Finance, Actuarial Studies and Statistics, Australian National University, Canberra, Australia
Robert Graham Clark
Australian Bureau of Agricultural and Resource Economics and Sciences, Canberra, Australia
Belinda Barnes & Mahdi Parsa

Authors

Robert Graham Clark
View author publications
You can also search for this author in PubMed Google Scholar
Belinda Barnes
View author publications
You can also search for this author in PubMed Google Scholar
Mahdi Parsa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Graham Clark.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 236 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Clark, R.G., Barnes, B. & Parsa, M. Clustered and Unclustered Group Testing for Biosecurity. JABES 29, 193–211 (2024). https://doi.org/10.1007/s13253-023-00566-x

Download citation

Received: 16 September 2022
Revised: 18 May 2023
Accepted: 25 July 2023
Published: 26 August 2023
Issue Date: June 2024
DOI: https://doi.org/10.1007/s13253-023-00566-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Clustered and Unclustered Group Testing for Biosecurity

Abstract

Similar content being viewed by others

A Reference Population-Based Conformance Proportion

Nested Group Testing Procedure

Inferring fruit infestation prevalence from a combination of pre-harvest monitoring and consignment sampling data

1 Introduction

2 Problem Definition and Review

2.1 Problem Definition

2.2 Review of Existing Models

2.2.1 Review of Models assuming Random Group Formation

2.2.2 Review of Models assuming Non-Random Group Formation

2.2.3 Review of Models allowing for Consignment-Level Heterogeneity

3 Models for Randomly Formed Groups

3.1 Model Definition

3.2 Distribution of Contamination Levels Conditional on Group Test Results

3.3 Properties of the Leakage

4 Models for Clustered Groups

4.1 Model Definition

4.2 Likelihood and Posterior

4.3 Leakage for the Nested Beta Model

4.4 Heuristic Approximation of the Model for Rare Events

5 Case Study

5.1 Background and Data

5.2 Estimating Parameters from Historical Data

5.3 Simulation Study for the Estimation of the Beta Distribution Parameters

6 Discussion

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 236 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation