1 Introduction

Mathematical models are being used increasingly to inform public health policy concerning control of emerging infections, see, e.g. Ferguson et al. (2005) and Fraser et al. (2009) for applications to avian influenza A(H5N1) and swine influenza A(H1N1), respectively. A key role for such models is to evaluate the effectiveness of possible strategies for containment of an emerging infection. In order to accomplish this, estimates are required of parameters used to define the model in question. This paper considers such estimation from data collected in the early phase of an emerging epidemic, using the model of Ball et al. (1997) for the spread of an SIR (susceptible \(\rightarrow \) infective \(\rightarrow \) recovered) epidemic among a population partitioned into households.

The model of Ball et al. (1997) assumes that an infectious individual makes two types of contacts, local contacts, i.e. with individuals chosen uniformly at random from the individual’s household, and global contacts, i.e. with individuals chosen uniformly at random from the entire population. Although an oversimplification, this structure, which includes a departure from homogeneous mixing that is clearly present in human populations, yields a model that (i) is amenable to considerable mathematical analysis and (ii) leads to important insights into disease dynamics and control, such as the impact of household structure on the performance of vaccination strategies (Becker and Dietz 1995; Becker and Starczak 1997; Ball and Lyne 2002). A household component is present in many complex simulation models (see, e.g. Ferguson et al. 2005). Moreover, data at a household level are often collected during emerging infections; see Cauchemez et al. (2009) and House et al. (2012) for analyses of such data for influenza A(H1N1) transmission in the United States and England, respectively.

For many stochastic models of epidemics with few initial infectives, if the disease does not die out quickly then, during the early stages of an epidemic, the number of infectives grows exponentially until saturation effects take over. Early exponential growth is also seen in many real-life epidemics and there has been a growing interest in quick inference methods during this stage of an epidemic. Assuming a homogeneously mixing population, Wallinga and Lipsitch (2007) provided a simple estimate of the basic reproduction number \(R_0\) (see, e.g. Heesterbeek and Dietz 1996) from an observed exponential growth rate \(r\) and knowledge of the generation interval for the disease. Fraser (2007) extended this methodology to a community of households, using a closed-form approximate method for determining the exponential growth rate of the households epidemic model. Fraser gives two illustrative applications of his methodology, to pandemic influenza and measles, using historical data to obtain estimates of within-household transmission parameters. As Fraser notes, these transmission parameters could be quite different for future pandemics, so methods are required for estimating such parameters from data on an emerging infection.

The following scenario is considered in this paper. It is assumed that the household size distribution for the population is known (this is usually available from census data), an estimate of the exponential growth rate \(r\) is available from general data on an emerging epidemic and more-detailed, household-level data are available in a sample of households. The primary goal is to estimate the local (within-household) infection rate \(\lambda _L\) from this information, whilst the epidemic is still in its emerging phase. For most of the paper it is assumed, primarily for ease of presentation, that there is no latent period and that the infectious period distribution is known, though both of these assumptions may be relaxed. For inference based on final outcome data (e.g. Knock and O’Neill 2014; Ball and Lyne 2015), estimates of infection rates are (i) invariant to very general assumptions concerning a latent period and (ii) confounded with the scale of the infectious period distribution. Neither is true for inference in an emerging epidemic. The partial nature of the assumed available data renders full maximum likelihood estimation difficult, if indeed feasible; the amount of unobserved data is such that computationally intensive methods for incomplete data, such as the EM algorithm and data augmentation MCMC, may well be problematic. Thus an alternative estimation procedure is developed and shown to give an estimator of \(\lambda _L\) which converges to the unknown true value as the population and sampled households both tend to infinity in an appropriate fashion.

It is well known that the early stages of an SIR epidemic among a community of households may be approximated by a branching process in which individuals correspond to single-household epidemics. Thus if, for example, the available data consist of the total number of cases in completed sub-epidemics within households, it is tempting to estimate \(\lambda _L\) by fitting the usual final size distribution for a single-household epidemic (see, e.g. Ball 1986) to such data. However, as illustrated in Sect. 3.2, this leads to \(\lambda _L\) being underestimated because in an emerging epidemic the completed single-household epidemics are likely to be the smaller ones. An improved estimate may be obtained by including single-household epidemics that are still ongoing at the time when estimation is performed, using right-censoring for their size (i.e. that their size is at least their current size), but, as also demonstrated in Sect. 3.2, the resulting estimate is still biased. (An estimator of a parameter is unbiased if its expectation, i.e. the average of infinitely many independent repetitions of the same experiment, equals the true value of that parameter, otherwise the estimator is biased.) In order to obtain an estimator that is unbiased when the population and sample households both tend to infinity, one needs to account correctly for the emerging nature of the epidemic which produced these data. (Similar issues arise in estimating the generation time of an infectious disease early in an epidemic Scalia Tomba et al. 2010.) The main purpose of this paper is to show that this can be achieved by using the theory of Nerman (1981) concerning the asymptotic behaviour of counts of characteristics associated with supercritical general (i.e. Crump-Mode-Jagers) branching processes applied to the above-mentioned branching process which approximates the early stages of an epidemic in a community of households.

The paper is structured as follows. The households epidemic model of Ball et al. (1997) is described in Sect. 2 and the early stages of epidemics in a large population is considered in Sect. 3. The threshold behaviour of the model is outlined in Sect. 3.1. Estimation of \(\lambda _L\) by fitting the usual final size distribution to single-household epidemics, both without and with censoring, is considered and shown to be inadequate in Sect. 3.2. The new method, which incorporates correctly the emerging nature of the epidemic is described in Sect. 4. The theory for the method is developed in Sect. 4.1 for the situtations when, at the time the inference is performed, (i) complete knowledge of the numbers of infective and recovered individuals in each household is available, and (ii) (sometimes the more realistic scenario) only the numbers of recovered individuals in each household are available. Some extensions of the theory and implementation issues are considered in Sect. 4.2. The theory as developed does not make any assumptions concerning the infectious period distribution, other than it possesses a moment-generating function, but it does need to be specified. However, the method is easy to implement only if single-household epidemic dynamics are Markovian, i.e. if the infectious period follows an exponential distribution, though phase-type distributions can also be accommodated. Extensions to incorporate a latent period and allow for the rate of the exponential distribution used to model the infectious period to be unknown are discussed briefly, as is allowing \(\lambda _L\) to depend on household size. Similar theory is developed in Sect. 5 for a households Reed-Frost type model, in which the latent period is constant and the infectious period is reduced to a single point in time, using multitype Galton-Watson branching process. Simulations depicting how the estimation methodologies developed in Sects. 4 and 5 perform in practice are shown in Sect. 6, while other plots in this section illustrate how changes to the parameters governing the spread of an epidemic affect the bias of the estimates based on single-household final size distributions. Proofs that the estimators derived in Sect. 4 are strongly consistent under suitable conditions are given in Sect. 7. Finally, some concluding comments are given in Sect. 8.

2 Model

The model used is based on that of Ball et al. (1997) for describing the spread of an SIR epidemic in a population that is partitioned into households. For a population in which \(n_{max}\) is the size of the largest household, let \(m_n\) be the number of households of size \(n\), for \(n = 1,2,\ldots ,n_{max}\), so that \(m = \sum _{n=1}^{n_{max}} m_n\) and \(N = \sum _{n=1}^{n_{max}} nm_n\) are, respectively, the total numbers of households and individuals in the population. Also, for \(n = 1,2,\ldots ,n_{max}\), let \(\alpha _n = m_n/m\) be the proportion of households of size \(n\) and \(\tilde{\alpha }_n = nm_n/N\) be the proportion of individuals who reside in households of size \(n\).

The epidemic is initiated by a small number of individuals becoming infected at time \(t=0\). Once infected, an individual remains in that state for the duration of its infectious period, which for each individual is independently and identically distributed according to a random variable \(T_I\), having an arbitrary but specified distribution. Once its infectious period is over, an individual is recovered and it plays no further part in the epidemic. During its infectious period, a given infective makes global contacts with any other given individual in the population at points of a homogeneous Poisson process having rate \(\lambda _G/N\) and it makes additional local contacts with any other given individual in the same household at points of a homogeneous Poisson process having rate \(\lambda _L\). All the Poisson processes describing infectious contacts (whether or not either or both of the individuals involved are the same) and the random variables describing the infectious periods are mutually independent. Whenever an infective makes contact with a susceptible individual, the susceptible becomes infected and is immediately able to transmit infection. Thus there is no latent period, though this can be relaxed; see Sect. 4.2. The process continues until there is no infective remaining in the population, at which point the epidemic is deemed to have ceased.

3 Early stages of an epidemic

3.1 Threshold parameter

When the number of households \(m\) is large, the probability of a global infectious contact in the early stages of an epidemic being with a susceptible in a previously infected household is small. Thus, the initial behaviour of an epidemic in a community of households can be approximated by a branching process of infected households, in which each global contact is assumed to be with an individual in a fully susceptible household. Let \(R_*\) be the mean number of global contacts that emanate from a typical household in this branching process. Then \(R_*\) is a threshold parameter for the households epidemic model, in that in the limit as \(m \rightarrow \infty \), the epidemic takes off with non-zero probability if and only if \(R_* > 1\); see Ball et al. (1997), where calculation of \(R_*\) is described.

The remainder of this paper focuses exclusively on epidemics having \(R_* > 1\) and is concerned with epidemics that do take off. It is assumed that \(\mathbb {E}[T_I] = 1\), as this can be done without loss of generality by rescaling the time axis.

3.2 Basic approach to estimating \(\lambda _L\)

Suppose one wishes to estimate \(\lambda _L\) for an epidemic that is observed whilst it is still in its initial stages and is therefore still mimicking the infected households branching process outlined above. For \(n = 1,2,\ldots \) and \(x = 0,1,\ldots ,n-1\), let \(p_{basic}^{(n)}(x|\lambda _L)\) be the probability that a single-household epidemic (without global infection) in a household of size \(n\), started by one initial infective, finishes with \(x\) susceptibles remaining. By using Eq. (2.5) of Ball (1986), \(p_{basic}^{(n)}(x|\lambda _L)\) (\(x = 0,1,\ldots ,n-1\)) can be determined using the following triangular system of linear equations:

$$\begin{aligned} \sum \limits _{i=1}^{j}\left( {\begin{array}{c}n-i\\ j-i\end{array}}\right) p_{basic}^{(n)}(n-i|\lambda _L)\phi (n-j)^{i} = \left( {\begin{array}{c}n-1\\ j-1\end{array}}\right) ,\quad j = 1,2,\ldots ,n, \end{aligned}$$

where \(\phi (\theta ) = \mathbb {E}[\exp (-\theta T_I)] (\theta \ge 0)\) is the moment-generating function of \(T_I\).

Let \(a_{x,y}^{(n)}\) be the number of households of size \(n\) containing \(x\) susceptibles and \(y\) infectives at the time when the epidemic is observed. By considering only those households in which the single-household epidemic has ceased (i.e. where \(x < n\) and \(y = 0\)), one can attempt to estimate \(\lambda _L\) by maximising the pseudolikelihood function

$$\begin{aligned} L_{basic}(\lambda _L|\varvec{a}) = \prod \limits _{n=2}^{n_{max}}\prod \limits _{x=0}^{n-1}{p_{basic}^{(n)}(x|\lambda _L)}^{a_{x,0}^{(n)}}. \end{aligned}$$
(3.1)

Note that households of size \(1\) provide no information about \(\lambda _L\) and that (3.1) is not a true likelihood function as it assumes independence between households. This method of estimation, which we call basic MPLE, is simple but does not use all of the information available since households in which infectives are still present are ignored. A similar approach using more of the information available is to use maximum pseudolikelihood estimation but with censoring on households in which there are still infectives remaining. For \(n = 2,3,\ldots ,n_{max}\) and \(x = 0,1,\ldots ,n-1\), let \(q_{basic}^{(n)}(x|\lambda _L) = \sum _{i=0}^{x}p_{basic}^{(n)}(i|\lambda _L)\) be the probability that a household of size \(n\) has at most \(x\) survivors from a single household epidemic and let \(b_{x}^{(n)} = \sum _{y=1}^{n-x}a_{x,y}^{(n)}\) be the number of observed households of size \(n\) containing at least one infective and exactly \(x\) susceptibles. Such households will have at most \(x\) survivors once the single-household epidemic is completed. We can now use what is referred to as the censored MPLE approach for estimating \(\lambda _L\), with left-censoring for the number of survivors (i.e. right-censoring for the total size), by maximising

$$\begin{aligned} L_{censor}(\lambda _L|\varvec{a},\varvec{b}) = \prod \limits _{n=2}^{n_{max}}\prod \limits _{x=0}^{n-1}{p_{basic}^{(n)}(x|\lambda _L)}^{a_{x,0}^{(n)}}{q_{basic}^{(n)}(x|\lambda _L)}^{b_{x}^{(n)}}. \end{aligned}$$

Figure 1 shows how well the basic and censored MPLE methods perform in practice. For these histograms, epidemics were simulated in a population containing 1,000,000 households, with estimates of \(\lambda _L\) taking place after the 1000th recovery has occurred. Any epidemic not reaching 1000 recoveries was considered not to have taken off and was ignored. Estimates of \(\lambda _L\) were made for the first 1000 epidemics to reach the 1000 recovery milestone. A large population was used to ensure that the simulated epidemics were still approximately mimicking a branching process at the time of estimation. The household distribution \(\varvec{\alpha }\) that was used was \([0.29,0.34,0.16,0.14,0.05,0.02]\), i.e. \(n_{max} = 6\) and \(\alpha _1 = 0.29,\alpha _2 = 0.34,\ldots ,\alpha _6 = 0.02\), as suggested by Fraser (2007), and is based on UK census data from 2001 (http://www.statistics.gov.uk/census/) . The infectious period was chosen to be exponentially distributed, the infectious parameters were \(\lambda _G = 1\) and \(\lambda _L = 1\), and all epidemics were initiated by a single individual, chosen uniformly at random from the population, becoming infected.

Fig. 1
figure 1

Estimates of \(\lambda _L\), with a true value of \(1\), from 1000 epidemic simulations using the basic and censored MPLE methods

It is clear from Figure 1 that the basic MPLE method severely underestimates \(\lambda _L\). Households in which the epidemic spreads locally are more likely to still be infective at the observation time than households infected at the same time but in which the initial infective does not infect any other individual locally. Consequently, households that contain less severe local epidemics are more likely to be included in the basic MPLE estimate, causing the observed underestimate of \(\lambda _L\). The censored MPLE approach appears to offer an improvement but repeated simulations with different parameters showed that this method generally overestimates \(\lambda _L\), as is observed in Fig. 1.

In order to obtain a more accurate estimate of \(\lambda _L\) one must understand the infected households branching process in more detail. The basic idea is the following. If the approximating branching process does not go extinct, then it grows exponentially at a rate \(r\), which depends on the parameters of the households epidemic model, and as time \(t \rightarrow \infty \) the fraction of completed single household epidemics (in the branching process), in households of size \(n\), that leave \(x\) members susceptible, converges to a limit \(\tilde{p}_{x,0}^{(n)}(r|\lambda _L)~(x = 0,1,\ldots ,n-1)\). Thus we assume that each observed household in the data has final size that comes from that distribution and estimate \(\lambda _L\) by maximising the pseudolikelihood obtained by replacing \({p_{basic}^{(n)}(x|\lambda _L)}\) by \(\tilde{p}_{x,0}^{(n)}(\hat{r}|\lambda _L)\) in (3.1), where \(\hat{r}\) is an estimate of the growth rate \(r\); see (4.5) in the next section, where calculation of \(\tilde{p}_{x,0}^{(n)}(r|\lambda _L)\) is explained.

4 A new method

4.1 A more accurate estimator

Consider the approximating branching process introduced in Sect. 3.1, in which individuals correspond to infected households and an individual has one offspring whenever a global contact emanates from the corresponding single-household epidemic. For \(n = 1,2,\ldots ,n_{max}\), let \(E_H^{(n)}\) denote a typical size-\(n\) single-household epidemic, started by one member of the household being infected at time \(t=0\). For \(t \ge 0\), let \(X_H^{(n)}(t)\) and \(Y_H^{(n)}(t)\) be respectively the numbers of susceptibles and infectives in \(E_H^{(n)}\) at time \(t\). Let \(\fancyscript{T}^{(n)} = \{(x,y):~x = 0,1,\ldots ,n-1;~y = 0,1,\ldots ,n-x\}\) and, for \((x,y)\in \fancyscript{T}^{(n)}\), let \({p}_{x,y}^{(n)}(t|\lambda _L) = \mathbb {P}(X_H^{(n)}(t) = x,~Y_H^{(n)}(t) = y)~(t \ge 0)\) and \(\tilde{p}_{x,y}^{(n)}(r|\lambda _L) = \int _0^\infty e^{-rt}{p}_{x,y}^{(n)}(t|\lambda _L)~\mathop {}\!\mathrm {d}t~(r \ge 0)\). Further, let \(\xi _H^{(n)}\) be the point process describing times that global contacts emanate from \(E_H^{(n)}\), so, for \(t \ge 0,~\xi _H^{(n)}([0,t])\) is the number of global contacts that emanate from \(E_H^{(n)}\) during \([0,t]\). For \(t \ge 0\) let \(\mu ^{(n)}(t) = \mathbb {E}[\xi _H^{(n)}([0,t])]\) and note that

$$\begin{aligned} \mu ^{(n)}(\mathop {}\!\mathrm {d}t) = \lambda _G\sum \limits _{(x,y)\in \fancyscript{T}^{(n)}}yp_{x,y}^{(n)}(t|\lambda _L)~ \mathop {}\!\mathrm {d}t. \end{aligned}$$
(4.1)

Let \(\xi _H\) be a mixture of \(\xi _H^{(1)},\xi _H^{(2)},\ldots ,\xi _H^{(n_{max})}\) with mixing probabilities \(\tilde{\alpha }_1,\tilde{\alpha }_2,\ldots , \tilde{\alpha }_{n_{max}}\). Then \(\xi _H\) is a point process which describes the ages at which a typical individual reproduces in the approximating branching process. Note that this branching process is a general (i.e. Crump-Mode-Jagers) branching process; e.g. (Haccou et al. 2005, Section 3.3). For \(t \ge 0\), let

$$\begin{aligned} \mu (t) = \mathbb {E}[\xi _H([0,t])] = \sum \limits _{n=1}^{n_{max}}\tilde{\alpha }_n\mu ^{(n)}(t). \end{aligned}$$
(4.2)

The branching process has a Malthusian parameter \(r \in (0,\infty )\), given by the unique solution of the equation

$$\begin{aligned} \int _0^\infty e^{-rt}\mu (\mathop {}\!\mathrm {d}t)=1. \end{aligned}$$

Note that, from (4.1) and (4.2), \(r\) satisfies

$$\begin{aligned} \lambda _G\sum \limits _{n=1}^{n_{max}}\tilde{\alpha }_n\sum \limits _{(x,y)\in \fancyscript{T}^{(n)}}y\tilde{p}_{x,y}^{(n)}(r|\lambda _L) = 1. \end{aligned}$$
(4.3)

It is convenient to assume that individuals live forever in the branching process, though of course an individual ceases to reproduce as soon as there is no infective in the corresponding single-household epidemic. For \(n = 1,2,\ldots ,n_{max}\) and \((x,y) \in \fancyscript{T}^{(n)}\), an individual in the branching process is said to be in state \((n,x,y)\) if it corresponds to a single size-\(n\) household epidemic and there are \(x\) susceptibles and \(y\) infectives in that epidemic. Let \(\fancyscript{T} = \{(n,x,y) :~n=1,2,\ldots ,n_{max}~\text {and}~(x,y)\in \fancyscript{T}^{(n)}\}\). For \(t \ge 0\) and \((n,x,y) \in \fancyscript{T}\), let \(Y_{n,x,y}(t)\) be the number of individuals in state \((n,x,y)\) at time \(t\) in the branching process. Suppose that the Malthusian parameter \(r\) is strictly positive. Then it is easily verified that the conditions of Theorem 5.4 of Nerman (1981) are satisfied and it follows from that theorem that there exists a random variable \(W \ge 0\), where \(W = 0\) if and only if the branching process goes extinct, such that for all \((n,x,y) \in \fancyscript{T}\),

$$\begin{aligned} e^{-rt}Y_{n,x,y}(t) \xrightarrow {\text {a.s.}} \tilde{\alpha }_n\tilde{p}_{x,y}^{(n)}(r|\lambda _L)W \quad \text {as } t \rightarrow \infty , \end{aligned}$$
(4.4)

where \(\xrightarrow {\text {a.s.}}\) denotes almost sure convergence (i.e. convergence with probability \(1\)).

Note that \(\sum _{(x,y)\in \fancyscript{T}^{(n)}}p_{x,y}^{(n)}(t|\lambda _L) = 1\), so \(\sum _{(x,y)\in \fancyscript{T}^{(n)}}\tilde{p}_{x,y}^{(n)}(r|\lambda _L) = 1/r \quad (n = 1,2,\ldots , n_{max})\). Thus, if the branching process does not go extinct, as \(t \rightarrow \infty \) the proportion of individuals that are in state \((n,x,y)\) converges almost surely to \(\tilde{\alpha }_nr\tilde{p}_{x,y}^{(n)}(r|\lambda _L)\).

Return to the households epidemic model. Recall that for \((n,x,y) \in \fancyscript{T}\), the number of households of size \(n\) that contain \(x\) susceptibles and \(y\) infectives when the epidemic is observed is denoted by \(a_{x,y}^{(n)}\). Suppose that an estimate, \(\hat{r}\) say, of the growth rate \(r\) is available. Then, provided the epidemic has taken off and it has been running for a sufficiently short period of time so that the branching process provides a good approximation but a sufficiently long time so that the above asymptotic composition of the branching process is applicable, \(\lambda _L\) can be estimated by maximising the normalised pseudolikelihood function

$$\begin{aligned} L_{full}(\lambda _L|\varvec{a},\hat{r}) = \prod \limits _{n=2}^{n_{max}}\prod \limits _{(x,y)\in \fancyscript{T}^{(n)}}\tilde{p}_{x,y}^{(n)}(\hat{r}|\lambda _L)^{a_{x,y}^{(n)}}. \end{aligned}$$
(4.5)

In Sect. 7 we prove that, under suitable conditions, the estimator \(\hat{\lambda }_L = \text {argmax } L_{full}(\lambda _L|\varvec{a},\hat{r})\) is strongly consistent as the number of households \(m \rightarrow \infty \), i.e. that \(\hat{\lambda }_L\) converges almost surely to the true value \(\lambda _L\) as \(m \rightarrow \infty \).

Suppose that estimation is based only on completed single-household epidemics, as in the basic MPLE method. Then \(\lambda _L\) may be estimated by maximising

$$\begin{aligned} L_{final}(\lambda _L|\varvec{a},\hat{r}) = \prod \limits _{n=2}^{n_{max}}\prod \limits _{x=0}^{n-1}\tilde{p}_{x,0}^{(n)}(\hat{r}|\lambda _L)^{a_{x,0}^{(n)}}. \end{aligned}$$

Observe that subject to mild conditions,

$$\begin{aligned} p_{basic}^{(n)}(x|\lambda _L) = \lim _{t \rightarrow \infty } p_{x,0}^{(n)}(t|\lambda _L) = \lim _{r \rightarrow 0^{+}} r\tilde{p}_{x,0}^{(n)}(r|\lambda _L). \end{aligned}$$

It follows that, under appropriate conditions, the basic MPLE method becomes asymptotically unbiased as the growth rate tends down to zero.

A key assumption of the estimator based on \(L_{full}\) is that the exact state of a household is observable but this is unlikely to be realised in practice. Suppose that only recoveries are observed. For \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,2,\ldots ,n\), let \(c_j^{(n)}\) be the observed number of households of size \(n\) with \(j\) recoveries, let \(\fancyscript{A}_j^{(n)} = \{(x,y) \in \fancyscript{T}^{(n)}:~x+y=n-j\}\) and let

$$\begin{aligned} \tilde{q}_j^{(n)}(r|\lambda _L) = \sum \limits _{(x,y) \in \fancyscript{A}_j^{(n)}}\tilde{p}_{x,y}^{(n)}(r|\lambda _L)/\left( \frac{1}{r}-\tilde{q}_0^{(n)}(r|\lambda _L)\right) , \end{aligned}$$

where \(\tilde{q}_0^{(n)}(r|\lambda _L) = \sum \nolimits _{y=1}^n\tilde{p}_{n-y,y}^{(n)}(r|\lambda _L)\). Then \(\lambda _L\) may be estimated by maximising

$$\begin{aligned} L_{rec}(\lambda _L|\varvec{c},\hat{r}) = \prod \limits _{n=2}^{n_{max}}\prod \limits _{j=1}^{n}\tilde{q}_{j}^{(n)}(\hat{r}|\lambda _L)^{c_{j}^{(n)}}. \end{aligned}$$
(4.6)

4.2 Practicalities and extensions

Estimates of \(\lambda _L\) based upon \(L_{full}\) and \(L_{rec}\) are both dependent on knowing \(\tilde{p}_{x,y}^{(n)}(r|\lambda _L)\) for \((n,x,y) \in \fancyscript{T}\), which is not practical in many circumstances. It is, however, possible if we restrict ourselves to the Markovian case, in which the infectious period \(T_I\) is exponentially distributed, by following a similar argument to that used in Section 4 of Pellis et al. (2011) to calculate real-time growth rates. Under these circumstances, the single-household epidemic \(E_H^{(n)} = \{(X_H^{(n)}(t),Y_H^{(n)}(t)):t \ge 0\}\) is a continuous-time Markov chain (CTMC). Figure 2 shows the transition rates of \(E_H^{(3)}\) as a CTMC and also assigns labels to each state \((x,y) \in \fancyscript{T}^{(3)}\). The exact assignment of these state labels is unimportant, however it is convenient for the initial state \((n-1,1)\) to be assigned as state \(1\) for a size-\(n\) household.

Fig. 2
figure 2

Graphical representation of a single-household epidemic for households of size 3 as a CTMC, where \((x,y)\) denotes the household state and state labels (shown as superfixes) for the CTMC are assigned as described. The values on the arrows represent transition rates between states in the single-household epidemic

Note that the state space \(\fancyscript{T}^{(n)}\) of \(E_H^{(n)}\) has size \(s^{(n)} = |\fancyscript{T}^{(n)}| = n(n+3)/2\). Let \(Q^{(n)}(\lambda _L) = [q_{ij}^{(n)}(\lambda _L)]\) be the \(s^{(n)} \times s^{(n)}\) transition-rate matrix of \(E_H^{(n)}\), using the assigned labelling. Thus, if \(i \ne j\) then \(q_{ij}^{(n)}(\lambda _L)\) is the transition rate of \(E_H^{(n)}\) from the state having label \(i\) to the state having label \(j\), and \(q_{ii}^{(n)}(\lambda _L) = -\sum _{j \ne i}q_{ij}^{(n)}(\lambda _L)\). Note that if a label \(i\) corresponds to a household state \((x,0)\), then \(q_{ij}^{(n)}(\lambda _L) = 0\) for all \(j\). If \(k\) is the label assigned to state \((x,y)\in \fancyscript{T}^{(n)}\) then \(p_{x,y}^{(n)}(t|\lambda _L) = (e^{tQ^{(n)}(\lambda _L)})_{1k}\), where \(e^{tQ^{(n)}(\lambda _L)} = \sum _{l = 0}^\infty (tQ^{(n)}(\lambda _L))^l/l!\) denotes the usual matrix exponential. Hence,

$$\begin{aligned} \tilde{p}_{x,y}^{(n)}(r|\lambda _L) =\int \limits _0^\infty e^{-rt}\left( e^{tQ^{(n)}(\lambda _L)}\right) _{1k} \mathop {}\!\mathrm {d}t = \left( [rI_{s^{(n)}} - Q^{(n)}(\lambda _L)]^{-1}\right) _{1k}, \end{aligned}$$

where \(I_{s^{(n)}}\) is the \(s^{(n)} \times s^{(n)}\) identity matrix.

The estimating procedure described in Sect. 4.1 assumes that the distribution of the infectious period is known. The theory may be extended easily to the setting where a parametric form is assumed for the infectious period distribution, with unknown parameters that need to be estimated from the data. For example, if the infectious period is assumed to follow an exponential distribution with rate \(\gamma \), then the preceding theory goes through with \({p}_{x,y}^{(n)}(t|\lambda _L)\) replaced in an obvious fashion by \({p}_{x,y}^{(n)}(t|\lambda _L,\gamma )\) and \((\lambda _L,\gamma )\) being estimated by maximising the appropriate normalised pseudolikelihood function. Note that for final outcome data it is impossible to estimate both \(\lambda _L\) and \(\gamma \), since the final outcome distribution is invariant to rescaling of time. However, that is not the case in an emerging epidemic setting, as the exponential growth rate is clearly time-scale dependent.

The assumption of exponentially distributed infectious periods can be relaxed by using the phase method (e.g. (Asmussen 1987, pp. 71–78)). For example, a \(J\)-stage Erlang distribution for the infectious period can be accommodated by splitting the infectious period into \(J\) stages having independent exponentially distributed durations. The Markov property is maintained by expanding the state space of a single-household epidemic to include the number of infectives in each of the \(J\) stages. This can lead to an appreciable increase in the size of \(\fancyscript{T}^{(n)}\). One can also extend the model to an SEIR (susceptible \(\rightarrow \) exposed \(\rightarrow \) infectious \(\rightarrow \) recovered) model by introducing a latent period. In the simplest case, both infectious and latent periods follow exponential distributions, in which case the state space of a single-household epidemic is extended to include the number of exposed (i.e. latent) individuals, but again the phase method can be used to accommodate more general distributions.

The methodology can be extended to allow the local contact rate to depend on household size. For \(n = 1,2,\ldots ,n_{max},\) let \(\lambda _L^{(n)}\) denote the local contact rate in a household of size \(n\). Then, provided there are enough households of each size in the sample, \((\lambda _L^{(2)}, \lambda _L^{(3)}, \dots \lambda _L^{(n_{max})})\) can be estimated jointly, e.g. by replacing \(\lambda _L\) by \(\lambda _L^{(n)}\) in (4.5). Alternatively, one can assume a specific form for \(\lambda _L^{(n)}\), Cauchemez et al. (2004) use \(\lambda _L^{(n)} = \lambda _L/n\) for influenza, and estimate its unknown parameter (here \(\lambda _L\)) in the obvious fashion.

5 Application to the Reed-Frost model

5.1 The Reed-Frost model

Under the Reed-Frost model, the latent period is assumed to have a constant duration, which without loss of generality can be taken to be one unit of time, and the infectious period is reduced to a single point in time. Consider an epidemic initiated by a small number of individuals being infected at time \(t=0\) among a population having the same structure as that outlined in Sect. 2. For \(t=0,1,\dots \), individuals infected at time \(t\) become infectious at time \(t+1\). Different infectives behave independently of each other. Consider an individual that is infected at time \(t\). At time \(t+1\) it makes global infectious contact with any given susceptible in the population with probability \(p_G=1-\exp (-\mu _G/N)\) and, additionally and independently, local infectious contact with any given susceptible in its household with probability \(p_L\). Moreover, contacts between this infectious individual and distinct susceptible individuals are mutually independent. Any susceptible individual that is contacted by at least one infective at time \(t\) is infected and becomes infectious at time \(t+1\). The process continues until there is no infective left in the population.

Again, we consider the case of an emerging epidemic, so it is assumed that, when the epidemic is observed, the proliferation of infected households still mimics a discrete-time branching process. Note that in the limit as the population size \(N \rightarrow \infty \), the mean number of global contacts made by a typical infective is \(\mu _G\). Note also that upon infection a household of size \(n\) is in state \((n,n-1,1)\) and that in subsequent generations that household contains at least one recovered individual. We assume that it is possible to observe the geometric growth rate \(\rho (p_L,\mu _G)\) of the approximating branching process. The parameter \(\mu _G\) increases with \(\rho (p_L,\mu _G)\) for fixed \(p_L\), so for any estimate of \(p_L\), an estimate for \(\mu _G\) is pre-determined since it is assumed that \(\rho (p_L,\mu _G)\) can be observed directly.

5.2 Estimating \(p_L\)

The local contact probability \(p_L\) can be estimated by approximating the early stages of a Reed-Frost epidemic with a discrete-time multitype branching process \(S\). Define the type space of \(S\) as \(\fancyscript{T}_{RF} = \{(n,n-1,1) : 1 \le n \le n_{max}\}\cup \bigcup _{n=1}^{n_{max}}\{(n,x,y) : x \ge 0, y \ge 1, x + y < n\}\) and label the elements of \(\fancyscript{T}_{RF}\) as \(1,2,\ldots ,k\) where \(k = |\fancyscript{T}_{RF}| = n_{max} + \sum _{n=2}^{n_{max}}\frac{n(n-1)}{2} = n_{max}(n_{max}^2 + 5)/6\) . The type space includes all possible household states where infection is still present.

Let \(\varvec{M}\) be the mean matrix of \(S\) on \(\fancyscript{T}_{RF}\), so the element \(m_{ij}\) is the expected number of type-\(j\) individuals that a typical type-\(i\) individual gives birth to upon death. Under the Reed-Frost model, a household in state \((n,x,y)\) gives birth to an expected number of \(\tilde{\alpha }_{n^{\prime }}\mu _G\) households in state \((n^{\prime },n^{\prime }-1,1)\), for \(n^{\prime } = 1,2,\ldots ,n_{max}\), as a result of global infectious contacts, and to an expected number of \(\left( {\begin{array}{c}x\\ z\end{array}}\right) (1-(1-p_L)^y)^z(1-p_L)^{y(x-z)}\) households in state \((n,x-z,z)\), for \(z = 0,1,\ldots ,x\), from local contacts. Let \(\varvec{Y}_t = (Y_{t1},Y_{t2},\ldots ,Y_{tk})\) denote the number of individuals of each type from \(\fancyscript{T}_{RF}\) alive after \(t\) generations of \(S\) and let \(\rho (p_L,\mu _G)\) be the maximal eigenvalue of \(\varvec{M}\). Assume that \(\rho (p_L,\mu _G) > 1\), so the branching process is supercritical. Kesten and Stigum (1966) show that if \(\varvec{u}(p_L,\mu _G)\) is the left-eigenvector associated with \(\rho (p_L,\mu _G)\), normalised so that its components sum to one, then

$$\begin{aligned} \rho (p_L,\mu _G)^{-t}\varvec{Y}_t \xrightarrow {\text {a.s.}} W\varvec{u}(p_L,\mu _G) \quad \text {as } t \rightarrow \infty , \end{aligned}$$
(5.1)

where \(W\) is a non-negative random variable such that \(W = 0\) if and only if \(S\) becomes extinct. The eigenvector \(\varvec{u}(p_L,\mu _G)\) therefore gives the proportions of individuals of each type in \(S\) as \(t\rightarrow \infty \), conditional upon \(S\) not going extinct. It follows from (5.1) that

$$\begin{aligned} \rho (p_L,\mu _G)^{-t}\sum \limits _{t^{\prime }=1}^t\varvec{Y}_{t^{\prime }} \xrightarrow {\text {a.s.}} \frac{\rho (p_L,\mu _G)}{\rho (p_L,\mu _G)-1}W\varvec{u}(p_L,\mu _G) \quad \text {as } t \rightarrow \infty . \end{aligned}$$
(5.2)

Let \(\varvec{Z}_t = (Z_{t1},Z_{t2},\ldots ,Z_{tk})\), where \(Z_{ti}\) denotes the number of single-household epidemics that terminate before \(t\) generations of the epidemic, for which the last active household state was \(i \in \fancyscript{T}_{RF}\). A household in state \((n,x,y)\) at time \(t^{\prime }\) has probability \((1 - p_L)^{xy}\) of containing no infectives at time \(t^{\prime } + 1\). Hence, if \((n,x,y)\) is the household state associated with a type-\(i\) individual in \(S\), it follows from (5.2) and the strong law of large numbers that, for \(i = 1,2,\ldots ,k,\)

$$\begin{aligned} \rho (p_L,\mu _G)^{-t}Z_{ti} \xrightarrow {\text {a.s.}} W\frac{(1-p_L)^{xy}}{\rho (p_L,\mu _G)-1}u_i(p_L,\mu _G) \quad \text {as } t \rightarrow \infty . \end{aligned}$$

Let \(u_{(n,x,y)} = u_i\) where \(i\) is the label of a type-\((n,x,y)\) individual in \(S\). By noting that any single-household epidemic finishing the generation after it was in state \((n,x,y)\) finishes with \(x\) susceptibles remaining, define the function \(p_{RFfull}(n,x,y|p_L,\mu _G)\) as follows:

$$\begin{aligned} p_{RFfull}(n,x,y|p_L,\mu _G) = \left\{ \begin{array}{l l} Ku_{(n,x,y)} &{} \quad \text {if }\,y \ge 1,\\ K\sum \nolimits _{y=1}^{n-x-1}(1-p_L)^{xy}\frac{u_{(n,x,1)}(p_L,\mu _G)}{\rho (p_L,\mu _G)-1} &{} \quad \text {if }\,y = 0, \end{array} \right. \end{aligned}$$

where \(K\) is chosen such that

$$\begin{aligned} \sum _{n=1}^{n_{max}}\left[ \left( \sum _{x=0}^{n-1}\sum _{y=0}^{n-x-1}p_{RFfull}(n,x,y|p_L,\mu _G)\right) \!+\! p_{RFfull}(n,n-1,1|p_L,\mu _G)\right] \!=\! 1. \end{aligned}$$

One can then estimate \(p_L\) by performing maximum pseudolikelihood estimation in exactly the same manner as described using \(L_{full}\) in Sect. 4.1. Note that this estimation procedure can be adapted to the case where susceptibles and infectives are indistinguishable, using the same method as described for \(L_{rec}\) in Sect. 4.1.

6 Numerical illustrations

6.1 Methods of estimation

We illustrate applications of the preceding theory using simulation studies with parameter choices loosely based on Fraser’s (2007) analysis of varicella data. Simulations are performed on a population of \(m =\) 10,000 households with distribution \(\varvec{\alpha } = [0.13,0.30,0.23,0.18,0.09,0.07]\). This distribution is based on the 1961 UK census data (http://www.statistics.gov.uk/census/) and contains a higher proportion of larger households than the 2001 distribution used previously, implying that local infectious contacts should have a greater effect on the simulated epidemics. The population size is chosen so that it is small enough to represent a realistic population cluster (e.g. a town) but large enough so that there are sufficient data to estimate \(\lambda _L\) whilst the epidemic is still in its emerging phase. For the sake of simplicity, an exponentially distributed infectious period with rate \(1\) is used. Fraser suggests having a within-household susceptible-infectious escape probability of \(0.39\), as reported by Hope-Simpson (1952), and that infected individuals be expected to infect \(1.21\) susceptibles outside of their household. This implies parameter values of \(\lambda _G = 1.21,~\lambda _L = 1.565\) [since \(\phi (1.565) = 0.39\), where \(\phi (\theta ) = \mathbb {E}[\exp (-\theta T_I)] = (1 + \theta )^{-1}\)] and \(r = 1.762\) [recall (4.3)] in the continuous-time case and \(\mu _G = 1.21,~p_L = 0.61~(= 1 - 0.39)~\hbox {and}~\rho (p_L,\mu _G) = 2.248\) under the Reed-Frost model. Unless stated otherwise, growth rates are estimated by fitting a straight line to the logarithm of the number of recoveries, as a function of time, using the polyfit function in MATLAB. The first \(20\) recoveries are ignored when estimating \(r\), to enable the exponential growing phase of the epidemic to settle in. Note that, while this is the most common method to estimate \(r\), other methods are also considered in the literature; see, e.g. Ma et al. (2014).

Fig. 3
figure 3

Estimates of the SAR (true value 61 %) through time for a single SIR households epidemic. The four estimation methods outlined earlier in the paper are shown along with estimates of the SAR using the final-size method

For illustrative purposes, estimates of \(\lambda _L\) are given in terms of the secondary attack rate (SAR), as defined by Longini and Koopman (1982). The SAR is the probability that an infective infects locally a given household member, expressed as a percentage, and is given by \(100(1-\phi (\lambda _L))\) (Note that with the continuous-time and discrete-time models, matching the SAR and \(\lambda _G\) results in different growth rates). The SAR is used since the variance of estimates of \(\lambda _L\), under any of the methods outlined in this paper, increases greatly as the true value of \(\lambda _L\) increases, whereas the variance of the SAR estimates is closer to being constant whatever its true value. Note that for a given distribution of \(T_I\), SAR strictly increases with \(\lambda _L\).

It is shown in Sects. 3 and 4 that an emerging households epidemic can be approximated by a Crump-Mode-Jagers branching process (CMJBP), however there is no indication as to when an epidemic can still be considered to be in its emerging phase. Figure 3 shows estimates of the SAR throughout the lifetime of a single simulated SIR epidemic using the parameters outlined above. Estimates of \(\lambda _L\) (and hence of the SAR using the above formula) were made at regular intervals throughout the epidemic using basic MPLE, censored MPLE, full- and recovery-pseudolikelihood estimation methods [using (4.5) and (4.6), respectively] and the pseudolikelihood method of Ball and Lyne (2015) (c.f. Section 5.1 of Ball et al. 1997), which uses the distribution of susceptible individuals in households of all sizes at the end of an epidemic and is referred to as the final-size method of estimation. Note that for the basic MPLE method, it takes some time before the SAR is estimated to be non-zero. This can be explained by the reliance of this method on household epidemics being completed since the basic MPLE method will only pick up any trace of local infectivity when a completed single-household epidemic with more than one recovered individual is observed. As would be expected, the final-size method appears to tend to the true SAR value as \(t \rightarrow \infty \). The initially large estimates from the final size data can be explained by noting that few households are infected at this time but that recoveries are clustered within households. The former point suggests a very low value of \(\lambda _G\) (considering that the estimator assumes that the epidemic is complete), so the estimate of the SAR is large to account for the clustering of recovered individuals. Note that the recovery-pseudolikelihood method estimates the SAR to be 100 % as the epidemic approaches completion. In the epidemic outlined above, with growth rate \(r = 1.762\) but with an SAR of 100 %, appreciably fewer than half of all infected households of size \(3\) and above are expected to contain only recovered individuals during the emerging phase. Once the true epidemic (with an SAR of 61 %) is completed, appreciably more than 80 % of households of size \(3\) and above in the entire population are expected to contain only recovered individuals. This suggests that there is a threshold, after the epidemic has stopped approximating a CMJBP, when the number of recovered individuals in infected households exceeds the expectations of even the maximum possible SAR in the recovery-pseudolikelihood estimation method, hence this method will continue to give an MPLE for the SAR as 100 % for the remainder of the epidemic.

Fig. 4
figure 4

Kernel density estimates of the distribution of the estimator the SAR (true value 61 %) based on \(1000\) simulations of the outlined epidemic using the full and recovery CMJBP estimation methods, both with and without the recovery rate \(\gamma \) (true value \(1.00\)) being also estimated. Inset scatter plot of estimates of \((SAR,\gamma )\) for the full-pseudolikelihood (\(\gamma \) unknown) method

Figure 3 shows that once an epidemic has had sufficient time to establish itself, there is a window when both the full and recovery CMJBP methods appear to give a good estimate of the SAR. Moreover, the length of this window is roughly the same for both CMJBP methods, although the recovery method gives a less reliable estimate owing to it using less information. This is confirmed in Fig. 4 which shows kernel density estimates of the distribution of the estimator of SAR for both CMJBP methods from \(1000\) simulations of the epidemic outlined above. The plots marked ‘\(\gamma \) known’ use the methodology described in Sect. 4.1 and those marked ‘\(\gamma \) unknown’ assume that \(\gamma \) is also estimated, as described in Sect. 4.2. Estimates of the SAR were made from each simulation after \(500\) recoveries were observed for reasons outlined below. Irrespective of whether or not \(\gamma \) is also estimated, both the full and recovery methods yield estimates of the SAR that are centred broadly around the true value of 61 % but the recovery method yields estimates having a far greater variance. The variance of the estimates is greater when \(\gamma \) is assumed unknown than when it is assumed known but the difference is appreciably smaller than that between the full and recovery methods.

The inset of Fig. 4 shows a scatter plot of the estimates of \((SAR,\gamma )\) using the full-pseudolikelihood method, which indicates that the estimates of the SAR and \(\gamma \) are positively correlated.

Repeated simulations using different population sizes yielded very similar results to those seen in Fig. 3, in that there appears to be a window once the epidemic has established itself when a households SIR epidemic can still be considered to be in its emerging phase and the full-pseudolikelihood estimate is relatively accurate. The start of this window corresponds to when the the asymptotic behaviour of the approximating CMJBP kicks in, the timing of which is independent of the total population size \(N\), provided \(N\) is sufficiently large. Further simulations suggested that this window ends when approximately \(N^{2/3}\) recoveries have occurred, after which the CMJBP approximation of the households epidemic breaks down. The time taken for \(N^{2/3}\) recoveries to take place depends on the severity of the epidemic and the population size. Note that Barbour and Utev (2004) prove that a homogeneously mixing Reed-Frost model can be closely approximated by a branching process up until order \(N^{2/3}\) individuals have been infected.

Fig. 5
figure 5

EMSE of estimates of the SAR using the full-pseudolikelihood method. See text for details

The above points are illustrated in Fig. 5, which shows estimates of the mean squared error (EMSE) of estimates of the SAR, using the full-pseudolikelihood method and assuming that \(\gamma \) (\(=\)1) is known, throughout the emerging stages of \(1000\) simulated epidemics among populations with differing numbers of households but with the same population structure \(\alpha \), growth-rate \(r\) and SAR as given above. If \(SAR_1,SAR_2,\ldots ,SAR_{1000}\) denote the estimates of the SAR (true value 61 %) obtained from \(1000\) simulated epidemics then EMSE \(= \frac{1}{1000}\sum _{i=1}^{1000}(SAR_i-61)^2\). It is assumed that the value \(r\) is known, in order that the figure illustrates only when the distribution of household states in an emerging epidemic conforms to its equivalent branching process. It can be seen that it takes approximately \(50\) recoveries to occur (regardless of population size) for the EMSE to settle to a reasonable value due to the high variance of SAR estimates when too few households have been infected and the epidemic is yet to establish itself in the population. The length of this window then clearly increases with population size as a consequence of a higher percentage of fully susceptible households still being available at this stage of the epidemic. For the population considered in most of the numerical illustrations, i.e. consisting of 10,000 households, it appears appropriate to estimate the SAR after approximately \(500\) recoveries have occurred.

Fig. 6
figure 6

Kernel density estimates of the distribution of the estimator of \(p_L\) (true value \(0.61\)) based on \(1000\) simulations of Reed-Frost type epidemics; see text for details

We now consider estimation of \(p_L\) in the Reed-Frost model. A single-household epidemic in a household of size \(n\) can last for at most \(n\) generations. Thus, under the assumption that all global contacts are with individuals in previously uninfected households, if the households epidemic is observed in the \(k\)th generation, one can estimate \(p_L\) by using an adaptation of the basic MPLE method from the continuous time case as follows. If one wishes to make the estimate in the \(k\)th generation then the single-household epidemics in all households with at least one recovery in the \((k-n_{max}+1)\)th generation are certain to have been completed. One can then estimate \(p_L\) by using only the latter households and considering the final-size distributions of single-household epidemics under the Reed-Frost model to perform the basic MPLE method of estimation in the same manner as before. This circumvents the problem of uncompleted epidemics in households but at the expense of ignoring the information about \(p_L\) contained in those single-household epidemics.

Figure 6 gives kernel density estimates of \(p_L\) (true value \(0.61\)) for \(1000\) simulations of Reed-Frost epidemics with parameters as outlined at the beginning of this section. Estimates were made in the first generation at which \(1000\) recoveries were observed using the full- and recovery-pseudolikelihood methods (i.e. both with and without the ability to distinguish between susceptibles and infectives) and by using the adapted basic MPLE method outlined above. Note that all three methods appear to give estimates that are centred roughly around the true value of \(p_L\), however, the adapted basic MPLE method estimates have a far larger variance than the other estimates, suggesting that the methods of estimation outlined in Sect. 4.1 are preferable, regardless of whether or not infectives are distinguishable. Estimates were made after \(1000\) recoveries had been observed rather than the \(500\) recoveries used in the continuous-time case, owing to the time it takes for \(500\) recoveries to occur potentially being \(n_{max}-1 = 5\) generations.

6.2 Relationship between parameters of the model and bias of the basic and censored MPLE methods

We examine the extent of the bias of the basic and censored MPLE methods, and how the bias is affected by various parameters of an epidemic, by considering “perfect” household data, \(\varvec{a}\), from an emerging epidemic, as determined by its CMJBP or multitype branching process approximation. Households data are considered to be perfect for an emerging epidemic in continuous-time with paramerters \(\lambda _L\) and \(r\), if the proportion of households in state \((n,x,y)\) is exactly \(\tilde{\alpha }_nr\tilde{p}_{x,y}^{(n)}(r|\lambda _L)\) for all \((n,x,y) \in \fancyscript{T}\) (Note that with perfect data, \(\hat{\lambda }_L = \hbox {argmax} \tilde{l}_{full}^{(\infty )}\), see Eq. (7.6) in Sect. 7). Similarly, perfect data for an emerging Reed-Frost epidemic with parameters \(p_L\) and \(\mu _G\) is achieved when the proportion of households in state \((n,x,y)\) is exactly \(p_{RFfull}(n,x,y|p_L,\mu _G)\) for all \((n,x,y) \in \fancyscript{T}_{RF}\). Note that in both cases, the distribution of household states representing perfect data is also dependent on the population structure \(\varvec{\alpha } = (\alpha _1,\alpha _2,\ldots ,\alpha _{n_{max}})\). Note also that assuming perfect data is equivalent to assuming an infinite population, in which all households are observed, and that in this setting, estimates of the SAR have no illustrative advantage over those of \(\lambda _L\), since all estimates have zero variance.

Fig. 7
figure 7

Estimates of different values \(p_L\) assuming perfect data in emerging Reed-Frost type epidemics, \(\rho = 2.248\), using the basic and censored MPLE methods

Fig. 8
figure 8

Estimates of \(\lambda _L\) assuming perfect data for emerging epidemics, with \(r = 1.762\), among populations with equal household sizes using the basic and censored MPLE methods. The upper plot takes \(\lambda _L = 1.565\) for all household sizes. The lower plot adopts the model \(\lambda _L^{(n)} = \lambda _L/n\), where \(n\) is household size and \(\lambda _L = 6.75\)

6.2.1 Effect of local contact rate

Figure 7 illustrates the effect of the local contact rate on the bias of the basic and censored MPLE methods by considering estimates of \(p_L\) for emerging Reed-Frost epidemics with geometric growth rate \(\rho = 2.248\) and population distribution \(\varvec{\alpha } = [0.13, 0.30, 0.23, 0.18, 0.09, 0.07]\), as given in Sect. 6.1 but with different local contact probabilities. Note that given perfect data, both estimates converge to the true value of \(p_L\) as \(p_L\) tends to \(0\) or \(1\). This can be easily explained by noting that all completed single-household epidemics in households of size \(n\) will have exactly \(1\) recovery if \(p_L = 0\) and exactly \(n\) recoveries if \(p_L = 1,\) implying that the issue of less severe single-household epidemics being more likely to be included in the estimation data becomes irrelevant since all single-household epidemics are of the same severity. The basic and censored MPLE methods appear to be at their most biased in the region \(0.3 < p_L < 0.6\) when the proportion of recoveries from single-household epidemics in households of sizes \(3\) and \(4\) (which make up a significant portion of the population) are distributed in a relatively uniform manner.

6.2.2 Effect of household size

Figure 8 gives two plots showing estimates of \(\lambda _L\) in continuous-time epidemics with real-time growth rate \(r = 1.762\) assuming perfect data for populations of equal sized households from \(2\) to \(20\). The upper plot considers the case where \(\lambda _L = 1.565\), independent of household size. In this plot the basic MPLE estimate considerably underestimates \(\lambda _L\) regardless of household size but the bias appears to get marginally worse as household size increases. This can be attributed to the most severe single-household epidemics taking longer in larger households and hence fewer of the more severe epidemics are completed by the time of estimation in larger households. The censored MPLE fares better however and appears to converge towards the true value of \(\lambda _L\) as household size increases. Since \(\lambda _L\) is a person-to-person contact rate, larger households are far more likely to have severe epidemics than smaller households with the same \(\lambda _L\), since the number of local contacts in a household increases quadratically with \(n\). Therefore, as household size increases, the proportion of recoveries from single-household epidemics with the same local contact rate becomes less uniform, leading to less bias in the censored MPLE estimate (as observed in Fig. 7).

The lower plot of Fig. 8 uses the same real-time growth rate and population distributions but assumes that the local infection rate depends on household size, specifically that \(\lambda _L^{(n)} = \lambda _L/n\) with \(\lambda _L = 6.75\) (see Sect. 4.2). This value was chosen as it gives a value of \(\lambda _G = 1.21\) when \(r = 1.762\) from the population distribution \(\varvec{\alpha }\) as used previously in this section. Here it can be seen that the basic and censored MPLE approaches both become more biased as household size increases. In the basic case this is for the same reasons as before, whereas in the censored case, the additional local contacts that come from an increased household size are compensated by the reduction of the local contact rate, leading to the relatively uniform distribution of recoveries in a single household-epidemic which causes bias.

6.2.3 Effect of growth rate

Figure 9 shows estimates of \(\lambda _L\) in emerging epidemics with \(\lambda _L\) and \(\varvec{\alpha }\) as defined in Sect. 6.1. It is clear from the plot that both the basic and censored MPLE estimates converge to the true value of \(\lambda _L\) as \(r \rightarrow 0\), as is proved in Sect. 4.1.

Fig. 9
figure 9

Estimates of \(\lambda _L\) assuming perfect data in emerging epidemics with different real-time growth rates \(r\) using the basic and censored MPLE methods

7 Strong consistency of estimators

In this section we consider the asymptotic behaviour of the estimators of \(\lambda _L\) described in Sect. 4 as the number of households in the population tends to infinity. Specifically we show that, under suitable conditions, the estimators are strongly consistent, conditional upon the epidemic taking off.

Consider a sequence of epidemics \(E^{(m)}~(m = 1,2,\ldots )\), indexed by the number of households in the population. For \(m = 1,2,\ldots \) and \(n = 1,2,\ldots ,n_{max}\), let \(\alpha _n^{(m)}\) be the proportion of households in \(E^{(m)}\) that have size \(n\). The epidemic \(E^{(m)}\) is as defined in Sect. 2 and has one initial infective, who is chosen uniformly at random from the population. The infection parameters \((\lambda _L,\lambda _G)\) and the infectious period distribution are all assumed to be independent of \(m\), as is the maximum household size \(n_{max}\). It is assumed that \(\alpha _n^{(m)} \rightarrow \alpha _n~\text {as } n \rightarrow \infty ~(n= 1,2,\ldots ,n_{max}).\)

Let \(E^{(\infty )}\) denote the general branching process, introduced in Sect. 3 and analysed further in Sect. 4, which approximates the epidemic \(E^{(m)}\) for suitably large \(m\). Recall that for \((n,x,y) \in \fancyscript{T}\), the number of individuals in \(E^{(\infty )}\) having state \((n,x,y)\) at time \(t\) is denoted by \(Y_{n,x,y}(t)\). For \(m = 1,2,\ldots , (n,x,y) \in \fancyscript{T}\) and \(t \ge 0\), let \(Y_{n,x,y}^{(m)}(t)\) denote the number of size-\(n\) households in \(E^{(m)}\) that have \(x\) susceptibles and \(y\) infectives at time \(t\). Let \(\fancyscript{T}_L = \{(n,x,y) \in \fancyscript{T}:~y \ge 1\}\). For \(t \ge 0\), let \(Y(t) = \sum _{(n,x,y)\in \fancyscript{T}_L}Y_{n,x,y}(t)\) denote the number of “live” individuals in \(E^{(\infty )}\) at time \(t\). Recall that \(r\) denotes the Malthusian parameter of \(E^{(\infty )}\).

Theorem 7.1

Suppose that \(r > 0\). Then there is a probability space \((\varOmega ,\fancyscript{F},\mathrm {P})\) on which are defined a sequence of epidemics \(E^{(m)}~(m \ge 1)\) and the approximating branching process \(E^{(\infty )}\) satisfying the following property. Let \(A =\{\omega \in \mathbb {R}: \lim _{t \rightarrow \infty }~Y(t,\omega ) = 0\}\) denote the set on which the branching process \(E^{(\infty )}\) goes extinct. Then for \(\mathrm {P}\)-almost all \(\omega \in A^{c}\) and any \(c \in (0,\frac{1}{2}r^{-1})\),

$$\begin{aligned} \sup _{0\le t\le c\log m}~\max _{(n,x,y) \in \fancyscript{T}} |Y_{n,x,y}^{(m)}(t,\omega ) - Y_{n,x,y}(t,\omega )| = 0 \end{aligned}$$
(7.1)

for all sufficiently large \(m\).

Proof

For \(m = 1,2,\ldots \), let \(N^{(m)} = m\sum _{n = 1}^{n_{max}}n\alpha _{n}^{(m)}\) denote the total number of individuals in the population among which \(E^{(m)}\) is spreading. Let \((\varOmega ,\fancyscript{F},\mathrm {P})\) be a probability space on which are defined the following independent sets of random quantities: (i) a realisation of the branching process \(E^{(\infty )}\); (ii) \(\chi _k^{(m)}~(m = 1,2,\ldots ;~k = 1,2,\ldots )\), where for each \(m, \chi _1^{(m)},\chi _1^{(m)},\ldots \) are independent and uniformly distributed on \(\{1,2,\ldots ,N^{(m)}\}\).

For \(m = 1,2,\ldots \), a realisation of the early stages of the epidemic \(E^{(m)}\) can be defined on \((\varOmega ,\fancyscript{F},\mathrm {P})\) as follows. Label the individuals in the \(m\)th population \(1,2, \ldots ,N^{(m)}\). The initial infective in \(E^{(m)}\) has a label given by \(\chi _1^{(m)}\) and corresponds to the ancestor in the branching process \(E^{(\infty )}\). Births of individuals in \(E^{(\infty )}\) correspond to global contacts being made in \(E^{(m)}\). For \(k = 1,2,\ldots \), the individual contacted in \(E^{(m)}\) corresponding to the \(k\)th birth in \(E^{(\infty )}\) has a label given by \(\chi _{k+1}^{(m)}\). If the household in which \(\chi _{k+1}^{(m)}\) resides has not been infected previously, then \(\chi _{k+1}^{(m)}\) becomes infected in \(E^{(m)}\) and initiates a new single-household epidemic in \(E^{(m)}\) whose course and subsequent global contacts is given by the life-history of the \((k+1)\)th individual in \(E^{(\infty )}\). If the household in which \(\chi _{k+1}^{(m)}\) resides has been infected previously then the construction of \(E^{(m)}\) needs modifying but such detail is not required for the present proof.

For \(m = 1,2,\ldots \), let \(M^{(m)}\) be the smallest \(k \ge 2\) such that \(\chi _{k}^{(m)}\) belongs to the same household as \(\chi _{l}^{(m)}\) for some \(l = 1,2,\ldots ,k-1\), and let \(\hat{M}^{(m)}\) be a random variable, taking values in \(2,3,\ldots \), having survivor function

$$\begin{aligned} \mathbb {P}(\hat{M}^{(m)} > k) = \prod _{i=1}^{k-1}\left( 1 -\frac{in_{max}}{N^{(m)}}\right) \quad (k = 2,3,\ldots ). \end{aligned}$$

Note that \({M}^{(m)}\) is stochastically greater than \(\hat{M}^{(m)}\), since the maximum household size is \(n_{max}\), and (c.f. Aldous (1985), p. 96) \(m^{-1/2}\hat{M}^{(m)} \xrightarrow {\text {D}} \hat{M} ~\text {as } m \rightarrow \infty \), where \(\xrightarrow {\text {D}}\) denotes convergence in distribution and \(\hat{M}\) has density \(f(x) = n_{max}x\mu _H^{-1}\exp {(-n_{max}\mu _H^{-1}x^2/2)}~(x > 0)\), with \(\mu _H = \sum _{n=1}^{n_{max}}n\alpha _n\) being the mean household size (Note that \(m^{-1}N^{(m)} \rightarrow \mu _H\) as \(m \rightarrow \infty \)).

By the Skorokhod representation theorem, the random variables \(\hat{M}, M^{(m)}\) and \(\hat{M}^{(m)} (m = 1,2,\ldots )\) may be defined on a common probability space so that \(\mathbb {P}({M}^{(m)} \ge \hat{M}^{(m)},~m = 1,2,\ldots ) = 1\) and \(m^{-1/2}\hat{M}^{(m)} \xrightarrow {\text {a.s.}} \hat{M}\) as \(m \rightarrow \infty \). Further, that probability space may be augmented to carry random variables \(\chi _{k}^{(m)} (m = 1,2,\ldots ;~k = 1,2,\ldots )\) distributed as above and consistent with \(M^{(m)} (m = 1,2,\ldots )\). Thus we may assume that the random variables \(\hat{M}^{(m)} (m = 1,2,\ldots )\) and \(\hat{M}\) are also defined on \((\varOmega ,\fancyscript{F},\mathrm {P})\) and that there exists \(B \in \fancyscript{F}\), with \(\mathbb {P}(B) = 1\), such that, for all \(\omega \in B\),

$$\begin{aligned} M^{(m)}(\omega ) \ge \hat{M}^{(m)}(\omega ) \quad \text {and}\quad m^{-1/2}\hat{M}^{(m)}(\omega ) \rightarrow \hat{M}(\omega ) \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.2)

For \(t \ge 0\), let \(T(t)\) be the number of births in \(E^{(\infty )}\) during \([0,t]\), including the ancestor. Then \(T(t) = \sum _{(n,x,y) \in \fancyscript{T}} Y_{n,x,y}(t)\) and it follows from (4.4) that \(e^{-rt}T(t) \xrightarrow {\text {a.s.}} r^{-1}W\) as \(t \rightarrow \infty \). Recall that \(W = 0\) if and only if the branching process goes extinct. Thus there exists \(C \in \fancyscript{F}\), with \(C \subseteq A^c\) and \(\mathbb {P}(C) = \mathbb {P}(A^c)\), such that for all \(\omega \in C\),

$$\begin{aligned} e^{-rt}T(t,\omega ) \rightarrow r^{-1}W(\omega ) \quad \text {as } t \rightarrow \infty . \end{aligned}$$
(7.3)

Let \(\omega \in B \cap C\) and \(c \in (0,\frac{1}{2}r^{-1})\). Then it follows from (7.3) that \(T(c \log m, \omega ) < 2m^{rc}r^{-1}W(\omega )\) for all sufficiently large \(m\). Also, (7.2) implies that \(M^{(m)}(\omega ) > \frac{1}{2}m^{1/2}\hat{M}(\omega )\) for all sufficiently large \(m\). Hence, since \(rc < 1/2\), for all sufficiently large \(m\), every birth in \(E^{(\infty )}(\omega )\) during \((0, c \log m]\) corresponds to a global contact with an uninfected household in \(E^{(m)}(\omega )\) and (7.1) follows since \(\mathbb {P}(B \cap C) = \mathbb {P}(A^c)\).

\(\square \)

We turn now to estimation of \(\lambda _L\). Suppose that the epidemic \(E^{(m)}\) is observed at time \(t^{(m)}\), where the sequence \((t^{(m)})\) satisfies (i) \(t^{(m)} \rightarrow \infty \text { as } m \rightarrow \infty \), (ii) \(t^{(m)} \le c \log m\) for all sufficiently large \(m\), for some \(c \le (2r)^{-1}\). Suppose also that an estimator \(\hat{r}^{(m)}\) of the growth rate \(r\) is available such that \(\hat{r}^{(m)} \xrightarrow [A^c]{\text {a.s.}} r\) as \(m \rightarrow \infty \), where \(\xrightarrow [A^c]{\text {a.s.}}\) means convergence for \(\mathrm {P}\)-almost all \(\omega \in A^c\). It is easily verified that one such estimator is \(\hat{r}^{(m)} = \log [T^{(m)}(t^{(m)})/T^{(m)}(t^{(m)}/2)]/(t^{(m)}/2)\), where \(T^{(m)}(t)\) is the total number of households that have been infected in \(E^{(m)}\) by time \(t\). Let \(\hat{\lambda }_{L,full}^{(m)}\) denote the estimator obtained by maximising the function \(L_{full}(\lambda _L|\varvec{a},\hat{r}^{(m)})\) defined at (4.5). For ease of exposition, we assume that all infected households are observed, so, in our present notation, \(a_{x,y}^{(m)} = Y_{n,x,y}^{(m)}(t^{(m)})\) for \((n,x,y) \in \fancyscript{T}\). The following theorems are easily extended to the situation when only some infected households are observed; of course, the number of observed households must tend to infinity as \(m \rightarrow \infty \) and the sampling mechanism must be independent of disease progression within households. In these theorems, it is convenient to denote the true value of \(\lambda _L\) by \(\bar{\lambda }_L\).

Theorem 7.2

Under the conditions of Theorem 7.1,

$$\begin{aligned} \hat{\lambda }_{L,full}^{(m)} \xrightarrow [A^c]{\text {a.s.}} \bar{\lambda }_L \quad \text {as } m \rightarrow \infty . \end{aligned}$$

Proof

First note that from (4.5),

$$\begin{aligned} \hat{\lambda }_{L,full}^{(m)} = \hbox {argmax} \tilde{l}_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) , \end{aligned}$$
(7.4)

where

$$\begin{aligned} \tilde{l}_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) = W^{-1}e^{-rt^{(m)}}\sum _{n=2}^{n_{max}}\sum _{(x,y) \in \fancyscript{T}^{(n)}}Y_{n,x,y}^{(m)}\left( t^{(m)}\right) \log \tilde{p}_{x,y}^{(n)}\left( \hat{r}^{(m)}|\lambda _L\right) . \end{aligned}$$

Observe that, under the conditions satisfied by \((t^{(m)})\), Theorem 7.1 and (4.4) imply that, for all \((n,x,y) \in \fancyscript{T}\),

$$\begin{aligned} W^{-1}e^{-rt^{(m)}}Y_{n,x,y}^{(m)}\left( t^{(m)}\right) \xrightarrow [A^c]{\text {a.s.}} \tilde{\alpha }_n\tilde{p}_{x,y}^{(n)}\left( r|\bar{\lambda }_L\right) \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.5)

Hence, since \(\hat{r}^{(m)} \xrightarrow [A^c]{\text {a.s.}} r\) as \(m \rightarrow \infty \), we have that for any \(\lambda _L \in (0,\infty )\),

$$\begin{aligned} \tilde{l}_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) \xrightarrow [A^c]{\text {a.s.}} \tilde{l}_{full}^{(\infty )}(\lambda _L|r) \quad \text {as } m \rightarrow \infty , \end{aligned}$$

where

$$\begin{aligned} \tilde{l}_{full}^{(\infty )}(\lambda _L|r) = \sum _{n=2}^{n_{max}}\tilde{\alpha }_n\sum _{(x,y) \in \fancyscript{T}^{(n)}}\tilde{p}_{x,y}^{(n)}(r|\bar{\lambda }_L)\log \tilde{p}_{x,y}^{(n)}(r|\lambda _L). \end{aligned}$$
(7.6)

Standard arguments, (e.g. Silvey (1975), p. 75) show that, for \(n = 2,3,\ldots ,n_{max}\), the function \(g_n(\lambda _L) = \sum _{(x,y) \in \fancyscript{T}^{(n)}}\tilde{p}_{x,y}^{(n)}(r|\bar{\lambda }_L)\log \tilde{p}_{x,y}^{(n)}(r|\lambda _L)\) has a unique global maximum at \(\bar{\lambda }_L\). Hence, as a function of \(\lambda _L \in (0,\infty ), \tilde{l}_{full}^{(\infty )}(\lambda _L|r)\) has a unique global maximum at \(\bar{\lambda }_L\).

Fix \(0 < a < \bar{\lambda }_L < b < \infty \). Then

$$\begin{aligned} \max _{a \le \lambda _L \le b}|\tilde{l}_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) -\tilde{l}_{full}^{(\infty )}(\lambda _L|r)| \le \sum _{n=2}^{n_{max}}\sum _{(x,y) \in \fancyscript{T}^{(n)}}\max _{a \le \lambda _L \le b}g_{n,x,y}^{(m)}(\lambda _L), \end{aligned}$$
(7.7)

where

$$\begin{aligned} g_{n,x,y}^{(m)}(\lambda _L)= & {} \left| W^{-1}e^{-rt^{(m)}}Y_{n,x,y}^{(m)}(t^{(m)})\log \tilde{p}_{x,y}^{(n)}(\hat{r}^{(m)}|\lambda _L)\right. \nonumber \\&\left. -\, \tilde{\alpha }_n\tilde{p}_{x,y}^{(n)}(r|\bar{\lambda }_L)\log \tilde{p}_{x,y}^{(n)}(r|\lambda _L)\right| . \end{aligned}$$

Now

$$\begin{aligned} g_{n,x,y}^{(m)}(\lambda _L) \le \hat{g}_{n,x,y}^{(m)}(\lambda _L) + \check{g}_{n,x,y}^{(m)}(\lambda _L), \end{aligned}$$
(7.8)

where

$$\begin{aligned} \hat{g}_{n,x,y}^{(m)}(\lambda _L) = W^{-1}e^{-rt^{(m)}}Y_{n,x,y}^{(m)}\left( t^{(m)}\right) \left| \log \tilde{p}_{x,y}^{(n)}\left( \hat{r}^{(m)}|\lambda _L\right) - \log \tilde{p}_{x,y}^{(n)}(r|\lambda _L)\right| \end{aligned}$$

and

$$\begin{aligned} \check{g}_{n,x,y}^{(m)}(\lambda _L) = \left| \{W^{-1}e^{-rt^{(m)}}Y_{n,x,y}^{(m)}(t^{(m)})-\tilde{\alpha }_n\tilde{p}_{x,y}^{(n)}(r|\bar{\lambda }_L)\}\log \tilde{p}_{x,y}^{(n)}(r|\lambda _L)\right| . \end{aligned}$$

Using (7.5), for all \((n,x,y) \in \fancyscript{T}\),

$$\begin{aligned} \max _{a \le \lambda _L \le b} \check{g}_{n,x,y}^{(m)}(\lambda _L) \xrightarrow [A^c]{\text {a.s.}} 0 \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.9)

Further, for any \(\lambda _L > 0\) and \(r,r^{\prime } > 0\),

$$\begin{aligned} \left| \tilde{p}_{x,y}^{(n)}(r|\lambda _L)-\tilde{p}_{x,y}^{(n)}(r^{\prime }|\lambda _L)\right| \le \int _0^\infty |e^{-rt}- e^{-r^{\prime }t}| \mathop {}\!\mathrm {d}t = |r - r^{\prime }|/(rr^{\prime }), \end{aligned}$$
(7.10)

so, since \(\log x\) is uniformly continuous on any bounded subinterval of \((0,\infty )\) and \(\hat{r}^{(m)}\xrightarrow [A^c]{\text {a.s.}} r\) as \(m \rightarrow \infty \), it follows using (7.5) that, for all \((n,x,y) \in \fancyscript{T}\),

$$\begin{aligned} \max _{a \le \lambda _L \le b} \hat{g}_{n,x,y}^{(m)}(\lambda _L) \xrightarrow [A^c]{\text {a.s.}} 0 \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.11)

Combining (7.4)–(7.9) yields

$$\begin{aligned} \max _{a \le \lambda _L \le b}\left| \tilde{l}_{full}^{(m)}(\lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)})-\tilde{l}_{full}^{(\infty )}(\lambda _L|r)\right| \xrightarrow [A^c]{\text {a.s.}} 0 \text { as } m \rightarrow \infty , \end{aligned}$$
(7.12)

whence, since \(\tilde{l}_{full}^{(\infty )}(\lambda _L|r)\) has a unique global maximum at \(\bar{\lambda }_L\),

$$\begin{aligned} \hbox {argmax}_{a \le \lambda _L \le b} \tilde{l}_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) \xrightarrow [A^c]{\text {a.s.}} \bar{\lambda }_L \text { as } m \rightarrow \infty . \end{aligned}$$
(7.13)

To complete the proof we explore the behaviour of \(l_{full}^{(m)}(\lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)})\) as \(\lambda _L \downarrow 0\) and \( \lambda _L \uparrow \infty \). Let \(X\) denote the time of the first point in \((0,\infty )\) of a homogeneous Poisson process having rate \((n-1)\lambda _L\). Then \(p_{n-2,2}^{(n)}(t|\lambda _L) \le \mathbb {P}(X \le t) = 1 - e^{-(n-1)\lambda _Lt}\), so

$$\begin{aligned} \tilde{p}_{n-2,2}^{(n)}(r|\lambda _L) \le \int _{0}^{\infty }\left( 1 - e^{-(n-1)\lambda _Lt}\right) e^{-rt} \mathop {}\!\mathrm {d}t \le (n-1)\lambda _L/r^2. \end{aligned}$$
(7.14)

For all \(n\), we have that \(\tilde{p}_{x,y}^{(n)}(\hat{r}^{(m)}|\lambda _L) \le 1/\hat{r}^{(m)}\) for all \((x,y) \in \fancyscript{T}^{(n)}\), so

$$\begin{aligned} \log \tilde{p}_{x,y}^{(n)}\left( \hat{r}^{(m)}|\lambda _L\right) + \log \hat{r}^{(m)} \le 0. \end{aligned}$$
(7.15)

Let

$$\begin{aligned}&l_{*}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) \\&\quad = W^{-1}e^{-rt^{(m)}}\sum _{n=2}^{n_{max}}\sum _{(x,y) \in \fancyscript{T}^{(n)}}Y_{n,x,y}^{(m)}\left( t^{(m)}\right) \left( \log \tilde{p}_{x,y}^{(n)}(\hat{r}^{(m)}|\lambda _L) + \log \hat{r}^{(m)}\right) \\&\quad = l_{full}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) + W^{-1}e^{-rt^{(m)}}\sum _{n=2}^{n_{max}}\sum _{(x,y) \in \fancyscript{T}^{(n)}}Y_{n,x,y}^{(m)}(t^{(m)}) \log \hat{r}^{(m)}, \end{aligned}$$

and, recalling (7.4), note that \(\hat{\lambda }_{L,full}^{(m)} = \hbox {argmax} l_{*}^{(m)}(\lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)})\).

Fix \(\lambda _0 > 0\). Then (7.14) and (7.15) imply that, for all \(\lambda _L \in (0,\lambda _0]\),

$$\begin{aligned} l_{*}^{(m)}\left( \lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right)\le & {} W^{-1}e^{-rt^{(m)}}Y_{n,n-2,2}^{(m)}\left( t^{(m)}\right) \left( \log (n-1) \!+\! \log \lambda _0 - \log \hat{r}^{(m)}\right) \nonumber \\&\xrightarrow [A^c]{\text {a.s.}} \tilde{\alpha }_n\tilde{p}_{n-2,2}^{(n)}\left( r|\bar{\lambda }_L\right) \left[ \log (n-1) + \log \lambda _0 - \log r\right] \end{aligned}$$
(7.16)

as \(m \rightarrow \infty \). Also, using (7.5) and (7.12),

$$\begin{aligned} l_{*}^{(m)}\left( \bar{\lambda }_L|\varvec{Y}^{(m)},\hat{r}^{(m)}\right) \xrightarrow [A^c]{\text {a.s.}} l_{full}^{(\infty )}(\bar{\lambda }_L|r) + r^{-1} \log r \sum _{n=2}^{n_{max}} \tilde{\alpha }_n \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.17)

Choose \(n\) and \(\lambda _0 > 0\) such that \(\alpha _n > 0\) and the right hand side of (7.16) is strictly less than the right hand side of (7.17). Then, since \(\hat{\lambda }_{L,full}^{(m)} = \hbox {argmax} l_{*}^{(m)}(\lambda _L|\varvec{Y}^{(m)},\hat{r}^{(m)})\), it follows that for \(\mathrm {P}\)-almost all \(\omega \in A^c\), there exists \(m_0(\omega )\) such that

$$\begin{aligned} \hat{\lambda }_{L,full}^{(m)}(\omega ) \not \in (0,\lambda _0) \quad \text {for all }\, m \ge m_0(\omega ). \end{aligned}$$
(7.18)

Let \(T_I\) denote the infectious period of the initial infective in a household of size \(n\). Then \(p_{n-1,1}^{(n)}(t|\lambda _L) = \mathbb {E}[e^{-(n-1)\lambda _Lt} \mathbb {1}_{\{T_I > t\}}] \le e^{-(n-1)\lambda _Lt}\), whence \(\tilde{p}_{n-1,1}^{(n)}(r|\lambda _L) \le 1/((n-1)\lambda _L + r)\). Arguing as before shows that there exists \(\lambda _1 > 0\) such that, for \(\mathrm {P}\)-almost all \(\omega \in A^c\), there exists \(m_1(\omega )\) such that

$$\begin{aligned} \hat{\lambda }_{L,full}^{(m)}(\omega ) \not \in (\lambda _1,\infty ) \quad \text {for all }\, m \ge m_1(\omega ). \end{aligned}$$
(7.19)

The theorem then follows from (7.13), (7.18) and (7.19).\(\square \)

We now consider estimation of \(\lambda _L\) based only on recoveries. For \(m = 1,2,\ldots , n = 1,2,\ldots ,n_{max}\) and \(t \ge 0\), let

$$\begin{aligned} Z_{n,j}^{(m)}(t) = \sum _{(x,y) \in A_j^{(n)}} Y_{n,x,y}^{(m)}(t)\quad (j = 1,2,\ldots ,n) \end{aligned}$$

be the total number of size-\(n\) households in which \(j\) recoveries have been observed by time \(t\) in the epidemic \(E^{(m)}\). Let \(\hat{\lambda }_{L,rec}^{(m)}\) denote the estimator of \(\lambda _L\) obtained by maximising the function \(L_{rec}(\lambda _L|\varvec{c},\hat{r}^{(m)})\) described at (4.6) (In our present notation \(c_j^{(n)} = Z_{n,j}^{(m)}(t^{(m)})\)).

Theorem 7.3

Under the conditions of Theorem 7.1,

$$\begin{aligned} \hat{\lambda }_{L,rec}^{(m)} \xrightarrow [A^c]{\text {a.s.}} \bar{\lambda }_L \quad \text {as } m \rightarrow \infty . \end{aligned}$$

Proof

First note from (4.6) that \(\hat{\lambda }_{L,rec}^{(m)} = \hbox {argmax} \tilde{l}_{rec}^{(m)}(\lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)})\), where

$$\begin{aligned} \tilde{l}_{rec}^{(m)}(\lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)}) = W^{-1}e^{-rt^{(m)}}\sum _{n=2}^{n_{max}}\sum _{j=1}^{n} Z_{n,j}^{(m)}(t^{(m)}) \log \tilde{q}_j^{(n)}(\hat{r}^{(m)}|\lambda _L). \end{aligned}$$

Using (7.5), for \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,2,\ldots ,n\),

$$\begin{aligned} W^{-1}e^{-rt^{(m)}}Z_{n,j}^{(m)}(t^{(m)}) \xrightarrow [A^c]{\text {a.s.}} \tilde{\alpha }_n\left( r^{-1} - \tilde{q}_0^{(n)}(r|\bar{\lambda }_L)\right) \tilde{q}_j^{(n)}(r|\bar{\lambda }_L) \quad \text {as } m \rightarrow \infty , \end{aligned}$$
(7.20)

so, for any \(\lambda _L \in (0,\infty ), \tilde{l}_{rec}^{(m)}(\lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)}) \xrightarrow [A^c]{\text {a.s.}} \tilde{l}_{rec}^{(\infty )}(\lambda _L|r)~\text {as } m \rightarrow \infty \), where

$$\begin{aligned} \tilde{l}_{rec}^{(\infty )}(\lambda _L|r) = \sum _{n=2}^{n_{max}}\tilde{\alpha }_n\left( r^{-1} - \tilde{q}_0^{(n)}(r|\bar{\lambda }_L)\right) \sum _{j=1}^{n} \tilde{q}_j^{(n)}(r|\bar{\lambda }_L) \log \tilde{q}_j^{(n)}(r|\lambda _L). \end{aligned}$$
(7.21)

Now

$$\begin{aligned} \left| \tilde{l}_{rec}^{(m)}\left( \lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)}\right) - \tilde{l}_{rec}^{(\infty )}(\lambda _L|r)\right| \le \sum _{n=2}^{n_{max}} \sum _{j=1}^{n} \left( \hat{h}_{n,j}^{(m)}(\lambda _L) + \check{h}_{n,j}^{(m)}(\lambda _L)\right) , \end{aligned}$$
(7.22)

where \(\hat{h}_{n,j}^{(m)}(\lambda _L) = W^{-1}e^{-rt^{(m)}}Z_{n,j}^{(m)}(t^{(m)})| \log \tilde{q}_j^{(n)}(\hat{r}^{(m)}|\lambda _L) - \log \tilde{q}_j^{(n)}(r|\lambda _L)|\) and \(\check{h}_{n,j}^{(m)}(\lambda _L) \!=\! |\{W^{-1}e^{-rt^{(m)}}Z_{n,j}^{(m)}(t^{(m)}) - \tilde{\alpha }_n(r^{-1} - \tilde{q}_0^{(n)}(r|\bar{\lambda }_L)) \tilde{q}_j^{(n)}(r|\bar{\lambda }_L)\}\log \tilde{q}_j^{(n)} (r|\lambda _L)|\). For \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,2,\ldots ,n,~\tilde{q}_j^{(n)}(r|\lambda _L) = \tilde{a}_j^{(n)}(r|\lambda _L)/\tilde{a}_0^{(n)}(r|\lambda _L)\), where \(\tilde{a}_j^{(n)}(r|\lambda _L) = \sum _{(x,y) \in \fancyscript{A}_j^{(n)}} \tilde{p}_{x,y}^{(n)}(r|\lambda _L)\, (j \!=\! 1,2,\ldots ,n)\) and \(\tilde{a}_0^{(n)}(r|\lambda _L) \!=\! r^{-1} - \sum _{y=1}^n \tilde{p}_{n-y,y}^{(n)}(r|\lambda _L)\). Note that \(|\fancyscript{A}_j^{(n)}| = n+1-j \quad (j = 1,2,\ldots ,n)\). It follows from (7.10) that, for \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,\ldots ,n\),

$$\begin{aligned} |\tilde{a}_j^{(n)}(r|\lambda _L) - \tilde{a}_j^{(n)}(r^{\prime }|\lambda _L)| \le (n+1-j)|r - r^{\prime }|/(rr^{\prime }), \end{aligned}$$
(7.23)

for all \(\lambda _L > 0\).

Consider a household of size \(n\). In the limit as \(\lambda _L \rightarrow \infty \), as soon as one individual in the household is infected, the whole household becomes infected, so the number of removals in that household \(t\) time units after it was infected follows a binomial distribution with success probability \({\mathbb {P}}(T_I \le t)\). It follows that, for \(j = 0,1,\ldots ,n\) and \(r > 0, \lim _{\lambda _L \rightarrow \infty } \tilde{a}_j^{(n)}(r|\lambda _L) \in (0,r^{-1}]\). Fix \(a \in (0,\bar{\lambda }_L)\). It then follows from (7.20) and the continuity of \(\tilde{a}_j^{(n)}(r|\lambda _L)\) that for \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,2,\ldots ,n\),

$$\begin{aligned} \max _{a \le \lambda _L < \infty } \check{h}_{n,j}^{(m)}(\lambda _L) \xrightarrow [A^c]{\text {a.s.}} 0 \quad \text {as } m \rightarrow \infty , \end{aligned}$$
(7.24)

Further, (7.23) and the uniform continuity of \(\log x\) imply that, for \(n = 2,3,\ldots ,n_{max}\) and \(j = 1,2,\ldots ,n\),

$$\begin{aligned} \max _{a \le \lambda _L < \infty } \hat{h}_{n,j}^{(m)}(\lambda _L) \xrightarrow [A^c]{\text {a.s.}} 0 \quad \text {as } m \rightarrow \infty , \end{aligned}$$
(7.25)

since \(\hat{r}^{(m)} \xrightarrow [A^c]{\text {a.s.}} r\) as \(m \rightarrow \infty \). Similar to before, (7.21) implies that \(\tilde{l}_{rec}^{(\infty )}(\lambda _L|r)\) has a unique global maximum at \(\lambda _L = \bar{\lambda }_L\). It follows using (7.22), (7.24) and (7.25), that, for any \(a \in (0,\bar{\lambda }_L)\),

$$\begin{aligned} \hbox {argmax}_{a \le \lambda _L < \infty } \tilde{l}_{rec}^{(m)}\left( \lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)}\right) \xrightarrow [A^c]{\text {a.s.}} \bar{\lambda }_L \quad \text {as } m \rightarrow \infty . \end{aligned}$$
(7.26)

To complete the proof of the theorem, we obtain a uniform upper bound for \(\tilde{l}_{rec}^{(m)}(\lambda _L|\varvec{Z}^{(m)},\hat{r}^{(m)})\) for small \(\lambda _L\). Two recoveries can occur in a household only if the initial infective has made at least one local infection, so, as at (7.14),

$$\begin{aligned} \hat{a}_2^{(n)}(r|\lambda _L) \le \lambda _L(n-1)/r^2. \end{aligned}$$

Also, there is at least one recovery in a household if the initial infective has recovered, so

$$\begin{aligned} \hat{a}_0^{(n)}(r|\lambda _L) \ge \int _0^\infty \mathbb {P}(T_I \le t)e^{-rt} \mathop {}\!\mathrm {d}t = \phi (r)/r. \end{aligned}$$

Hence, for \(n = 2,3,\ldots ,n_{max}\) and \(\lambda _0 > 0\),

$$\begin{aligned} \tilde{q}_2^{(n)}(r|\lambda _L) \le \lambda _0(n-1)/(r\phi (r)) \quad \text {for all }\, \lambda _L \in (0,\lambda _0]. \end{aligned}$$

Note that \(\log \tilde{q}_j^{(n)}(r|\lambda _L) < 0\) for all \(n\) and \(j\). We can now argue as in the derivation of (7.18) to show that \(\lambda _0\) can be chosen so that, for \(\mathrm {P}\)-almost all \(\omega \in A^c\), there exists \(m_2(\omega )\) such that

$$\begin{aligned} \hat{\lambda }_{L,rec}^{(m)}(\omega ) \not \in (0,\lambda _0) \quad \text {for all }\, m \ge m_0(\omega ), \end{aligned}$$

which, together with (7.26), completes the proof.\(\square \)

We omit the proofs but similar results to Theorems 7.17.3 hold for SEIR and Reed-Frost based models. Theorems 7.2 and 7.3 may also be extended to the case when the infectious period distribution has a parametric form with unknown parameters that need to be estimated. For example, if the infectious period follows an exponential distribution with unknown rate \(\gamma \), it is straightforward to show that, for any compact subset \(K\) of \((0,\infty )^2\), if \((\lambda _L,\gamma )\) is estimated by maximising the relevant pseudolikelihood over \(K\) then the resulting estimator is strongly consistent. Extending this to \(K=(0,\infty )^2\) is more complicated than in the one-dimensional setting of Theorems 7.2 and 7.3 and not considered here.

8 Concluding comments

In this paper we demonstrate that for an emerging SIR epidemic among a population partitioned into households, basing inference on the usual single-household final size distribution normally leads to a biased estimate of the within-household infection rate \(\lambda _L\) and use branching process theory to develop a new estimator which accounts correctly for the emerging nature of an epidemic. Although the model used is undoubtedly simpler than a real-life epidemic, the presence of households is a key departure from homogeneous mixing for human epidemics, and it seems likely that similar issues will arise in more complex settings when using data collected at a household level for inference during the exponentially growing phase of an outbreak. In particular, such data need to be modelled very carefully to ensure that the effects of a growing epidemic are incorporated correctly.

The new method is predicated upon the availability of an estimate of the exponential growth rate \(r\). How best to estimate \(r\) for an emerging epidemic is an open challenge (Ball et al. 2014) since, as illustrated by Fig. 3, the exponentially growing phase occupies only a narrow time window and consequently care is required in choosing start and end time points for fitting it. Of course, the method assumes also that, at the time when estimation is performed, the epidemic is still in its exponentially growing phase and it should be checked that this is a reasonable assumption.

The new method has been shown to be computationally feasible under the assumption of no latent period and exponentially distributed infectious period. Extending its implementation to models with more realistic disease dynamics is an important area for research. One approach is via the phase method, see Sect. 4.2, though the matrices involved soon become large. Thus it would be worthwhile developing numerically amenable approximations to the key Laplace transforms \(\tilde{p}_{x,y}^{(n)}(r|\lambda _L)\) (\((n,x,y) \in \fancyscript{T}\)). Fraser (2007) has developed a closed-form approximate method for calculating the growth rate \(r\) for quite general households models, which works well if both the maximum household size and the variance of the generation interval of the disease are not too large; it may be possible to apply related methods to approximate the aforementioned Laplace transforms.

It would be useful to attach standard errors to estimates obtained using the new method. One way of doing this is using a parametric bootstrap, along similar lines to Fig. 4. Another approach is to determine the asymptotic distributions of the estimators, which would require central limit (or related) analogues of the almost sure results in Nerman (1981).

The method can be extended to multitype SIR epidemics among a community of households, using the model of Ball and Lyne (2001) together with multitype generalisations of Nerman (1981). This would accommodate age-stratified populations (e.g. children and adults), with age-specific susceptibilities, and also asymptomatic infections with different transmission parameters for symptomatic and asymptomatic cases. Note that the setting where all infectious episodes are governed by the same transmission parameters but infections are unobserved independently with a common (unknown) probability may be handled within the single-type framework, since the distribution of the number of observed cases in a households is obtained easily by conditioning on the total number of cases in that household and using binomial sampling.

The method can in principle also be extended to situations where information on the temporal progression of disease within households is available. In the Reed-Frost setting of Sect. 5, estimation can be generalised to the case when chains of infection within households are observed (rather than total number of cases) by extending the type space of the approximating discrete-time multitype branching process to include such information. In the continuous-time setting of Sect. 4, suppose that inter-recovery times are observed. Consider the single-household epidemic \(E_H^{(n)}\) described in Sect. 4.1, suppose that \(k\) recoveries occur in \((0,t]\), where \(k=1,2,\dots ,n\). Let \(t_1\) denote the time of the first recovery and let \(s_1,s_2,\dots ,s_k\) denote the \(k\) successive inter-recovery times, where \(s_k\) is the time elapsing between the \(k\)th recovery and \(t\). Let \(f_k^{(n)}(t_1,s_1, s_2, \dots , s_{k-1}|\lambda _L)\) denote the joint-density of \(s_1,s_2,\dots ,s_{k-1}\), including the information that no recovery occurs between the \(k\)th recovery and time \(t\). Then using Theorem 5.4 of Nerman (1981) shows that the contribution of such a household epidemic to the pseudolikelihood for \(\lambda _L\) is \(\tilde{f}_k^{(n)}(\hat{r}|\lambda _L)=\int _{t_A}^\infty e^{-\hat{r}t} f_k^{(n)}(t-t_A,s_1, s_2, \dots , s_{k-1}|\lambda _L)~\mathop {}\!\mathrm {d}t\), where \(t_A=s_1+s_2+\dots +s_k\), thus providing, at least in principle, a way of estimating \(\lambda _L\).