1 Introduction

A key aim of parametric Bayesian statistics is, given a model \(\mathcal{M}\), with parameters θ=(θ 1,θ 2,…,θ d ) and observed data x, to derive the posterior distribution of θ, π(θ|x). From π(θ|x), it is then possible to obtain key summary statistics, such as the marginal posterior mean, \(\mathbb{E}[ \theta_{1} | \mathbf{x}]\), and variance, var(θ 1|x), of the parameter θ 1, or problem specific quantities, such as \(\mathbb{E}[ 1_{\{\boldsymbol{\theta}\in A \}} | \mathbf{x}]\), for some \(A \subset\mathbb{R}^{d}\). By Bayes’ Theorem,

$$\begin{aligned} \pi(\boldsymbol{\theta}| \mathbf{x}) =& \frac{\pi(\mathbf{x} | \boldsymbol{\theta}) \pi (\boldsymbol{\theta})}{\pi(\mathbf{x})} \\ \propto& \pi(\mathbf{x} | \boldsymbol{\theta}) \pi(\boldsymbol {\theta}). \end{aligned}$$
(1.1)

There are two potentially problematic components in (1.1). Firstly, computation of π(x) is rarely possible. This is often circumvented by recourse to Markov chain Monte Carlo methods. Secondly, it is common that the likelihood function, L(θ|x)=π(x|θ), is not in a convenient form for statistical analysis. For the analysis of many statistical problems, both Bayesian and frequentest, data augmentation (Dempster et al. 1977; Gelfand and Smith 1990) has been used to assist in evaluating π(θ|x) or computing the maximum likelihood estimate of θ. That is, additional, unobserved data, y, is imputed so that π(x,y|θ)=L(θ|x,y) is of a convenient form for analysis. In a Bayesian framework y is often chosen so that the (conditional) posterior distribution of π(θ|x,y) is known.

There are a number of algorithms such as the EM algorithm (Dempster et al. 1977) and the Metropolis-Hastings (MCMC) algorithm (Metropolis et al. 1953; Hastings 1970), which exploit π(θ|x,y) and iterate between the following two steps.

  1. 1.

    Update θ given x and y. i.e. Use π(θ|x,y).

  2. 2.

    Update y given x and θ. i.e. Use π(y|x,θ).

The above algorithmic structure often permits a Gibbs sampler algorithm (Geman and Geman 1984), an important special case of the Metropolis-Hastings algorithm. Examples of Gibbs sampler algorithms that use data augmentation are the genetics linkage data (Dempster et al. 1977; Tanner and Wong 1987), mixture distributions (Diebolt and Robert 1994; Frühwirth-Schnatter 2006), censored data (Smith and Roberts 1993) and Gaussian hierarchical models (Papaspoliopoulos et al. 2003), to name but a few. The Gibbs sampler exploits the conditional distributions of the components of θ=(θ 1,…,θ d ) and y=(y 1,…,y m ), by at each iteration, successively drawing θ i from π(θ i |θ i,y,x) (i=1,…,d) and y j from π(y j |θ,y j,x) (j=1,…,m), where θ i and y j denote the vectors θ and y with the ith and jth component omitted, respectively. Typically the sampling distributions are well known probability distributions. The Gibbs sampler generates dependent samples from π(θ,y|x), where usually only the θ values are stored, with y being discarded as nuisance parameters. In many problems there is conditional independence between the elements of the parameter vector θ given y. That is, π(θ i |θ i,y,x)=π(θ i |y,x). This is seen, for example, in simple Normal mixture models with known variance (Diebolt and Robert 1994) or Poisson mixture models (Fearnhead 2005) and for the infection and recovery parameters of the general stochastic epidemic model with unknown infection times (y) and observed recovery times (x), see for example, O’Neill and Roberts (1999), Neal and Roberts (2005) and Kypraios (2007). Thus, in this case,

$$\begin{aligned} \pi(\boldsymbol{\theta}, \mathbf{y} | \mathbf {x}) =& \pi( \mathbf{y} |\mathbf{x})\prod_{i=1}^d \pi( \theta_i |\mathbf{y}, \mathbf{x}). \end{aligned}$$
(1.2)

For joint densities which factorise in the form of (1.2), it is possible in certain circumstances to collapse the Gibbs sampler (Liu 1994) by integrating out θ. That is, compute π(y|x), up to a constant of proportionality, and construct a Metropolis-Hastings algorithm for sampling from π(y|x). Note that after collapsing it is rarely possible to construct a straightforward Gibbs sampler to sample from π(y|x). In Liu (1994), the benefits of collapsing the Gibbs sampler are highlighted. Collapsing can equally be applied to any Metropolis-Hastings algorithm, see, for example, Neal and Roberts (2005) and Kypraios (2007). In Fearnhead (2005) and Fearnhead (2006), the idea of collapsing is taken a stage further in the case where the number of possibilities of y is finite by computing π(y|x) exactly, and hence, express the posterior distribution, π(θ|x), as a finite mixture distribution. Specifically, Fearnhead (2005) and Fearnhead (2006) consider mixture models and change-point models, respectively, with the main focus of both papers perfect simulation from the posterior distribution as an alternative to MCMC.

In this paper we present a generic framework for obtaining the exact posterior distribution π(θ|x) using data augmentation. The generic framework covers mixture models (Fearnhead 2005) and change-point models (Fearnhead 2006) as important special cases, but is applicable to a wide range of problems including two-level mixing Reed-Frost epidemic model (Sect. 3) and integer valued autoregressive time series models (Sect. 4). The key observation, which is developed in Sect. 2, is that there are generic classes of models, where augmented data, y, can be identified such that, π(y|x) can be computed exactly, and π(θ|y,x) is a well known probability density. In such circumstances we can express the exact posterior distribution, π(θ|x), as a finite mixture distribution satisfying

$$\begin{aligned} \pi(\boldsymbol{\theta}| \mathbf{x}) =& \sum _\mathbf{y} \pi(\boldsymbol{\theta} |\mathbf{y}, \mathbf{x}) \pi(\mathbf{y} |\mathbf{x}). \end{aligned}$$
(1.3)

Throughout this paper we use the term, exact posterior (distribution), to refer to using (1.3). In Sect. 2, we show how in general y can be chosen and how sufficient statistics can be exploited to extend the usefulness of the methodology. Thus we extend the methodology introduced in Fearnhead (2005) beyond mixture models by identifying the key features which make (1.3) practical to use.

A natural alternative to computing the exact posterior distribution is to use MCMC, and in particular, the Gibbs sampler, to obtain a sample from the posterior distribution. There are a number of benefits from obtaining the exact posterior distribution. Firstly, any summary statistic \(\mathbb{E}[ h (\theta) | \mathbf{x}]\), for which \(\mathbb{E}[ h (\theta) | \mathbf{x}, \mathbf{y}]\) is known, can be computed without Monte Carlo error. Moreover, knowing the exact posterior distribution enables the use of importance sampling to efficiently estimate \(\mathbb{E}[ h (\theta) | \mathbf{x}]\), even when \(\mathbb {E}[ h (\theta) | \mathbf{x}, \mathbf{y}]\) is unknown. Secondly, there are none of the convergence issues of MCMC, such as determining the length of the burn-in period, and the total number of iterations required for a given effective sample size. Thirdly, the marginal likelihood (evidence) of the model, π(x)=∫π(x|θ)π(θ)  can easily be computed. This can be used for model selection, see, for example, Fearnhead (2005), Sect. 3. Model selection using MCMC samples is non-trivial, see, for example, Han and Carlin (2001), and often requires the construction of a reversible jump MCMC, Green (1995), adding an extra level of complexity to the MCMC procedure. Fourthly, obtaining the exact posterior distribution enables straightforward perfect simulation to obtain independent and identically distributed samples from the posterior distribution (Fearnhead 2005). This provides an efficient alternative to ‘coupling from the past’, Propp and Wilson (1996), perfect simulation MCMC algorithms, with the key computational cost, computing the exact posterior distribution, incurred only once (Fearnhead 2005). Finally, we can obtain the exact predictive distribution for forward prediction (see Liu 1994), although MCMC samples can easily be used to obtain good samples from the predictive distribution of future observations.

There are also important drawbacks of the proposed methodology compared with MCMC. Only a small number of relatively simple, albeit important, models can be analysed using (1.3), compared to the vast array of models that can increasingly be analysed using MCMC. MCMC algorithms are generally easier to program than algorithms for computing the exact posterior distribution. The computational cost of the exact posterior distribution grows exponentially in the total number of observations, whereas the computational cost of many MCMC algorithms will be linear or better in the total number of observations. However, for moderate size data sets computation of the exact posterior distribution and the MCMC alternative, obtaining a sufficiently large MCMC sample, can often take comparable amounts of time. We discuss computational costs in more detail in Sects. 3.3 and 4 for the household Reed-Frost epidemic model and the integer autoregressive (INAR) model, respectively.

The remainder of the paper is structured as follows. A generic framework is presented in Sect. 2 for computing the exact posterior distribution for a wide class of models, including Poisson mixture models, models for multinomial-beta data (Sect. 3) and integer valued autoregressive (INAR) processes (Sect. 4). We outline how sufficient statistics can be exploited to extend the usefulness of the methodology from small to moderate sized data sets. This involves giving details of an efficient mechanism for identifying the different sufficient statistics, s, compatible with x and for computing π(s|x). In Sect. 3, we consider a general model for multinomial-beta data motivated by the genetic linkage data studied in Dempster et al. (1977) and Tanner and Wong (1987) and applicable to a two-level mixing Reed-Frost epidemic model (see Addy et al. 1991). The exact posterior distribution of the parameters is obtained with computation of key summary statistics of the posterior distribution, such as the posterior mean and standard deviation. In Sect. 4, we obtain the exact posterior distribution for the parameters of the integer valued autoregressive (INAR) processes (McKenzie 2003; Jung and Tremayne 2006; Neal and Subba Rao 2007). In addition to computing the key summary statistics of the posterior distribution, we use the marginal likelihood for selecting the most parsimonious model for the data and we derive the exact predictive distribution for forecasting purposes. All the models considered in Sects. 3 and 4 have previously only been analysed using MCMC and/or by numerical methods that are only applicable for very small data sets. Throughout we compare analysing the data sets using the exact posterior distribution with using MCMC. Finally, in Sect. 5 we conclude with a few final remarks upon other models that can be analysed using (1.3).

2 Generic setup

2.1 Introduction to the generic setup

In this Section we consider the generic situation where π(θ|x) is intractable, but there exists augmented data, y, such that, π(θ|y,x) belongs to a well known probability distribution. Specifically, we assume that y is such that the components of θ are conditionally independent given y. That is,

$$\begin{aligned} \pi(\boldsymbol{\theta}| \mathbf{y}, \mathbf {x}) = \prod _{i=1}^d \pi(\theta_i | \mathbf{y}, \mathbf{x}), \end{aligned}$$
(2.1)

with θ i |y,x belonging to a well known probability distribution. In particular, suppose that we have

$$\begin{aligned} \pi(\mathbf{x} | \boldsymbol{\theta}) = \sum _{j=1}^N \pi(\mathbf{x}, \mathbf{y}_j | \boldsymbol{\theta}), \end{aligned}$$
(2.2)

where for j=1,2,…,N, π(θ|y j ,x) satisfies (2.1). That is, there is a finite number N of possible augmented data values that are consistent with the data. In each case the conditional posterior distribution of the parameters given the augmented data is easily identifiable.

To exploit the data augmentation introduced in (2.2), we consider the joint posterior density π(θ,y j |x), j=1,2,…,N. The first step is to rewrite π(θ|x) as

$$\begin{aligned} \pi(\boldsymbol{\theta}| \mathbf{x}) =& \sum _{j=1}^N \pi(\boldsymbol{\theta}, \mathbf{y}_j | \mathbf{x}) \\ =& \sum_{j=1}^N \pi(\boldsymbol{ \theta}| \mathbf{y}_j, \mathbf{x}) \pi(\mathbf{y}_j | \mathbf{x}). \end{aligned}$$
(2.3)

Since y j is chosen such that π(θ|y j ,x) is known, we only need to compute {π(y j |x)} to be able to express π(θ|x) as a mixture density. The key step for obtaining π(y j |x) is to observe that

$$\begin{aligned} \pi(\boldsymbol{\theta}, \mathbf{y}_j | \mathbf{x}) =& \pi(\mathbf{y}_j |\mathbf{x}) \pi(\boldsymbol{\theta}| \mathbf{y}_j, \mathbf{x}) \\ \propto& K_{\mathbf{y}_j}\prod_{i=1}^d \pi(\theta_i | \mathbf{y}_j, \mathbf{x}), \end{aligned}$$

for some \(K_{\mathbf{y}_{j}}\) independent of θ. We can then integrate out θ, which since ∫π(θ i |y,x) i =1, leaves \(\pi(\mathbf{y}_{j}| \mathbf{x}) \propto K_{\mathbf{y}_{j}}\). Therefore, for j=1,2,…,N,

$$\begin{aligned} \pi( \mathbf{y}_j | \mathbf{x}) = \frac {K_{\mathbf{y}_j}}{\sum_{k=1}^N K_{\mathbf{y}_k}}. \end{aligned}$$
(2.4)

This is a ‘brute-force’ approach enumerating all the N possibilities and computing the normalising constant. The main complication is that computing the normalization constant in (2.4) is only be practical if N is sufficiently small.

A partial solution to the above brute-force enumeration is to use sufficient statistics to significantly reduce the number of computations required. This is successfully employed in Fearnhead (2005) for Poisson mixture models (Sect. 2.2), where a simple, sequential algorithmic approach is given to compute sufficient statistics. We develop this approach detailing how the sequential computation of the sufficient statistics can be efficiently implemented in the generic setup.

In general, let S(y,x) denote the sufficient statistics of the model under consideration, such that π(θ|y,x)=π(θ|S(y,x)). Let \(\mathcal{S} = \{ S (\mathbf{y}_{1}, \mathbf{x}), S (\mathbf{y}_{2}, \mathbf{x}), \ldots, S (\mathbf{y}_{N}, \mathbf{x})\}\), the set of all possible sufficient statistics compatible with the data. Then we can write

$$\begin{aligned} \pi(\boldsymbol{\theta}| \mathbf{x}) =& \sum _{k=1}^S \pi(\boldsymbol{\theta}| \mathbf{s}_k) \biggl\{ \sum_{j: S (\mathbf{y}_j, \mathbf{x}) = \mathbf{s}_k} \pi( \mathbf{y}_j | \mathbf{x}) \biggr\} \\ =& \sum_{k=1}^S \pi(\boldsymbol{ \theta}| \mathbf{s}_k) \pi( \mathbf{s}_k | \mathbf{x}), \end{aligned}$$
(2.5)

where \(S = | \mathcal{S}|\) and \(\mathcal{S} = \{ \mathbf{s}_{1}, \mathbf{s}_{2}, \ldots, \mathbf{s}_{S} \}\). For (2.5) to be useful, a systematic approach to computing \(\mathcal{S}\) and \(\pi (\mathbf{s}_{k} | \mathbf{x}) = \sum_{j: S (\mathbf{y}_{j}, \mathbf{x}) = \mathbf{s}_{k}} \pi(\mathbf{y}_{j} | \mathbf{x})\) is needed that requires far fewer than N computations.

The remainder of this section is structured as follows. In Sect. 2.2, we introduce the Poisson mixture model, (Fearnhead 2005) as an illustrative example of how the exact posterior distribution can be computed. Then in Sect. 2.3 we develop the general framework referencing the Poisson mixture model as an example. This culminates in a mechanism for computing sufficient statistics and associated probability weights in order to fully exploit (2.5). Finally, in Sect. 2.4 we make some final remarks about the generic framework and its limitations.

2.2 Illustrative example

A prime example of a model where (2.2) can be exploited is the Poisson mixture model (Fearnhead 2005). For the Poisson mixture model the data x arises as independent and identically distributed observations from X, where for \(x=0,1,\ldots, \mathbb{P}(X =x) =\sum_{k=1}^{m} p_{k} \lambda_{k}^{x} \exp(-\lambda_{k})/x!\). That is, for k=1,2,…,m, the probability that X, is drawn from \({\rm Po} (\lambda_{k})\) is p k . In this case, y i =1,2,…,m with y i =k denoting x i is drawn from \(\operatorname{Po} (\lambda_{k})\), see, for example, (Fearnhead 2005) for details. Thus N=m n. Although the exact posterior distribution for this model has been derived in Fearnhead (2005), it is good illustrative example for the generic setup presented in this paper and it is a useful stepping stone before moving onto the more complex Reed-Frost epidemic model and integer valued autoregressive (INAR) model in Sects. 3 and 4, respectively.

As noted in Fearnhead (2005), π(θ|y,x) depends upon y and x through the 2m dimensional sufficient statistic S(y,x)=(a,z) (=s), where \(a_{k} = \frac{1}{n} \sum_{i=1}^{n} 1_{\{y_{i} = k \}}\) and \(z_{k}= \frac{1}{n} \sum_{i=1}^{n} 1_{\{y_{i} = k \}} x_{i}\) denote the total number of observations and the sum of the observations from the kth Poisson component, respectively. Note that 2(m−1) sufficient statistics suffice as a m and z m can be recovered from x and (a 1,z 1,…,z m−1), see Fearnhead (2005). The sequential method for constructing \(\mathcal{S}\) given in Fearnhead (2005) ensures that the amount of computations required are far fewer than N. In Fearnhead (2005), Sect. 3.1, for a Poisson mixture model comprising 2 components and 1096 observations, N=21096=8.49×10329 with \(S = | \mathcal{S}|= 501501\). Whilst in this example S is still large, the computation of {π(s k |x)} can be done relatively quickly, see Fearnhead (2005) and an example with S approximately 2.5×107 is considered in Sect. 4.2 of this paper.

2.3 Sufficient statistics

In this section we outline how (2.5) can be derived for a broad class of models that covers Poisson mixture models, the Reed-Frost household epidemic model (Sect. 3) and the INAR model (Sect. 4). First note that we can write,

$$\begin{aligned} \pi(\mathbf{x} | \boldsymbol{\theta}) =& \prod _{t=1}^n \pi(x_t | \mathbf{x}_{t-1}, \boldsymbol{\theta}), \end{aligned}$$
(2.6)

where x t =(x 1,x 2,…,x t ) (t=1,…,n), with π(x 1|x 0,θ)=π(x 1|θ). We consider models for which the augmented data, y, can be chosen such that

$$\begin{aligned} \pi(\mathbf{x} | \boldsymbol{\theta}) =& \prod _{t=1}^n \Biggl\{ \sum _{k=1}^{m_t} \pi(x_t, y_{t,k} | \mathbf{x}_{t-1}, \boldsymbol{\theta}) \Biggr\} \end{aligned}$$
(2.7)

with \(\prod_{t=1}^{n} m_{t} (=N)\) denoting the total number of possibilities for y. That is, we assume that there are m t possibilities, in terms of the choice of the augmented data, of how the observation x t could have arisen given x t−1, with each y t,k corresponding to a different possibility. For the Poisson mixture model, the observations are assumed to be independent and identically distributed (x t is independent of x t−1) and m t =m (t=1,2,…,n) with y t,k corresponding to x t arising from the kth Poisson component. The extension of (2.7) to the more general case \(\pi(\mathbf{x} | \boldsymbol{\theta}) = \prod _{t=1}^{n} \{ \sum_{k=1}^{m_{t}} \pi(x_{t}, y_{t,k} | \mathbf{x}_{t-1}, \mathbf{y}_{t-1}, \boldsymbol{\theta}) \}\), where the allocation y t,k depends on both the observations x t−1 and the allocations y t−1 is straightforward.

The key step in using (2.7), is to identify appropriate augmented data, y. The choice of y=(y 1,y 2,…,y n ) has to be made such that π(θ|y,x) is from a well known probability distribution. Thus y is chosen such that

$$\begin{aligned} \pi(\mathbf{x}, \mathbf{y} | \boldsymbol{\theta}) = \prod _{t=1}^n \Biggl\{ c_{t,y_t} \prod _{i=1}^d h_{ty_ti} ( \theta_i) \Biggr\} , \end{aligned}$$
(2.8)

where \(\pi(x_{t} | \mathbf{x}_{t-1}, \boldsymbol{\theta}) = \sum _{k=1}^{m_{t}} \pi (x_{t}, y_{t,k} | \mathbf{x}_{t-1}, \boldsymbol{\theta})\) and h tki (⋅) is an integrable function throughout this paper h tki (q) will be q A−1(1−q)B−1 (0≤q≤1) or q A−1exp(−Bq) (q≥0), although other choices of h tki (q) are possible. Note that both q A−1(1−q)B−1 (0≤q≤1) and q A−1exp(−Bq) (q≥0) are proportional to probability density functions from an exponential family of distributions, namely, \(\operatorname{beta} (A,B)\) and \(\operatorname{gamma} (A,B)\), respectively, which arise naturally as the posterior distributions of discrete parametric distributions (for example, the binomial, negative binomial and Poisson distributions). We require that for a fixed i, h sji (⋅) and h tki (⋅) \((j,k,s,t \in\mathbb{N})\) belong to the same family of functions (probability distributions) and that a conjugate prior, π i (⋅), for θ i exists. This allows for easy identification of the conditional posterior distribution of θ i , π(θ i |y,x). Therefore, we have that

$$\begin{aligned} \pi(\mathbf{x}, \mathbf{y} , \boldsymbol{\theta}) = \Biggl\{ \prod_{t=1}^n c_{t,y_t} \Biggr\} \prod_{i=1}^d \Biggl( \Biggl\{ \prod _{t=1}^n h_{ty_ti} ( \theta_i) \Biggr\} \pi(\theta_i) \Biggr). \end{aligned}$$
(2.9)

Then

$$\begin{aligned} \pi(\mathbf{y} | \mathbf{x}) \propto\prod _{t=1}^n c_{t,y_t} \prod _{i=1}^d B_i (\mathbf{y}), \end{aligned}$$

where \(B_{i} (\mathbf{y}) = \int\prod_{t=1}^{n} h_{t y_{t}i} (\theta_{i}) d \theta_{i}\) (i=1,2,…,d). This in turn gives

$$\begin{aligned} \pi(\mathbf{y}_j | \mathbf{x}) = & \prod _{t=1}^n c_{t,y_{t,j}} \prod _{i=1}^d B_i (\mathbf{y}_j) \\ &{} \bigg/ \Biggl\{ \sum_{k=1}^N \Biggl( \prod _{t=1}^n c_{t,y_{t,k}} \prod _{i=1}^d B_i (\mathbf{y}_k) \Biggr) \Biggr\} . \end{aligned}$$
(2.10)

Note that for (2.10) it suffices to know \(c_{t,y_{t,j}}\) up to a constant of proportionality. Hence we can replace \(c_{t,y_{t,j}}\) by \(\tilde{c}_{t,y_{t,j}} = k_{x_{t}} c_{t,y_{t,j}}\), where \(k_{x_{t}}\) is a constant independent of y t,j .

Suppose that for t=1,2,…,n, \(h_{ty_{t}i} (\theta_{i}) = \theta_{i}^{e_{ti}} (1-\theta_{i})^{f_{ti}}\) (0≤θ i ≤1) with a \({\rm beta} (C_{i}, D_{i})\) prior on θ i . Then setting \(E_{i} = C_{i} + \sum_{t=1}^{n} e_{ti}\) and \(F_{i} = D_{i} + \sum_{t=1}^{n} f_{ti}\), we have that the (conditional) posterior distribution for θ i is \({\rm beta} (E_{i}, F_{i})\) with

$$\begin{aligned} B_i (\mathbf{y}) = \frac{\Gamma (C_i + D_i)}{\Gamma (C_i) \Gamma(D_i)} \times \frac{\Gamma(E_i) \Gamma(F_i)}{\Gamma (E_i +F_i)}. \end{aligned}$$
(2.11)

Alternatively suppose that for t=1,2,…,n, \(h_{ty_{t}i} (\theta_{i}) = \theta_{i}^{e_{ti}} \exp(- f_{ti} \theta_{i})\) (θ i ≥0) with a \(\operatorname{gamma} (C_{i}, D_{i})\) prior on θ i . Then setting \(E_{i} = C_{i} + \sum_{t=1}^{n} e_{ti}\) and \(F_{i} = D_{i} + \sum_{t=1}^{n} f_{ti}\), we have that the (conditional) posterior distribution for θ i is \(\operatorname{gamma} (E_{i}, F_{i})\) with

$$\begin{aligned} B_i (\mathbf{y}) =\frac {D_i^{C_i}}{\Gamma(C_i)} \times \frac{\Gamma(E_i)}{ F_i^{E_i}}. \end{aligned}$$
(2.12)

The expressions for B i (y) in (2.11) and (2.12) are particularly straightforward if E i and F i are integers. This will be the case when the data arises from discrete parametric distributions with \(C_{i}, D_{i} \in\mathbb{N}\). In both cases E i and F i are sufficient statistics (the dependence upon y is suppressed) and the sums of n+1 terms, with each term depending upon a given observation or the prior. This is key for constructing the sufficient statistics in an efficient manner and we shall discuss this shortly.

We illustrate the above with the 2 component Poisson mixture model, the extension to the m component Poisson mixture model is trivial. Assign a beta(1,1) prior on p (p 1=p,p 2=1−p) and \(\operatorname{gamma} (C_{k}, D_{k})\) prior on λ k (k=1,2), giving

$$\begin{aligned} \pi(\mathbf{x}, \mathbf{y}, \boldsymbol{\theta}) =& \prod _{t=1}^n \biggl( p^{1_{\{y_t =1 \}}} (1-p)^{1_{\{y_t =2 \}}} \times\frac{\lambda_{y_t}^{x_t}}{x_t!} \exp (- \lambda_{y_t}) \biggr) \\ &{} \times1 \times\prod_{k=1}^2 \frac{D_k^{C_k}}{\Gamma(C_k)} \lambda_k^{C_k -1} \exp(- \lambda_k D_k) \\ =& \prod_{t=1}^n \Biggl( p^{1_{\{y_t =1 \}}} (1-p)^{1_{\{y_t =2 \}}} \\ &{}\times\prod_{k=1}^2 \biggl\{ \biggl( \prod_{t: y_t=k} \frac{\lambda_k^{x_t}}{x_t!} \exp(- \lambda_k) \biggr) \\ &{}\times\frac{D_k^{C_k}}{\Gamma(C_k)} \lambda_k^{C_k -1} \exp(-\lambda_k D_k) \biggr\} \Biggr), \end{aligned}$$
(2.13)

where θ=(p,λ 1,λ 2). Thus,

$$\begin{aligned} \pi(\mathbf{y}, \boldsymbol{\theta}| \mathbf{x}) =& \Biggl( \prod_{t=1}^n \frac{1}{x_t!} \Biggr) p^{a_1} (1-p)^{a_2} \prod_{k=1}^2 \lambda_k^{z_k +C_k -1} \\ &{}\times\exp\bigl(-\lambda_k (a_k + D_k)\bigr) \\ \propto& p^{a_1} (1-p)^{a_2} \prod _{k=1}^2 \lambda_k^{z_k +C_k -1} \\ &{}\times \exp \bigl(-\lambda_k (a_k + D_k)\bigr), \end{aligned}$$
(2.14)

giving

$$\begin{aligned} \pi(\mathbf{y} | \mathbf{x}) \propto& \frac{a_1 ! a_2!}{(a_1 +a_2 +1)!} \times \prod_{k=1}^2 \frac{\Gamma(z_k + C_k)}{(a_k + D_k)^{z_k + C_k}} \\ \propto& \prod_{k=1}^2 \frac{a_k! \Gamma(z_k + C_k)}{(a_k + D_k)^{z_k + C_k}}, \end{aligned}$$
(2.15)

since a 1+a 2+1=n+1, independent of y. Finally, π(y|x) can be computed by finding the normalising constant by summing the right hand side of (2.15) over the N=2n possibilities. This is only practical if N is sufficiently small, hence the need to exploit sufficient statistics to reduce the number of calculations that are necessary.

Suppose that the vector of sufficient statistics, S(y,x)=s, is D-dimensional and that \(s_{l} = \sum_{t=1}^{n} g_{t,l} (y_{t}, x_{t})\) (l=1,2,…,D). This form of sufficient statistic is consistent with the beta and gamma conditional posterior distributions for the components of θ outlined above. For example, for the Poisson mixture model with s=(a,z), \(a_{l} = \sum_{t=1}^{n} g_{l} (y_{t}, x_{t})\) with g l (y,x)=1{y=l} and \(z_{l} = \sum_{t=1}^{n} g_{l} (y_{t}, x_{t})\) with g l (y,x)=1{y=l} x. Now for p=1,2,…,n, let S p(y,x)=s p denote the sufficient statistics for the first p observations, i.e. using y p and x p only, with \(s_{l}^{p} = \sum_{t=1}^{p} g_{t,l} (y_{t}, x_{t})\) denoting the corresponding partial sum. In the obvious notation, let \(\mathcal{S}^{p} = \{S^{p} (\mathbf{y}_{1}, \mathbf{x}), S^{p} (\mathbf{y}_{2}, \mathbf{x}), S^{p} (\mathbf{y}_{n}, \mathbf{x}) \} = \{ \mathbf{s}_{1}^{p}, \mathbf{s}_{2}^{p}, \ldots, \mathbf{s}_{S^{p}}^{p} \}\) denote the set of possible sufficient statistics for the first p observations with \(S^{p} = | \mathcal{S}^{p}|\). Let \(C^{p}_{j} = \sum_{\mathbf{y} : S^{p} (\mathbf{y}, \mathbf{x}) = \mathbf{s}^{p}_{j}} \prod_{t=1}^{p} c_{t,y_{t}}\), the relative weight associated, with \(\mathbf{s}^{p}_{j}\). It is trivial to obtain \(\mathcal{S}^{1}\) and \(\{ C^{1}_{j} \}\). For p=2,3,…,n, we can construct \(\mathcal{S}^{p}\) and corresponding weights \(\{C^{p}_{j} \}\) iteratively, using \(\mathcal{S}^{p-1}\), \(\{C^{p-1}_{j} \}\) and {(y p,k ,x p )} as follows. Let σ p(y p ,x p )=(g p,1(y p ,x p ),g p,2(y p ,x p ),…,g p,D (y p ,x p )), the vector of terms that need to be added to the sufficient statistics s p−1 to construct s p, given that the pth observation and augmented data are x p and y p , respectively. Then \(\mathcal{S}^{p} = \{\mathbf{s}^{p} = \mathbf{s}^{p-1} + \sigma^{p} (y_{p,k},x_{p}): \mathbf{s}^{p-1} \in\mathcal{S}^{p-1}, k \in\{1,2,\ldots,m_{t}\} \}\), with \(C^{p}_{j} = \sum_{\{\mathbf{s}^{p-1}_{l} + \sigma^{p} (y_{t,k}, x_{t}) = \mathbf{s}^{p}_{j} \}} C^{p-1}_{l} c_{p,y_{p,k}}\). That is, we consider all possible allocations for the pth observation combined with the sufficient statistics constructed from the first p−1 observations. Finally, as noted after (2.10), for the calculation of the weights, we can multiply the \(c_{t,y_{t,j}}\)’s by arbitrary constants that do not depend upon y.

The above construction of sufficient statistics is a generalisation of that given in Fearnhead (2005), Sect. 2, for mixture models, where the construction of sufficient statistics for data x=(1,1,2,1) arising from a 2 component Poisson mixture model is discussed. For p=1,2,3,4, we take \(c_{p,y_{p}} =1\) regardless of the allocation of the pth observation, which ignores the constant 1/x p ! present in the first line of (2.14). Then in our notation with \(\mathbf{s}^{p} = (a_{1}^{p}, z_{1}^{p})\) since \(a_{2}^{p} =p - a_{1}^{p}\) and \(z_{2}^{p} = \sum_{i=1}^{p} x_{i} - z_{1}^{p}\), we have that \(\mathcal{S}^{3} = \{ (3,4), (2,3), (2,2), (1,2), (1,1), (0,0) \}\) with \(\mathcal{C}^{3} = \{ 1,2,1,1,2,1\}\) denoting the corresponding set of weights. It is straightforward to add σ p(1,1)=(1,1) and σ p(2,1)=(0,0) to the elements of \(\mathcal{S}^{3}\) and the weights to \(\mathcal{C}^{3}\) to get \(\mathcal{S}^{4} = \{(4,5), (3,4),(3,3), (2,3), (2,2), (1,2), (1,1), (0,0) \}\) with corresponding weights \(\mathcal{C}^{4} = \{ 1,3,1,3,3,1,3,1\}\). For example, s 4=(3,4) can arise from s 3=(3,4) and y 4=2 or s 3=(2,3) and y 4=1. See Fearnhead (2005), Sect. 2, for a diagram of the construction.

2.4 Remarks

We make a few concluding comments about the generic setup before implementing the methodology in Sects. 3 and 4. For ease of exposition, throughout this section we have focused upon \(\pi(\boldsymbol{\theta}| \mathbf{y}, \mathbf {x}) = \prod_{i=1}^{d} \pi(\theta_{i} | \mathbf{y}, \mathbf{x})\), (2.1), where θ i is a single parameter. It is trivial to extend to the case where θ i is a vector of parameters, provided that π(θ i |y,x) belongs to a well known probability distribution. A common example of this is where π(θ i |y,x) is the probability density function of a Dirichlet distribution and we will consider an example of this is Sect. 3.2. All the other examples considered in this paper satisfy (2.1).

The generic set up allows us to identify classes of models for which the methodology described in this paper are applicable and also models for which the methodology is not appropriate. The key requirement is that the augmented data y is discrete. Note that, in principle, we do not require x to be discrete with the methodology readily applicable to a mixture of m Gaussian distribution with unknown means and known variances with y i denoting the allocation of an observation x i to a particular Gaussian distribution. The main limitation in applying the methodology to mixtures of Gaussian distributions, or more generally mixtures of continuous distributions, is that all m n possible values for y need to be evaluated as almost surely each combination of (y,x) yields a different sufficient statistic. Therefore for the methodology to be practically useful we require x and y to be discrete. For the examples considered in this paper and the Poisson mixture model, the probability of observing x t can be expressed as a sum over m t terms, where each term in the sum consists either of a single discrete random variable or the sum of independent discrete random variables. The well known discrete probability distributions are Bernoulli-based distributions (binomial, negative binomial, geometric distributions), the Poisson distribution and the multinomial distribution which give rise to beta, gamma and Dirichlet posterior distributions. Hence the emphasis on the discussion of these probability distributions above.

A final remark is that we have discussed the data x=(x 1,x 2,…,x n ) in terms of n observations. This is an applicable representation of the data for the models considered in Sects. 3 and 4. However, it will be convenient in Sect. 3 to give a slightly different representation of the data with x i denoting the total number of observations in the data that belong to a given category i. For example, for data from a Poisson mixture model category i would correspond to observations equal to i. The four observations 1, 1, 2, 1 would be recorded as three 1 s and a 2 and we can then characterise the data augmentation by how many 1 s and 2 s are assigned to each Poisson distribution. This representation can greatly reduce N. For example, for the genetics linkage data in Sect. 3.2 it reduces the number of possibilities from 2125 to 126. This alternative representation can then negate the need for computing sufficient statistics or can speed up the computation of the sufficient statistics.

3 Multinomial-beta data

3.1 Introduction to multinomial-beta data

In this section we study in detail a special case of the generic model introduced in Sect. 2. This case is motivated by the classic genetic linkage data from (Rao 1965, pp. 368–369), popularized by Dempster et al. (1977), where the data are assumed to arise from a multinomial distribution with the components of θ having independent beta distributed posterior distributions, conditional upon x and y. Other models that give rise to multinomial-beta data are the Reed-Frost epidemic model (Longini and Koopman 1982; Addy et al. 1991; O’Neill and Roberts 1999) and (O’Neill et al. 2000) and with minor modifications the INAR(p) model with Geometrically distributed innovations, see Sect. 4.3 and (McCabe and Martin 2005). As noted at the end of Sect. 2, it is convenient to use an alternative representation of the data than that given for the generic model and we give details of the generic form of the data below.

Suppose that there are n independent observations that are classified into t types with n h observations of type h=1,2,…,t. For the genetic linkage data t=1 and the subscript h can be dropped. The n h observations of type h are divided into k h categories with each observation independently having probability p hi (θ) of belonging to category (h,i) (i=1,2,…,k h ). Let \(\mathbf{p}_{h} (\boldsymbol{\theta}) = (p_{h1} (\boldsymbol{\theta}), p_{h2} (\boldsymbol{\theta}), \ldots, p_{hk_{h}} (\boldsymbol{\theta}))\) and let \(\mathbf{x}_{h} = (x_{h1}, x_{h2}, \ldots, x_{hk_{h}})\), where θ is a d-dimensional vector of parameters and x hi denotes the total number of observations in category (h,i). For i=1,2,…,k h , we assume that there exists \(m_{hi} \in\mathbb{N}\) such that \(p_{hi} (\boldsymbol{\theta}) = \sum_{l=1}^{m_{hi}} c_{hil} \prod_{j=1}^{d} \theta_{j}^{a_{hilj}} (1-\theta_{j})^{b_{hilj}}\), where a hilj ,b hilj and c hil are constants independent of θ. Typically, a hilj and b hilj will be non-negative integers. Thus we have that

$$\begin{aligned} \pi(\mathbf{x} | \boldsymbol{\theta}) =& \prod _{h=1}^t \frac{n_h!}{\prod_{i=1}^{k_h} x_{hi} !} \prod _{i=1}^{k_h} \sum_{l=1}^{m_{hi}} c_{hil} \prod_{j=1}^d \theta_j^{a_{hilj}} (1-\theta_j)^{b_{hilj}}, \end{aligned}$$

which is of the same form as (2.7). If all of the m hi ’s are equal to 1, then assuming independent beta priors for the components of θ, the posterior distribution of θ consist of d independent beta distributions, one for each component. (Throughout the remainder of this section, for ease of exposition, we assume a uniform prior on θ, that is, π(θ)=1 for θ∈[0,1]d.) However, if at least one of the m hi ’s is greater than 1 then the posterior distribution of θ is not readily available from x=(x 1,x 2,…,x t ) alone. Therefore we augment the data by dividing the type h data into \(K_{h} = \sum_{i=1}^{k_{h}} m_{hi}\) categories, classified \((h,1,1), \ldots, (h,1,m_{1}), (h,2,1), \ldots, (h,k_{h},m_{k_{h}})\). For i=1,2…,k h and l=1,2,…,m hi , let y hil and \(q_{hil} (\boldsymbol{\theta}) = c_{hil} \prod_{j=1}^{d} \theta_{j}^{a_{hilj}} (1-\theta_{j})^{b_{hilj}}\) denote the total number of observations in category (h,i,l) and the probability that an observation belongs to category (h,i,l), respectively, with \(x_{hi} = \sum_{l=1}^{m_{i}} y_{hil}\) and \(p_{hi} (\boldsymbol{\theta}) = \sum_{l=1}^{m_{hi}} q_{hil} (\boldsymbol {\theta})\). Then

$$\begin{aligned} \pi(\mathbf{y} | \boldsymbol{\theta}) =& \prod _{h=1}^t \frac{n_h!}{\prod_{i=1}^{k_h} \prod_{l=1}^{m_{hi}} y_{hil} !} \prod _{i=1}^{k_h} \prod_{l=1}^{m_{hi}} q_{hil} (\boldsymbol{\theta})^{y_{hil}}, \end{aligned}$$

with \(\pi( \mathbf{x}| \mathbf{y}, \boldsymbol{\theta}) = \pi( \mathbf{x}| \mathbf{y}) = \prod_{h=1}^{t} \prod_{i=1}^{k_{h}} 1_{\{ x_{hi} = \sum_{l=1}^{m_{hi}} y_{hil} \}}\), i.e. the augmented data y is consistent with the observed data x. Therefore

$$\begin{aligned} \pi(\theta, \mathbf{y} | \mathbf{x}) \propto& \prod _{h=1}^t \Biggl\{ \Biggl( \prod _{i=1}^{k_h} 1_{\{ x_{hi} = \sum_{l=1}^{m_i} y_{hil} \}} \Biggr) \frac{n_h!}{\prod_{i=1}^k \prod_{l=1}^{m_{hi}} y_{hil} !} \\ &{}\times\prod_{i=1}^{k_h} \prod _{l=1}^{m_{hi}} q_{hil} ( \theta)^{y_{hil}} \Biggr\} \\ \propto& \prod_{h=1}^t \Biggl\{ \Biggl( \prod_{i=1}^{k_h} 1_{\{ x_{hi} = \sum_{l=1}^{m_i} y_{hil} \}} \Biggr) \prod_{i=1}^{m_{hi}} \biggl( \frac{c_{hil}^{y_{hil}}}{y_{hil}!} \biggr) \\ &{}\times\prod_{j=1}^d \theta_j^{E_j (\mathbf{y})} (1-\theta_j)^{F_j (\mathbf{y})} \Biggr\} , \end{aligned}$$
(3.1)

where \(E_{j} (\mathbf{y}) = \sum_{h=1}^{t} \sum_{i=1}^{k_{h}} \sum_{l=1}^{m_{hi}} a_{hilj} y_{il}\) and \(F_{j} (\mathbf{y}) = \sum_{h=1}^{t} \sum_{i=1}^{k_{h}} \sum_{l=1}^{m_{hi}} b_{ilj} y_{hil}\). The sufficient statistics for the model are S(y,x)=(E 1(y),F 1(y),…,F d (y)) in agreement with the observations made in Sect. 2. Hence, integrating θ out of (3.1) yields

$$\begin{aligned} \pi(\mathbf{y} | \mathbf{x}) \propto& \prod _{h=1}^t \prod_{i=1}^{k_h} 1_{\{ x_{hi} = \sum_{l=1}^{m_i} y_{hil} \}} \prod_{i=1}^{m_{hi}} \biggl( \frac{c_{hil}^{y_{hil}}}{y_{hil}!} \biggr) \\ &{}\times\prod_{j=1}^d \frac{E_j (\mathbf{y})! F_j (\mathbf{y})!}{(E_j (\mathbf{y}) + F_j (\mathbf{y}) +1)!}, \end{aligned}$$
(3.2)

with the components of θ, conditional upon y, having independent beta posterior distributions.

For h=1,2,…,t and i=1,2,…,k h , the total number of possible states, y hi , satisfying \(x_{hi} = \sum_{l=1}^{m_{hi}} y_{hil}\) is \(\binom{x_{hi} + m_{hi} -1}{x_{hi}}\) and thus the total number of possibilities for y is \(N = \prod_{h=1}^{t} \prod_{i=1}^{k_{h}} \binom{x_{hi} + m_{hi} -1}{x_{hi}}\). This is considerably fewer than the total number of data augmented states, \(\prod_{h=1}^{t} \prod_{i=1}^{k_{h}} m_{hi}^{x_{hi}}\), that would be needed if we considered each of the n observations one by one rather than by category. Given π(y|x), the posterior distribution of π(θ|x) can then easily be obtained. If N is small, we can work directly with (3.2), otherwise we can construct \(\mathcal{S} = \{ S(\mathbf{y}_{1}, \mathbf{x}), \ldots, S (\mathbf{y}_{N}, \mathbf{x}) \}\), the set of sufficient statistics consistent with x and the corresponding probability weights, as outlined in Sect. 2.

3.2 Genetic linkage data

The data consists of the genetic linkage of 197 animals and the data are divided into four genetic categories, labeled 1 through to 4, see Rao (1965), Dempster et al. (1977) and Tanner and Wong (1987). The probabilities that an animal belongs to each of the four categories are \(\frac{1}{2} + \frac{\theta}{4}, \frac{1-\theta}{4}, \frac{1-\theta}{4}, \frac{\theta}{4}\), respectively. Let x=(x 1,x 2,x 3,x 4)=(125,18,20,34) denote the total number of animals in each category. In the notation of Sect. 3.1, d=1, t=1, m 1=2, m 2=m 3=m 4=1, so data augmentation is required to gain insight into the posterior distribution of θ, see Tanner and Wong (1987). Let q 11(θ)=1/2, q 12(θ)=θ/4 and y 12=z with y 11=x 1z. Then N x =126 with the possible values of z=0,1,…,x 1(=125) and

$$\begin{aligned} \pi(z | \mathbf{x}) \propto& \frac{1}{(x_1-z)! z!} 2^{-z} \frac{\Gamma(z + x_4 +1)}{\Gamma(z + x_2 + x_3 + x_4 + 2)}. \end{aligned}$$
(3.3)

Then (3.3) and π(θ|z,x)∼beta(z+x 4+1,x 2+x 3+1) together imply that π(θ|x) is a mixture of beta distributions with

$$\begin{aligned} \mathbb{E}[ \theta| \mathbf{x}] =& \sum_{z=0}^{125} \frac{z+x_4 +1}{z + x_2 + x_3 + x_4 +2} \pi(z | \mathbf{x}) = 0.6228 \end{aligned}$$

and var(θ|x)=(0.05094)2=0.002595. The posterior distribution of θ is plotted in Fig. 1.

Fig. 1
figure 1

Exact posterior density of π(θ|x) for the genetics linkage data

In Gelfand and Smith (1990), Sect. 3.1, the genetic linkage example is extended to a model where the data x is assumed to arise from a \({\rm Multinomial} (n,(a_{1} \theta+ b_{1}, a_{2} \theta+ b_{2}, a_{3} \eta + b_{3}, a_{4} \eta+ b_{4}, c (1- \theta-\eta)))\), where (θ,η) are the unknown parameters of interest, a i ,b i ≥0 are known and \(0 < c = 1 - \sum_{j=1}^{4} b_{j} = a_{1} +a_{2} = a_{3} +a_{4} < 1\). In Gelfand and Smith (1990), a Gibbs sampler is described for obtaining samples from the joint distribution of (θ,η) and this is applied to x=(14,1,1,1,5) (n=22) with probabilities (θ/4+1/8,θ/4,η/4,η/4+3/8,(1−θη)/2). It is easy to see that π(x|θ,η) does not satisfy (3.1) but it is straightforward to extend the above arguments to obtain π(θ,η|x). We split categories 1 and 4 into two subcategories leaving the other three categories unaffected. For category 1 (4), observations have probabilities q 11=θ/4 (q 41=η/4) and q 12=1/8 (q 42=3/8) of belonging to subcategories 11 (41) and 12 (42), respectively. We augment x by (z,w), where z and w denote the total number of observations in subcategories 11 and 41, respectively. Note that there are (14+1)×(1+1)=30 possibilities for (z,w). With a \(\operatorname{Dirichlet} (1,1,1)\) prior on (θ,η,1−θη), we have that

$$\begin{aligned} \pi\bigl(\theta, \eta, (z,w) | \mathbf{x}\bigr) & \propto \frac{1}{(x_1-z)! z! (x_4 - w)! w!} \\ &\quad\times\theta^{z + x_2} 2^z \eta^{x_3 + w} \biggl( \frac{2}{3} \biggr)^w (1-\theta-\eta)^{x_5}, \end{aligned}$$

which upon integrating out (θ,η) yields

$$\begin{aligned} \pi\bigl((z,w) | \mathbf{x}\bigr) \propto& \frac{1}{(x_1-z)! z! (x_4 - w)! w!} 2^z \biggl( \frac{2}{3} \biggr)^w\\ &{}\times \frac{(z + x_2)! (x_3 + w)!}{(z + x_2 + x_3 + w + x_5 +2)!}. \end{aligned}$$

Given (z,w), the (conditional) posterior distribution of (θ,η,1−θη) is \(\operatorname{Dirichlet} (z + x_{2} +1, x_{3} + w+ 1, x_{5}+ 1)\). The marginal posterior means (standard deviations) of θ and η are 0.5200 (0.1333) and 0.1232 (0.0809), respectively, with the posterior correlation between θ and η equal to −0.1049.

3.3 Household Reed-Frost epidemic model

The model presented here is the specialization of the household epidemic model of Addy et al. (1991) to a constant infectious period, the Reed-Frost epidemic model. The data consists of the final outcome of an epidemic amongst a population of individuals partitioned into households. Let t denote the maximum household size. Then there are t types of observations corresponding to households with h=1,2,…,t members. For type h, there are k h =h+1 categories corresponding to i=0,1,…,h members of the household ultimately being infected with the disease. The data is assumed to arise as follows. Each individual in the population has independent probability 1−q G of being infected from outside their household, which we shall term a global infection. (Thus q G is the probability that an individual avoids a global infection.) If an individual is ever infected, it attempts to infect susceptible members of its household, whilst infectious, before recovering from the disease and entering the removed class. Each infectious individual has probability 1−q L of infecting each given member of their household, and therefore makes \({\rm Bin} (H-1, 1-q_{L})\) local infectious contacts if they belong to a household of size H. Infectious contacts with susceptibles result in infection, whereas infectious contacts with non-susceptibles have no impact on the recipient. Thus θ=(q G ,q L ).

It is trivial to show that, for h=0,1,…, \(p_{h0} = q_{G}^{h}\) and \(p_{h1} = h (1-q_{G}) q_{G}^{h-1} q_{L}^{h-1}\), and hence, m h0=m h1=1. For i≥2, p hi can be obtained recursively using Addy et al. (1991), Theorem 2 or more conveniently for the above setting using Ball et al. (1997), (3.12). For any i and hi, m hi =m ii with m 22=2, m 33=5, m 44=13 and m 55=33. To see how m hi is obtained it is useful to construct the within-household epidemic on a generation basis. Let generation 0 denote those individuals in the household infected globally, and for l≥1, let generation l denote those individuals infected by the infectives in generation l−1. Then {a 0,a 1,…,a b } denotes that there are a c infectives in generation c with a b+1=0, i.e. the epidemic in the household has ended after generation b. The final size of the epidemic in the household is \(\sum_{c=0}^{b} a_{c}\). For example, there are four ways that \(\sum_{c=0}^{b} a_{c} =3\), {1,1,1}, {1,2}, {2,1} and {3}. For {2,1}, this can arise from either one or both of the generation 0 individuals attempting to infect the generation 1 individual and we denote these two possibilities as {2,11} and {2,12}, respectively. Thus the category (h,3) is split into five sub-categories corresponding to infection chains {1,1,1}, {1,2}, {2,11}, {2,12} and {3}. For the category (h,4), the 13 sub-categories are {1,1,1,1}, {1,1,2}∪{1,2,11}, {1,2,12}, {1,3}, {2,22}, {2,23}, {2,24}, {2,11,1}, {2,12,1}, {3,11}, {3,12}, {3,13} and {4}, where, for example, {2,23} denotes that there are 3 attempted infections between the 2 infectives in generation 0 and the 2 infectives in generation 1. Note that the outcomes {1,1,2} and {1,2,11} are combined into a single sub-category. This is because they have probabilities \((h!/(2 (h-4)!)) (1-q_{G}) q_{G}^{h-1} (1-q_{L})^{3} q_{L}^{4h-14}\) and \((h!/(h-4)!) (1-q_{G}) q_{G}^{h-1} (1-q_{L})^{3} q_{L}^{4h-14}\), respectively. In all cases the sub-category probabilities are of the form \(c (1-q_{G})^{b_{G}} q_{G}^{a_{G}} (1-q_{L})^{b_{L}} q_{L}^{a_{L}}\) as required for Sect. 3.1. The full probabilities for i≤4 are given in Table 5 in the Appendix.

We applied the household Reed-Frost model to four influenza outbreaks, two outbreaks (Influenza A and Influenza B) from Seattle, Washington in the 1970’s, reported in Fox and Hall (1980) and two outbreaks (1977-78, 1980-81) from Tecumseh, Michigan, reported in Monto et al. (1985). In addition, following Addy et al. (1991), Ball et al. (1997) and O’Neill et al. (2000), we also consider the two Tecumseh outbreaks as a combined data set. The data for the four influenza outbreaks are available in both (Clancy and O’Neill 2007) and (Neal 2012). The posterior means and standard deviations of q L and q G for each data set are recorded in Table 1 along with N, \(S =| \mathcal{S}|\) and the time taken to compute the exact posterior distribution using Fortran95 on a computer with a dual 1 GHz Pentium III processor. The differences between S and N are dramatic showing a clear advantage for computing sufficient statistics. Note that for the Tecumseh data sets it is not feasible to compute the exact posterior distribution without computing \(\mathcal{S}\). For all the data sets except the combined Tecumseh data set computation of the posterior distribution is extremely fast. In Fig. 2, contour plots of the exact joint posterior distribution of (q G ,q L ) are given for the Seattle, Influenza A and Influenza B outbreaks.

Fig. 2
figure 2

Exact posterior density contour plot of π(q L ,q G |x) for the Seattle, Influenza A outbreak (left) and the Seattle, Influenza B outbreak (right)

Table 1 Posterior means and standard deviations for q G and q L for the Seattle and Tecumseh influenza data sets

There are alternatives to computing the exact posterior distribution, for example, MCMC, either the Gibbs sampler or the random walk Metropolis algorithm, (O’Neill et al. 2000), or rejection sampling, (Clancy and O’Neill 2007). The rejection sampler has the advantage over MCMC of producing independent and identically distributed (perfect) samples from the posterior distribution, but incurs considerable overheads in finding an efficient proposal distribution and bounding constant. A Gibbs sampler algorithm can easily be constructed using the data augmentation scheme given above. However, the simplest approach is to construct a random walk Metropolis algorithm to obtain a sample from π(θ|x) without data augmentation using the recursive equations for {p hi } given by Ball et al. (1997), (3.12).

We assessed the performance of computing the exact posterior distribution with both the Gibbs sampler and the random walk Metropolis algorithms. The MCMC algorithms were run for 11000 iterations for the Gibbs sampler and 110000 iterations for the random walk Metropolis algorithm with the first 1000 and 10000 iterations discarded as burn-in, respectively. This was sufficient for accurate estimates of all the summary statistics recorded in Table 1. For a fair comparison the MCMC algorithms were also run in Fortran95 on the same computer with the run times taking between two and thirty seconds. Therefore, with the exception of the full Tecumseh data set, the computation time of the exact posterior distribution compares favourably with obtaining moderate sized MCMC samples. However, since the computation of the exact posterior distribution grows exponentially with the total number of observations (households), its applicability is limited by the size of the data. In contrast to the MCMC algorithms which are not affected so long as |x| remains constant with the number of computations per iteration for the MCMC algorithms depending upon |x|, the total number of categories. It should be noted though, that the individual Tecumseh data sets are not small consisting of nearly 300 households each, so the exact posterior distribution offers a good alternative to MCMC, for small-to-moderate data sets, for computing posterior summary statistics, in addition to the advantages of the exact distribution listed in Sect. 1.

4 Integer valued AR processes

4.1 Introduction

The second class of model is the integer-valued autoregressive (INAR) process, see McKenzie (2003), McCabe and Martin (2005) and Neal and Subba Rao (2007). An integer-valued time series {X t ;−∞<t<∞} is an INAR(p) process if it satisfies the difference equation:

$$\begin{aligned} X_t = \sum_{i=1}^p \alpha_i \circ X_{t-i} + Z_t, \quad t \in\mathbb{Z}, \end{aligned}$$

for some generalized, Steutel and van Harn operators α i ∘ (i=1,2,…,p), see Latour (1997), and Z t (−∞<t<∞) are independent and identically distributed according to an arbitrary, but specified, non-negative integer-valued random variable. Generally, see, for example, Neal and Subba Rao (2007), \(Z_{t} \sim{\rm Po} (\lambda)\), and the operators α i ∘ are taken to be binomial operators with

$$\begin{aligned} \alpha_i \circ w = \begin{cases}\operatorname{Bin} (w, \alpha_i)&w >0, \\ 0 & w=0. \end{cases} \end{aligned}$$
(4.1)

This is the situation we primarily focus on here, with θ=(α 1,α 2,…,α p ,λ). In Sect. 4.3, we also consider \(Z_{t} \sim{\rm Geom} (\beta)\), see, McCabe and Martin (2005). In Neal and Subba Rao (2007) and Enciso-Mora et al. (2009a), MCMC is used to obtain samples from the posterior distribution of the parameters of INARMA(p,q), whereas in McCabe and Martin (2005) numerical integration is used to compute the posterior distribution of INAR(1) models with Poisson, binomial and negative binomial models. Although there are similarities between the current work and McCabe and Martin (2005) in computing the exact posterior distribution, it should be noted that McCabe and Martin (2005) is only practical for an INAR(1) model with very low counts (for the data considered in McCabe and Martin (2005), max t x t =2) and depends upon the gridpoints used for the numerical integration. Maximum likelihood estimation and model selection for INAR(p) models and their extensions are considered in Bu and McCabe (2008), Bu et al. (2008) and Enciso-Mora et al. (2009a), Sect. 4.

We follow Neal and Subba Rao (2007), Sect. 3.1, in the data augmentation step and construction of the likelihood. The observed data consists of x=(x 1,x 2,…,x n ) and x I =(x 1−p ,x 2−p ,…,x 0), and we compute π(θ|x I ,x). (The representation of the data is the same as in Sect. 2.) For \(t \in\mathbb{Z}\), we represent α i X ti by Y t,i . Note that given Y t =(Y t,1,Y t,2,…,Y t,p ), we have that \(Z_{t} = X_{t} - \sum_{i=1}^{p} Y_{t,i}\). For t≥1, let y t =(y t,1,y t,2,…,y t,p ) denote a realization of Y t . (If p=1, we simply write Y t and y t in place of Y t,1 and y t,1, respectively.) Then it is shown on p. 96 of Neal and Subba Rao (2007), that for t≥1,

$$\begin{aligned} &\mathbb{P}(X_t = x_t, \mathbf{Y}_t = \mathbf{y}_t | \mathbf{x}_{t-1}, \mathbf{x}_I, \boldsymbol{\theta}) \\ &\quad \propto1_{\{ \sum_{i=1}^p y_{t,i} \leq x_t \}} \prod_{i=1}^p \biggl\{ \binom{x_{t-i}}{y_{t,i}} \alpha_i^{y_{t,i}} (1 - \alpha_i)^{x_{t-i} - y_{t,i}} \biggr\} \\ &\qquad\times \lambda^{x_t -\sum_{i=1}^p y_{t,i}} \exp(- \lambda) \\ &\quad\propto\Biggl\{ \Biggl\{ 1_{\{\sum_{i=1}^p y_{t,i} \leq x_t \}} \prod _{i=1}^p \frac{1}{y_{t,i}! (x_{t-i} - y_{t,i})!} \Biggr\} \\ &\qquad \times \frac{1}{ (x_t -\sum_{i=1}^p y_{t,i})!} \Biggr\} \\ &\qquad \times\prod_{i=1}^p \alpha_i^{y_{t,i}} (1-\alpha_i)^{x_{t-i} -y_{t,i}} \lambda^{x_t - \sum_{i=1}^p y_{t,i}} \exp(- \lambda) \\ &\quad= c_{\mathbf{y}_t, \mathbf{x}} \prod_{i=1}^p \alpha_i^{y_{t,i}} (1-\alpha_i)^{x_{t-i} - y_{t,i}} \lambda^{x_t - \sum_{i=1}^p y_{t,i}} \exp(- \lambda), \\ & \quad \mbox{say}, \end{aligned}$$
(4.2)

with

$$\begin{aligned} & \mathbb{P}(\mathbf{X} = \mathbf{x}, \mathbf {Y}=\mathbf{y} | \theta, \mathbf{x}_I ) \\ &\quad = \prod_{t=1}^n \mathbb{P}(X_t = x_t, \mathbf{Y}_t = \mathbf{y}_t | \mathbf{x}_{t-1}, \mathbf{x}_I, \theta). \end{aligned}$$
(4.3)

The form of (4.2) and (4.3) satisfies (2.8) with \(h_{ty_{t}i} (\alpha_{i}) = \alpha_{i}^{y_{t,i}} (1-\alpha_{i})^{x_{t-i} - y_{t,i}}\) (i=1,2,…,p) and \(h_{ty_{t}(p+1)} (\lambda) = \lambda^{x_{t} - \sum_{i=1}^{p} y_{t,i}} \exp(- \lambda)\). Therefore we assign beta distributed priors for the α i ’s, and a gamma distributed prior for λ. We differ slightly from Neal and Subba Rao (2007), in assuming independent uniform priors for the α i ’s, instead of including the stationary condition that \(\sum_{i=1}^{p} \alpha_{i} < 1\). The inclusion of the constraint \(\sum_{i=1}^{p} \alpha_{i} < 1\) had no affect on the examples considered in Neal and Subba Rao (2007) which were all clearly stationary, and without the constraint we can easily obtain π(y|x,x I ). We take a \({\rm gamma} (a_{\lambda}, b_{\lambda})\) prior for λ and for ease of exposition set a λ =1.

We now depart from Neal and Subba Rao (2007), who used the above data augmentation within a Gibbs sampler, by integrating out θ and identifying the sufficient statistics. For i=0,1,…,p, let \(K_{i} = \sum_{t=1}^{n} x_{t-i} \) and for i=1,2,…,p, let \(G_{i} (\mathbf{y}) = \sum_{t=1}^{n} y_{t,i}\). Then

$$\begin{aligned} & \pi(\theta, \mathbf{y} | \mathbf{x}, \mathbf{x}_I) \\ &\quad= \prod_{t=1}^n \Biggl\{ 1_{\{\sum_{i=1}^p y_{t,i} \leq x_t \}} \prod_{i=1}^p \biggl\{ \binom{x_{t-i}}{y_{t,i}} \alpha_i^{y_{t,i}} (1 - \alpha_i)^{x_{t-i} - y_{t,i}} \biggr\} \\ &\qquad\times\frac{\lambda^{x_t -\sum_{i=1}^p y_{t,i}}}{ (x_t -\sum_{i=1}^p y_{t,i})!} \exp(- \lambda) \Biggr\} \times e^{-b_\lambda\lambda} \\ &\quad= \prod_{t=1}^n \Biggl\{ \Biggl\{ 1_{\{\sum_{i=1}^p y_{t,i} \leq x_t \}} \prod_{i=1}^p \frac{x_{t-i}!}{y_{t,i}! (x_{t-i} - y_{t,i})!} \Biggr\} \\ &\qquad \times\frac{1}{ (x_t -\sum_{i=1}^p y_{t,i})!} \Biggr\} \\ & \qquad\times\prod_{i=1}^p \alpha_i^{G_i (\mathbf{y})} (1-\alpha_i)^{K_i - G_i (\mathbf{y})} \lambda^{K_0 - \sum_{i=1}^p G_i (\mathbf{y})} \\ &\qquad\times\exp\bigl(- (n + b_\lambda) \lambda\bigr). \end{aligned}$$

Integrating out θ yields

$$\begin{aligned} \pi( \mathbf{y} | \mathbf{x}, \mathbf{x}_I) & \propto\prod_{t=1}^n \Biggl\{ \Biggl\{ 1_{\{\sum_{i=1}^p y_{t,i} \leq x_t \}} \prod_{i=1}^p \frac{x_{t-i}!}{y_{t,i}! (x_{t-i} - y_{t,i})!} \Biggr\} \\ &\quad \times\frac{1}{ (x_t -\sum_{i=1}^p y_{t,i})!} \Biggr\} \\ &\quad\times\prod_{i=1}^p \frac{G_i (\mathbf{y})! (K_i - G_i (\mathbf{y}))!}{(K_i+1)!} \\ &\quad\times\frac{(K_0 - \sum_{i=1}^p G_i (\mathbf{y}))!}{ (n + b_\lambda)^{K_0 - \sum_{i=1}^p G_i (\mathbf{y}) +1}}. \end{aligned}$$
(4.4)

We note that for i=1,2,…,p, \(\alpha_{i} | \mathbf{y}, \mathbf{x}, \mathbf{x}_{I} \sim{\rm beta} (G_{i} (\mathbf{y})+1, K_{i} - G_{i} (\mathbf{y})+1)\) and \(\lambda| \mathbf{y}, \mathbf{x}, \mathbf{x}_{I} \sim \operatorname {gamma} (K_{0} - \sum_{i=1}^{p} G_{i} (\mathbf{y}) + 1, n+ b_{\lambda})\). Thus G(y)=(G 1(y),G 2(y),…,G p (y)) are sufficient statistics for the INAR(p) model.

For \(\mathbf{g} \in\prod_{i=1}^{p} [0,K_{i}]\), let \(C_{\mathbf{g}} = \sum_{\mathbf{y} \in\mathcal{S}_{\mathbf{g}}} \{ \prod_{t=1}^{n} c_{t,\mathbf{y}_{t}, \mathbf{x}} \}\), where from (4.4),

$$\begin{aligned} c_{t,\mathbf{y}_t, \mathbf{x}} =& \Biggl\{ \Biggl\{ 1_{\{\sum_{i=1}^p y_{t,i} \leq x_t \}} \prod _{i=1}^p \frac{x_{t-i}!}{y_{t,i}! (x_{t-i} - y_{t,i})!} \Biggr\} \\ &{} \times\frac{1}{ (x_t -\sum_{i=1}^p y_{t,i})!} \Biggr\} \end{aligned}$$
(4.5)

and \(\mathcal{S}_{\mathbf{g}} = \{ \mathbf{y}: \mathbf{G} (\mathbf{y}) = \mathbf{g} \}\). The construction of \(\mathcal{S}_{\mathbf{g}}\) and \(\{C_{\mathbf{g}_{j}} \}\) is then straightforward following the sequential procedure outlined in Sect. 2. Thus, for j=1,2,…,S, with g j =(g j1,g j2,…,g jp ), we have that

$$\begin{aligned} \begin{aligned}[t] & \pi(S_{\mathbf{y}, \mathbf{x}} = \mathbf{g}_j | \mathbf{x}, \mathbf{x}_I) \propto C_{\mathbf{g}_j} \prod _{i=1}^p g_{ji}! (K_i - g_{ji})! \\ &\quad\times \frac{(K_0 - \sum_{i=1}^p g_{ji})!}{ (n + b_\lambda )^{K_0 - \sum_{i=1}^p g_{ji} +1}}. \end{aligned} \end{aligned}$$

For p=1, S=(K 1+1) and π(S(y,x)=g j |x,x I ) (j=1,2,…,S) can be computed very quickly using the ordering of the outcomes g 1=0,g 2=1,…,g S =K 1. For p>1, computation of \(S \leq\prod_{i=1}^{p} (K_{i}+1)\) and π(S(y,x)=g j |x,x I ) (j=1,2,…,N) is more time consuming but still feasible for p=2,3 as demonstrated in Sect. 4.2.

4.2 Westgren’s gold particle data set

This data set consists of 380 counts of gold particles at equidistant points in time. The data, originally published in Westgren (1916), has been analysed in Jung and Tremayne (2006) and Neal and Subba Rao (2007) using an INAR(2) model. In Enciso-Mora et al. (2009a), it was shown using reversible jump MCMC that the INAR(2) model is the most appropriate model of INARMA(p,q) type for the data.

We obtain the exact posterior distribution for the parameters of the INAR(p) model (p=0,1,2,3). (The INAR(0) corresponds to independent and identically distributed Poisson data.) We also computed the marginal log-likelihood (evidence) for the models for comparison. The results are presented in Table 2. There is clear evidence from the marginal log-likelihood for p=2. Applying the BIC-based penalisation prior used in Enciso-Mora et al. (2009a), where for p=0,1,2,3 the prior on INAR(p) was set proportional to n p/2 gives posterior probabilities of 0.0030, 0.7494 and 0.2476, for the INAR(1), INAR(2) and INAR(3) models, respectively. The total number of categories grows rapidly with the order p of the model, and to compute {π(y|x)} using Fortran95 for the INAR(1), INAR(2) and INAR(3) models, took less than a second, 8 seconds and 45 minutes, respectively. It should be noted that the INAR(3) model was at the limits of what is computationally feasible requiring over 1500 MB of computer memory. The memory limitation is due to the total number of categories, and for smaller data sets, either in terms of n, or the magnitude of x t ’s, it would be possible to study INAR(p) models with p>3. However, most interest in INAR(p) models is when p≤3.

Table 2 Posterior means (standard deviations) of the parameters and marginal log-likelihood for INAR(p) model for Westgren’s data set

For comparison the MCMC algorithms of Neal and Subba Rao (2007) and Enciso-Mora et al. (2009a) were run on the gold particle data set with findings similar to those presented above and reported in Neal and Subba Rao (2007) and Enciso-Mora et al. (2009a). For fixed p=1,2,3, the MCMC algorithm of Neal and Subba Rao (2007) was run for 110000 iterations with the first 10000 iterations discarded as burn-in. The algorithm took about 30 seconds to run, regardless of the value of p, although the mixing of the MCMC algorithm gets worse as p increases. Therefore for analysis of the INAR(1) and INAR(2) models, it is quicker to compute the exact posterior distribution than to obtain a sufficiently large sample from an MCMC algorithm. For model selection the reversible jump MCMC algorithm of Enciso-Mora et al. (2009a) was used, restricted to the INAR(p) models with p=1,2 or 3. The algorithm took approximately twice as long, per iteration, as the standard algorithm, due to the model switching step. Also for accurate estimation of the posterior model probabilities longer runs (106 iterations) of the algorithm are required due the low acceptance rate (0.4 %) of the model switching move. This is particularly the case for the estimation of the posterior model probability of the INAR(1) model, with typically only 1 or 2 excursions to this model in 106 iterations. Hence for model selection, using the exact posterior distribution is preferable to reversible jump MCMC, especially for computing the posterior probability for rarer models as noted in Sect. 1.

The exact one-step ahead predictive distribution \(\mathbb{P}(X_{n+1} =x_{n+1} | \mathbf{x})\) can easily be obtained and then the r-step (r=2,3,…) ahead predictive distribution can be obtained recursively. In particular for the INAR(2) model, integrating over the posterior distribution of (α,λ), we get that

$$\begin{aligned} & \mathbb{P}(X_{n+1} = x_{n+1} | \mathbf{x}, \mathbf{x}_I) \\ &\quad= \sum_{j=1}^S C_{\mathbf{g}_j} \sum _{y_{n+1,1}} \sum_{y_{n+1,2}} \Biggl( 1_{\{ y_{n+1,1} + y_{n+1,2} \leq x_{n+1} \}}\binom{x_n}{y_{n+1,1}} \\ &\qquad\times \binom{x_{n-1}}{y_{n+1,2}} \\ &\qquad \times\prod_{i=1}^2 \biggl\{ \frac{(K_i+1)!}{g_{ji}! (K_i - g_{ji})!} \\ &\qquad\times\frac{(g_{ji}+y_{n+1,i})! (K_i + x_{n+1-i} - y_{n+1,i} - g_{ji})!}{(K_i + x_{n+1-i}+1)!} \biggr\} \\ &\qquad \times\frac{(n + b_\lambda)^{K_0 - \sum_{i=1}^2 g_{ji} +1}}{(K_0 - \sum_{i=1}^2 g_{ji})!} \\ &\qquad \times\frac{(K_0 - \sum_{i=1}^2 g_{ji} +x_{n+1} - \sum_{l=1}^2 y_{n+1,l})!}{(n +1+ b_\lambda)^{K_0 - \sum_{i=1}^2 g_{ji} + x_{n+1} - \sum_{l=1}^2 y_{n+1,l} +1}} \Biggr)\\ &\qquad\bigg/\sum_{k=1}^S C_{\mathbf{g}_k}. \end{aligned}$$

We compute the predictive distributions of X 371 and X 372 with the results given in Table 3. For X 372, we have to sum over all possibilities for X 371, of which there are infinitely many. However, we restrict the summation to X 371=0,1,…,10, since \(\mathbb{P}(X_{371} > 10) \leq10^{-6}\) and restricting the sum does not affect the computation of the probabilities to four decimal places as presented in Table 3. Given the focus of this paper, we did not evaluate the predictions or the adequacy of the INAR(p) model.

Table 3 Marginal predictive distribution of X 371 and X 372 for the INAR(2) model for Westgren’s data set

4.3 US polio data set

The US polio data set, from Zeger (1988), consists of the total number of monthly polio (Poliomyelitis) cases in the USA from January 1970 to December 1983. The data has been studied by Zeger (1988) and Davis et al. (2000) using Poisson regression models, and (Enciso-Mora et al. 2009b) using an INAR(1) model with covariates. We consider two simple INAR(1) models for the polio data. The first model has Poisson innovations \((Z_{t} \sim{\rm Po} (\lambda))\) as in Sect. 4.2 and the second model has geometric innovations \((Z_{t} \sim{\rm Geom} (\beta), \mathbb{P}(Z_{t} =k) = (1-\beta)^{k} \beta (k=0,1,\ldots))\). We denote the models by PINAR (Poisson) and GINAR (Geometric), respectively. Although the more complex models, taking into account seasonality and trend, studied in Zeger (1988), Davis et al. (2000) and Enciso-Mora et al. (2009b), are probably more appropriate for this data, this example allows us to further demonstrate the scope of the approach taken in this paper.

A geometric, or more generally a negative binomial or binomial, innovation distribution falls under the auspices of a model with multinomial-beta data studied in Sect. 3 with

$$\begin{aligned} & \mathbb{P}(X_t = x_t, Y_t =y_t | \alpha, \beta, X_{t-1} = x_{t-1}) \\ &\quad= 1_{\{ y_t \leq x_t \}} \binom{x_{t-1}}{y_t} \alpha^{y_t} (1- \alpha)^{x_{t-1} -y_t} (1-\beta)^{x_t -y_t} \beta, \end{aligned}$$

and

$$\begin{aligned} & \pi(\mathbf{y} | \mathbf{x}, \mathbf{x}_I) \\ &\quad\propto \prod_{t=1}^n \biggl\{ 1_{\{ y_t \leq x_t \}} \binom{x_{t-1}}{y_t} \biggr\} \frac{G_1 (\mathbf{y})! (K_1 - G_1 (\mathbf{y}))!}{(K_1 +1)!}\\ &\qquad\times\frac{n! (K_0 - G_1 (\mathbf{y}))!}{(n + K_0 +1 -G_1 (\mathbf{y}))!}. \end{aligned}$$

The total number of possibilities for G 1(y) for the INAR(1) models is 101, and results were obtained instantaneously using, either Fortran95 or R. The results are summarised in Table 4, and show that there is far stronger evidence for the GINAR model than the PINAR model. This is not surprising given the sudden spikes in the data with extreme values far more likely from a Geometric distribution than a Poisson distribution with the same mean. The posterior mean of α for the GINAR model is only 0.0977 and this suggests that an appropriate model could be to assume that the data are independent and identically distributed according to a Geometric distribution. However, as shown in Table 4, such a model has a significantly lower marginal log-likelihood than the GINAR model, so the dependence in the data can not be ignored.

Table 4 Posterior means (standard deviations) of the parameters and marginal log-likelihood for the models for US polio data set

5 Conclusions

We have presented a generic framework for obtaining the exact posterior distribution π(θ|x) using data augmentation, y. The models which can be analysed in this way are generally fairly simple, Poisson mixtures (Fearnhead 2005), household Reed-Frost epidemic model and INAR(p), but have been widely applied, and offer a useful benchmark for comparing more complex models against. The computation time for the exact posterior distribution compares favourably to MCMC especially for smaller data sets with all the computations taking less than 15 seconds in Fortran95 on a computer with a dual 1 GHz Pentium III processor with the exception of the combined Tecumseh data set and the INAR(3) model. Moreover, the genetics linkage, Seattle data sets and INAR(1) models took less than 15 seconds to compute the posterior distribution in R on a standard desktop computer. The key elements in the feasibility of the method are the identification of sufficient statistics S(y,x) and the easy computation of π(S(y,x)|x).

This paper has focused upon the case where the data x correspond to discrete outcomes, for reasons discussed in Sect. 2.4. Thus, throughout the examples in Sects. 3 and 4, it has been assumed that the data have arisen from mixtures or sums of discrete distributions, such as the binomial, negative binomial and Poisson distributions. However, the approach taken can be extended to models based upon other discrete distributions, such as the discrete uniform distribution. The key requirement is that an appropriate choice of y can be found such that, π(θ|x,y) belongs to a well-known probability distribution, typically the product of independent univariate densities, and that π(y|x) can easily be obtained by integrating out θ.

The methodology presented in this paper is not restricted to models satisfying (2.7) with change-point models, (Fearnhead 2006), being a prime example. A similar approach to Sect. 3 can be used to obtain the exact (joint) posterior distribution of the parameters and change-point for the Markov change-point model of Carlin et al. (1992), Sect. 5. Ongoing research involves extending the work in Sect. 4 to INARMA(p,q) processes.