1.1 Introduction

Statistics mainly aim at addressing two major things. First, we wish to learn or draw conclusions about an unknown quantity, \(\theta \in \Theta \) called ‘the parameter’, which cannot be directly measured or observed, by measuring or observing a sequence of other quantities called ‘observations (or data, or samples)’ \(x_{1:n}:=(x_{1},\ldots ,x_{n})\in \mathcal {X}^{m}\) whose generating mechanism is (or can be considered as) stochastically dependent on the quantity of interest \(\theta \) though a probabilistic model \(x_{1:n}\sim f(\cdot |\theta )\). This is an inverse problem since we wish to study the cause \(\theta \) by knowing its effect \(x_{1:n}\). We will refer to this as parametric inference. Second, we wish to learn the possible values of a future sequence of observations \(y_{1:m}\in \mathcal {X}^{m}\) given \(x_{1:n}\). This is a forward problem, and we will call it predictive inference. Here, we present how both inferences can be addressed in the Bayesian paradigm.Footnote 1

Consider a sequence of observables \(x_{1:n}:=(x_{1},\ldots ,x_{n})\) generated from a sampling distribution \(f(\cdot |\theta )\) labeled by the unknown parameter \(\theta \in \Theta \). The statistical model \(\mathfrak {m}\) consists of the observations \(x_{1:n}\), and their sampling distribution \(f(\cdot |\theta )\) ; \(\mathfrak {m}=(f(\cdot |\theta );\,\theta \in \Theta )\).

Unlike in Frequentist statistics, in Bayesian statistics unknown/uncertain parameters are treated as random quantities and hence follow probability distributions. This is justified by adopting the subjective interpretation of probability [4], as the degree of the researcher’s believe about the uncertain parameter \(\theta \). Central to the Bayesian paradigm is the specification of the so-called prior distributions \(\text {d}\pi (\theta )\) on the uncertain parameters \(\theta \) representing the degree of believe (or state of uncertainty) of the researcher about the parameter. Different researchers may specify different prior probabilities, as this is in accordance to the subjective nature of the probability. The specification of the prior is discussed in Sect. 1.2.

The Bayesian model consists of the statistical model \(f(x_{1:n}|\theta )\) containing the information about \(\theta \) available from the observed data \(x_{1:n}\), and the prior distribution \(\pi (\theta )\) reflecting the researcher’s believe about \(\theta \) before the data collection. It is denoted as

$$\begin{aligned} (f(x_{1:n}|\theta ),\pi (\theta ))\,\,\,\text {or as} \,\,\, {\left\{ \begin{array}{ll} x_{1:n}|\theta &{} \sim f(\cdot |\theta )\\ \theta &{} \sim \pi (\cdot ) \end{array}\right. }. \end{aligned}$$

Bayesian parametric inference relies on the posterior distribution \(\pi (\theta |x_{1:n})\) whose density or mass function (PDF or PMF) is calculated by using the Bayes theorem

$$\begin{aligned} \pi (\theta |x_{1:n})&=\frac{f(x_{1:n}|\theta )\pi (\theta )}{\int _{\Theta }f(x_{1:n}|\theta )\pi (\text {d}\theta )} \end{aligned}$$
(1.1)

as a tool to invert the conditioning from \(x_{1:n}|\theta \) to \(\theta |x_{1:n}\). Posterior distribution (1.1) quantifies the researcher’s degree of believe after taking into account the observations. By using subjective probability arguments, we can see interpret (1.1) as a mechanism that updates the researcher’s degree of believe from the prior \(\pi (\theta )\) to the posterior \(\pi (\theta |x_{1:n})\) in the light of the observations collected.

Bayesian predictive inference about a future observation \(y_{*}\) can be addressed based on the predictive distribution defined as

$$\begin{aligned} p(y|x_{1:n})=\int _{\Theta }f(y|\theta )\pi (\text {d}\theta |x_{1:n})=\text {E}_{\pi }(f(y|\theta )|x_{1:n}). \end{aligned}$$
(1.2)

Essentially, it is the expected value of the sampling distribution averaging out the uncertain parameter \(\theta \) with respect to its posterior distribution reflecting the researcher’s degree of believe.

Although the posterior and predictive distributions quantify the researcher’s knowledge, they are not enough to give a solid answer about the quantity to be learned. In what follows we discuss important concepts based on decision theory which are used for Bayesian inference.

1.2 Specification of the Prior

Prior distribution \(\pi (\theta )\) needs to reflect the researcher’s degree of believe about the uncertain parameter \(\theta \in \Theta \). Sophisticated prior distributions often lead to ineluctable posterior or predictive probabilities, and hence Bayesian analysis. Following, we present a computationally convenient class of priors applicable to several scenarios.

1.2.1 Conjugate Priors

Conjugate priors is a mathematically convenient way to specify the prior model in certain cases. They facilitate the tractable implementation of the Bayesian statistical analysis, by leading to computationally tractable posterior distributions.

Formally, if \(\mathcal {F}=\{f(\cdot |\theta );\forall \theta \in \Theta \}\) is a class of parametric models (sampling distributions), and \(\mathcal {P}=\{\pi (\theta |\tau );\forall \tau \}\) is a class of prior distributions for \(\theta \), then the class \(\mathcal {P}\) is conjugate for \(\mathcal {F}\) if

$$ \pi (\theta |x_{1:n})\in \mathcal {P},\,\,\,\forall f(\cdot |\theta )\in \mathcal {F}\,\text {and}\,{\pi (\cdot )\in \mathcal {P}}. $$

It is straightforward to specify a conjugate prior when the sampling distribution is member of the exponential family. Consider observation \(x_{i}\) generated from a sampling distribution in the exponential family

$$\begin{aligned} x_{i}|\theta&\overset{\text {IID}}{\sim }\text {Ef}_{k}(u,g,h,\phi ,\theta ,c);\quad i=1,\ldots ,n \end{aligned}$$

with density \(\text {Ef}_{k}(x|u,g,h,\phi ,\theta ,c)=u(x)g(\theta )\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )(\sum _{i=1}^{n}h_{j}(x)))\) and \(g(\theta )=1/\int u(x)\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )(\sum _{i=1}^{n}h_{j}(x)))\text {d}x\). The likelihood function is equal to

$$\begin{aligned} f(x_{1:n}|\theta )&=\prod _{i=1}^{n}u(x_{i})g(\theta )^{n}\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )(\sum _{i=1}^{n}h_{j}(x_{i}))). \end{aligned}$$
(1.3)

The conjugate prior, corresponding to likelihood (1.3), admits density of the form

$$\begin{aligned} \pi (\theta |\tau )&=\frac{1}{K(\tau )}g(\theta )^{\tau _{0}}\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )\tau _{j}) \end{aligned}$$
(1.4)

where \(\tau =(\tau _{0},\ldots ,\tau _{k})\) is such that \(K(\tau )=\int _{\Theta }g(\theta )^{\tau _{0}}\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )\tau _{j})\text {d}\theta <\infty \). The resulting posterior of \(\theta \) has the form

$$ \pi (\theta |x_{1:n},\tau )=\frac{1}{K(\tau ^{*})}g(\theta )^{\tau _{0}^{*}}\exp (\sum _{j=1}^{k}c_{j}\phi _{j}(\theta )\tau _{j}^{*})) $$

with \(\tau ^{*}=(\tau _{0}^{*},\tau _{1}^{*},\ldots ,\tau _{k}^{*})\), \(\tau _{0}^{*}=\tau _{0}+n\), and \(\tau _{j}^{*}=\sum _{i=1}^{n}h_{j}(x_{i})+\tau _{j}\) for \(j=1,\ldots ,k\).

It is easy to see that (1.4) is conjugate to (1.3) as the posterior can be re-written as \(\pi (\theta |x_{1:n},\tau )=\pi (\theta |\tau ^{*})\) where \(\tau ^{*}=\tau +t_{n}(x_{1:n})\), and \(t_{n}(x_{1:n})=(n,\sum _{i=1}^{n}h_{1}(x_{i}),\ldots ,\sum _{i=1}^{n}h_{k}(x_{i}))\). You can check the demo in.Footnote 2

Example: Bernoulli model (Cont.)

Consider observations \(x_{1:n}=(x_{1},\ldots ,x_{n})\in \mathbb {R}^{n}\) generated from a Bernoulli distribution with success rate \(\theta \in [0,1]\); i.e., \(x_{i}|\theta \sim \text {Br}(\theta )\), \(i=1,\ldots ,n\). Interest lies in specifying a conjugate prior for \(\theta \).

The sampling distribution is member of the exponential family, with \(u(x)=1\), \(g(\theta )=(1-\theta )\), \(c_{1}=1\), \(\phi _{1}(\theta )=\log (\frac{\theta }{1-\theta })\), \(h_{1}(x)=x\), because

$$\begin{aligned} f(x|\theta )&=\text {Br}(x|\theta )=\theta ^{x}(1-\theta )^{1-x}=(1-\theta )\exp (\log (\frac{\theta }{1-\theta })x). \end{aligned}$$

The corresponding conjugate prior has PDF such as

$$\begin{aligned} \pi (\theta |\tau )&\propto (1-\theta )^{\tau _{0}}\exp (\log (\frac{\theta }{1-\theta })\tau _{1})\,=\theta ^{(\tau _{1}+1)-1}(1-\theta )^{(\tau _{0}-\tau _{1}+1)-1}, \end{aligned}$$

where we recognize Beta distribution \(\pi (\theta |\tau )=\text {Be}(\theta |a,b)\), with \(a=\tau _{1}+1\), \(b=\tau _{0}-\tau _{1}+1\). Therefore, the posterior distribution is

$$ \pi (\theta |x_{1:n},\tau )=\pi (\theta |\tau _{0}+n,\tau +\sum _{i=1}^{n}h(x_{i}))\propto \theta ^{(\tau _{1}+n\bar{x}+1)-1}(1-\theta )^{(\tau _{0}+n-\tau _{1}-n\bar{x}+1)-1} $$

which is \(\text {Be}(\theta |a^{*},b^{*})\), with \(a^{*}=a+n\bar{x}\), and \(b^{*}=b+n-n\bar{x}\).

1.3 Point Estimation

Often interest lies in learning the ‘true’ value of the unknown parameter \(\theta \in \Theta \), or the future values of a future sequence of observations \(y_{1:m}\in \mathcal {X}^{m}\); this is performed via the Bayesian point estimator. Here, we demonstrate the theory of the Bayesian point estimator in parametric inference, and leave the extension to the predictive inference to the reader.

Bayes (parametric) point estimator of \(\theta \in \Theta \) with respect to the loss function \(\ell (\theta ,\delta )\) and the posterior distribution \(\pi (\theta |x_{1:n})\) is an Bayes rule \(\delta ^{\pi }\) which minimizes \(\int _{\Theta }\ell (\theta ,\delta )\pi (\text {d}\theta |x_{1:n})\); i.e.,

$$\begin{aligned} \delta ^{\pi }(x_{1:n})&=\arg \min _{\forall \delta \in \Theta }\text {E}_{\pi }(\ell (\theta ,\delta )|x_{1:n})=\arg \min _{\forall \delta \in \Theta }\int _{\Theta }\ell (\theta ,\delta )\pi (\text {d}\theta |x_{1:n}). \end{aligned}$$
(1.5)

Often the accuracy of the Bayes point estimator is represented by its standard error. A commonly accepted metric for the standard error of the j-th dimension of the estimator \(\delta ^{\pi }\) is

$$ \text {se}_{\pi }(\delta _{j}|x_{1:n})=\sqrt{\text {MSE}_{\pi }(\delta _{j}|x_{1:n})} $$

where \(\text {MSE}_{\pi }(\delta _{j}|x_{1:n})=[\text {E}_{\pi }\left( (\theta -\delta )(\theta -\delta )^{\top }|x_{1:n}\right) ]_{j,j}\) is the mean squared error of \(\delta _{j}\).

A number of standard Bayesian point estimates, under different loss functions, are location summary statistics of the posterior distribution (mean, median, mode, quantiles, etc.) You can check the demo in.Footnote 3

The Bayesian estimate of \(\theta \) with respect to the linear loss \(\ell (\theta ,\delta )=c_{1}(\delta -\theta )\text {1}_{\theta \le \delta }(\delta )+c_{2}(\theta -\delta )\text {1}_{\{\theta \le \delta \}^{c}}(\delta )\) is the \(\frac{c_{2}}{c_{1}+c_{2}}\)-th posterior quantile; i.e., \(\pi (\theta \in (-\infty ,\delta (x_{1:n}))|x_{1:n})=\frac{c_{2}}{c_{1}+c_{2}}\). The linear loss function essentially allows the adjustment of the penalty between over-estimating and under-estimating \(\theta \), by adjusting \(c_{1}\)and \(c_{2}\). In particular, for \(c_{1}=c_{2}\), we get the absolute loss \(\ell (\theta ,\delta )=|\theta -\delta |\) and the posterior estimator is the posterior median

$$\begin{aligned} \delta (x_{1:n})=\text {median}_{\pi }(\theta |x_{1:n}). \end{aligned}$$
(1.6)

The absolute loss is more appropriate when over-estimation and under-estimation are of the same concern (as penalized the same).

The Bayes estimate \(\delta ^{\pi }(x_{1:n})\) of \(\theta \) with respect to the quadratic loss function \(\ell (\theta ,\delta )=(\theta -\delta )^{2}\) is

$$\begin{aligned} \delta ^{\pi }(x_{1:n})=\text {E}_{\pi }(\theta |x_{1:n}). \end{aligned}$$
(1.7)

The posterior mean of \(\theta \) as an estimator of \(\theta \) essentially minimizes the estimator error \(\text {se}_{\pi }(\delta |x_{1:n})\), which is equal to the posterior standard error. Obviously, the standard error of the estimator (1.7) is equal to the posterior standard error. Compared to the absolute loss, the quadratic loss aims at over-penalizing large but unlikely errors. In fact, quadratic loss aims at minimizing the standard error \(\text {se}_{\pi }(\delta |x_{1:n})\).

Finally, the Bayesian estimate of \(\theta \) with respect to the zero-one loss \(\ell (\theta ,\delta )=1-\text {1}_{B_{\epsilon }(\delta )}(\theta )\) is the posterior mode

$$\begin{aligned} \delta (x_{1:n})=\text {mode}_{\pi }(\theta |x_{1:n}) \end{aligned}$$
(1.8)

as \(\epsilon \rightarrow 0\).

Example: Bernoulli model (Cont.)

Interest lies in calculating the Bayesian point estimator under the absolute loss function. This is the Maximum A posteriori Estimator (the posterior mode). It is

$$ \log (\pi (\theta |x_{1:n}))\propto (n\bar{x}+a-1)\log (\theta )+(n-n\bar{x}+b-1)\log (1-\theta ). $$

For \(a>0\), \(b>0\), \(\frac{\text {d}}{\text {d}\theta }\log (\pi (\theta |x_{1:n}))|_{p=\delta (x)}=0\) implies \(\delta (x)=\frac{n\bar{x}+a-1}{n+a+b-2}\). Note that (a.) if \(a\rightarrow 1\), \(b\rightarrow 1\) (aka \(\pi (\theta |a,b)\propto 1\)), then \(\delta ^{\pi }(x)=\bar{x}\) similar to frequentists stats; (b.) if \(a\rightarrow 0\), \(b\rightarrow 0\) (aka \(\pi (\theta |a,b)\propto \theta ^{-1}(1-\theta )^{-1}\)), then \(\delta (x)=\frac{n\bar{x}-1}{n-2}\); if \(a\rightarrow 1/2\), \(b\rightarrow 1/2\) (aka \(\pi (\theta |a,b)\propto \theta ^{-1/2}(1-\theta )^{-1/2}\)), then \(\delta (x)=\frac{n\bar{x}-1/2}{n-1}\); if \(n\rightarrow \infty \), \(a>0\), \(b>0\), then \(\delta (x)=\bar{x}.\)

1.4 Credible Sets

Instead of just reporting parametric (or predictive) point estimates for \(\theta \) (or \(y_{1:m}\)), it is often desirable and more useful to report a subset of values \(C_{a}\subseteq \Theta \) (or \(C_{a}\subseteq \mathcal {X}^{m}\)) where the posterior (or predictive) probability that \(\theta \in C_{a}\) (or \(y_{1:m}\in C_{a}\)) is equal to a certain value a reflecting one’s degree of believe.

The definition below describes the credible set [1, 5].

Definition 1.1

(Posterior Credible Set) A set \(C_{a}\subseteq \Theta \) such that

$$ \pi (\theta \in C_{a}|x_{1:n})=\int _{C_{a}}\pi (\text {d}\theta |x_{1:n})\ge 1-a $$

is called ‘\(100(1-a)\%\)’ posterior credible set for \(\theta \), with respect to the posterior distribution \(\pi (\text {d}\theta |x_{1:n})\).

In contrast to the frequentist stats, in Bayesian stats we can speak meaningfully of the probability that \(\theta \) is in \(C_{a}\), because probability \(1-a\) reflects one’s degree of believe that \(\theta \in C_{a}\).

Among all the credible sets \(C_{a}\) in Definition 1.1, we are often interested in those that have the minimum volume. It can be proved [2] that the highest probability density (HPD) sets have this property. HPD consider those values of \(\theta \) corresponding to the highest posterior pdf/pmf (aka the most likely values of \(\theta \)).

Definition 1.2

(Posterior highest probability density (HPD) set) The \(100(1-a)\%\) highest probability density set for \(\theta \in \Theta \) with respect to the posterior distribution \(\pi (\theta |x_{1:n})\) is the subset \(C_{a}\) of \(\Theta \) of the form

$$\begin{aligned} C_{a}=\{\theta \in \Theta :\pi (\theta |x_{1:n})\ge k_{a}\} \end{aligned}$$
(1.9)

where \(k_{a}\) is the largest constant such that

$$\begin{aligned} \pi (\theta \in C_{a}|x_{1:n})\ge 1-a. \end{aligned}$$
(1.10)

From the decision theory perspective, HPD set \(C_{a}\) is the Bayes estimate of \(C_{a}\) the credible interval under the loss function \(\ell (C_{a},\theta )=k|C_{a}|-1_{C_{a}}(\theta )\), for \(k>0\) which penalizes sets with larger volumes. The proof is available in [2].

Example: Multivariate Normal model

Consider observations \(x_{1},\ldots ,x_{n}\) independently drawn from a q-dimensional normal \(\text {N}_{q}(\mu ,\Sigma )\) with unknown \(\mu \in \mathbb {R}^{q}\), \(q\ge 1\), and known \(\Sigma \), \(\mu _{0}\), \(\Sigma _{0}\). Assume prior \(\mu \sim \text {N}_{q}(\mu _{0},\Sigma _{0})\). Interest lies in calculating the \(C_{a}\) parametric HPD credible interval for \(\mu \).

The posterior PDF of \(\mu \) is

$$\begin{aligned} \pi (\mu |x_{1:n})&\propto f(x_{1:n}|\mu )\pi (\mu )=\prod _{i=1}^{n}\text {N}_{q}(x_{i}|\mu ,\Sigma ) {\text {N}_{q}}(\mu |\mu _{0},\Sigma _{0})\\&\propto \exp (-\frac{1}{2}(\mu -\hat{\mu }_{n})^{T}\hat{\Sigma }_{n}^{-1}(\mu -\hat{\mu }_{n}))\propto {\text {N}_{q}}(\mu |\hat{\mu }_{n},\hat{\Sigma }_{n}) \end{aligned}$$

where \(\Sigma _{n}=(n\Sigma ^{-1}+\Sigma _{0}^{-1})^{-1}\), and \(\hat{\mu }_{n}=\hat{\Sigma }_{n}(n\Sigma ^{-1}\bar{x}+\Sigma _{0}^{-1}\mu _{0})\). So \(\mu |x_{1:n}\sim {\text {N}_{q}}(\hat{\mu }_{n},\hat{\Sigma }_{n})\).

From Definition 1.2, the credible set has the form

$$\begin{aligned} C_{a}&=\{\mu \in \mathbb {R}^{q}:\pi (\mu |x_{1:n})\ge k_{a}\} \\&=\{\mu \in \mathbb {R}^{q}:(\mu -\hat{\mu }_{n})^{T}\hat{\Sigma }_{n}^{-1}(\mu -\hat{\mu }_{n})\le -\log (2\pi \det (\hat{\Sigma }_{n})))k_{a}=\tilde{k}_{a}\} \end{aligned}$$

where \(k_{a}\) is the greatest value satisfying

$$\begin{aligned} \pi _{{\text {N}_{q}}(\hat{\mu }_{n},\hat{\Sigma }_{n})}(\mu \in C_{a}|x_{1:n})&\ge 1-a\Longleftrightarrow \nonumber \\ \pi _{\chi _{q}^{2}}((\mu -\hat{\mu }_{n})^{T}\hat{\Sigma }_{n}^{-1}(\mu -\hat{\mu }_{n})\le \tilde{k}_{a})&\ge 1-a. \end{aligned}$$
(1.11)

Here, \((\mu -\hat{\mu }_{n})^{T}\hat{\Sigma }_{n}^{-1}(\mu -\hat{\mu }_{n})\sim \chi _{q}^{2}\) as a sum of squares of independent standard normal random variables, and hence \(\tilde{k}_{a}\) is the \(1-a\)-th quantile of the \(\chi _{q}^{2}\) distribution; i.e., \(\tilde{k}_{a}=\chi _{q,1-a}^{2}\). Therefore, \(C_{a}\) parametric HPD credible set for \(\mu \) is

$$ C_{a}=\{\mu \in \mathbb {R}^{q}:(\mu -\hat{\mu }_{n})^{T}\hat{\Sigma }_{n}^{-1}(\mu -\hat{\mu }_{n})\le \chi _{q,1-a}^{2}\} $$

In real applications, the calculation of the credible interval might be intractable, due to the inversion in (1.9) or integration in (1.10). Below, we present a Naive algorithm [1] that can be implemented in a computer.Footnote 4

  • Create a routine which computes all solutions \(\theta ^{*}\) to the equation \(\pi (\theta |x_{1:n})=k_{a}\), for a given \(k_{a}\). Typically, \(C_{a}=\{\theta \in \Theta :\pi (\theta |x_{1:n})\ge k_{a}\}\) can be constructed from those solutions.

  • Create a routine which computes \(\pi (\theta \in C_{a}|x_{1:n})=\int _{\theta \in C_{a}}\pi (\theta |x_{1:n})\text {d}\theta \)

  • Numerically solve the equation \(\pi (\theta \in C_{a}|x_{1:n})=1-a\) as \(k_{a}\) varies.

Figure 1.1 demonstrates the above procedure in 1D unimodal and tri-modal cases. Specifically, the red horizontal bar denotes \(k_{a}\) moves upwards, and intersects the density at locations which are the potential boundaries of \(C_{a}\). The bar stops to move when the total density above regions of the parametric space is equal to \(1-\alpha \). The HPD credible set results as the union of these sub-regions. You can check the demo in.Footnote 5

Fig. 1.1
figure 1

Schematic of 1D HPD set

Theorem 1.1 suggests a computationally convenient way to calculate HPD credible intervals in 1D, and unimodal cases. The proof is available in [3].

Theorem 1.1

Let \(\theta \) follows a distribution with unimodal density \(\pi (\theta |x_{1:n})\). If the interval \(C_{a}=[L,U]\) satisfies

  1. 1.

    \(\int _{L}^{U}\pi (\theta |x_{1:n})\text {d}\theta =1-a\),

  2. 2.

    \(\pi (U)=\pi (L)>0\), and

  3. 3.

    \(\theta _{\text {mode}}\in (L,U)\), where \(\theta _{\text {mode}}\) is the mode of \(\pi (\theta |x_{1:n})\),

then interval \(C_{a}=[L,U]\) is the HPD interval of \(\theta \) with respect to \(\pi (\theta |x_{1:n})\).

Example: Bernoulli model (Cont.)

Interest lies in calculating the 2-sides \(95\%\) HPD interval for \(\theta \), given a sample with \(n=30\), and \(\sum _{i=1}^{30}x_{i}=15\), and prior hyper-parameters \(a=b=2\).

The posterior distribution of \(\theta \) is \(\text {Be}(a+n\bar{x}=17,b+n-n\bar{x}=17)\), which is 1D and unimodal; hence we use Theorem 1.1. It is

$$\begin{aligned} 1-a&=\int _{L}^{U}\text {Be}(\theta |17,17)\text {d}\theta =\text {Be}(\theta<U|17,17)-\text {Be}(p<L|17,17). \end{aligned}$$

Note that Beta PDF is symmetric around 0.5 when \(a^{*}=b^{*}\), and so is here where \(\text {Be}(17,17)\). Then,

$$\begin{aligned} 1-a&=\text {Be}(\theta<U|17,17)-(1-\text {Be}(\theta<U|17,17))=2\text {Be}(\theta <U|17,17)-1 \end{aligned}$$

so \(\text {Be}(\theta <U|17,17)=1-a/2\) and \(L=1-U\). For \(a=0.95\), the \(95\%\) posterior credible interval for \(\theta \) is \([L,U]=[0.36,0.64].\)

Remark 1.1

Predictive credible sets for a future sequence of observations \(y_{1:m}\), are defined and constructed as parametric ones by replacing \(\theta \) with \(y_{1:m}\) and \(\pi (x_{1:n}|\theta )\) with \(p(y_{1:m}|x_{1:n})\) in Definitions 1.1 and 1.2, and their consequences in this section. It is left as an Exercise.

1.5 Hypothesis Test

Often there is interest in reducing the overall parametric space \(\Theta \) (aka the set of possible values of that the uncertain parameter \(\theta \) can take) to a smaller subset. For instance; whether the proportion of Brexiters is larger than 0.5 (\(p>0.5\)) or not (\(p\le 0.5\)).

Such a decision can be formulated as a hypothesis test [1], namely the decision procedure of choosing between two non-overlapping hypotheses

$$\begin{aligned} \text {H}_{0}:\,\theta \in \Theta _{0}\quad \text {vs}\quad \text {H}_{1}:\,\theta \in \Theta _{1} \end{aligned}$$
(1.12)

where \(\{\Theta _{0},\Theta _{1}\}\) partitions the space \(\Theta \). Typically, hypotheses, \(\{\text {H}_{k}\}\), are categorized in three categories. Single hypothesis for \(\theta \) is called the hypothesis where \(\Theta _{j}=\{\theta _{j}\}\) contains a single element. Composite hypothesis for \(\theta \) is called the hypothesis where \(\Theta _{j}\subseteq \Theta \) contains many elements. General alternative hypothesis for \(\theta \) is called the composite hypothesis where \(\Theta _{1}=\Theta -\{\theta _{0}\}\) when it is compared against a single hypothesis \(\text {H}_{0}:\theta =\theta _{0}\). It is denoted as \(\text {H}_{1}:\theta \ne \theta _{0}\).

Based on the partitioning implied by (1.12), the overall prior \(\pi \) can be expressed as \(\pi (\theta )=\pi _{0}\times \pi _{0}(\theta )+\pi _{1}\times \pi _{1}(\theta )\) where \(\pi _{k}=\int _{\Theta _{k}}\pi (\text {d}\theta )\), and \(\pi _{k}(\theta )=\frac{\pi (\theta )\text {1}_{\Theta _{k}}(\theta )}{\int _{\Theta _{k}}\pi (\text {d}\theta )}\). Here, \(\pi _{0},\) and \(\pi _{1}\) describe the prior probabilities on \(\text {H}_{0}\) and \(\text {H}_{1}\), respectively, while \(\pi _{0}(\theta )\) and \(\pi _{1}(\theta )\) describe how the prior mass is spread out over the hypotheses \(\text {H}_{0}\) and \(\text {H}_{1}\), respectively.

We could see the hypothesis testing (1.12) as parametric point inference about the indicator function

$$\begin{aligned} \text {1}_{\Theta _{1}}(\theta )={\left\{ \begin{array}{ll} 0 &{} ,\,\theta \in \Theta _{0}\\ 1 &{} ,\,\theta \in \Theta _{1} \end{array}\right. }. \end{aligned}$$
(1.13)

To estimate (1.13), a reasonable loss function \(\ell (\theta ,\delta )\) would be the \(c_{\text {I}}-c_{\text {II}}\) loss function

$$\begin{aligned} \ell (\theta ,\delta )={\left\{ \begin{array}{ll} 0 &{} ,\,\text {if }\theta \in \Theta _{0},\,\delta =0\\ 0 &{} ,\,\text {if }\theta \notin \Theta _{0},\,\delta =1\\ c_{\text {II}} &{} ,\,\text {if }\theta \notin \Theta _{0},\,\delta =0\\ c_{\text {I}} &{} ,\,\text {if }\theta \in \Theta _{0},\,\delta =1 \end{array}\right. } \end{aligned}$$
(1.14)

where \(c_{\text {I}}>0\) and \(c_{\text {II}}>0\) are specified by the researcher. Here, \(c_{\text {I}}>0\) (and \(c_{\text {II}}>0\)) denote the loss if we decide to accept \(\text {H}_{0}\) (and \(\text {H}_{1}\)) while the correct answer would be to choose \(\text {H}_{1}\) (\(\text {H}_{0}\)). According to (1.5), under (1.14), the Bayes estimator of (1.13) is

$$\begin{aligned} \delta (x_{1:n})={\left\{ \begin{array}{ll} 0 &{} ,\,\text {if }\pi (\theta \in \Theta _{0}|x_{1:n})>\frac{c_{\text {II}}}{c_{\text {II}}+c_{\text {I}}}\\ 1 &{} ,\,\text {otherwise} \end{array}\right. } \end{aligned}$$
(1.15)

where \(\pi (\theta \in \Theta _{0}|x_{1:n})=\int _{\Theta _{0}}\pi (\text {d}\theta |x_{1:n})\). In other words, hypothesis \(\text {H}_{1}\) is accepted if \(\frac{\pi (\theta \in \Theta _{0}|x_{1:n})}{\pi (\theta \in \Theta _{1}|x_{1:n})}<\frac{c_{\text {II}}}{c_{\text {I}}}\).

Hypothesis tests in Bayesian statistics can also be addressed with the aid of Bayes factors. Bayes factor \(\text {B}_{01}(x_{1:n})\) is the ratio of the posterior probabilities of \(\text {H}_{0}\) and \(\text {H}_{1}\) over the ratio of the prior probabilities of \(\text {H}_{0}\) and \(\text {H}_{1}\)

$$\begin{aligned} \text {B}_{01}(x_{1:n})&=\frac{\pi (\theta \in \Theta _{0}|x_{1:n})/\pi (\theta \in \Theta _{0})}{\pi (\theta \in \Theta _{1}|x_{1:n})/\pi (\theta \in \Theta _{1})}\end{aligned}$$
(1.16)
$$\begin{aligned}&={\left\{ \begin{array}{ll} \frac{f(x_{1:n}|\theta _{0})}{f(x_{1:n}|\theta _{1})} &{} ;\text {H}_{0}:\text {single}\,\text {vs}\,\text {H}_{1}:\text {single}\\ \frac{\int _{\Theta _{0}}f(x_{1:n}|\theta )\pi _{0}(\text {d}\theta )}{\int _{\Theta _{1}}f(x_{1:n}|\theta )\pi _{1}(\text {d}\theta )} &{} ;{\text {H}_{0}:\text {composite}\,\text {vs}\,\text {H}_{1}:\text {composite}}\\ \frac{f(x_{1:n}|\theta _{0})}{\int _{\Theta _{1}}f(x_{1:n}|\theta )\pi _{1}(\text {d}\theta )} &{} ; {\text {H}_{0}:\text {single}\,\text {vs}\,\text {H}_{1}:\text {composite}} \end{array}\right. }. \end{aligned}$$
(1.17)

Under the \(c_{\text {I}}-c_{\text {II}}\) loss function, (1.15) implies that one would accept \(\text {H}_{0}\) if \(B_{01}(x_{1:n})>\frac{c_{\text {II}}}{c_{\text {I}}}\frac{\pi _{1}}{\pi _{0}}\), and accept \(\text {H}_{1}\) if otherwise. Alternatively, Jeffreys [6] developed a scale rule (Table 1.1) to judge the strength of evidence in favor of \(\text {H}_{0}\) or against \(\text {H}_{0}\) brought by the data. Although Jeffreys’ rule avoids the need to specify \(c_{I}\) and \(c_{II}\), it is a heuristic rule-of-thumb guide, not based on decision theory concepts, and hence many researchers argue against its use.

Table 1.1 Jeffreys’ scale rule [6]

Example: Bernoulli model (Cont.)

We are interested in testing the hypotheses \(\text {H}_{0}:\theta =0.5\) and \(\text {H}_{1}:\theta \ne 0.5\), given that \(\pi _{0}=1/2\), and using the \(c_{\text {I}}-c_{\text {II}}\) loss function with \(c_{\text {I}}=c_{\text {II}}\). Here, \(\Theta _{0}=\{0.5\}\) and \(\Theta _{1}=[0,0.5)\cup (0.5,1]\). The overall prior is \(\pi (\theta )=\pi _{0}\text {1}_{\theta _{0}}(\theta )+(1-\pi _{0})\text {Be}(\theta |a,b)\). The Bayes factor is

$$\begin{aligned} \text {B}_{01}(x_{1:n})&=\frac{\prod _{i=1}^{n}\text {Br}(x_{i}|\theta _{0})}{\int _{(0,1)}\prod _{i=1}^{n}\text {Br}(x_{i}|\theta )\text {Be}(\theta |a,b)\text {d}\theta }=\frac{\theta _{0}^{x_{*}}(1-\theta _{0})^{n-x_{*}}}{\text {B}(n\bar{x}+a,n-n\bar{x}+b)/\text {B}(a,b)}. \end{aligned}$$

Given \(a=b=2\), \(n=30\), and \(\sum _{i=1}^{30}x_{i}=15\), it is \(\text {B}_{01}(x_{1:n})=18.47>c_{\text {II}}/c_{\text {I}}=1\). Hence, we accept \(\text {H}_{1}\).

1.5.1 Model Selection

Often the researcher is uncertain which statistical model (sampling distribution) can better represent the real data generating process. There is a set \(\mathcal {M}=\{\mathfrak {m}_{1},\mathfrak {m}_{2},\ldots \}\) of candidate statistical models \(\mathfrak {m}_{k}=\{f_{k}(\cdot |\varphi _{k});\,\varphi _{k}\in \varPhi _{k}\}\), where \(f_{k}(\cdot |\varphi _{k})\) denotes the sampling distribution, and \(\varphi _{k}\) denotes the unknown parameters for \(k=1,2,\ldots \) Let \(\pi _{k}=\pi (\mathfrak {m}_{k})\) denote the marginal model prior and \(\pi _{k}(\varphi _{k})=\pi (\varphi _{k}|\mathfrak {m}_{k})\) denote the prior of the unknown parameters \(\varphi _{k}\) of given model \(\mathfrak {m}_{k}\).

Selection of the ‘best’ model from a set of available candidate models can be addressed via hypothesis testing. For simplicity, we consider there are only two models \(\mathfrak {m}_{0}\) and \(\mathfrak {m}_{1}\) with unknown parameters \(\vartheta _{0}\in \varPhi _{0}\) and \(\vartheta _{1}\in \varPhi _{1}\). Then, model selection is performed as a hypothesis test

$$\begin{aligned} \text {H}_{0}:\,(\mathfrak {m},\varphi )\in \Theta _{0}\quad \text {vs}\quad \text {H}_{1}:\,(\mathfrak {m},\varphi )\in \Theta _{1} \end{aligned}$$
(1.18)

where \(\Theta _{k}=\{\mathfrak {m}_{k}\}\times \varPhi _{k}\), \(\Theta =\cup _{k}\Theta _{k}\). The overall joint prior is specified as \(\pi (\mathfrak {m},\varphi )=\pi _{0}\times \pi _{0}(\varphi _{0})+\pi _{1}\times \pi _{1}(\varphi _{1})\) on \((\mathfrak {m},\varphi )\in \Theta \) where \(\Theta =\cup _{k}\Theta _{k}\), where \(\pi _{k}(\varphi _{k})=\frac{\pi (\mathfrak {m},\varphi )\text {1}_{\mathfrak {m}_{k}}(\mathfrak {m})}{\int _{\varPhi _{k}}\pi (\mathfrak {m},\text {d}\varphi )}\) on \(\varphi _{k}\in \varPhi _{k}\), and \(\pi _{k}=\int _{\Theta _{k}}\pi (\mathfrak {m}_{k},\text {d}\varphi _{k})\). Now the model selection problem has been translated into a hypothesis test.

Example: Negative binomial vs. Poisson model [2

We are interested in testing the hypotheses

$$ \text {H}_{0}:x_{i}|\phi \sim \text {Nb}(\phi ,1),\quad \phi>0,\quad \text {vs.}\quad \text {H}_{1}:x_{i}|\lambda \sim \text {Pn}(\lambda ),\quad \lambda >0 $$

by using the \(c_{\text {I}}-c_{\text {II}}\) loss function with \(c_{\text {I}}=c_{\text {II}}\). Consider two observations \(x_{1}=x_{2}=2\) are available. Consider overall prior \(\pi (\theta )\) with density \(\pi (\theta )=\pi _{0}\text {Be}(\phi |a_{0},b_{0})+\pi _{1}\text {Ga}(\lambda |a_{1},b_{1})\) with \(\pi _{0}=\pi _{1}=0.5\).

This is a composite vs. composite hypothesis test. It is

$$\begin{aligned} \int _{\Theta _{0}}f(x_{1:n}|\varphi _{0})\pi _{0}(\text {d}\varphi _{0})&=\frac{\Gamma (a_{0}+b_{0})}{\Gamma (a_{0})\Gamma (b_{0})}\int _{0}^{1}\phi ^{n+a_{0}-1}(1-\phi )^{n\bar{x}+b_{0}-1}\text {d}\phi \\&=\frac{\Gamma (a_{0}+b_{0})}{\Gamma (a_{0})\Gamma (b_{0})}\frac{\Gamma (n+a_{0})\Gamma (n\bar{x}+b_{0})}{\Gamma (n+n\bar{x}+a_{0}+b_{0})} \end{aligned}$$
$$\begin{aligned} \int _{\Theta _{1}}f(x_{1:n}|\varphi _{1})\pi _{1}(\text {d}\varphi _{1})&=\frac{b_{1}^{a_{1}}}{\Gamma (a_{1})(n+b_{1})^{n\bar{x}+a_{1}}}\int _{0}^{\infty }\lambda ^{n\bar{x}+a_{1}-1}\exp (-(n+b_{1})\lambda )\text {d}\lambda \\&=\frac{\Gamma (n\bar{x}+a_{1})}{\Gamma (a_{1})(n+b_{1})^{n\bar{x}+a_{1}}}\frac{1}{\prod _{i=1}^{n}x_{i}!} \end{aligned}$$

and hence \(B_{01}(x_{1:n})=\frac{\Gamma (a_{0}+b_{0})}{\Gamma (a_{0})\Gamma (b_{0})}\frac{\Gamma (n+a_{0})\Gamma (n\bar{x}+b_{0})}{\Gamma (n+n\bar{x}+a_{0}+b_{0})}\frac{\Gamma (a_{1})(n+b_{1})^{n\bar{x}+a_{1}}}{\Gamma (n\bar{x}+a_{1})}\prod _{i=1}^{n}x_{i}!\) . It is \(B_{01}(x_{1:n})=0.29>1\), and hence I accept \(\text {H}_{1}\) and the Poisson model.