Key words

1 Statistical Models

Evolutionary genomics can only be approached with the help of statistical modeling. Stochastic fluctuations are inherent to many biological systems. Specifically, the evolutionary process itself is stochastic, with random mutations and random mating being major sources of variation. In general, stochastic effects play an increasingly important role if the number of molecules, or cells, or individuals of a population is small. Stochastic variation also arises from measurement errors. Biological data is often noisy due to experimental limitations, especially for high-throughput technologies, such as microarrays or next-generation sequencing [1, 2].

Statistical modeling addresses the following questions: What can be generalized from a finite sample obtained from an experiment to the population? What can be learned about the underlying biological mechanisms? How certain can we be about our model predictions?

In the frequentist view of statistics, the observed variability in the data is the result of a fixed true value being perturbed by random variation, such as, for example, measurement noise. Probabilities are thus interpreted as long-run expected relative frequencies. By contrast, from a Bayesian point of view, probabilities represent our uncertainty about the state of nature. There is no true value, but only the data is real. Our prior belief about an event is updated in light of the data.

Statistical models represent the observed variability or uncertainty by probability distributions [3, 4]. The observed data are regarded as realizations of random variables. The parameters of a statistical model are usually the quantities of interest because they describe the amount and nature of systematic variation in the data. Parameter estimation and model selection are discussed in more detail in the next section. In this section, we first consider discrete, and then continuous random variables and univariate (1-dimensional) before multivariate (n-dimensional) ones. We start by formulating the well-known Hardy–Weinberg principle [5, 6] as a statistical model.

Example 1 (Hardy–Weinberg Model)

The Hardy–Weinberg model is a statistical model for the genotypes in a diploid population of infinite size. Let us assume that there are two alleles, denoted A and a, and hence three genotypes, denoted AA, Aa = aA, and aa. Let X be the random variable with state space \( \mathcal{X}=\Big\{ \)AA, Aa, aa} describing the genotype. We parametrize the probability distribution of X by the allele frequency p of A and the allele frequency q = 1 − p of a. The Hardy–Weinberg model is defined by:

$$ P\left(X=\mathrm{AA}\right)={p}^2,\kern0.5em $$
(1)
$$ P\left(X=\mathrm{Aa}\right)=2p\left(1-p\right),\kern0.5em $$
(2)
$$ P\left(X=\mathrm{aa}\right)={\left(1-p\right)}^2.\kern0.5em $$
(3)

The parameter space of the model is \( \Theta =\left\{p\in \mathbb{R}\mid 0\le p\le 1\right\}=\left[0,1\right] \), the unit interval. We denote the Hardy–Weinberg model by \( \mathrm{HW}(p) \) and write \( X\sim \mathrm{HW}(p) \) if X follows the distribution (Eqs. 13). □

The Hardy–Weinberg distribution P(X) is a discrete probability distribution (or probability mass function) with finite state space: We have 0 ≤ P(X = x) ≤ 1 for all \( x\in \mathcal{X} \) and \( {\sum}_{x\in \mathcal{X}}P\left(X=x\right)={p}^2+2p\left(1-p\right)+{\left(1-p\right)}^2={\left[p+\left(1-p\right)\right]}^2=1 \). In general, any statistical model for a discrete random variable with n states defines a subset of the (n − 1)-dimensional probability simplex:

$$ {\Delta}_{n-1}=\left\{\left({p}_1,\dots, {p}_n\right)\in \Big[0,1\Big]{}^n\mid {p}_1+\cdots +{p}_n=1\right\}.\kern1.00em $$
(4)

The probability simplex is the set of all possible probability distributions of X, and statistical models can be understood as specific subsets of the simplex [7].

The Hardy–Weinberg distribution is of interest because it arises under the assumption of random mating. A population with major allele frequency p has genotype probabilities given in Eqs. 13 after one round of random mating. We find that the new allele frequency:

$$ p^{\prime }=P\left(\mathrm{AA}\right)+P\left(\mathrm{Aa}\right)/ 2={p}^2+2p\left(1-p\right)/ 2=p,\kern1.00em $$
(5)

is equal to the one in the previous generation. Thus, genetic variation is preserved under this simple model of sexual reproduction, and the population is at equilibrium after one generation. In other words, Eqs. 13 describe the set of all populations at Hardy–Weinberg equilibrium. The parametric representation:

$$ {\displaystyle \begin{array}{rcll}& & \left\{\left({p}_{\mathrm{AA}},{p}_{\mathrm{Aa}},{p}_{\mathrm{aa}}\right)\in {\Delta}_2\mid {p}_{\mathrm{AA}}={p}^2,\kern0.3em {p}_{\mathrm{Aa}}=2p\left(1-p\right),\right.& \\ {}& & \kern1em \left.\kern0.3em {p}_{\mathrm{aa}}={\left(1-p\right)}^2\right\},& \end{array}} $$
(6)

of this set of distributions is equivalent to the implicit representation as the intersection of the Hardy–Weinberg curve:

$$ 4\kern0.3em {p}_{\mathrm{AA}}\kern0.3em {p}_{\mathrm{aa}}-{p}_{\mathrm{Aa}}^2=0\kern1.00em $$
(7)

with the probability simplex Δ2 (Fig. 1).

Fig. 1
figure 1

De Finetti diagram showing the Hardy–Weinberg curve \( 4\kern0.3em {p}_{\mathrm{AA}}\kern0.3em {p}_{\mathrm{aa}}-{p}_{\mathrm{Aa}}^2=0 \) inside the probability simplex Δ2 = {(p AA, p Aa, p aa)∣p AA + p Aa + p aa = 1}. Each point in this space represents a population as described by its genotype frequencies. Points on the curve correspond to populations in Hardy–Weinberg equilibrium

The simplest discrete random variable is a binary (or Bernoulli) random variable X. The textbook example of a Bernoulli trial is the flipping of a coin. The state space of this random experiment is the set that contains all possible outcomes, namely, whether the coin lands on heads (X = 0) or tails (X = 1). We write \( \mathcal{X}=\left\{0,1\right\} \) to denote this state space. The parameter space is the set that contains all possible values of the model parameters. In the coin tossing example, the only parameter is the probability of observing tails, p, and this parameter can take any value between 0 and 1, so we write Θ = {p ∣ 0 ≤ p ≤ 1} for the parameter space. In general, the event X = 1 is often called a “success,” and p = P(X = 1) the probability of success.

Example 2 (Binomial Distribution)

Consider n independent Bernoulli trials, each with success probability p. Let X be the random variable counting the number of successes k among the n trials. Then, X has state space \( \mathcal{X}=\left\{0,\dots, n\right\} \) and

$$ P\left(X=k\right)=\left(\genfrac{}{}{0.0pt}{}{n}{k}\right){p}^k{\left(1-p\right)}^{n-k}.\kern0.5em $$
(8)

This is the binomial distribution, denoted \( \mathrm{Binom}\left(n,p\right) \). Its parameter space is \( \Theta =\mathbb{N}\times \left[0,1\right] \). Examples of binomially distributed random variables are the number of “heads” in n successive coin tosses or the number of mutated genes in a group of species. □

Important characteristics of a probability distribution are its expectation (or expected value, or mean) and its variance. They are defined, respectively, as:

$$ \mathrm{E}(X)=\sum \limits_{x\in \mathcal{X}}x\kern0.3em P\left(X=x\right),\kern0.5em $$
(9)
$$ \mathrm{Var}(X)=\sum \limits_{x\in \mathcal{X}}{\left[x-\mathrm{E}(X)\right]}^2\kern0.3em P\left(X=x\right).\kern0.5em $$
(10)

The standard deviation is \( \sqrt{\mathrm{Var}(X)} \). For the binomial distribution, \( X\sim \mathrm{Binom}\left(n,p\right) \), we find E(X) = np and \( \mathrm{Var}(X)= np\left(1-p\right) \).

Example 3 (Poisson Distribution)

The Poisson distribution \( \mathrm{Pois}\left(\lambda \right) \) with parameter λ ≥ 0 is defined as:

$$ P\left(X=k\right)\kern0.5em =\kern0.5em \frac{\lambda^k\kern0.3em {e}^{-\lambda }}{k!},\kern1em k\in \mathbb{N}.\kern0.5em $$
(11)

It describes the number X of independent events occurring in a fixed period of time (or space) at average rate λ and independently of the time since (or distance to) the last event. The Poisson distribution has equal expectation and variance, \( \mathrm{E}(X)=\mathrm{Var}(X)=\lambda \). □

The Poisson distribution is used frequently as a model for the number of DNA mutations in a gene after a certain time period, where λ is the mutation rate. Both the binomial and the Poisson distribution describe counts of random events. In the limit of large n and fixed product np, the two distributions coincide, \( \mathrm{Binom}\left(n,p\right)\to \mathrm{Pois}(np) \), for n →.

Example 4 (Shotgun Sequencing)

Let us consider a simplified model of the shotgun approach to DNA sequencing. Suppose that n reads of length L have been obtained from a genome of size G. We assume that all reads have the same probability of being sequenced. Then, the probability of hitting a specific base with one read is p = LG, and the average coverage of the sequencing run is c = np. Under this model, the number of times X a single base is sequenced is distributed as \( \mathrm{Binom}\left(n,p\right). \) For large n, we have

$$ P\left(X=k\right)=\left(\genfrac{}{}{0.0pt}{}{n}{k}\right){p}^k{\left(1-p\right)}^{n-k}\approx \frac{c^k\kern0.3em {e}^{-c}}{k!}.\kern1.00em $$
(12)

For example, using next-generation sequencing technology, one might obtain n = 108 reads of length L = 100 bases in a single run. For the human genome of length G = 3 ⋅ 109, we obtain a coverage of c = 3.4. The distribution of the number of reads per base pair is shown in Fig. 2. In particular, the fraction of unsequenced positions is P(X = 0) = e c = 3.57%. □

Fig. 2
figure 2

Coverage distribution of a shotgun sequencing experiment with n = 108 reads of length L = 100 of the human genome of length G = 3 ⋅ 109. The average coverage is c = np = 3.4, where p = LG. Dots show the binomial coverage distribution \( \mathrm{Binom}\left(n,p\right) \) and the solid line its approximation by the Poisson distribution \( \mathrm{Pois}(np) \). Note that the Poisson distribution is also discrete and just shown as a line to distinguish it from the binomial distribution

A continuous random variable X takes values in \( \mathcal{X}=\mathbb{R} \) and is defined by a nonnegative function f(x) such that:

$$ P\left(X\in B\right)={\int}_Bf(x) dx,\kern1em \mathrm{for}\ \mathrm{all}\ \mathrm{subsets}\kern0.3em B\subseteq \mathbb{R}.\kern1.00em $$
(13)

The function f is called the probability density function of X. For an interval:

$$ P\left(X\in \left[a,b\right]\right)=P\left(a\le X\le b\right)={\int}_a^bf(x) dx.\kern1.00em $$
(14)

The cumulative distribution function is

$$ F(b)=P\left(X\le b\right)={\int}_{-\infty}^bf(x) dx,\kern1em b\in \mathbb{R}.\kern1.00em $$
(15)

Thus, the density is the derivative of the cumulative distribution function, \( \frac{d}{dx}F(x)=f(x) \).

In analogy to the discrete case, expectation and variance of a continuous random variable are defined, respectively, as:

$$ \mathrm{E}(X)\kern0.5em =\kern0.5em {\int}_{-\infty}^{\infty }x\kern0.3em f(x)\kern0.3em dx,\kern0.5em $$
(16)
$$ \mathrm{Var}(X)\kern0.5em =\kern0.5em {\int}_{-\infty}^{\infty }{\left[x-\mathrm{E}(X)\right]}^2\kern0.3em f(x)\kern0.3em dx.\kern0.5em $$
(17)

Example 5 (Normal Distribution)

The normal (or Gaussian) distribution has the density function:

$$ f(x)={\left(2\pi {\sigma}^2\right)}^{-1/ 2}\exp \left[-\frac{{\left(x-\mu \right)}^2}{2{\sigma}^2}\right].\kern1.00em $$
(18)

The parameter space is \( \Theta =\left\{\left(\mu, {\sigma}^2\right)\mid \mu \in \mathbb{R},\kern0.3em {\sigma}^2\in {\mathbb{R}}_{+}\right\} \). A normal random variable \( X\sim \mathrm{Norm}\left(\mu, {\sigma}^2\right) \) has mean E(X) = μ and variance \( \mathrm{Var}(X)={\sigma}^2 \). \( \mathrm{Norm}\left(0,1\right) \) is called the standard normal distribution. □

The normal distribution is frequently used as a model for measurement noise. For example, \( X\sim \mathrm{Norm}\left(\mu, {\sigma}^2\right) \) might describe the hybridization intensity of a sample to a probe on a microarray. Then, μ is the level of expression of the corresponding gene and σ 2 summarizes the experimental noise associated with the microarray experiment. The parameters can be estimated from a finite sample {x (1), …, x (N)}, i.e., from N replicate experiments, as the empirical mean and variance, respectively:

$$ \overline{x}=\frac{1}{N}\sum \limits_{i=1}^N{x}^{(i)},\kern0.5em $$
(19)
$$ {s}^2=\frac{1}{N-1}\sum \limits_{i=1}^N{\left({x}^{(i)}-\overline{x}\right)}^2.\kern0.5em $$
(20)

The normal distribution plays a special role in statistics due to the central limit theorem. It asserts that the average \( {\overline{X}}_N=\left({X}^{(1)}+\cdots +{X}^{(N)}\right)/ N \) of N independent (see below) and identically distributed (i.i.d.) random variables X (i) with equal mean μ and variance σ 2 converges in distribution to the standard normal distribution:

$$ \sqrt{N}\kern0.3em \left(\frac{{\overline{X}}_N-\mu }{\sigma}\right)\overset{d}{\to}\mathrm{Norm}\left(0,1\right),\kern1.00em $$
(21)

irrespective of the shape of their distribution. As a consequence, many test statistics and estimators are asymptotically normally distributed. For example, the Poisson distribution \( \mathrm{Pois}\left(\lambda \right) \) is approximately normal \( \mathrm{Norm}\left(\lambda, \lambda \right) \) for large values of λ.

We often measure multiple quantities at the same time, for example the expression of several genes, and are interested in correlations among the variables. Let X and Y be two random variables with expected values μ X and μ Y and variances \( {\sigma}_X^2 \) and \( {\sigma}_Y^2 \), respectively. The covariance between X and Y is

$$ \mathrm{Cov}\left(X,Y\right)=\mathrm{E}\left[\left(X-{\mu}_X\right)\left(Y-{\mu}_Y\right)\right]=\mathrm{E}\left[ XY\right]-\mathrm{E}\left[X\right]\mathrm{E}\left[Y\right]\kern1.00em $$
(22)

and the correlation between X and Y is \( {\rho}_{X,Y}=\mathrm{Cov}\left(X,Y\right)/ \left({\sigma}_X{\sigma}_Y\right) \). For observations (x (1), y (1)), …, (x (N), y (N)), the sample correlation coefficient is

$$ {r}_{x,y}=\frac{\sum_{i=1}^N\left({x}^{(i)}-\overline{x}\right)\left({y}^{(i)}-\overline{y}\right)}{\left(N-1\right){s}_X{s}_Y},\kern1.00em $$
(23)

where s X and s Y are the sample standard deviations of X and Y , respectively, defined in Eq. 20.

So far, we have worked with univariate distributions and we now turn to multivariate distributions, i.e., we consider random vectors X = (X 1, …, X n) such that each X i is a random variable. For the case of discrete random variables X i, we first generalize the binomial distribution to random experiments with a finite number of outcomes.

Example 6 (Multinomial Distribution)

Let K be the number of possible outcomes of a random experiment and θ k the probability of outcome k. We consider the random vector X = (X 1, …, X K) with values in \( \mathcal{X}={\mathbb{N}}^K \), where X k counts the number of outcomes of type k. The multinomial distribution \( \mathrm{Mult}\left(n,{\theta}_1,\dots, {\theta}_K\right) \) is defined as:

$$ P\left(X=x\right)=\frac{n!}{x_1!\cdots {x}_K!}{\theta}_1^{x_1}\cdots {\theta}_K^{x_K}\kern0.5em $$
(24)

if \( {\sum}_{k=1}^K{x}_k=n \), and 0 otherwise. The parameter space of the model is \( \Theta =\mathbb{N}\times {\Delta}_{K-1} \). For K = 2, we recover the binomial distribution (Eq. 8). Each component X k of a multinomial vector has expected value E(X k) =  k and \( \mathrm{Var}\left({X}_k\right)=n{\theta}_k\left(1-{\theta}_k\right) \). The covariance of two components is \( \mathrm{Cov}\left({X}_k,{X}_l\right)=-n{\theta}_k{\theta}_l \), for k ≠ l. □

In general, the covariance matrix Σ of a random vector X is defined by:

$$ {\Sigma}_{ij}=\mathrm{Cov}\left({X}_i,{X}_j\right)=\mathrm{E}\left[\left({X}_i-{\mu}_i\right)\left({X}_j-{\mu}_j\right)\right],\kern1.00em $$
(25)

where μ i is the expected value of X i. The matrix Σ is also called the variance–covariance matrix because the diagonal terms are the variances \( {\Sigma}_{ii}=\mathrm{Cov}\left({X}_i,{X}_i\right)=\mathrm{Var}\left({X}_i\right) \).

A continuous multivariate random variable X takes values in \( \mathcal{X}={\mathbb{R}}^n \). It is defined by its cumulative distribution function:

$$ F(x)=P\left(X\le x\right),\kern1em x\in {\mathbb{R}}^n\kern1.00em $$
(26)

or, equivalently, by the probability density function:

$$ f(x)=\frac{\partial^n}{\partial {x}_1\cdots \partial {x}_n}F\left({x}_1,\dots, {x}_n\right),\kern1em x\in {\mathbb{R}}^n.\kern1.00em $$
(27)

Example 7 (Multivariate Normal Distribution)

For n ≥ 1 and \( x\in {\mathbb{R}}^n \), the multivariate normal (or Gaussian) distribution has density:

$$ f(x)={\left(2\pi \right)}^{-n/ 2}\det {\left(\Sigma \right)}^{-1/ 2}\exp \left[-\frac{1}{2}{\left(x-\mu \right)}^t{\Sigma}^{-1}\left(x-\mu \right)\right],\kern1.00em $$
(28)

with parameter space \( \Theta =\Big\{\left(\mu, \Sigma \right)\mid \mu =\left({\mu}_1,\dots, {\mu}_n\right)\in {\mathbb{R}}^n \) and \( \Sigma =\left({\sigma}_{ij}^2\right)\in {\mathbb{R}}^{n\times n}\Big\} \), where Σ is the symmetric, positive-definite covariance matrix and μ the expectation. We write \( X=\left({X}_1,\dots, {X}_n\right)\sim \mathrm{Norm}\left(\mu, \Sigma \right) \) for a random vector with such a distribution. □

We say that two random variables X and Y are independent if P(X, Y ) = P(X)P(Y ) or, equivalently, if the conditional probability P(X ∣ Y ) = P(X, Y )∕P(Y ) is equal to the unconditional probability P(X). If X and Y are independent, denoted X ⊥ Y , then E[XY ] = E[X]E[Y ] and \( \mathrm{Var}\left(X+Y\right)=\mathrm{Var}(X)+\mathrm{Var}(Y) \). It follows that independent random variables have covariance zero. However, the converse is only true in specific situations, for example if (X, Y ) is multivariate normal, but not in general because correlation captures only linear dependencies.

This limitation can be addressed by using statistical models which allow for a richer dependency structure. Subheading 7 is devoted to Bayesian networks, a family of probabilistic graphical models based on conditional independences. Let X, Y , and Z be three random vectors. Generalizing the notion of statistical independence, we say that X is conditionally independent of Y given Z and write X ⊥ Y ∣ Z if P(X, Y ∣ Z) = P(X ∣ Z)P(Y ∣ Z). Bayes’ theorem states that

$$ P\left(Y\mid X\right)=\frac{P\left(X\mid Y\right)P(Y)}{P(X)},\kern0.5em $$
(29)

where P(Y ) is called the prior probability and P(Y ∣ X) the posterior probability. Intuitively, the prior P(Y ) encodes our a priori knowledge about Y (i.e., before observing X), and P(Y ∣ X) is our updated knowledge about Y a posteriori (i.e., after observing X).

We have P(X) =∑Y P(X, Y ) if Y is discrete, and similarly P(X) =∫Y P(X, Y )dY if Y is continuous. Here, P(X) is called the marginal and P(X, Y ) the joint probability. This summation or integration is known as marginalization (Fig. 3).

Fig. 3
figure 3

Marginalization. Left: two-dimensional histogram of a discrete bivariate distribution with the two marginal histograms. Right: contour plot of a two-dimensional Gaussian density with the marginal distributions of each component

Since P(X) =∑Y P(X, Y ) =∑Y P(X ∣ Y )P(Y ), Bayes’ theorem can also be rewritten as:

$$ P\left(Y\mid X\right)=\frac{P\left(X\mid Y\right)P(Y)}{\sum \limits_{y^{\prime}\in \mathcal{Y}}P\left(X\mid y^{\prime}\right)P\left(y^{\prime}\right)},\kern0.5em $$
(30)

where P(y′) = P(Y = y′) and \( \mathcal{Y} \) is the state space of Y .

Example 8 (Diagnostic Test)

We want to evaluate a diagnostic test for a rare genetic disease. The binary random variables D and T indicate disease status (D = 1, diseased) and test result (T = 1, positive), respectively. Let us assume that the prevalence of the disease is 0.5%, i.e., 0.5% of all people in the population are known to be affected. The test has a false positive rate (probability that somebody is tested positive who does not have the disease) of P(T = 1 ∣ D = 0) = 5% and a true positive rate (probability that somebody is tested positive who has the disease) of P(T = 1 ∣ D = 1) = 90%. Then, the posterior probability of a person having the disease given that he or she tested positive is

$$ {\displaystyle \begin{array}{l}P\left(D=1\mid T=1\right)=\\ {}\frac{P\left(T=1\mid D=1\right)P\left(D=1\right)}{P\left(T=1\mid D=0\right)P\left(D=0\right)+P\left(T=1\mid D=1\right)P\left(D=1\right)}=0.083,\end{array}} $$
(31)

that is, only 8.3% of the positively tested individuals actually have the disease. Thus, our prior belief of the disease status, P(D), has been modified in light of the test result by multiplication with P(T ∣ D) to obtain the updated belief P(D ∣ T). □

Exercise 9 (Conditional Independence)

Let X, Y , and Z be random variables. Using the laws of probability, show that X and Y are conditionally independent given Z (i.e., X ⊥ Y ∣ Z) if and only if P(X ∣ Y, Z) = P(X ∣ Z).

2 Statistical Inference

Statistical models have parameters and a common task is to estimate the model parameters from observed data. The goal is to find the set of parameters with the best model fit. There are two major approaches to parameter estimation: maximum likelihood (ML) and Bayes.

The maximum likelihood approach is based on the likelihood function. Let us consider a fixed statistical model \( \mathrm{M} \) with parameter space Θ and assume that we have observed realizations \( \mathcal{D}=\left\{{x}^{(1)},\dots, {x}^{(N)}\right\} \) of the discrete random variable \( X\sim \mathrm{M}\left({\theta}_0\right) \) for some unknown parameter θ 0 ∈ Θ. For the fixed data set \( \mathcal{D} \), the likelihood function of the model is

$$ L\left(\theta \right)=P\left(\mathcal{D}\mid \theta \right),\kern1.00em $$
(32)

where we write \( P\left(\mathcal{D}\mid \theta \right) \) to emphasize that, here, the probability of the data depends on the model parameter θ. For continuous random variables, the likelihood function is defined similarly in terms of the density function, \( L\left(\theta \right)=f\left(\mathcal{D}\mid \theta \right) \). Maximum likelihood estimation seeks the parameter θ ∈ Θ for which L(θ) is maximal. Rather than L(θ), it is often more convenient to maximize \( \ell \left(\theta \right)=\log L\left(\theta \right) \), the log-likelihood function. If the data are i.i.d., then:

$$ \ell \left(\theta \right)=\sum \limits_{i=1}^N\log P\left(X={x}^{(i)}\mid \theta \right).\kern1.00em $$
(33)

Example 10 (Likelihood Function of the Binomial Model)

Suppose we have observed k = 7 successes in a total of N = 10 Bernoulli trials. The likelihood function of the binomial model (Eq. 8) is

$$ L(p)={p}^k\kern0.3em {\left(1-p\right)}^{N-k},\kern1.00em $$
(34)

where p is the success probability (Fig. 4). To maximize L, we consider the log-likelihood function:

$$ \ell (p)=\log L(p)=k\log (p)+\left(N-k\right)\log \left(1-p\right)\kern1.00em $$
(35)

and the likelihood equation dℓdp = 0. The ML estimate (MLE) is the solution \( {\hat{p}}_{\mathrm{ML}}=k/ N=7/ 10 \). Thus, the MLE of the success probability is just the relative frequency of successes. □

Fig. 4
figure 4

Likelihood function of the binomial model. The underlying data set consists of k = 7 successes out of N = 10 Bernoulli trials. The likelihood L(p) = p k (1 − p)Nk is plotted as a function of the model parameter p, the probability of success (solid line). The MLE is the maximum of this function, \( {\hat{p}}_{\mathrm{ML}}=k/ N=7/ 10 \) (dashed line)

Example 11 (Likelihood Function of the Hardy–Weinberg Model)

If we genotype a finite random sample of a population of diploid individuals at a single locus, then the resulting data consists of the numbers of individuals n AA, n Aa, and n aa with the respective genotypes. Assuming Hardy–Weinberg equilibrium (Eqs. 13), we want to estimate the allele frequencies p and q = 1 − p of the population. The likelihood function of the Hardy–Weinberg model is \( L(p)=P{\left(\mathrm{AA}\right)}^{n_{\mathrm{AA}}}P{\left(\mathrm{Aa}\right)}^{n_{\mathrm{Aa}}}P{\left(\mathrm{aa}\right)}^{n_{\mathrm{aa}}} \) and the log-likelihood is

$$ {\displaystyle \begin{array}{r}\ell (p)={n}_{\mathrm{AA}}\log {p}^2+{n}_{\mathrm{Aa}}\log 2p\left(1-p\right)+{n}_{\mathrm{aa}}\log {\left(1-p\right)}^2\\ {}\propto \left(2{n}_{\mathrm{AA}}+{n}_{\mathrm{Aa}}\right)\log p+\left({n}_{\mathrm{Aa}}+2{n}_{\mathrm{aa}}\right)\log \left(1-p\right),\end{array}}\kern1.00em $$
(36)

where we have dropped the constant \( {n}_{\mathrm{Aa}}\log 2 \). The MLE of p ∈ [0, 1] can be found by maximizing . Solving the likelihood equation:

$$ \frac{\partial \ell }{\partial p}=\frac{2{n}_{\mathrm{AA}}+{n}_{\mathrm{Aa}}}{p}-\frac{n_{\mathrm{Aa}}+2{n}_{\mathrm{aa}}}{1-p}=0\kern1.00em $$
(37)

yields the MLE \( {\hat{p}}_{\mathrm{ML}}=\left(2{n}_{\mathrm{AA}}+{n}_{\mathrm{Aa}}\right)/ (2N) \), where N = n AA + n Aa + n aa is the total sample size. For example, if we sample N = 100 genotypes with n AA = 81, n Aa = 18, and n aa = 1, then we find \( {\hat{p}}_{\mathrm{ML}}=\left(2\cdot \left(81+18\right)\right)/ \left(2\cdot 100\right)=0.9 \) for the frequency of the major allele. □

MLEs have many desirable properties. Asymptotically, as the sample size N →, they are normally distributed, unbiased, and have minimal variance. The uncertainty in parameter estimation associated with the sampling variance of the finite data set can be quantified in confidence intervals. There are several ways to construct confidence intervals and statistical tests for MLEs based on the asymptotic behavior of the log-likelihood function \( \ell \left(\theta \right)=\log L\left(\theta \right) \) and its derivatives. For example, the asymptotic normal distribution of the MLE is

$$ {\hat{\theta}}_{\mathrm{ML}}\kern0.60em \overset{a}{\sim}\kern0.60em \mathrm{Norm}\left(\theta, J{\left(\theta \right)}^{-1}\right),\kern1.00em $$
(38)

where I(θ) = − 2 ∂θ 2 is the Fisher information and J(θ) = E[I(θ)] the expected Fisher information. This result gives rise to the Wald confidence intervals:

$$ \left[{\hat{\theta}}_{\mathrm{ML}}\kern0.3em \pm \kern0.3em {z}_{1-\alpha / 2}\kern0.3em J{\left(\theta \right)}^{-1}\right],\kern1.00em $$
(39)

where \( {z}_{1-\alpha / 2}=\operatorname{inf}\left\{x\in \mathbb{R}\mid 1-\alpha / 2\le F(x)\right\} \) is the (1 − α∕2) quantile and F the cumulative distribution function of the standard normal distribution. Equation 38 still holds after replacing J(θ) with the standard error \( \mathrm{se}\left({\hat{\theta}}_{\mathrm{ML}}\right)={\left[I\left({\hat{\theta}}_{\mathrm{ML}}\right)\right]}^{-\frac{1}{2}} \) or \( {\left[J\left({\hat{\theta}}_{\mathrm{ML}}\right)\right]}^{-\frac{1}{2}} \), and it also generalizes to higher dimensions. Other common constructions of confidence intervals include those based on the asymptotic distribution of the score function S(θ) = ∂ℓ∂θ and the log-likelihood ratio \( \log \left(L\left({\hat{\theta}}_{\mathrm{ML}}\right)/ L\left(\theta \right)\right) \) [8].

We now discuss another more generic approach to quantify parameter uncertainty, not restricted to ML estimation, which is applied frequently in practice due to its simple implementation. Bootstrapping [9] is a resampling method in which independent observations are resampled from the data with replacement. The resulting new data set consists of (some of) the original observations, and under i.i.d. assumptions, the bootstrap replicates have asymptotically the same distribution as the data. Intuitively, by sampling with replacement, one is pretending that the collection of replicates thus obtained is a good proxy for the distribution of data sets that one would have obtained, had we been able to actually replicate the experiment. In this way, the variability of an estimator (or more generally the distribution of any test statistic) can be approximated by evaluating the estimator (or the statistic) on a collection of bootstrap replicates. For example, the distribution of the ML estimator of a model parameter θ can be obtained from the bootstrap samples.

Example 12 (Bootstrap Confidence Interval for the ML Allele Frequency)

We use bootstrapping to estimate the distribution of the ML estimator \( {\hat{p}}_{\mathrm{ML}} \) of the Hardy–Weinberg model for the data set (n AA, n Aa, n aa) = (81, 18, 1) of Example 11. For each bootstrap sample, we draw N = 100 genotypes with replacement from the original data to obtain random integer vectors of length three summing to 100. The ML estimate is computed for each of a total of B bootstrap samples. The resulting distributions of \( {\hat{p}}_{\mathrm{ML}} \) are shown in Fig. 5, for B = 100, 1000, and 10,000. The means of these empirical distributions are 0.899, 0.9004, and 0.9001, respectively, and 95% bootstrap confidence intervals can be derived from the 2.5 and 97.5% quantiles of the distributions. For B = 100, 1000, and 10,000, we obtain, respectively, [0.8598, 0.9350], [0.860, 0.940], and [0.855, 0.940]. The basic bootstrap confidence intervals have several limitations, including bias of the bootstrap estimator and skewness of the bootstrap distribution. Other methods exist for constructing confidence intervals from the bootstrap distribution to address some of them [9]. □

Fig. 5
figure 5

Bootstrap analysis of the ML allele frequency. The bootstrap distribution of the maximum likelihood estimator \( {\hat{p}}_{\mathrm{ML}}=\left(2{n}_{\mathrm{AA}}+{n}_{\mathrm{Aa}}\right)/ (2N) \) of the major allele frequency in the Hardy–Weinberg model is plotted for B = 100 (left), B = 1000 (center), and B = 10, 000 (right) bootstrap samples, for the data set (n AA, n Aa, n aa) = (81, 18, 1)

The Bayesian approach takes a different point of view and regards the model parameters as random variables [10]. Inference is then concerned with estimating the joint distribution of the parameters θ given the observed data \( \mathcal{D} \). By Bayes’ theorem (Eq. 30), we have

$$ P\left(\theta \mid \mathcal{D}\right)=\frac{P\left(\mathcal{D}\mid \theta \right)P\left(\theta \right)}{P\left(\mathcal{D}\right)}=\frac{P\left(\mathcal{D}\mid \theta \right)P\left(\theta \right)}{\int_{\theta \in \Theta}P\left(\mathcal{D}\mid \theta \right)P\left(\theta \right)\kern0.3em d\theta},\kern1.00em $$
(40)

that is, the posterior probability of the parameters is proportional to the likelihood of the data times the prior probability of the parameters. It follows that, for a uniform prior, the mode of the posterior is equal to the MLE.

From the posterior, credible intervals of parameter estimates can be derived such that the parameter lies in the interval with a certain probability, say 95%. This is in contrast to a 95% confidence interval in the frequentist approach because, there, the parameter is fixed and the interval boundaries are random variables. The meaning of a confidence interval is that 95% of similar intervals would contain the true parameter, if intervals were constructed independently from additional identically distributed data.

The prior P(θ) encodes our a priori belief in θ before observing the data. It can be used to incorporate domain-specific knowledge into the model, but it may also be uninformative or objective, in which case all observations are equally likely, or nearly so, a priori. However, it can sometimes be difficult to find noninformative priors. In practice, conjugate priors are most often used. A conjugate prior is one that is invariant with respect to the distribution family under multiplication with the likelihood, i.e., the posterior belongs to the same family as the prior. Conjugate priors are mathematically convenient and computationally efficient because the posterior can be calculated analytically for a wide range of statistical models.

Example 13 (Dirichlet Prior)

Let T = (T 1, …, T K) be a continuous random variable with state space ΔK−1. The Dirichlet distribution \( \mathrm{Dir}\left(\alpha \right) \) with parameters \( \alpha \in {\mathbb{R}}_{+}^K \) has probability density function:

$$ f\left({\theta}_1,\dots, {\theta}_K\right)\kern0.5em =\kern0.5em \frac{\Gamma \left(\sum \limits_{i=1}^K{\alpha}_i\right)}{\prod \limits_{i=1}^K\Gamma \left({\alpha}_i\right)}\prod \limits_{i=1}^K{\theta}_i^{\alpha_i-1},\kern0.5em $$
(41)

where Γ is the gamma function. The Dirichlet prior is conjugate to the multinomial likelihood: If \( T\sim \mathrm{Dir}\left(\alpha \right) \) and \( \left(X\mid T=\theta \right)\sim \mathrm{Mult}\left(n,{\theta}_1,\dots, {\theta}_K\right) \), then \( \left(\theta \mid X=x\right)\sim \mathrm{Dir}\left(\alpha +x\right) \). For K = 2, this distribution is called the beta distribution. Hence, the beta distribution is the conjugate prior to the binomial likelihood. □

Example 14 (Posterior Probability of Genotype Frequencies)

Let us consider the simple genetic system with two loci and two alleles each of Example 1, but without assuming the Hardy–Weinberg model. We regard the observed genotype frequencies (n AA, n Aa, n aa) = (81, 18, 1) as the result of a draw from a multinomial distribution \( \mathrm{Mult}\left(n,{\theta}_{\mathrm{AA}},{\theta}_{\mathrm{Aa}},{\theta}_{\mathrm{aa}}\right) \). Assuming a Dirichlet prior \( \mathrm{Dir}\Big({\alpha}_{\mathrm{AA}} \), α Aa, α aa), the posterior genotype probabilities follow the Dirichlet distribution \( \mathrm{Dir}\Big({\alpha}_{\mathrm{AA}}+{n}_{\mathrm{AA}} \), α Aa + n Aa, α aa + n aa). In Fig. 6, the prior \( \mathrm{Dir}\left(10,\kern0.3em 10,\kern0.3em 10\right) \) is shown on the left, the multinomial likelihood P((n AA, n Aa, n aa) = (81, 18, 1) ∣ θ AA, θ Aa, θ aa) in the center, and the resulting posterior \( \mathrm{Dir}\left(10+81,\kern0.3em 10+18,\kern0.3em 10+1\right) \) on the right. Note that the MLE is different from the mode of the posterior. As compared to the likelihood, the nonuniform prior has shifted the maximum of the posterior toward the center of the probability simplex. □

Fig. 6
figure 6

Dirichlet prior for multinomial likelihood. The Dirichlet prior is conjugate to the multinomial likelihood. Shown are contour lines of the prior \( \mathrm{Dir}\left(10,\kern0.3em 10,\kern0.3em 10\right) \) on the left, the multinomial likelihood P((n AA, n Aa, n aa) = (81, 18, 1)∣θ AA, θ Aa, θ aa) in the center, and the resulting posterior \( \mathrm{Dir}\left(91,\kern0.3em 28,\kern0.3em 11\right) \) on the right. The posterior is the product of prior and likelihood

We often have two or more competing models and would like to assess which one describes best the given data. For example, we may have observed genotypes from the set {AA, Aa, aa} and want to test whether the Hardy–Weinberg model (Example 1) is a more appropriate description of the genotype data than the multinomial model of the previous Example 14. Intuitively, we might want to select the model that fits the data best, for example, by comparing their likelihoods. However, the Hardy–Weinberg model has only one parameter, namely the allele frequency p, whereas the multinomial model has three parameters subject to the constraint θ AA + θ Aa + θ aa = 1. Hence, the number of free parameters is one and two, respectively, for the two models. This difference in the complexity of the models makes a comparison based only on the goodness of fit invalid, because models with more parameters, i.e., higher complexity, can generally provide a better fit. Estimating model complexity and scoring models based on both model complexity and goodness of fit is therefore essential for model comparison and model selection.

The goal of model selection is to find the model that best generalizes to unseen data, rather than just fits the observed data, because we seek the model capable of the most accurate predictions. A model that fits well but generalizes poorly is said to overfit the data. Models that are too complex tend to overfit the data. Model selection can be regarded as finding the right level model complexity for the given data, such that the predictive performance is optimized. This involves defining a criterion of optimality and a procedure for finding the optimal model.

A common frequentist approach to model selection are likelihood ratios. For a data set \( \mathcal{D} \), we compare a null model, \( {\mathrm{M}}_0 \), to an alternative model, \( {\mathrm{M}}_1 \), at given point estimates using the ratio of their likelihoods:

$$ \Lambda \left(\mathcal{D}\right)=\frac{L\left({\hat{\theta}}_0\right)}{L\left({\hat{\theta}}_1\right)}\kern1.00em $$
(42)

If \( \Lambda \left(\mathcal{D}\right)<c \), for a defined threshold c, we reject the null model and favor the alternative model. The choice of c should be informed by the distribution of Λ under the null. If the two models are nested, i.e., if \( {\mathrm{M}}_0 \) can be obtained from \( {\mathrm{M}}_1 \) by specifying a subset of the parameters, then \( -2\kern0.3em \log \Lambda \) is approximately χ 2-distributed with degrees of freedom equal to the difference in the number of free parameters between \( {\mathrm{M}}_1 \) and \( {\mathrm{M}}_0 \).

In the Bayesian framework, it is natural to compare the posterior probabilities of the two models. By Bayes theorem, we have, for i = 0, 1:

$$ P\left({\mathrm{M}}_i\mid \mathcal{D}\right)=\frac{P\left(\mathcal{D}\mid {\mathrm{M}}_i\right)P\left({\mathrm{M}}_i\right)}{P\left(\mathcal{D}\right)}\kern1.00em $$
(43)

where:

$$ P\left(\mathcal{D}\mid {\mathrm{M}}_i\right)=\int P\left(\mathcal{D}\mid {\theta}_i,\kern0.3em {\mathrm{M}}_i\right)P\left({\theta}_i\mid {\mathrm{M}}_i\right)\kern0.3em d{\theta}_i\kern1.00em $$
(44)

is the marginal likelihood. The marginal likelihood accounts for model complexity and for uncertainty in parameter estimates, but is usually analytically intractable and costly to compute. Various approximations of the marginal likelihood exist that give rise to model selection scores, such as the Bayesian information criterion (BIC; see Subheading 7) and the Akaike information criterion (AIC) [11].

For Bayesian model comparison, we consider the posterior odds:

$$ \frac{P\left({\mathrm{M}}_0\mid \mathcal{D}\right)}{P\left({\mathrm{M}}_1\mid \mathcal{D}\right)}=\frac{P\left(\mathcal{D}\mid {\mathrm{M}}_0\right)}{P\left(\mathcal{D}\mid {\mathrm{M}}_1\right)}\kern0.3em \frac{P\left({\mathrm{M}}_0\right)}{P\left({\mathrm{M}}_1\right)}\kern1.00em $$
(45)

The ratio of the marginal likelihoods, i.e., the first factor on the right-hand side of Eq. 45, is called the Bayes factor. With equal priors, a Bayes factor larger than 20 is often considered strong support for \( {\mathrm{M}}_0 \) over \( {\mathrm{M}}_1 \) [12].

Exercise 15 (Poisson Distribution)

We wish to model the number of bacterial colonies in a Petri dish and assume that the count data of this experiment follows a Poisson distribution \( \mathrm{Pois}\left(\lambda \right) \) (Example 3). Derive the log-likelihood function of this model and calculate the MLE of the model parameter λ. Suppose now that the number of bacterial colonies on a Petri dish follows the Poisson distribution with mean λ = 5. What is the probability of finding exactly three colonies?

3 Hidden Data and the EM Algorithm

We often cannot observe all relevant random variables due to, for example, experimental limitations or study designs. In this case, a statistical model P(X, Z ∣ θ ∈ Θ) consists of the observed random variable X and the hidden (or latent) random variable Z, both of which can be multivariate. In this section, we write X = (X (1), …, X (N)) for the random variables describing the N observations and refer to X also as the observed data. The hidden data for this model is Z = (Z (1), …, Z (N)) and the complete data is (X, Z). For convenience, we assume the parameter space Θ to be continuous and the state spaces \( \mathcal{X} \) of X and \( \mathcal{Z} \) of Z to be discrete.

In the Bayesian framework, one does not distinguish between unknown parameters and hidden data, and it is natural to assess the joint posterior P(θ, Z ∣ X) ∝ P(X ∣ θ, Z)P(θ, Z), which is P(X, Z ∣ θ)P(θ) if priors are independent, i.e., if P(θ, Z) = P(θ)P(Z). Alternatively, if the distribution of the hidden data Z is not of interest, it can be marginalized out. Then, the posterior (Eq. 40) becomes

$$ P\left(\theta \mid X\right)=\frac{\sum_ZP\left(X,Z\mid \theta \right)P\left(\theta \right)}{\int_{\theta \in \Theta}{\sum}_ZP\left(X,Z\mid \theta \right)P\left(\theta \right)\kern0.3em d\theta}.\kern1.00em $$
(46)

In the likelihood framework, it can be more efficient to estimate the hidden data, rather than marginalizing over it. The hidden (or complete-data) log-likelihood is

$$ {\ell}_{\mathrm{hid}}\left(\theta \right)=\log P\left(X,Z\mid \theta \right)=\sum \limits_{i=1}^N\log P\left({X}^{(i)},{Z}^{(i)}\mid \theta \right).\kern1.00em $$
(47)

For ML parameter estimation, we need to consider the observed log-likelihood:

$$ {\displaystyle \begin{array}{rcll}& & {\ell}_{\mathrm{obs}}\left(\theta \right)=\log P\left(X\mid \theta \right)=\log \sum \limits_ZP\left(X,Z\mid \theta \right)& \\ {}& & \kern1em =\log \sum \limits_{Z^{(1)}\in \mathcal{Z}}\dots \sum \limits_{Z^{(N)}\in \mathcal{Z}}\prod \limits_{i=1}^NP\left({X}^{(i)},{Z}^{(i)}\mid \theta \right).& \end{array}} $$
(48)

This likelihood function is usually very difficult to maximize and one has to resort to numerical optimization techniques. Generic local methods, such as gradient descent or Newton’s method, can be used, but there is also a more specific local optimization procedure, which avoids computing any derivatives of the likelihood function, called the expectation maximization (EM) algorithm [13].

In order to maximize the likelihood function (Eq. 48), we consider any distribution q(Z) of the hidden data Z and write

$$ {\ell}_{\mathrm{obs}}\left(\theta \right)=\log \sum \limits_Zq(Z)\frac{P\left(X,Z\mid \theta \right)}{q(Z)}=\log \mathrm{E}\left[P\left(X,Z\mid \theta \right)/ q(Z)\right],\kern1.00em $$
(49)

where the expected value is with respect to q(Z). Jensen’s inequality applied to the concave \( \log \) function asserts that \( \log \kern0.15em \mathrm{E}\left[Y\right]\ge \mathrm{E}\left[\log Y\right] \). Hence, the observed log-likelihood is bounded from below by \( \mathrm{E}\left[\log \left(P\left(X,Z\mid \theta \right)/ q(Z)\right)\right] \), or

$$ {\ell}_{\mathrm{obs}}\left(\theta \right)\ge \mathrm{E}\left[{\ell}_{\mathrm{hid}}\left(\theta \right)\right]+H(q),\kern1.00em $$
(50)

where \( H(q)=-\mathrm{E}\left[\log q(Z)\right] \) is the entropy. The idea of the EM algorithm is to maximize this lower bound instead of obs(θ) itself. Intuitively, this task is easier because the big sum over the hidden data in Eq. 48 disappears on the right-hand side of Eq. 50 upon taking expectations.

The EM algorithm is an iterative procedure alternating between an E step and an M step. In the E step, the lower bound (Eq. 50) is maximized with respect to the distribution q by setting q(Z) = P(Z ∣ X, θ (t)), where θ (t) is the current estimate of θ, and computing the expected value of the hidden log-likelihood:

$$ Q\left(\theta \mid {\theta}^{(t)}\right)={\mathrm{E}}_{Z\mid X,{\theta}^{(t)}}\left[{\ell}_{\mathrm{hid}}\left(\theta \right)\right].\kern1.00em $$
(51)

In the M step, Q is maximized with respect to θ to obtain an improved estimate:

$$ {\theta}^{\left(t+1\right)}=\arg \underset{\theta }{\max }Q\left(\theta \mid {\theta}^{(t)}\right).\kern1.00em $$
(52)

The sequence θ (1), θ (2), θ (3), … converges to a local maximum of the likelihood surface (Eq. 48). The global maximum and, hence, the MLE is generally not guaranteed to be found with this local optimization method. In practice, the EM algorithm is often run repeatedly with many different starting solutions θ (1), or with few very reasonable starting solutions obtained from other heuristics or educated guesses.

Example 16 (Naive Bayes)

Let us assume that we observe realizations of a discrete random variable (X 1, …, X L) and we want to cluster observations into K distinct groups. For this purpose, we introduce a hidden random variable Z with state space \( \mathcal{Z}=\left[K\right]=\left\{1,\dots, K\right\} \) indicating class membership. The joint probability of (X 1, …, X L) and Z is

$$ {\displaystyle \begin{array}{rc}P\left({X}_1,\dots, {X}_L,Z\right)=P(Z)P\left({X}_1,\dots, {X}_L\mid Z\right)& \\ {}=P(Z)\prod \limits_{n=1}^LP\left({X}_n\mid Z\right).& \end{array}} $$
(53)

The marginalization of this model with respect to the hidden data Z is the unsupervised naive Bayes model. The observed variables X n are often called features and Z the latent class variable (Fig. 7).

Fig. 7
figure 7

Graphical representation of the naive Bayes model. Observed features X n are conditionally independent given the latent class variable Z

The model parameters are the class prior P(Z), which we assume to be constant and will ignore, and the conditional probabilities θ n,kx = P(X n = x ∣ Z = k). The complete-data likelihood of observed data X = (X (1), …, X (N)) and hidden data Z = (Z (1), …, Z (N)) is

$$ P\left(X,Z\mid \theta \right)=\prod \limits_{i=1}^NP\left({X}^{(i)},{Z}^{(i)}\mid \theta \right)=\prod \limits_{i=1}^NP\left({Z}^{(i)}\right)\prod \limits_{n=1}^LP\left({X}_n^{(i)}\mid {Z}^{(i)}\right)\kern0.5em $$
(54)
$$ \kern0.5em \propto \kern0.5em \prod \limits_{i=1}^N\prod \limits_{n=1}^L{\theta}_{n,{Z}^{(i)}{X}_n^{(i)}}=\prod \limits_{i=1}^N\prod \limits_{n=1}^L\prod \limits_{k\in \left[K\right]}\prod \limits_{x\in \mathcal{X}}{\theta}_{n, kx}^{I_{n, kx}\left({Z}^{(i)}\right)},\kern0.5em $$
(55)

where I n,kx(Z (i)) is equal to one if and only if Z (i) = k and \( {X}_n^{(i)}=x \), and zero otherwise.

To apply the EM algorithm for estimating θ without observing Z, we consider the hidden log-likelihood:

$$ {\ell}_{\mathrm{hid}}\left(\theta \right)=\log P\left(X,Z\mid \theta \right)=\sum \limits_{i=1}^N\sum \limits_{n=1}^L\sum \limits_{k\in \left[K\right]}\sum \limits_{x\in \mathcal{X}}{I}_{n, kx}\left({Z}^{(i)}\right)\log {\theta}_{n, kx}.\kern1.00em $$
(56)

In the E step, we compute the expected values of Z (i):

$$ \begin{array}{lll}{\gamma}_{n, kx}^{(i)}={\mathrm{E}}_{Z\mid {X}_n=x,\theta \prime}\left[{Z}^{(i)}\right]=\frac{P\left({X}_n^{(i)}=x\mid {Z}^{(i)}=k\right)}{\sum_{k\prime \in K}P\left({X}_n^{(i)}=x\mid {Z}^{(i)}=k^{\prime}\right)}\\ {}=\frac{\theta_{n, kx}^{\prime }}{\sum_{k\prime \in K}{\theta}_{n,k\prime x}^{\prime }}, \end{array} $$
(57)

where θ′ is the current estimate of θ. The expected value \( {\gamma}_{n, kx}^{(i)} \) is sometimes referred to as the responsibility of class k for observation \( {X}_n^{(i)}=x \). The expected hidden log-likelihood can be written in terms of the expected counts \( {N}_{n, kx}={\sum}_{i=1}^N{\gamma}_{n, kx}^{(i)} \) as:

$$ {\mathrm{E}}_{Z\mid X,\theta \prime}\left[{\ell}_{\mathrm{hid}}\left(\theta \right)\right]=\sum \limits_{n=1}^L\sum \limits_{k\in \left[K\right]}\sum \limits_{x\in \mathcal{X}}{N}_{n, kx}\log {\theta}_{n, kx}.\kern1.00em $$
(58)

In the M step, maximization of this sum yields \( {\hat{\theta}}_{n, kx}={N}_{n, kx}/ {\sum}_{x\prime }{N}_{n, kx\prime } \). □

4 Markov Chains

A stochastic process \( \left\{{X}_t,t\in \mathcal{T}\right\} \) is a collection of random variables with common state space \( \mathcal{X} \). The index set \( \mathcal{T} \) is usually interpreted as time and X t is the state of the process at time t. A discrete-time stochastic process X = (X 1, X 2, X 3, … ) is called a Markov chain [14], if X n+1 ⊥ X n−1 ∣ X n for all n ≥ 2 or, equivalently, if each state depends only on its immediate predecessor:

$$ P\left({X}_n\mid {X}_{n-1},\dots, {X}_1\right)=P\left({X}_n\mid {X}_{n-1}\right),\kern1em \mathrm{for}\ \mathrm{all}\kern0.3em n\ge 2.\kern1.00em $$
(59)

We consider here Markov chains with finite state space \( \mathcal{X}=\left[K\right]=\left\{1,\dots, K\right\} \) that are homogeneous, i.e., with transition probabilities independent of time:

$$ {T}_{kl}=P\left({X}_{n+1}=l\mid {X}_n=k\right),\kern1em \mathrm{for}\ \mathrm{all}\kern0.3em k,l\in \left[K\right],\kern0.3em n\ge 2. $$
(60)

The finite-state homogeneous Markov chain is a statistical model denoted \( \mathrm{MC}\left(\Pi, T\right) \) and defined by the initial state distribution Π ∈ ΔK−1, where Πk = P(X 1 = k), and the stochastic K × K transition matrix T = (T kl).

We can generalize the one-step transition probabilities T kl to:

$$ {T}_{kl}^n=P\left({X}_{n+j}=l\mid {X}_j=k\right),\kern1.00em $$
(61)

the probability of jumping from state k to state l in n time steps. Any (n + m)-step transition can be regarded as an n-step transition followed by an m-step transition. Because the intermediate state i is unknown, summing over all possible values yields the decomposition:

$$ {T}_{kl}^{n+m}=\sum \limits_{i=1}^K{T}_{ki}^n{T}_{il}^m,\kern1em \mathrm{for}\ \mathrm{all}\kern0.3em n,m\ge 1,\kern0.3em k,l\in \left[K\right],\kern1.00em $$
(62)

known as the Chapman–Kolmogorov equations. In matrix notation, they can be written as T (n+m) = T (n) T (m). It follows that the n-step transition matrix is the n-th matrix power of the one-step transition matrix, T (n) = T n.

A state l of a Markov chain is accessible from state k if \( {T}_{kl}^n>0 \). We say that k and l communicate with each other and write k ∼ l if they are accessible from one another. State communication is reflexive (k ∼ k), symmetric (k ∼ l ⇒ l ∼ k), and, by the Chapman–Kolmogorov equations, transitive (j ∼ k ∼ l ⇒ j ∼ l). Hence, it defines an equivalence relation on the state space. The Markov chain is irreducible if it has a single communication class, i.e., if any state is accessible from any other state.

A state is recurrent if the Markov chain will reenter it with probability one. Otherwise, the state is transient. In finite-state Markov chains, recurrent states are also positive recurrent, i.e., the expected time to return to the state is finite. A state is aperiodic if the process can return to it after any time n ≥ 1. Recurrence, positive recurrence, and aperiodicity are class properties: if they hold for a state k, then they also hold for all states communicating with k.

A Markov chain is ergodic if it is irreducible, aperiodic, and positive recurrent. An ergodic Markov chain has a unique stationary distribution π given by:

$$ {\pi}_l=\underset{n\to \infty }{\lim }{T}_{kl}^n=\sum \limits_{k=1}^K{\pi}_k{T}_{kl},\kern2em l\in \left[K\right],\kern1em \sum \limits_{l=1}^K{\pi}_l=1\kern1.00em $$
(63)

independent of the initial distribution Π. In matrix notation, π is the solution of π t = π t T.

Example 17 (Two-State Markov Chain)

Consider the Markov chain with state space {1, 2} and transition probabilities T 12 = α > 0 and T 21 = β > 0. Clearly, the chain is ergodic and its stationary distribution π is given by:

$$ \left(\begin{array}{ll}\hfill {\pi}_1\hfill & \hfill {\pi}_2\hfill \end{array}\right)=\left(\begin{array}{ll}\hfill {\pi}_1\hfill & \hfill {\pi}_2\hfill \end{array}\right)\left(\begin{array}{ll}\hfill 1-\alpha \hfill & \hfill \alpha \hfill \\ {}\hfill \beta \hfill & \hfill 1-\beta \hfill \end{array}\right)\kern1.00em $$
(64)

or, equivalently, απ 1 = βπ 2. With π 1 + π 2 = 1, we obtain π t = (α + β)−1(α, β). □

In Example 17, if α = 0, then state 1 is called an absorbing state because once entered it is never left. In evolutionary biology and population genetics, Markov chains are often used to model evolving populations, and the fixation probability of an allele can be computed as the absorption probability in such models.

Example 18 (Wright–Fisher Process)

We consider two alleles, A and a, in a diploid population of size N. The total number of A alleles in generation n is described by a Markov chain X n with state space {0, 1, 2, …, 2N}. We assume that individuals mate randomly and that maternal and paternal alleles are chosen randomly such that \( \left({X}_{n+1}\mid {X}_n\right)\sim \mathrm{Binom}\left(2N,k/ (2N)\right) \), where k is the number of A alleles in generation n. The Markov chain has transition probabilities:

$$ {T}_{kl}=\left(\genfrac{}{}{0.0pt}{}{2N}{l}\right){\left(\frac{k}{2N}\right)}^l{\left(\frac{2N-k}{2N}\right)}^{2N-l}.\kern1.00em $$
(65)

If the initial number of A alleles is X 1 = k, then E(X 1) = k. After binomial sampling, E(X 2) = 2N(k∕(2N)) = k and hence E(X n) = k for all n ≥ 0. The Markov chain has the two absorbing states 0 and 2N, which correspond, respectively, to extinction and fixation of the A allele. To compute the fixation probability h k of A given k initial copies of it:

$$ {h}_k=\underset{n\to \infty }{\lim }P\left({X}_n=2N\mid {X}_1=k\right),\kern1.00em $$
(66)

we consider the expected value, which is equal to k, in the limit as n → to obtain

$$ k=\underset{n\to \infty }{\lim}\mathrm{E}\left({X}_n\right)=0\cdot \left(1-{h}_k\right)+2N\cdot {h}_k.\kern1.00em $$
(67)

Thus, the fixation probability is just h k = k∕(2N), the initial relative frequency of the allele. The Wright–Fisher process [15, 16] is a basic stochastic model for random genetic drift, i.e., for the variation in allele frequencies only due to random sampling. □

If we observe data X = (X (1), …, X (N)) from a finite Markov chain \( \mathrm{MC}\left(\Pi, T\right) \) of length L, then the likelihood is

$$ {\displaystyle \begin{array}{r}L\left(\Pi, T\right)=\prod \limits_{i=1}^NP\left({X}^{(i)}\right)=\prod \limits_{i=1}^NP\left({X}_1^{(i)}\right)\prod \limits_{n=1}^{L-1}P\left({X}_{n+1}^{(i)}\mid {X}_n^{(i)}\right)\\ {}=\prod \limits_{i=1}^N{\Pi}_{X_1^{(i)}}\prod \limits_{n=1}^{L-1}{T}_{X_n^{(i)},{X}_{n+1}^{(i)}},\end{array}} $$
(68)

which can be rewritten as:

$$ {\displaystyle \begin{array}{cccccccc}L\left(\Pi, T\right)=\prod \limits_{i=1}^N\prod \limits_{k\in \left[K\right]}{\Pi}_k^{N_k\left({X}^{(i)}\right)}\prod \limits_{k\in \left[K\right]}\prod \limits_{l\in \left[K\right]}{T}_{kl}^{N_{kl}\left({X}^{(i)}\right)}\\ {}=\prod \limits_{k\in \left[K\right]}{\Pi}_k^{N_k}\prod \limits_{k\in \left[K\right]}\prod \limits_{l\in \left[K\right]}{T}_{kl}^{N_{kl}}. \end{array}} $$
(69)

with N kl(X (i)) the number of observed transitions from state k into state l in observation X (i), and \( {N}_{kl}={\sum}_{i=1}^N{N}_{kl}\left({X}^{(i)}\right) \) the total number of k-to-l transitions in the data, and similarly N k(X (i)) and N k the number of times the i-th chain, respectively all chains, started in state k.

Exercise 19 (Markov Chains)

Let us consider a simple infectious disease model, where each individual is either healthy (H) or diseased (D). We assume the following two-state Markov chain to describe infection-related disease and recovery via clearance of the pathogen:

The probability of a healthy individual becoming sick due to infection is α = 0.6, and the probability of a diseased individual to clear the infection and recover is β = 0.9. The initial probabilities for health and disease are P(H) = 0.7 and P(D) = 0.3. Write down the transition matrix T of this Markov chain. What is the probability of observing the disease trajectories DDHHD and HDHDH? Calculate the stationary distribution of the Markov chain.

5 Continuous-Time Markov Chains

A continuous-time stochastic process {X(t), t ≥ 0} with finite state space [K] is a continuous-time Markov chain if

$$ {\displaystyle \begin{array}{r}P\left[X\left(t+s\right)=l\mid X(s)=k,\kern0.3em X(u)=x(u),\kern0.3em 0\le u<s\right]\\ {}=P\left[X\left(t+s\right)=l\mid X(s)=k\right]\end{array}} $$
(70)

for all s, t ≥ 1, k, l, x(u) ∈ [K], 0 ≤ u < s. The chain is homogeneous if Eq. 70 is independent of s. The transition probabilities are then denoted:

$$ {T}_{kl}(t)=P\left[X\left(t+s\right)=l\mid X(s)=k\right].\kern1.00em $$
(71)

It can be shown that the transition matrix T(t) is the matrix exponential of a constant rate matrix R times t:

$$ T(t)=\exp (Rt)=\sum \limits_{j=0}^{\infty}\frac{1}{j!}{(Rt)}^j.\kern1.00em $$
(72)

Example 20 (Jukes–Cantor Model)

Consider a fixed position in a DNA sequence, and let T kl(t) be the probability that, due to mutation, nucleotide k changes to nucleotide l after time t at this position (Fig. 8). The Jukes–Cantor model [17] is the simplest DNA substitution model. It assumes that the transition rates from any nucleotide to any other are equal:

$$ R=\left(\begin{array}{llll}\hfill -3\alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill \\ {}\hfill \alpha \hfill & \hfill -3\alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill \\ {}\hfill \alpha \hfill & \hfill \alpha \hfill & \hfill -3\alpha \hfill & \hfill \alpha \hfill \\ {}\hfill \alpha \hfill & \hfill \alpha \hfill & \hfill \alpha \hfill & \hfill -3\alpha \hfill \\ {}\hfill \hfill \end{array}\right).\kern1.00em $$
(73)

The resulting transition matrix \( T(t)=\exp (Rt) \) is

$$ T(t)=\frac{1}{4}\left(\begin{array}{llll}\hfill 1+3{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill \\ {}\hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1+3{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill \\ {}\hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1+3{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill \\ {}\hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1-{e}^{-4\alpha t}\hfill & \hfill 1+3{e}^{-4\alpha t}\hfill \\ {}\hfill \hfill \end{array}\right)\kern1.00em $$
(74)

and the stationary distribution as t → is uniform, π = (1∕4, 1∕4, 1∕4, 1∕4)t. □

Fig. 8
figure 8

Nucleotide substitution model. The state space and transitions of a general nucleotide substitution model are shown. For the Jukes–Cantor model (Example 20), all transitions from any nucleotide to any other nucleotide have the same probability \( \frac{1}{4}\left(1-{e}^{-4\alpha t}\right) \)

Example 21 (The Poisson Process)

A continuous-time Markov chain X(t) is a counting process, if X(t) represents the total number of events that occur by time t. It is a Poisson process, if in addition X(0) = 0, the increments are independent, and in any interval of length t the number of events is Poisson distributed with rate λt:

$$ P\left[X\left(t+s\right)-X(s)=k\right]=P\left[X(t)=k\right]={e}^{-\lambda t}\frac{{\left(\lambda t\right)}^k}{k!}.\kern1.00em $$
(75)

The Poisson process is used, for example, to count mutations in a gene. □

Example 22 (Exponential Distribution)

The exponential distribution \( \mathrm{Exp}\left(\lambda \right) \) with parameter λ > 0 is a common distribution for waiting times. It is defined by the density function:

$$ f(x)=\lambda {e}^{-\lambda x},\kern1em \mathrm{for}\kern0.3em x\ge 0.\kern1.00em $$
(76)

If \( X\sim \mathrm{Exp}\left(\lambda \right) \), then X has expectation E(X) = λ −1 and variance \( \mathrm{Var}(X)={\lambda}^{-2} \). The exponential distribution is memoryless, which means that P(X > s + t ∣ X > t) = P(X > s), for all s, t > 0. An important consequence of the memoryless property is that the waiting times between successive events are i.i.d. For example, the waiting times τ n (n ≥ 1) between the events of a Poisson process, the sequence of interarrival times, are exponentially distributed, \( {\tau}_n\sim \mathrm{Exp}\left(\lambda \right) \), for all n ≥ 1. □

Exercise 23 (Kimura Model)

The Kimura two-parameter model is a DNA substitution model that distinguishes between transitions, i.e., purine-to-purine and pyrimidine-to-pyrimidine substitutions, from transversions, i.e., purine-to-pyrimidine and pyrimidine-to-purine substitutions [18]. It is defined by the rate matrix:

$$ R=\left(\begin{array}{llll}\hfill -2\beta -\alpha \hfill & \hfill \beta \hfill & \hfill \alpha \hfill & \hfill \beta \hfill \\ {}\hfill \beta \hfill & \hfill -2\beta -\alpha \hfill & \hfill \beta \hfill & \hfill \alpha \hfill \\ {}\hfill \alpha \hfill & \hfill \beta \hfill & \hfill -2\beta -\alpha \hfill & \hfill \beta \hfill \\ {}\hfill \beta \hfill & \hfill \alpha \hfill & \hfill \beta \hfill & \hfill -2\beta -\alpha \hfill \end{array}\right),\kern1.00em $$

where \( \alpha, \kern0.3em \beta \in {\mathbb{R}}_{+} \) are the two substitution rates. Assuming that the Markov chain is ergodic, derive its stationary distribution.

6 Hidden Markov Models

A hidden Markov model (HMM) is a statistical model for hidden random variables Z = (Z 1, …, Z L), which form a homogeneous Markov chain, and observed random variables X = (X 1, …, X L). Each observed symbol X n depends on the hidden state Z n. The HMM is illustrated in Fig. 9. It encodes the following conditional independence statements:

$$ \kern0.5em {Z}_{n+1}\perp {Z}_{n-1}\mid {Z}_n,\kern2em 2\le n\le L-1\kern2em \left(\mathrm{Markov}\ \mathrm{property}\right)\kern0.5em $$
(77)
$$ \kern0.5em {X}_n\perp {X}_m\mid {Z}_n,\kern2em 1\le m,n\le L,\kern0.3em m\ne n\kern0.5em $$
(78)
Fig. 9
figure 9

Hidden Markov model. Shaded nodes represent observed random variables (or symbols) X n, and clear nodes represent hidden states (or the annotation). Directed edges indicate statistical dependencies which are given, respectively, by transition and emission probabilities among hidden states and between hidden states and observed symbols

The parameters of the HMM consist of the initial state probabilities Π = P(Z 1), the transition probabilities T kl = P(Z n = l ∣ Z n−1 = k) of the Markov chain, and the emission probabilities E kx = P(X n = x ∣ Z n = k) of symbols \( x\in \mathcal{X} \). The HMM is denoted \( \mathrm{HMM}\left(\Pi, T,E\right) \). For simplicity, we restrict ourselves here to finite state spaces \( \mathcal{Z}=\left[K\right] \) of Z and \( \mathcal{X} \) of X. The joint probability of (Z, X) factorizes as:

$$ {\displaystyle \begin{array}{r}P\left(X,Z\right)=P\left({Z}_1\right)\prod \limits_{n=1}^{L-1}P\left({X}_n\mid {Z}_n\right)P\left({Z}_{n+1}\mid {Z}_n\right)\\ {}={\Pi}_{Z_1}\prod \limits_{n=1}^{L-1}{E}_{Z_n,{X}_n}{T}_{Z_n,{Z}_{n+1}}.\end{array}} $$
(79)

The HMM is typically used to model sequence data x = (x 1, x 2, …, x L) generated by different mechanisms z n which cannot be observed. Each observation x can be a time series or any other object with a linear dependency structure [19]. In computational biology, the HMM is frequently applied to DNA and protein sequence data, where it accounts for first-order spatial dependencies of nucleotides or amino acids [20].

Example 24 (CpG Islands)

CpG islands are CG-enriched regions in a DNA sequence. They are typically a few hundreds to thousands of base pairs long. We want to use a simple HMM to detect CpG islands in genomic DNA. The hidden states \( {Z}_n\in \mathcal{Z}=\left\{-,+\right\} \) indicate whether sequence position n belongs to a CpG island (+ ) or not (−). The observed sequence is given by the nucleotide at each position, \( {X}_n\in \mathcal{X}=\left\{\mathtt{A},\mathtt{C},\mathtt{G},\mathtt{T}\right\} \).

Suppose we observe the sequence x = (C, A, C, G). Then, we can calculate the joint probability of x and any state path z by Eq. 79. For example, if z = (+, −, −, +), then P(X = x, Z = z) = Π+ E +,C T +,− E −,A T −,− E −,C T −,+ E +,G. □

Typically, one is interested in the hidden state path z = (z 1, z 2, …, z L) that gave rise to the observation x. For biological sequences, z is often called the annotation of x. In Example 24, the genomic sequence is annotated with CpG islands. For generic parameters, any state path can give rise to a given observed sequence, but with different probabilities. The decoding problem is to find the annotation z that maximizes the joint probability:

$$ {z}^{\ast }=\underset{z\in \mathcal{Z}}{\mathrm{argmax}}\kern0.3em P\left(X=x,Z=z\right).\kern1.00em $$
(80)

There are K L possible state paths such that already for sequences of moderate length, the optimization problem (Eq. 80) cannot be solved by enumerating all paths.

However, there is an efficient algorithm solving (Eq. 80) based on the following factorization along the Markov chain:

$$ {\displaystyle \begin{array}{rcll}& & \underset{Z}{\max }P\left(X,Z\right)=\underset{Z_1,\dots, {Z}_L}{\max }P\left({Z}_1\right)\prod \limits_{n=1}^{L-1}P\left({X}_n\mid {Z}_n\right)P\left({Z}_{n+1}\mid {Z}_n\right)& \\ {}& & \kern1em =\underset{Z_L}{\max }P\left({Z}_L\mid {Z}_{L-1}\right)P\left({X}_L\mid {Z}_L\right)& \\ {}& & \kern2em \left[\dots \right[\underset{Z_2}{\max }P\left({Z}_3\mid {Z}_2\right)P\left({X}_2\mid {Z}_2\right)& \\ {}& & \kern2em \left[\underset{Z_1}{\max }P\left({Z}_2\mid {Z}_1\right)P\left({X}_1\mid {Z}_1\right)\cdot P\left({Z}_1\right)\right]\left]\dots \right].& \end{array}} $$
(81)

Thus, the maximum over state paths (Z 1, …, Z L) can be obtained by recursively computing maxima over each Z n. Each of the L terms in parenthesis defines a probability distribution over K states by maximizing over K values. Hence, the time complexity of the algorithm is O(LK 2), despite the fact that the maximum is over K L paths. This procedure is known as dynamic programming and it is the workhorse of biological sequence analysis. For HMMs, it is known as the Viterbi algorithm [21].

In order to compute the marginal likelihood P(X = x) of an observed sequence x, we need to sum the joint probability P(Z = z, X = x) over all hidden states \( z\in \mathcal{Z} \). The length of this sum is K L, but it can be computed efficiently by the same dynamic programming principle used for the Viterbi algorithm:

$$ {\displaystyle \begin{array}{rcll}& & \sum \limits_ZP\left(X,Z\right)=\sum \limits_{Z_1,\dots, {Z}_L}P\left({Z}_1\right)\prod \limits_{n=1}^{L-1}P\left({X}_n\mid {Z}_n\right)P\left({Z}_{n+1}\mid {Z}_n\right)& \\ {}& & \kern1em =\sum \limits_{Z_L}P\left({Z}_L\mid {Z}_{L-1}\right)P\left({X}_L\mid {Z}_L\right)& \\ {}& & \kern2em \left[\dots \right[\sum \limits_{Z_2}P\left({Z}_3\mid {Z}_2\right)P\left({X}_2\mid {Z}_2\right)& \\ {}& & \kern2em \left[\sum \limits_{Z_1}P\left({Z}_2\mid {Z}_1\right)P\left({X}_1\mid {Z}_1\right)\cdot P\left({Z}_1\right)\right]\left]\dots \right].& \end{array}} $$
(82)

Indeed, this factorization is the same as in Eq. 81 with maxima replaced by sums. The recursive algorithm implementing (Eq. 82) is known as the forward algorithm. In each step, it computes the partial solution f(n, Z n) = P(X 1, …, X n, Z n).

The factorization along the Markov chain can also be done in the other direction starting the recursion from Z L down to Z 1. The resulting backward algorithm generates the partial solutions b(n, Z n) = P(X n+1, …, X L ∣ Z n). From the forward and backward quantities, one can also compute the position-wise posterior state probabilities:

$$ {\displaystyle \begin{array}{r}P\left({Z}_n\mid X\right)=\frac{P\left(X,{Z}_n\right)}{P(X)}=\frac{P\left({X}_1,\dots, {X}_n,{Z}_n\right)P\left({X}_{n+1},\dots, {X}_L\mid {Z}_n\right)}{P(X)}\\ {}=\frac{f\left(n,{Z}_n\right)b\left(n,{Z}_n\right)}{P(X)}.\end{array}} $$
(83)

For example, in the CpG island HMM (Example 24), we can compute, for each nucleotide, the probability that it belongs to a CpG island given the entire observed DNA sequence. Selecting the state that maximizes this probability independently at each sequence position is known as posterior decoding. In general, the result will be different from Viterbi decoding.

Example 25 (Pairwise Sequence Alignment)

The pair HMM is a statistical model for pairwise alignment of two observed sequences over a fixed alphabet \( \mathcal{A} \). For protein sequences, \( \mathcal{A} \) is the set of 20 natural amino acids and for DNA sequences, \( \mathcal{A} \) consists of the four nucleotides, plus the gap symbol (“-”). At each position of the alignment, a hidden variable \( {Z}_n\in \mathcal{Z}=\Big\{ \) M, X, Y} indicates whether there is a (mis-)match (M), an insertion (X), or a deletion (Y) in sequence y relative to sequence x. For example:

$$ {\displaystyle \begin{array}{r}z=\mathtt{MMMMMMMMMMMMMXXMMMMMMMMMMMMYMMMMYMMMMM}\\ {}x=\mathtt{CTRPNNNTRKSIRPQIGPGQAFYATGD}-\mathtt{IGDI}-\mathtt{RQAHC}\\ {}y=\mathtt{CGRPNNHRIKGLR}--\mathtt{IGPGRAFFAMGAIRGGEIRQAHC}\end{array}} $$

The emitted symbols are pairs (X n, Y n) of aligned sequence characters with state space \( \left(\mathcal{A}\times \mathcal{A}\right)\setminus \left\{\left(-,-\right)\right\} \). Thus, a pairwise alignment is a probabilistically generated sequence of pairs of symbols.

The choice of transition and emission probabilities corresponds to fixing a scoring scheme in nonprobabilistic formulations of sequence alignment. For example, the emission probabilities P[(a, b) ∣ M] from a match state encode pairwise amino acid preferences and can be modeled by substitution matrices, such as PAM and BLOSUM [20].

In the pair HMM, computing an optimal alignment between x and y means to find the most probable state path \( {z}^{\ast }={\mathrm{argmax}}_zP\Big(X=x \), Y = y, Z = z), which can be solved using the Viterbi algorithm. Using the forward algorithm, we can also compute efficiently the marginal probability of two sequences being related independent of their alignment, P(X, Y ) =∑Z P(X, Y, Z). In general, this probability is more informative than the posterior P(Z ∣ X, Y ) of an optimal alignment z because many alignments tend to have the same or nearly the same probability such that P(Z = z  ∣ X, Y ) can be very small. Finally, we can also compute the probability of two characters x n and y m being aligned by means of posterior decoding. □

Example 26 (Profile HMM)

Profile hidden Markov models represent groups of related sequences, such as protein families. They are used for searching homologous sequences and for building multiple sequence alignments. They can be regarded as unrolled versions of the pair HMM. A profile HMM is a statistical model for observed sequences, which are regarded as i.i.d. realizations. It has site-specific emission probabilities E n(a) = P(X n = a). In its simplest form allowing only gap-free alignments, the probability of an observation x is just

$$ P\left(X=x\right)=\prod \limits_{n=1}^L{E}_n\left({x}_i\right).\kern1.00em $$
(84)

The matrix \( {\left({E}_n(a)\right)}_{1\le n\le L,\kern0.3em a\in \mathcal{A}} \) is called a position-specific scoring matrix (PSSM).

Profile HMMs can also model indels. Figure 10 shows the hidden state space of such a model. It has match states M n, which can emit symbols according to the probability tables E n, insert states I n, which usually emit symbols in an unspecific manner, and delete states D n, which do not emit any symbols. The possible transitions between those states allow for modeling alignment gaps of any length.

Fig. 10
figure 10

Profile hidden Markov model. The hidden state space and its transitions are shown for the profile HMM of length L = 3. Match states are denoted M n, insert states I n, and delete states D n. B and E denote silent begin and end states, respectively. With match and insert states, probability tables for the emissions of symbols (amino acids or nucleotides, and gaps) are associated

A given profile HMM for a protein family can be used to detect new sequences that belong to the same family. For a query sequence x, we can either consider the most probable alignment of the sequence to the HMM, P(X = x, Z = z ), or the marginal probability independent of the alignment, P(X = x) =∑Z P(X = x, Z), to decide about family membership. □

Parameter estimation in HMMs is complicated by the presence of hidden variables. In Subheading 2, the EM algorithm has been introduced for finding a local maximum of the likelihood surface. For HMMs, the EM algorithm is known as the Baum–Welch algorithm [22]. For simplicity, let us ignore the initial state probabilities Π and summarize the parameters of the HMM by θ = (T, E). For ML estimation, we need to maximize the observed log-likelihood:

$$ {\displaystyle \begin{array}{rc}{\ell}_{\mathrm{obs}}\left(\theta \right)=\log P\left(X\mid \theta \right)=\log \sum \limits_ZP\left(X,Z\mid \theta \right)& \\ {}=\log \sum \limits_{Z^{(1)},\dots, {Z}^{(N)}}\prod \limits_{i=1}^NP\left({X}^{(i)},{Z}^{(i)}\mid \theta \right),& \end{array}} $$
(85)

where X (1), …, X (N) are the i.i.d. observations. For each observation, we can rewrite the joint probability as:

$$ P\left({X}^{(i)},{Z}^{(i)}\mid \theta \right)=\prod_{k\in \left[K\right]}\prod \limits_{x\in \mathcal{X}}{E}_{kx}^{N_{kx}\left({Z}^{(i)}\right)}\cdot \prod \limits_{k\in \left[K\right]}\prod \limits_{l\in \left[K\right]}{T}_{kl}^{N_{kl}\left({Z}^{(i)}\right)},\kern1.00em $$
(86)

where N kx(Z (i)) is the number of x emissions when in state k and N kl(Z (i)) the number of k-to-l transitions in state path Z (i) (cf. Eq. 68).

In the E step, the expectation of Eq. 85 is computed with respect to P(Z ∣ X, θ′), where θ′ is the current best estimate of θ. We use Eq. 86 and denote by N kx and N kl the expected value of ∑i N kx(Z (i)) and ∑i N kl(Z (i)), respectively, to obtain

$$ {\displaystyle \begin{array}{cccccccccc}\mathrm{E}\left[{\ell}_{hid}\left(\theta \right)\right]=\sum \limits_ZP\left(Z\mid X,\theta^{\prime}\right)\log P\left(X,Z\mid \theta \right)\\ {}=\sum \limits_{Z^{(1)},\dots, {Z}^{(N)}}P\left(Z\mid X,\theta^{\prime}\right) \\ {}\; \left[\sum \limits_{k,\; x}{N}_{kx}\left({Z}^{(i)}\right)\log {E}_{kx}+\sum \limits_{k,\; l}{N}_{kl}\left({Z}^{(i)}\right)\log {T}_{kl}\right]\\ {}=\sum \limits_{k,\; x}{N}_{kx}\log {E}_{kx}+\sum \limits_{k,\; l}{N}_{kl}\log {T}_{kl}.\end{array}} $$
(87)

The expected counts N kx and N kl are the sufficient statistics [11] of the HMM, i.e., with respect to the model, they contain all information about the parameters available from the data. The expected counts can be computed using the forward and backward algorithms. In the M step, this expression is maximized with respect to θ = (T, E). We find the MLEs \( {\hat{T}}_{kl}={N}_{kl}/ {\sum}_m{N}_{km} \) and \( {\hat{E}}_{kx}={N}_{kx}/ {\sum}_y{N}_{ky} \).

7 Bayesian Networks

Bayesian networks are a class of probabilistic graphical models which generalize Markov chains and HMMs. The basic idea is to use a graph for encoding conditional independences among random variables (Fig. 11). The graph representation provides not only an intuitive and simple visualization of the model structure, but it is also the basis for designing efficient algorithms for inference and learning in graphical models [23,24,25].

Fig. 11
figure 11

Example of a Bayesian network. Vertices correspond to random variables and edges represent conditional probabilities. The graph encodes conditional independence statements about the random variables U, V , W, X, Y , and Z. Their joint probability factors according to the graph as P(U, V, W, X, Y ) = P(U)P(Y )P(V ∣ U, Y )P(W ∣ V )P(X ∣ U)

A Bayesian network (BN) for a set of random variables X = (X 1, …, X L) consists of a directed acyclic graph (DAG) and local probability distributions (LPDs). The DAG G = (V, E) has vertex set V = [L] and edge set E ⊆ V × V . Each vertex n ∈ V is identified with the random variable X n. If there is an edge X m → X n in G, then X m is a parent of X n and X n is a child of X m. For each vertex n ∈ V , there is an LPD \( P\left({X}_n\mid {X}_{\mathrm{pa}(n)}\right) \), where \( \mathrm{pa}(n) \) is the set of parents of X n in G. The Bayesian network model is defined as the family of distributions for which the joint probability of X factors into conditional probabilities as:

$$ P\left({X}_1,\dots, {X}_L\right)=\prod \limits_{n=1}^LP\left({X}_n\mid {X}_{\mathrm{pa}(n)}\right).\kern1.00em $$
(88)

In this case, we write \( X\sim \mathrm{BN}\left(G,\theta \right) \), where θ = (θ 1, …, θ L) denotes the parameters of the LPDs.

For the Bayesian network shown in Fig. 11, we find P(U, V, W, X, Y ) = P(U)P(Y )P(V ∣ U, Y )P(W ∣ V )P(X ∣ U). The graph encodes several conditional independence statements about (U, V, W, X, Y ), including, for example, W ⊥{U, X} ∣ V .

Example 27 (Markov Chain)

A finite Markov chain is a Bayesian network with the DAG X 1 → X 2 →⋯ → X L, denoted C, and joint distribution:

$$ P\left({X}_1,\dots, {X}_n\right)=P\left({X}_1\right)P\left({X}_2\mid {X}_1\right)P\left({X}_3\mid {X}_2\right)\cdots P\left({X}_L\mid {X}_{L-1}\right).\kern1.00em $$
(89)

If \( X\sim \mathrm{MC}\left(\Pi, T\right) \) is homogeneous, then the LPDs are θ 1 = P(X 1) =  Π and θ n+1 = P(X n+1 ∣ X n) = T for all n ∈ [L − 1] such that \( \mathrm{MC}\left(\Pi, T\right)=\mathrm{BN}\left(C,\theta \right) \). Similarly, HMMs are Bayesian networks with hidden variables Z and factorized joint distribution given in Eq. 79. □

The meaning of the parameters θ of a Bayesian network depends on the family of distributions that has been chosen for the LPDs. In the general case of a discrete random variable with finite state space, θ n is a conditional probability table. If each vertex X n has K possible states, then:

$$ {\theta}_n={\left(P\left({X}_n=a\mid {X}_{\mathrm{pa}(n)}=b\right)\right)}_{b\in {\left[K\right]}^{pa(n)},a\in \left[K\right]}\kern1.00em $$
(90)

has \( {K}^{\mathrm{pa}(n)}\times \left(K-1\right) \) free parameters. If X n depends on all other variables, then θ n has the maximal number of K L − 1 parameters, which is exponential in the number of vertices. If, on the other hand, X n is independent of all other variables, \( \mathrm{pa}(n)=\varnothing \), then θ n has (K − 1) parameters, which is independent of L. For the chain (Example 27) where each vertex has exactly one outgoing and one incoming edge, we find a total of (K − 1) + (L − 1)K(K − 1) free parameters which is of order O(LK 2).

A popular model for continuous random variables X n is the linear Gaussian model. Here, the LPDs are Gaussian distributions with mean a linear function of the parents:

$$ P\left({X}_n\mid {X}_{\mathrm{pa}(n)}\right)=\mathrm{Norm}\left({v}_n+{w}_n^t\cdot {X}_{\mathrm{pa}(n)},\kern0.3em {\sigma}_n^2\right),\kern1.00em $$
(91)

with parameters \( {v}_n\in \mathbb{R} \) and \( {w}_i\in {\mathbb{R}}^{\mathrm{pa}(n)} \) specifying the mean and variance \( {\sigma}_n^2 \). The number of parameters increases linearly with the number of parents, but only linear relationships can be modeled. All marginal and conditional probabilities of (X 1, …, X L) are also Gaussians.

Learning a Bayesian network \( \mathrm{BN}\left(G,\theta \right) \) from data \( \mathcal{D} \) can be done in different ways following either the Bayesian or the maximum likelihood approach as introduced in Subheading 2. In general, it involves first finding the optimal network structure:

$$ {G}^{\ast }=\underset{G}{\mathrm{argmax}}\kern0.3em P\left(G\mid \mathcal{D}\right),\kern1.00em $$
(92)

and then estimating the parameters:

$$ {\theta}^{\ast }=\underset{\theta }{\mathrm{argmax}}\kern0.3em P\left(\theta \mid {G}^{\ast },\kern0.3em \mathcal{D}\right)\kern1.00em $$
(93)

for the given optimal structure G . The first step is a model selection problem as introduced in Subheading 2.

Model selection for Bayesian networks is a particularly hard problem because the number of DAGs increases super-exponentially with the number of vertices rendering exhaustive searches impractical, and the objective function in Eq. 92 is difficult to compute. Recall that the posterior \( P\left(G\mid \mathcal{D}\right) \) is proportional to the product \( P\left(\mathcal{D}\mid G\right)P(G) \) of marginal likelihood and network prior, and the marginal likelihood:

$$ P\left(\mathcal{D}\mid G\right)=\int P\left(\mathcal{D}\mid \theta, \kern0.3em G\right)P\left(\theta \mid G\right)\kern0.3em d\theta \kern1.00em $$
(94)

is usually analytically intractable. Here, P(θ ∣ G) is the prior distribution of parameters given the network topology.

To address this limitation, the marginal likelihood (Eq. 94) can be approximated by a function that is easier to evaluate. A popular choice is the Bayesian information criterion (BIC) [26]:

$$ \log P\left(\mathcal{D}\mid G\right)\approx \log P\left(\mathcal{D}\mid {\hat{\theta}}_{\mathrm{ML}},G\right)-\frac{1}{2}\nu \log N,\kern1.00em $$
(95)

where ν is the number of free parameters of the model and N the size of the data. The BIC approximation can be derived under certain assumptions, including a unimodal likelihood. It replaces computation of the integral (Eq. 94) by evaluating the integrand at the MLE and adding the correction term \( -\left(\nu \log N\right)/ 2 \), which penalizes models of high complexity.

The model selection problem remains hard even with a tractable scoring function, such as BIC, because of the enormous search space. Local search methods, such as greedy hill climbing or simulated annealing, are often used in practice. They return a local maximum as a point estimate for the best network structure. Results can be improved by running several local searches from different starting topologies.

Often, data are sparse and we will find diffuse posterior distributions of network structures, which might not be represented very well by a single point estimate. In the fully Bayesian approach, we aim at estimating the full posterior \( P\left(G\mid \mathcal{D}\right)\propto P\left(\mathcal{D}\mid G\right)P(G) \). One way to approximate this distribution is to draw a finite number of samples from it. Markov chain Monte Carlo (MCMC) methods generate such a sample by constructing a Markov chain that converges to the target distribution [27].

In the Metropolis–Hastings algorithm [28], we start with a random DAG G (0) and then iteratively generate a new DAG G (n) from the previous one G (n−1) by drawing it from a proposal distribution Q:

$$ {G}^{(n)}\sim Q\left({G}^{(n)}\mid {G}^{\left(n-1\right)}\right).\kern1.00em $$
(96)

The new DAG is accepted with acceptance probability:

$$ \min \left\{\frac{P\left(\mathcal{D}\mid {G}^{(n)}\right)P\left({G}^{(n)}\right)Q\left({G}^{\left(n-1\right)}\mid {G}^{(n)}\right)}{P\left(\mathcal{D}\mid {G}^{\left(n-1\right)}\right)P\left({G}^{\left(n-1\right)}\right)Q\left({G}^{(n)}\mid {G}^{\left(n-1\right)}\right)},\kern0.3em 1\right\}\kern1.00em $$
(97)

Otherwise, the model is left unchanged and the next sample is drawn. With this acceptance probability, it is guaranteed that the Markov chain is ergodic and converges to the desired distribution. After an initial burn-in phase, samples from the stationary phase of the chain are collected, say G (m), …, G (N). Any feature f of the network (e.g., the presence of an edge or a subgraph) can be estimated as the expected value:

$$ \mathrm{E}(f)=\sum \limits_Gf(G)P\left(G\mid \mathcal{D}\right)\approx \frac{1}{N}\sum \limits_{n=m}^Nf\left({G}^{(n)}\right).\kern1.00em $$
(98)

A critical point of the Metropolis–Hastings algorithm is the choice of the proposal distribution Q, which encodes the way the network space is explored. Because not all graphs, but only DAGs, are allowed, computing the transition probabilities Q(G (n) ∣ G (n−1)) is usually the main computational bottleneck.

Parameter estimation, i.e., solving (Eq. 93), can be done along the lines described in Subheading 2 following either the ML or the Bayesian approach. If the model contains hidden random variables, then the EM algorithm (Subheading 3) can be used. However, this approach is feasible only if efficient inference algorithms are available. For hidden Markov models (Subheading 6), the forward and backward algorithms provided an efficient way to compute marginal probabilities and the expected hidden log-likelihood. These algorithms can be generalized to the sum–product algorithm for tree-like graphs and the junction tree algorithm for general DAGs. The computational complexity of the junction tree algorithm is exponential in the size of the largest clique of the so-called moralized graph, which is obtained by dropping edge directions and adding edges between any two vertices that have a common child in the original DAG [11].

Alternatively, if exact inference is computationally too expensive, then approximate inference can be used. For example, Gibbs sampling [29] is an MCMC technique for generating a sample from the joint distribution P(X 1, …, X L). The idea is to iteratively sample from the conditional probabilities of P(X 1, …, X L), starting with \( {X}_1^{\left(n+1\right)}\sim P\left({X}_1\mid {X}_2^{(n)},\dots, {X}_L^{(n)}\right) \) and cycling through all variables in turns:

$$ {\displaystyle \begin{array}{r}{X}_j^{\left(n+1\right)}\sim P\left({X}_j\mid {X}_1^{\left(n+1\right)},\dots, {X}_{j-1}^{\left(n+1\right)},{X}_{j+1}^{(n)},\dots, {X}_L^{(n)}\right)\\ {}\mathrm{for}\ \mathrm{all}\kern0.3em j=2,\dots, L.\end{array}} $$
(99)

Gibbs sampling can be regarded as a special case of the Metropolis–Hastings algorithm. It is particularly useful, if it is much easier to sample from the conditionals P(X k ∣ X k) than from the joint distribution P(X 1, …, X L), where X k denotes all variables X n except X k. For graphical models, the conditional probability of each vertex X k depends only on its Markov blanket X MB(k), defined as the set of its parents, children, and co-parents (vertices with the same children), P(X k ∣ X k) = P(X k ∣ X MB(k)).

Example 28 (Phylogenetic Tree Models)

A phylogenetic tree model [30] for a set of aligned DNA sequences from different species is a Bayesian network model, where the graph is a tree in which the leaves represent the observed contemporary species and the interior vertices correspond to common extinct ancestors (Fig. 12). The topology (graph structure) S defines the branching order and the branch lengths correspond to (phylogenetic) time. The LPDs are defined by a nucleotide substitution model (Subheading 5).

Fig. 12
figure 12

Phylogenetic tree model. The observed random variables X i represent contemporary species and the hidden random variables Z i their unknown common ancestors

Let X (i) ∈ {A, C, G, T}L denote the i-th column of a multiple sequence alignment of L observed species. We regard the alignment columns as independent observations of the evolutionary process. The character states of the hidden (extinct) ancestors are denoted Z (i). The likelihood of the observed sequence data X = (X (1), …, X (N)) given the tree topology S and the branch lengths t is

$$ P\left(X\mid S,t\right)=\sum \limits_Z\prod \limits_{i=1}^NP\left({X}^{(i)},{Z}^{(i)}\mid S,t\right),\kern1.00em $$
(100)

where P(X (i), Z (i) ∣ S, t) factors into conditional probabilities according to the tree structure. This marginal probability can be computed efficiently with an instance of the sum–product algorithm known as the peeling algorithm (or Felsenstein algorithm) [31].

For example, in the tree displayed in Fig. 12, each observation X has probability:

$$ P(X)=\sum \limits_ZP\left(X,Z\right) $$
(101)
$$ {\displaystyle \begin{array}{r}=\sum \limits_ZP\left({X}_1\mid {Z}_4\right)P\left({X}_2\mid {Z}_1\right)P\left({X}_3\mid {Z}_1\right)P\left({X}_4\mid {Z}_2\right)\dot{\mathrm{c}}\\ {}\kern1em \dot{\mathrm{c}}P\left({X}_5\mid {Z}_2\right)P\left({Z}_1\mid {Z}_3\right)P\left({Z}_2\mid {Z}_3\right)P\left({Z}_3\mid {Z}_4\right)P\left({Z}_4\right)\end{array}} $$
(102)
$$ {\displaystyle \begin{array}{rlll}& =\sum \limits_{Z_4}P\left({Z}_4\right)P\left({X}_1\mid {Z}_4\right)\left[\sum \limits_{Z_3}P\left({Z}_3\mid {Z}_4\right)\right.& & \\ {}& \kern1em \left[\sum \limits_{Z_2}P\left({Z}_2\mid {Z}_3\right)P\left({X}_4\mid {Z}_2\right)P\left({X}_5\mid {Z}_2\right)\right]& & \\ {}& \kern1em \cdot \left.\left[\sum \limits_{Z_1}P\left({Z}_1\mid {Z}_3\right)P\left({X}_2\mid {Z}_1\right)P\left({X}_3\mid {Z}_1\right)\right]\right],& \end{array}} $$
(103)

where we have omitted the dependency on the branch length t. Several software packages implement ML or Bayesian learning of phylogenetic tree models. □

In the simplest case, we suppose that the observed alignment columns are independent. However, it is more realistic to assume that nucleotide substitution rates vary across sites because of varying selective pressures. For example, there could be differences between coding and noncoding regions, among different regions of a protein (loops, and catalytic sites), or among the three bases of a triplet coding for an amino acid. More sophisticated models can account for this rate heterogeneity. Let us assume site-specific substitution rates r i such that the local probabilities become P(X (i) ∣ r i, t, S). To model the distribution of the rates, often a gamma distribution is used.

Example 29 (Gamma Distribution)

The gamma distribution \( \mathrm{Gamma}\left(\alpha, \beta \right) \) is parametrized by a shape parameter α and a rate parameter β. It is defined by the density function:

$$ f(x)=\frac{\beta^{\alpha }}{\Gamma \left(\alpha \right)}{x}^{\alpha -1}{e}^{-\beta x},\kern1em \mathrm{for}\kern0.3em x\ge 0.\kern1.00em $$
(104)

Its expectation is E(X) = αβ and its variance \( \mathrm{Var}(X)=\alpha / {\beta}^2 \). The gamma distribution generalizes several other distributions, for example \( \mathrm{Gamma}\left(1,\lambda \right)=\mathrm{Exp}\left(\lambda \right) \) (Example 22). □

Another approach to account for varying mutation rates are phylogenetic hidden Markov models (phylo-HMMs).

Example 30 (Phylo-HMM)

Phylo-HMMs [32] combine HMMs and phylogenetic trees into a single Bayesian network model. The idea is to use an HMM along the linear chain of the genomic sequence and, at each position, to condition a phylogenetic tree model on the hidden state (Fig. 13). This architecture allows for modeling different evolutionary histories at different sites of the genome. In particular, the model can account for heterogeneity in the rate of evolution, for example, due to functionally conserved elements, but it also allows for a change in tree topology along the sequence, a situation that can result from recombination [23]. Phylo-HMMs are also used for gene finding. □

Fig. 13
figure 13

Phylo-HMM. Shown are the first four positions of a Phylo-HMM. The hidden Markov chain has random variables Z. In the trees, Y denote the hidden common ancestors and X the observed species. Note that the tree topology changes between position 2 and 3

Exercise 31 (Inference in Bayesian Networks)

Consider the gene network on five genes denoted A, B, C, D, E, with the graph structure displayed below. Gene expression profiles under different conditions C1–C9 have been observed and are summarized in the table below, where a zero indicates that the gene is not expressed and a one that it is expressed.

  1. (a)

    Specify the adjacency matrix of the directed graph.

  2. (b)

    Determine the local probability distributions for each vertex of the graph. Use conditional counting to determine the conditional probabilities as:

    $$ P\left({X}_i\mid {X}_{\mathrm{pa}(i)}\right)\approx \frac{N\left({X}_i,{X}_{\mathrm{pa}(i)}\right)}{\sum_kN\left({X}_i=k,\kern0.3em {X}_{\mathrm{pa}\Big(i}\right)},\kern1.00em $$

    where \( N\left({X}_i,\kern0.3em {X}_{\mathrm{pa}(i)}\right) \) is the number of joint observations of X i and its parents.

  3. (c)

    What is the joined probability of (X A, X B, X C, X D, X E) for this network?

  4. (d)

    We now want to determine the most probable explanation for observing a gene C to be active as a result of the influences of its upstream genes A and E. For this, one has to infer the posterior probabilities P(A ∣ C = 1) and P(E ∣ C = 1) using Bayes theorem. Here, assume that the probabilities P(A) and P(E) derived from the expression data are suitable prior probabilities. Which constellation is most likely to trigger the expression of C?