Evolutionary Genomics pp 33-70 | Cite as

# Probability, Statistics, and Computational Science

## Abstract

In this chapter, we review basic concepts from probability theory and computational statistics that are fundamental to evolutionary genomics. We provide a very basic introduction to statistical modeling and discuss general principles, including maximum likelihood and Bayesian inference. Markov chains, hidden Markov models, and Bayesian network models are introduced in more detail as they occur frequently and in many variations in genomics applications. In particular, we discuss efficient inference algorithms and methods for learning these models from partially observed data. Several simple examples are given throughout the text, some of which provide the basis for models that are discussed in more detail in subsequent chapters.

## Key words

Bayesian inference Bayesian networks Dynamic programming EM algorithm Hidden Markov models Markov chains Maximum likelihood Statistical models## 1 Statistical Models

Evolutionary genomics can only be approached with the help of statistical modeling. Stochastic fluctuations are inherent to many biological systems. Specifically, the evolutionary process itself is stochastic, with random mutations and random mating being major sources of variation. In general, stochastic effects play an increasingly important role if the number of molecules, or cells, or individuals of a population is small. Stochastic variation also arises from measurement errors. Biological data is often noisy due to experimental limitations, especially for high-throughput technologies, such as microarrays or next-generation sequencing [1, 2].

Statistical modeling addresses the following questions: What can be generalized from a finite sample obtained from an experiment to the population? What can be learned about the underlying biological mechanisms? How certain can we be about our model predictions?

In the frequentist view of statistics, the observed variability in the data is the result of a fixed true value being perturbed by random variation, such as, for example, measurement noise. Probabilities are thus interpreted as long-run expected relative frequencies. By contrast, from a Bayesian point of view, probabilities represent our uncertainty about the state of nature. There is no true value, but only the data is real. Our prior belief about an event is updated in light of the data.

Statistical models represent the observed variability or uncertainty by probability distributions [3, 4]. The observed data are regarded as realizations of random variables. The parameters of a statistical model are usually the quantities of interest because they describe the amount and nature of systematic variation in the data. Parameter estimation and model selection are discussed in more detail in the next section. In this section, we first consider discrete, and then continuous random variables and univariate (1-dimensional) before multivariate (*n*-dimensional) ones. We start by formulating the well-known Hardy–Weinberg principle [5, 6] as a statistical model.

### Example 1 (Hardy–Weinberg Model)

*X*be the random variable with state space \( \mathcal{X}=\Big\{ \)AA, Aa, aa} describing the genotype. We parametrize the probability distribution of

*X*by the allele frequency

*p*of A and the allele frequency

*q*= 1 −

*p*of a. The Hardy–Weinberg model is defined by:

*X*follows the distribution (Eqs. 1–3). □

*P*(

*X*) is a discrete probability distribution (or probability mass function) with finite state space: We have 0 ≤

*P*(

*X*=

*x*) ≤ 1 for all \( x\in \mathcal{X} \) and \( {\sum}_{x\in \mathcal{X}}P\left(X=x\right)={p}^2+2p\left(1-p\right)+{\left(1-p\right)}^2={\left[p+\left(1-p\right)\right]}^2=1 \). In general, any statistical model for a discrete random variable with

*n*states defines a subset of the (

*n*− 1)-dimensional probability simplex:

*X*, and statistical models can be understood as specific subsets of the simplex [7].

*p*has genotype probabilities given in Eqs. 1–3 after one round of random mating. We find that the new allele frequency:

_{2}(Fig. 1).

The simplest discrete random variable is a binary (or Bernoulli) random variable *X*. The textbook example of a Bernoulli trial is the flipping of a coin. The state space of this random experiment is the set that contains all possible outcomes, namely, whether the coin lands on heads (*X* = 0) or tails (*X* = 1). We write \( \mathcal{X}=\left\{0,1\right\} \) to denote this state space. The parameter space is the set that contains all possible values of the model parameters. In the coin tossing example, the only parameter is the probability of observing tails, *p*, and this parameter can take any value between 0 and 1, so we write Θ = {*p* ∣ 0 ≤ *p* ≤ 1} for the parameter space. In general, the event *X* = 1 is often called a “success,” and *p* = *P*(*X* = 1) the probability of success.

### Example 2 (Binomial Distribution)

*n*independent Bernoulli trials, each with success probability

*p*. Let

*X*be the random variable counting the number of successes

*k*among the

*n*trials. Then,

*X*has state space \( \mathcal{X}=\left\{0,\dots, n\right\} \) and

*n*successive coin tosses or the number of mutated genes in a group of species. □

*X*) =

*np*and \( \mathrm{Var}(X)= np\left(1-p\right) \).

### Example 3 (Poisson Distribution)

*λ*≥ 0 is defined as:

*X*of independent events occurring in a fixed period of time (or space) at average rate

*λ*and independently of the time since (or distance to) the last event. The Poisson distribution has equal expectation and variance, \( \mathrm{E}(X)=\mathrm{Var}(X)=\lambda \). □

The Poisson distribution is used frequently as a model for the number of DNA mutations in a gene after a certain time period, where *λ* is the mutation rate. Both the binomial and the Poisson distribution describe counts of random events. In the limit of large *n* and fixed product *np*, the two distributions coincide, \( \mathrm{Binom}\left(n,p\right)\to \mathrm{Pois}(np) \), for *n* →*∞*.

### Example 4 (Shotgun Sequencing)

*n*reads of length

*L*have been obtained from a genome of size

*G*. We assume that all reads have the same probability of being sequenced. Then, the probability of hitting a specific base with one read is

*p*=

*L*∕

*G*, and the average coverage of the sequencing run is

*c*=

*np*. Under this model, the number of times

*X*a single base is sequenced is distributed as \( \mathrm{Binom}\left(n,p\right). \) For large

*n*, we have

*n*= 10

^{8}reads of length

*L*= 100 bases in a single run. For the human genome of length

*G*= 3 ⋅ 10

^{9}, we obtain a coverage of

*c*= 3.4. The distribution of the number of reads per base pair is shown in Fig. 2. In particular, the fraction of unsequenced positions is

*P*(

*X*= 0) =

*e*

^{−c}= 3.57%. □

*X*takes values in \( \mathcal{X}=\mathbb{R} \) and is defined by a nonnegative function

*f*(

*x*) such that:

*f*is called the probability density function of

*X*. For an interval:

### Example 5 (Normal Distribution)

*X*) =

*μ*and variance \( \mathrm{Var}(X)={\sigma}^2 \). \( \mathrm{Norm}\left(0,1\right) \) is called the standard normal distribution. □

*μ*is the level of expression of the corresponding gene and

*σ*

^{2}summarizes the experimental noise associated with the microarray experiment. The parameters can be estimated from a finite sample {

*x*

^{(1)}, …,

*x*

^{(N)}}, i.e., from

*N*replicate experiments, as the empirical mean and variance, respectively:

*N*independent (see below) and identically distributed (i.i.d.) random variables

*X*

^{(i)}with equal mean

*μ*and variance

*σ*

^{2}converges in distribution to the standard normal distribution:

*λ*.

*X*and

*Y*be two random variables with expected values

*μ*

_{X}and

*μ*

_{Y}and variances \( {\sigma}_X^2 \) and \( {\sigma}_Y^2 \), respectively. The covariance between

*X*and

*Y*is

*X*and

*Y*is \( {\rho}_{X,Y}=\mathrm{Cov}\left(X,Y\right)/ \left({\sigma}_X{\sigma}_Y\right) \). For observations (

*x*

^{(1)},

*y*

^{(1)}), …, (

*x*

^{(N)},

*y*

^{(N)}), the sample correlation coefficient is

*s*

_{X}and

*s*

_{Y}are the sample standard deviations of

*X*and

*Y*, respectively, defined in Eq. 20.

So far, we have worked with univariate distributions and we now turn to multivariate distributions, i.e., we consider random vectors *X* = (*X*_{1}, …, *X*_{n}) such that each *X*_{i} is a random variable. For the case of discrete random variables *X*_{i}, we first generalize the binomial distribution to random experiments with a finite number of outcomes.

### Example 6 (Multinomial Distribution)

*K*be the number of possible outcomes of a random experiment and

*θ*

_{k}the probability of outcome

*k*. We consider the random vector

*X*= (

*X*

_{1}, …,

*X*

_{K}) with values in \( \mathcal{X}={\mathbb{N}}^K \), where

*X*

_{k}counts the number of outcomes of type

*k*. The multinomial distribution \( \mathrm{Mult}\left(n,{\theta}_1,\dots, {\theta}_K\right) \) is defined as:

*K*= 2, we recover the binomial distribution (Eq. 8). Each component

*X*

_{k}of a multinomial vector has expected value E(

*X*

_{k}) =

*nθ*

_{k}and \( \mathrm{Var}\left({X}_k\right)=n{\theta}_k\left(1-{\theta}_k\right) \). The covariance of two components is \( \mathrm{Cov}\left({X}_k,{X}_l\right)=-n{\theta}_k{\theta}_l \), for

*k*≠

*l*. □

*X*is defined by:

*μ*

_{i}is the expected value of

*X*

_{i}. The matrix Σ is also called the variance–covariance matrix because the diagonal terms are the variances \( {\Sigma}_{ii}=\mathrm{Cov}\left({X}_i,{X}_i\right)=\mathrm{Var}\left({X}_i\right) \).

*X*takes values in \( \mathcal{X}={\mathbb{R}}^n \). It is defined by its cumulative distribution function:

### Example 7 (Multivariate Normal Distribution)

*n*≥ 1 and \( x\in {\mathbb{R}}^n \), the multivariate normal (or Gaussian) distribution has density:

*μ*the expectation. We write \( X=\left({X}_1,\dots, {X}_n\right)\sim \mathrm{Norm}\left(\mu, \Sigma \right) \) for a random vector with such a distribution. □

We say that two random variables *X* and *Y* are independent if *P*(*X*, *Y* ) = *P*(*X*)*P*(*Y* ) or, equivalently, if the conditional probability *P*(*X* ∣ *Y* ) = *P*(*X*, *Y* )∕*P*(*Y* ) is equal to the unconditional probability *P*(*X*). If *X* and *Y* are independent, denoted *X* ⊥ *Y* , then E[*XY* ] = E[*X*]E[*Y* ] and \( \mathrm{Var}\left(X+Y\right)=\mathrm{Var}(X)+\mathrm{Var}(Y) \). It follows that independent random variables have covariance zero. However, the converse is only true in specific situations, for example if (*X*, *Y* ) is multivariate normal, but not in general because correlation captures only linear dependencies.

*X*,

*Y*, and

*Z*be three random vectors. Generalizing the notion of statistical independence, we say that

*X*is conditionally independent of

*Y*given

*Z*and write

*X*⊥

*Y*∣

*Z*if

*P*(

*X*,

*Y*∣

*Z*) =

*P*(

*X*∣

*Z*)

*P*(

*Y*∣

*Z*). Bayes’ theorem states that

*P*(

*Y*) is called the prior probability and

*P*(

*Y*∣

*X*) the posterior probability. Intuitively, the prior

*P*(

*Y*) encodes our a priori knowledge about

*Y*(i.e., before observing

*X*), and

*P*(

*Y*∣

*X*) is our updated knowledge about

*Y*a posteriori (i.e., after observing

*X*).

*P*(

*X*) =∑

_{Y}

*P*(

*X*,

*Y*) if

*Y*is discrete, and similarly

*P*(

*X*) =∫

_{Y}

*P*(

*X*,

*Y*)

*dY*if

*Y*is continuous. Here,

*P*(

*X*) is called the marginal and

*P*(

*X*,

*Y*) the joint probability. This summation or integration is known as marginalization (Fig. 3).

*P*(

*X*) =∑

_{Y}

*P*(

*X*,

*Y*) =∑

_{Y}

*P*(

*X*∣

*Y*)

*P*(

*Y*), Bayes’ theorem can also be rewritten as:

*P*(

*y′*) =

*P*(

*Y*=

*y′*) and \( \mathcal{Y} \) is the state space of

*Y*.

### Example 8 (Diagnostic Test)

*D*and

*T*indicate disease status (

*D*= 1, diseased) and test result (

*T*= 1, positive), respectively. Let us assume that the prevalence of the disease is 0.5%, i.e., 0.5% of all people in the population are known to be affected. The test has a false positive rate (probability that somebody is tested positive who does not have the disease) of

*P*(

*T*= 1 ∣

*D*= 0) = 5% and a true positive rate (probability that somebody is tested positive who has the disease) of

*P*(

*T*= 1 ∣

*D*= 1) = 90%. Then, the posterior probability of a person having the disease given that he or she tested positive is

*P*(

*D*), has been modified in light of the test result by multiplication with

*P*(

*T*∣

*D*) to obtain the updated belief

*P*(

*D*∣

*T*). □

### Exercise 9 (Conditional Independence)

Let *X*, *Y* , and *Z* be random variables. Using the laws of probability, show that *X* and *Y* are conditionally independent given *Z* (i.e., *X* ⊥ *Y* ∣ *Z*) if and only if *P*(*X* ∣ *Y*, *Z*) = *P*(*X* ∣ *Z*).

## 2 Statistical Inference

Statistical models have parameters and a common task is to estimate the model parameters from observed data. The goal is to find the set of parameters with the best model fit. There are two major approaches to parameter estimation: maximum likelihood (ML) and Bayes.

*θ*

_{0}∈ Θ. For the fixed data set \( \mathcal{D} \), the likelihood function of the model is

*θ*. For continuous random variables, the likelihood function is defined similarly in terms of the density function, \( L\left(\theta \right)=f\left(\mathcal{D}\mid \theta \right) \). Maximum likelihood estimation seeks the parameter

*θ*∈ Θ for which

*L*(

*θ*) is maximal. Rather than

*L*(

*θ*), it is often more convenient to maximize \( \ell \left(\theta \right)=\log L\left(\theta \right) \), the log-likelihood function. If the data are i.i.d., then:

### Example 10 (Likelihood Function of the Binomial Model)

*k*= 7 successes in a total of

*N*= 10 Bernoulli trials. The likelihood function of the binomial model (Eq. 8) is

*p*is the success probability (Fig. 4). To maximize

*L*, we consider the log-likelihood function:

*dℓ*∕

*dp*= 0. The ML estimate (MLE) is the solution \( {\hat{p}}_{\mathrm{ML}}=k/ N=7/ 10 \). Thus, the MLE of the success probability is just the relative frequency of successes. □

### Example 11 (Likelihood Function of the Hardy–Weinberg Model)

*n*

_{AA},

*n*

_{Aa}, and

*n*

_{aa}with the respective genotypes. Assuming Hardy–Weinberg equilibrium (Eqs. 1–3), we want to estimate the allele frequencies

*p*and

*q*= 1 −

*p*of the population. The likelihood function of the Hardy–Weinberg model is \( L(p)=P{\left(\mathrm{AA}\right)}^{n_{\mathrm{AA}}}P{\left(\mathrm{Aa}\right)}^{n_{\mathrm{Aa}}}P{\left(\mathrm{aa}\right)}^{n_{\mathrm{aa}}} \) and the log-likelihood is

*p*∈ [0, 1] can be found by maximizing

*ℓ*. Solving the likelihood equation:

*N*=

*n*

_{AA}+

*n*

_{Aa}+

*n*

_{aa}is the total sample size. For example, if we sample

*N*= 100 genotypes with

*n*

_{AA}= 81,

*n*

_{Aa}= 18, and

*n*

_{aa}= 1, then we find \( {\hat{p}}_{\mathrm{ML}}=\left(2\cdot \left(81+18\right)\right)/ \left(2\cdot 100\right)=0.9 \) for the frequency of the major allele. □

*N*→

*∞*, they are normally distributed, unbiased, and have minimal variance. The uncertainty in parameter estimation associated with the sampling variance of the finite data set can be quantified in confidence intervals. There are several ways to construct confidence intervals and statistical tests for MLEs based on the asymptotic behavior of the log-likelihood function \( \ell \left(\theta \right)=\log L\left(\theta \right) \) and its derivatives. For example, the asymptotic normal distribution of the MLE is

*I*(

*θ*) = −

*∂*

^{2}

*ℓ*∕

*∂θ*

^{2}is the Fisher information and

*J*(

*θ*) = E[

*I*(

*θ*)] the expected Fisher information. This result gives rise to the Wald confidence intervals:

*α*∕2) quantile and

*F*the cumulative distribution function of the standard normal distribution. Equation 38 still holds after replacing

*J*(

*θ*) with the standard error \( \mathrm{se}\left({\hat{\theta}}_{\mathrm{ML}}\right)={\left[I\left({\hat{\theta}}_{\mathrm{ML}}\right)\right]}^{-\frac{1}{2}} \) or \( {\left[J\left({\hat{\theta}}_{\mathrm{ML}}\right)\right]}^{-\frac{1}{2}} \), and it also generalizes to higher dimensions. Other common constructions of confidence intervals include those based on the asymptotic distribution of the score function

*S*(

*θ*) =

*∂ℓ*∕

*∂θ*and the log-likelihood ratio \( \log \left(L\left({\hat{\theta}}_{\mathrm{ML}}\right)/ L\left(\theta \right)\right) \) [8].

We now discuss another more generic approach to quantify parameter uncertainty, not restricted to ML estimation, which is applied frequently in practice due to its simple implementation. Bootstrapping [9] is a resampling method in which independent observations are resampled from the data with replacement. The resulting new data set consists of (some of) the original observations, and under i.i.d. assumptions, the bootstrap replicates have asymptotically the same distribution as the data. Intuitively, by sampling with replacement, one is pretending that the collection of replicates thus obtained is a good proxy for the distribution of data sets that one would have obtained, had we been able to actually replicate the experiment. In this way, the variability of an estimator (or more generally the distribution of any test statistic) can be approximated by evaluating the estimator (or the statistic) on a collection of bootstrap replicates. For example, the distribution of the ML estimator of a model parameter *θ* can be obtained from the bootstrap samples.

### Example 12 (Bootstrap Confidence Interval for the ML Allele Frequency)

*n*

_{AA},

*n*

_{Aa},

*n*

_{aa}) = (81, 18, 1) of Example 11. For each bootstrap sample, we draw

*N*= 100 genotypes with replacement from the original data to obtain random integer vectors of length three summing to 100. The ML estimate is computed for each of a total of

*B*bootstrap samples. The resulting distributions of \( {\hat{p}}_{\mathrm{ML}} \) are shown in Fig. 5, for

*B*= 100, 1000, and 10,000. The means of these empirical distributions are 0.899, 0.9004, and 0.9001, respectively, and 95% bootstrap confidence intervals can be derived from the 2.5 and 97.5% quantiles of the distributions. For

*B*= 100, 1000, and 10,000, we obtain, respectively, [0.8598, 0.9350], [0.860, 0.940], and [0.855, 0.940]. The basic bootstrap confidence intervals have several limitations, including bias of the bootstrap estimator and skewness of the bootstrap distribution. Other methods exist for constructing confidence intervals from the bootstrap distribution to address some of them [9]. □

*θ*given the observed data \( \mathcal{D} \). By Bayes’ theorem (Eq. 30), we have

From the posterior, credible intervals of parameter estimates can be derived such that the parameter lies in the interval with a certain probability, say 95%. This is in contrast to a 95% confidence interval in the frequentist approach because, there, the parameter is fixed and the interval boundaries are random variables. The meaning of a confidence interval is that 95% of similar intervals would contain the true parameter, if intervals were constructed independently from additional identically distributed data.

The prior *P*(*θ*) encodes our a priori belief in *θ* before observing the data. It can be used to incorporate domain-specific knowledge into the model, but it may also be uninformative or objective, in which case all observations are equally likely, or nearly so, a priori. However, it can sometimes be difficult to find noninformative priors. In practice, conjugate priors are most often used. A conjugate prior is one that is invariant with respect to the distribution family under multiplication with the likelihood, i.e., the posterior belongs to the same family as the prior. Conjugate priors are mathematically convenient and computationally efficient because the posterior can be calculated analytically for a wide range of statistical models.

### Example 13 (Dirichlet Prior)

*T*= (

*T*

_{1}, …,

*T*

_{K}) be a continuous random variable with state space Δ

_{K−1}. The Dirichlet distribution \( \mathrm{Dir}\left(\alpha \right) \) with parameters \( \alpha \in {\mathbb{R}}_{+}^K \) has probability density function:

*K*= 2, this distribution is called the beta distribution. Hence, the beta distribution is the conjugate prior to the binomial likelihood. □

### Example 14 (Posterior Probability of Genotype Frequencies)

*n*

_{AA},

*n*

_{Aa},

*n*

_{aa}) = (81, 18, 1) as the result of a draw from a multinomial distribution \( \mathrm{Mult}\left(n,{\theta}_{\mathrm{AA}},{\theta}_{\mathrm{Aa}},{\theta}_{\mathrm{aa}}\right) \). Assuming a Dirichlet prior \( \mathrm{Dir}\Big({\alpha}_{\mathrm{AA}} \),

*α*

_{Aa},

*α*

_{aa}), the posterior genotype probabilities follow the Dirichlet distribution \( \mathrm{Dir}\Big({\alpha}_{\mathrm{AA}}+{n}_{\mathrm{AA}} \),

*α*

_{Aa}+

*n*

_{Aa},

*α*

_{aa}+

*n*

_{aa}). In Fig. 6, the prior \( \mathrm{Dir}\left(10,\kern0.3em 10,\kern0.3em 10\right) \) is shown on the left, the multinomial likelihood

*P*((

*n*

_{AA},

*n*

_{Aa},

*n*

_{aa}) = (81, 18, 1) ∣

*θ*

_{AA},

*θ*

_{Aa},

*θ*

_{aa}) in the center, and the resulting posterior \( \mathrm{Dir}\left(10+81,\kern0.3em 10+18,\kern0.3em 10+1\right) \) on the right. Note that the MLE is different from the mode of the posterior. As compared to the likelihood, the nonuniform prior has shifted the maximum of the posterior toward the center of the probability simplex. □

We often have two or more competing models and would like to assess which one describes best the given data. For example, we may have observed genotypes from the set {AA, Aa, aa} and want to test whether the Hardy–Weinberg model (Example 1) is a more appropriate description of the genotype data than the multinomial model of the previous Example 14. Intuitively, we might want to select the model that fits the data best, for example, by comparing their likelihoods. However, the Hardy–Weinberg model has only one parameter, namely the allele frequency *p*, whereas the multinomial model has three parameters subject to the constraint *θ*_{AA} + *θ*_{Aa} + *θ*_{aa} = 1. Hence, the number of free parameters is one and two, respectively, for the two models. This difference in the complexity of the models makes a comparison based only on the goodness of fit invalid, because models with more parameters, i.e., higher complexity, can generally provide a better fit. Estimating model complexity and scoring models based on both model complexity and goodness of fit is therefore essential for model comparison and model selection.

The goal of model selection is to find the model that best generalizes to unseen data, rather than just fits the observed data, because we seek the model capable of the most accurate predictions. A model that fits well but generalizes poorly is said to overfit the data. Models that are too complex tend to overfit the data. Model selection can be regarded as finding the right level model complexity for the given data, such that the predictive performance is optimized. This involves defining a criterion of optimality and a procedure for finding the optimal model.

*c*, we reject the null model and favor the alternative model. The choice of

*c*should be informed by the distribution of Λ under the null. If the two models are nested, i.e., if \( {\mathrm{M}}_0 \) can be obtained from \( {\mathrm{M}}_1 \) by specifying a subset of the parameters, then \( -2\kern0.3em \log \Lambda \) is approximately

*χ*

^{2}-distributed with degrees of freedom equal to the difference in the number of free parameters between \( {\mathrm{M}}_1 \) and \( {\mathrm{M}}_0 \).

*i*= 0, 1:

*see*Subheading 7) and the Akaike information criterion (AIC) [11].

### Exercise 15 (Poisson Distribution)

We wish to model the number of bacterial colonies in a Petri dish and assume that the count data of this experiment follows a Poisson distribution \( \mathrm{Pois}\left(\lambda \right) \) (Example 3). Derive the log-likelihood function of this model and calculate the MLE of the model parameter *λ*. Suppose now that the number of bacterial colonies on a Petri dish follows the Poisson distribution with mean *λ* = 5. What is the probability of finding exactly three colonies?

## 3 Hidden Data and the EM Algorithm

We often cannot observe all relevant random variables due to, for example, experimental limitations or study designs. In this case, a statistical model *P*(*X*, *Z* ∣ *θ* ∈ Θ) consists of the observed random variable *X* and the hidden (or latent) random variable *Z*, both of which can be multivariate. In this section, we write *X* = (*X*^{(1)}, …, *X*^{(N)}) for the random variables describing the *N* observations and refer to *X* also as the observed data. The hidden data for this model is *Z* = (*Z*^{(1)}, …, *Z*^{(N)}) and the complete data is (*X*, *Z*). For convenience, we assume the parameter space Θ to be continuous and the state spaces \( \mathcal{X} \) of *X* and \( \mathcal{Z} \) of *Z* to be discrete.

*P*(

*θ*,

*Z*∣

*X*) ∝

*P*(

*X*∣

*θ*,

*Z*)

*P*(

*θ*,

*Z*), which is

*P*(

*X*,

*Z*∣

*θ*)

*P*(

*θ*) if priors are independent, i.e., if

*P*(

*θ*,

*Z*) =

*P*(

*θ*)

*P*(

*Z*). Alternatively, if the distribution of the hidden data

*Z*is not of interest, it can be marginalized out. Then, the posterior (Eq. 40) becomes

*q*(

*Z*) of the hidden data

*Z*and write

*q*(

*Z*). Jensen’s inequality applied to the concave \( \log \) function asserts that \( \log \kern0.15em \mathrm{E}\left[Y\right]\ge \mathrm{E}\left[\log Y\right] \). Hence, the observed log-likelihood is bounded from below by \( \mathrm{E}\left[\log \left(P\left(X,Z\mid \theta \right)/ q(Z)\right)\right] \), or

*ℓ*

_{obs}(

*θ*) itself. Intuitively, this task is easier because the big sum over the hidden data in Eq. 48 disappears on the right-hand side of Eq. 50 upon taking expectations.

*q*by setting

*q*(

*Z*) =

*P*(

*Z*∣

*X*,

*θ*

^{(t)}), where

*θ*

^{(t)}is the current estimate of

*θ*, and computing the expected value of the hidden log-likelihood:

*Q*is maximized with respect to

*θ*to obtain an improved estimate:

*θ*

^{(1)},

*θ*

^{(2)},

*θ*

^{(3)}, … converges to a local maximum of the likelihood surface (Eq. 48). The global maximum and, hence, the MLE is generally not guaranteed to be found with this local optimization method. In practice, the EM algorithm is often run repeatedly with many different starting solutions

*θ*

^{(1)}, or with few very reasonable starting solutions obtained from other heuristics or educated guesses.

### Example 16 (Naive Bayes)

*X*

_{1}, …,

*X*

_{L}) and we want to cluster observations into

*K*distinct groups. For this purpose, we introduce a hidden random variable

*Z*with state space \( \mathcal{Z}=\left[K\right]=\left\{1,\dots, K\right\} \) indicating class membership. The joint probability of (

*X*

_{1}, …,

*X*

_{L}) and

*Z*is

*Z*is the unsupervised naive Bayes model. The observed variables

*X*

_{n}are often called features and

*Z*the latent class variable (Fig. 7).

*P*(

*Z*), which we assume to be constant and will ignore, and the conditional probabilities

*θ*

_{n,kx}=

*P*(

*X*

_{n}=

*x*∣

*Z*=

*k*). The complete-data likelihood of observed data

*X*= (

*X*

^{(1)}, …,

*X*

^{(N)}) and hidden data

*Z*= (

*Z*

^{(1)}, …,

*Z*

^{(N)}) is

*I*

_{n,kx}(

*Z*

^{(i)}) is equal to one if and only if

*Z*

^{(i)}=

*k*and \( {X}_n^{(i)}=x \), and zero otherwise.

*θ*without observing

*Z*, we consider the hidden log-likelihood:

*Z*

^{(i)}:

*θ′*is the current estimate of

*θ*. The expected value \( {\gamma}_{n, kx}^{(i)} \) is sometimes referred to as the responsibility of class

*k*for observation \( {X}_n^{(i)}=x \). The expected hidden log-likelihood can be written in terms of the expected counts \( {N}_{n, kx}={\sum}_{i=1}^N{\gamma}_{n, kx}^{(i)} \) as:

## 4 Markov Chains

*X*

_{t}is the state of the process at time

*t*. A discrete-time stochastic process

*X*= (

*X*

_{1},

*X*

_{2},

*X*

_{3}, … ) is called a Markov chain [14], if

*X*

_{n+1}⊥

*X*

_{n−1}∣

*X*

_{n}for all

*n*≥ 2 or, equivalently, if each state depends only on its immediate predecessor:

_{K−1}, where Π

_{k}=

*P*(

*X*

_{1}=

*k*), and the stochastic

*K*×

*K*transition matrix

*T*= (

*T*

_{kl}).

*T*

_{kl}to:

*k*to state

*l*in

*n*time steps. Any (

*n*+

*m*)-step transition can be regarded as an

*n*-step transition followed by an

*m*-step transition. Because the intermediate state

*i*is unknown, summing over all possible values yields the decomposition:

*T*

^{(n+m)}=

*T*

^{(n)}

*T*

^{(m)}. It follows that the

*n*-step transition matrix is the

*n*-th matrix power of the one-step transition matrix,

*T*

^{(n)}=

*T*

^{n}.

A state *l* of a Markov chain is accessible from state *k* if \( {T}_{kl}^n>0 \). We say that *k* and *l* communicate with each other and write *k* ∼ *l* if they are accessible from one another. State communication is reflexive (*k* ∼ *k*), symmetric (*k* ∼ *l* ⇒ *l* ∼ *k*), and, by the Chapman–Kolmogorov equations, transitive (*j* ∼ *k* ∼ *l* ⇒ *j* ∼ *l*). Hence, it defines an equivalence relation on the state space. The Markov chain is irreducible if it has a single communication class, i.e., if any state is accessible from any other state.

A state is recurrent if the Markov chain will reenter it with probability one. Otherwise, the state is transient. In finite-state Markov chains, recurrent states are also positive recurrent, i.e., the expected time to return to the state is finite. A state is aperiodic if the process can return to it after any time *n* ≥ 1. Recurrence, positive recurrence, and aperiodicity are class properties: if they hold for a state *k*, then they also hold for all states communicating with *k*.

*π*given by:

*π*is the solution of

*π*

^{t}=

*π*

^{t}

*T*.

### Example 17 (Two-State Markov Chain)

*T*

_{12}=

*α*> 0 and

*T*

_{21}=

*β*> 0. Clearly, the chain is ergodic and its stationary distribution

*π*is given by:

*απ*

_{1}=

*βπ*

_{2}. With

*π*

_{1}+

*π*

_{2}= 1, we obtain

*π*

^{t}= (

*α*+

*β*)

^{−1}(

*α*,

*β*). □

In Example 17, if *α* = 0, then state 1 is called an absorbing state because once entered it is never left. In evolutionary biology and population genetics, Markov chains are often used to model evolving populations, and the fixation probability of an allele can be computed as the absorption probability in such models.

### Example 18 (Wright–Fisher Process)

*N*. The total number of A alleles in generation

*n*is described by a Markov chain

*X*

_{n}with state space {0, 1, 2, …, 2

*N*}. We assume that individuals mate randomly and that maternal and paternal alleles are chosen randomly such that \( \left({X}_{n+1}\mid {X}_n\right)\sim \mathrm{Binom}\left(2N,k/ (2N)\right) \), where

*k*is the number of A alleles in generation

*n*. The Markov chain has transition probabilities:

*X*

_{1}=

*k*, then E(

*X*

_{1}) =

*k*. After binomial sampling, E(

*X*

_{2}) = 2

*N*(

*k*∕(2

*N*)) =

*k*and hence E(

*X*

_{n}) =

*k*for all

*n*≥ 0. The Markov chain has the two absorbing states 0 and 2

*N*, which correspond, respectively, to extinction and fixation of the A allele. To compute the fixation probability

*h*

_{k}of A given

*k*initial copies of it:

*k*, in the limit as

*n*→

*∞*to obtain

*h*

_{k}=

*k*∕(2

*N*), the initial relative frequency of the allele. The Wright–Fisher process [15, 16] is a basic stochastic model for random genetic drift, i.e., for the variation in allele frequencies only due to random sampling. □

*X*= (

*X*

^{(1)}, …,

*X*

^{(N)}) from a finite Markov chain \( \mathrm{MC}\left(\Pi, T\right) \) of length

*L*, then the likelihood is

*N*

_{kl}(

*X*

^{(i)}) the number of observed transitions from state

*k*into state

*l*in observation

*X*

^{(i)}, and \( {N}_{kl}={\sum}_{i=1}^N{N}_{kl}\left({X}^{(i)}\right) \) the total number of

*k*-to-

*l*transitions in the data, and similarly

*N*

_{k}(

*X*

^{(i)}) and

*N*

_{k}the number of times the

*i*-th chain, respectively all chains, started in state

*k*.

### Exercise 19 (Markov Chains)

Let us consider a simple infectious disease model, where each individual is either healthy (H) or diseased (D). We assume the following two-state Markov chain to describe infection-related disease and recovery via clearance of the pathogen:

The probability of a healthy individual becoming sick due to infection is *α* = 0.6, and the probability of a diseased individual to clear the infection and recover is *β* = 0.9. The initial probabilities for health and disease are *P*(H) = 0.7 and *P*(D) = 0.3. Write down the transition matrix *T* of this Markov chain. What is the probability of observing the disease trajectories DDHHD and HDHDH? Calculate the stationary distribution of the Markov chain.

## 5 Continuous-Time Markov Chains

*X*(

*t*),

*t*≥ 0} with finite state space [

*K*] is a continuous-time Markov chain if

*s*,

*t*≥ 1,

*k*,

*l*,

*x*(

*u*) ∈ [

*K*], 0 ≤

*u*<

*s*. The chain is homogeneous if Eq. 70 is independent of

*s*. The transition probabilities are then denoted:

*T*(

*t*) is the matrix exponential of a constant rate matrix

*R*times

*t*:

### Example 20 (Jukes–Cantor Model)

*T*

_{kl}(

*t*) be the probability that, due to mutation, nucleotide

*k*changes to nucleotide

*l*after time

*t*at this position (Fig. 8). The Jukes–Cantor model [17] is the simplest DNA substitution model. It assumes that the transition rates from any nucleotide to any other are equal:

*t*→

*∞*is uniform,

*π*= (1∕4, 1∕4, 1∕4, 1∕4)

^{t}. □

### Example 21 (The Poisson Process)

*X*(

*t*) is a counting process, if

*X*(

*t*) represents the total number of events that occur by time

*t*. It is a Poisson process, if in addition

*X*(0) = 0, the increments are independent, and in any interval of length

*t*the number of events is Poisson distributed with rate

*λt*:

### Example 22 (Exponential Distribution)

*λ*> 0 is a common distribution for waiting times. It is defined by the density function:

*X*has expectation E(

*X*) =

*λ*

^{−1}and variance \( \mathrm{Var}(X)={\lambda}^{-2} \). The exponential distribution is memoryless, which means that

*P*(

*X*>

*s*+

*t*∣

*X*>

*t*) =

*P*(

*X*>

*s*), for all

*s*,

*t*> 0. An important consequence of the memoryless property is that the waiting times between successive events are i.i.d. For example, the waiting times

*τ*

_{n}(

*n*≥ 1) between the events of a Poisson process, the sequence of interarrival times, are exponentially distributed, \( {\tau}_n\sim \mathrm{Exp}\left(\lambda \right) \), for all

*n*≥ 1. □

### Exercise 23 (Kimura Model)

## 6 Hidden Markov Models

*Z*= (

*Z*

_{1}, …,

*Z*

_{L}), which form a homogeneous Markov chain, and observed random variables

*X*= (

*X*

_{1}, …,

*X*

_{L}). Each observed symbol

*X*

_{n}depends on the hidden state

*Z*

_{n}. The HMM is illustrated in Fig. 9. It encodes the following conditional independence statements:

*P*(

*Z*

_{1}), the transition probabilities

*T*

_{kl}=

*P*(

*Z*

_{n}=

*l*∣

*Z*

_{n−1}=

*k*) of the Markov chain, and the emission probabilities

*E*

_{kx}=

*P*(

*X*

_{n}=

*x*∣

*Z*

_{n}=

*k*) of symbols \( x\in \mathcal{X} \). The HMM is denoted \( \mathrm{HMM}\left(\Pi, T,E\right) \). For simplicity, we restrict ourselves here to finite state spaces \( \mathcal{Z}=\left[K\right] \) of

*Z*and \( \mathcal{X} \) of

*X*. The joint probability of (

*Z*,

*X*) factorizes as:

The HMM is typically used to model sequence data *x* = (*x*_{1}, *x*_{2}, …, *x*_{L}) generated by different mechanisms *z*_{n} which cannot be observed. Each observation *x* can be a time series or any other object with a linear dependency structure [19]. In computational biology, the HMM is frequently applied to DNA and protein sequence data, where it accounts for first-order spatial dependencies of nucleotides or amino acids [20].

### Example 24 (CpG Islands)

CpG islands are CG-enriched regions in a DNA sequence. They are typically a few hundreds to thousands of base pairs long. We want to use a simple HMM to detect CpG islands in genomic DNA. The hidden states \( {Z}_n\in \mathcal{Z}=\left\{-,+\right\} \) indicate whether sequence position *n* belongs to a CpG island (+ ) or not (−). The observed sequence is given by the nucleotide at each position, \( {X}_n\in \mathcal{X}=\left\{\mathtt{A},\mathtt{C},\mathtt{G},\mathtt{T}\right\} \).

Suppose we observe the sequence *x* = (C, A, C, G). Then, we can calculate the joint probability of *x* and any state path *z* by Eq. 79. For example, if *z* = (+, −, −, +), then *P*(*X* = *x*, *Z* = *z*) = Π_{+} *E*_{+,C}*T*_{+,−} *E*_{−,A}*T*_{−,−} *E*_{−,C}*T*_{−,+} *E*_{+,G}. □

*z*= (

*z*

_{1},

*z*

_{2}, …,

*z*

_{L}) that gave rise to the observation

*x*. For biological sequences,

*z*is often called the annotation of

*x*. In Example 24, the genomic sequence is annotated with CpG islands. For generic parameters, any state path can give rise to a given observed sequence, but with different probabilities. The decoding problem is to find the annotation

*z*

^{∗}that maximizes the joint probability:

*K*

^{L}possible state paths such that already for sequences of moderate length, the optimization problem (Eq. 80) cannot be solved by enumerating all paths.

*Z*

_{1}, …,

*Z*

_{L}) can be obtained by recursively computing maxima over each

*Z*

_{n}. Each of the

*L*terms in parenthesis defines a probability distribution over

*K*states by maximizing over

*K*values. Hence, the time complexity of the algorithm is

*O*(

*LK*

^{2}), despite the fact that the maximum is over

*K*

^{L}paths. This procedure is known as dynamic programming and it is the workhorse of biological sequence analysis. For HMMs, it is known as the Viterbi algorithm [21].

*P*(

*X*=

*x*) of an observed sequence

*x*, we need to sum the joint probability

*P*(

*Z*=

*z*,

*X*=

*x*) over all hidden states \( z\in \mathcal{Z} \). The length of this sum is

*K*

^{L}, but it can be computed efficiently by the same dynamic programming principle used for the Viterbi algorithm:

*f*(

*n*,

*Z*

_{n}) =

*P*(

*X*

_{1}, …,

*X*

_{n},

*Z*

_{n}).

*Z*

_{L}down to

*Z*

_{1}. The resulting backward algorithm generates the partial solutions

*b*(

*n*,

*Z*

_{n}) =

*P*(

*X*

_{n+1}, …,

*X*

_{L}∣

*Z*

_{n}). From the forward and backward quantities, one can also compute the position-wise posterior state probabilities:

### Example 25 (Pairwise Sequence Alignment)

*y*relative to sequence

*x*. For example:

*X*

_{n},

*Y*

_{n}) of aligned sequence characters with state space \( \left(\mathcal{A}\times \mathcal{A}\right)\setminus \left\{\left(-,-\right)\right\} \). Thus, a pairwise alignment is a probabilistically generated sequence of pairs of symbols.

The choice of transition and emission probabilities corresponds to fixing a scoring scheme in nonprobabilistic formulations of sequence alignment. For example, the emission probabilities *P*[(*a*, *b*) ∣ M] from a match state encode pairwise amino acid preferences and can be modeled by substitution matrices, such as PAM and BLOSUM [20].

In the pair HMM, computing an optimal alignment between *x* and *y* means to find the most probable state path \( {z}^{\ast }={\mathrm{argmax}}_zP\Big(X=x \), *Y* = *y*, *Z* = *z*), which can be solved using the Viterbi algorithm. Using the forward algorithm, we can also compute efficiently the marginal probability of two sequences being related independent of their alignment, *P*(*X*, *Y* ) =∑_{Z}*P*(*X*, *Y*, *Z*). In general, this probability is more informative than the posterior *P*(*Z* ∣ *X*, *Y* ) of an optimal alignment *z*^{∗} because many alignments tend to have the same or nearly the same probability such that *P*(*Z* = *z*^{∗} ∣ *X*, *Y* ) can be very small. Finally, we can also compute the probability of two characters *x*_{n} and *y*_{m} being aligned by means of posterior decoding. □

### Example 26 (Profile HMM)

*E*

_{n}(

*a*) =

*P*(

*X*

_{n}=

*a*). In its simplest form allowing only gap-free alignments, the probability of an observation

*x*is just

*M*

_{n}, which can emit symbols according to the probability tables

*E*

_{n}, insert states

*I*

_{n}, which usually emit symbols in an unspecific manner, and delete states

*D*

_{n}, which do not emit any symbols. The possible transitions between those states allow for modeling alignment gaps of any length.

A given profile HMM for a protein family can be used to detect new sequences that belong to the same family. For a query sequence *x*, we can either consider the most probable alignment of the sequence to the HMM, *P*(*X* = *x*, *Z* = *z*^{∗}), or the marginal probability independent of the alignment, *P*(*X* = *x*) =∑_{Z}*P*(*X* = *x*, *Z*), to decide about family membership. □

*θ*= (

*T*,

*E*). For ML estimation, we need to maximize the observed log-likelihood:

*X*

^{(1)}, …,

*X*

^{(N)}are the i.i.d. observations. For each observation, we can rewrite the joint probability as:

*N*

_{kx}(

*Z*

^{(i)}) is the number of

*x*emissions when in state

*k*and

*N*

_{kl}(

*Z*

^{(i)}) the number of

*k*-to-

*l*transitions in state path

*Z*

^{(i)}(cf. Eq. 68).

*P*(

*Z*∣

*X*,

*θ′*), where

*θ′*is the current best estimate of

*θ*. We use Eq. 86 and denote by

*N*

_{kx}and

*N*

_{kl}the expected value of ∑

_{i}

*N*

_{kx}(

*Z*

^{(i)}) and ∑

_{i}

*N*

_{kl}(

*Z*

^{(i)}), respectively, to obtain

*N*

_{kx}and

*N*

_{kl}are the sufficient statistics [11] of the HMM, i.e., with respect to the model, they contain all information about the parameters available from the data. The expected counts can be computed using the forward and backward algorithms. In the M step, this expression is maximized with respect to

*θ*= (

*T*,

*E*). We find the MLEs \( {\hat{T}}_{kl}={N}_{kl}/ {\sum}_m{N}_{km} \) and \( {\hat{E}}_{kx}={N}_{kx}/ {\sum}_y{N}_{ky} \).

## 7 Bayesian Networks

*X*= (

*X*

_{1}, …,

*X*

_{L}) consists of a directed acyclic graph (DAG) and local probability distributions (LPDs). The DAG

*G*= (

*V*,

*E*) has vertex set

*V*= [

*L*] and edge set

*E*⊆

*V*×

*V*. Each vertex

*n*∈

*V*is identified with the random variable

*X*

_{n}. If there is an edge

*X*

_{m}→

*X*

_{n}in

*G*, then

*X*

_{m}is a parent of

*X*

_{n}and

*X*

_{n}is a child of

*X*

_{m}. For each vertex

*n*∈

*V*, there is an LPD \( P\left({X}_n\mid {X}_{\mathrm{pa}(n)}\right) \), where \( \mathrm{pa}(n) \) is the set of parents of

*X*

_{n}in

*G*. The Bayesian network model is defined as the family of distributions for which the joint probability of

*X*factors into conditional probabilities as:

*θ*= (

*θ*

_{1}, …,

*θ*

_{L}) denotes the parameters of the LPDs.

For the Bayesian network shown in Fig. 11, we find *P*(*U*, *V*, *W*, *X*, *Y* ) = *P*(*U*)*P*(*Y* )*P*(*V* ∣ *U*, *Y* )*P*(*W* ∣ *V* )*P*(*X* ∣ *U*). The graph encodes several conditional independence statements about (*U*, *V*, *W*, *X*, *Y* ), including, for example, *W* ⊥{*U*, *X*} ∣ *V* .

### Example 27 (Markov Chain)

*X*

_{1}→

*X*

_{2}→⋯ →

*X*

_{L}, denoted

*C*, and joint distribution:

*θ*

_{1}=

*P*(

*X*

_{1}) = Π and

*θ*

_{n+1}=

*P*(

*X*

_{n+1}∣

*X*

_{n}) =

*T*for all

*n*∈ [

*L*− 1] such that \( \mathrm{MC}\left(\Pi, T\right)=\mathrm{BN}\left(C,\theta \right) \). Similarly, HMMs are Bayesian networks with hidden variables

*Z*and factorized joint distribution given in Eq. 79. □

*θ*of a Bayesian network depends on the family of distributions that has been chosen for the LPDs. In the general case of a discrete random variable with finite state space,

*θ*

_{n}is a conditional probability table. If each vertex

*X*

_{n}has

*K*possible states, then:

*X*

_{n}depends on all other variables, then

*θ*

_{n}has the maximal number of

*K*

^{L}− 1 parameters, which is exponential in the number of vertices. If, on the other hand,

*X*

_{n}is independent of all other variables, \( \mathrm{pa}(n)=\varnothing \), then

*θ*

_{n}has (

*K*− 1) parameters, which is independent of

*L*. For the chain (Example 27) where each vertex has exactly one outgoing and one incoming edge, we find a total of (

*K*− 1) + (

*L*− 1)

*K*(

*K*− 1) free parameters which is of order

*O*(

*LK*

^{2}).

*X*

_{n}is the linear Gaussian model. Here, the LPDs are Gaussian distributions with mean a linear function of the parents:

*X*

_{1}, …,

*X*

_{L}) are also Gaussians.

*G*

^{∗}. The first step is a model selection problem as introduced in Subheading 2.

*P*(

*θ*∣

*G*) is the prior distribution of parameters given the network topology.

*ν*is the number of free parameters of the model and

*N*the size of the data. The BIC approximation can be derived under certain assumptions, including a unimodal likelihood. It replaces computation of the integral (Eq. 94) by evaluating the integrand at the MLE and adding the correction term \( -\left(\nu \log N\right)/ 2 \), which penalizes models of high complexity.

The model selection problem remains hard even with a tractable scoring function, such as BIC, because of the enormous search space. Local search methods, such as greedy hill climbing or simulated annealing, are often used in practice. They return a local maximum as a point estimate for the best network structure. Results can be improved by running several local searches from different starting topologies.

Often, data are sparse and we will find diffuse posterior distributions of network structures, which might not be represented very well by a single point estimate. In the fully Bayesian approach, we aim at estimating the full posterior \( P\left(G\mid \mathcal{D}\right)\propto P\left(\mathcal{D}\mid G\right)P(G) \). One way to approximate this distribution is to draw a finite number of samples from it. Markov chain Monte Carlo (MCMC) methods generate such a sample by constructing a Markov chain that converges to the target distribution [27].

*G*

^{(0)}and then iteratively generate a new DAG

*G*

^{(n)}from the previous one

*G*

^{(n−1)}by drawing it from a proposal distribution

*Q*:

*G*

^{(m)}, …,

*G*

^{(N)}. Any feature

*f*of the network (e.g., the presence of an edge or a subgraph) can be estimated as the expected value:

*Q*, which encodes the way the network space is explored. Because not all graphs, but only DAGs, are allowed, computing the transition probabilities

*Q*(

*G*

^{(n)}∣

*G*

^{(n−1)}) is usually the main computational bottleneck.

Parameter estimation, i.e., solving (Eq. 93), can be done along the lines described in Subheading 2 following either the ML or the Bayesian approach. If the model contains hidden random variables, then the EM algorithm (Subheading 3) can be used. However, this approach is feasible only if efficient inference algorithms are available. For hidden Markov models (Subheading 6), the forward and backward algorithms provided an efficient way to compute marginal probabilities and the expected hidden log-likelihood. These algorithms can be generalized to the sum–product algorithm for tree-like graphs and the junction tree algorithm for general DAGs. The computational complexity of the junction tree algorithm is exponential in the size of the largest clique of the so-called moralized graph, which is obtained by dropping edge directions and adding edges between any two vertices that have a common child in the original DAG [11].

*P*(

*X*

_{1}, …,

*X*

_{L}). The idea is to iteratively sample from the conditional probabilities of

*P*(

*X*

_{1}, …,

*X*

_{L}), starting with \( {X}_1^{\left(n+1\right)}\sim P\left({X}_1\mid {X}_2^{(n)},\dots, {X}_L^{(n)}\right) \) and cycling through all variables in turns:

*P*(

*X*

_{k}∣

*X*

_{∖k}) than from the joint distribution

*P*(

*X*

_{1}, …,

*X*

_{L}), where

*X*

_{∖k}denotes all variables

*X*

_{n}except

*X*

_{k}. For graphical models, the conditional probability of each vertex

*X*

_{k}depends only on its Markov blanket

*X*

_{MB}(

*k*), defined as the set of its parents, children, and co-parents (vertices with the same children),

*P*(

*X*

_{k}∣

*X*

_{∖k}) =

*P*(

*X*

_{k}∣

*X*

_{MB(k)}).

### Example 28 (Phylogenetic Tree Models)

*S*defines the branching order and the branch lengths correspond to (phylogenetic) time. The LPDs are defined by a nucleotide substitution model (Subheading 5).

*X*

^{(i)}∈ {A, C, G, T}

^{L}denote the

*i*-th column of a multiple sequence alignment of

*L*observed species. We regard the alignment columns as independent observations of the evolutionary process. The character states of the hidden (extinct) ancestors are denoted

*Z*

^{(i)}. The likelihood of the observed sequence data

*X*= (

*X*

^{(1)}, …,

*X*

^{(N)}) given the tree topology

*S*and the branch lengths

*t*is

*P*(

*X*

^{(i)},

*Z*

^{(i)}∣

*S*,

*t*) factors into conditional probabilities according to the tree structure. This marginal probability can be computed efficiently with an instance of the sum–product algorithm known as the peeling algorithm (or Felsenstein algorithm) [31].

*X*has probability:

*t*. Several software packages implement ML or Bayesian learning of phylogenetic tree models. □

In the simplest case, we suppose that the observed alignment columns are independent. However, it is more realistic to assume that nucleotide substitution rates vary across sites because of varying selective pressures. For example, there could be differences between coding and noncoding regions, among different regions of a protein (loops, and catalytic sites), or among the three bases of a triplet coding for an amino acid. More sophisticated models can account for this rate heterogeneity. Let us assume site-specific substitution rates *r*_{i} such that the local probabilities become *P*(*X*^{(i)} ∣ *r*_{i}, *t*, *S*). To model the distribution of the rates, often a gamma distribution is used.

### Example 29 (Gamma Distribution)

*α*and a rate parameter

*β*. It is defined by the density function:

*X*) =

*α*∕

*β*and its variance \( \mathrm{Var}(X)=\alpha / {\beta}^2 \). The gamma distribution generalizes several other distributions, for example \( \mathrm{Gamma}\left(1,\lambda \right)=\mathrm{Exp}\left(\lambda \right) \) (Example 22). □

Another approach to account for varying mutation rates are phylogenetic hidden Markov models (phylo-HMMs).

### Example 30 (Phylo-HMM)

### Exercise 31 (Inference in Bayesian Networks)

Consider the gene network on five genes denoted A, B, C, D, E, with the graph structure displayed below. Gene expression profiles under different conditions C1–C9 have been observed and are summarized in the table below, where a zero indicates that the gene is not expressed and a one that it is expressed.

- (a)
Specify the adjacency matrix of the directed graph.

- (b)Determine the local probability distributions for each vertex of the graph. Use conditional counting to determine the conditional probabilities as:where \( N\left({X}_i,\kern0.3em {X}_{\mathrm{pa}(i)}\right) \) is the number of joint observations of$$ P\left({X}_i\mid {X}_{\mathrm{pa}(i)}\right)\approx \frac{N\left({X}_i,{X}_{\mathrm{pa}(i)}\right)}{\sum_kN\left({X}_i=k,\kern0.3em {X}_{\mathrm{pa}\Big(i}\right)},\kern1.00em $$
*X*_{i}and its parents. - (c)
What is the joined probability of (

*X*_{A},*X*_{B},*X*_{C},*X*_{D},*X*_{E}) for this network? - (d)
We now want to determine the most probable explanation for observing a gene C to be active as a result of the influences of its upstream genes A and E. For this, one has to infer the posterior probabilities

*P*(A ∣ C = 1) and*P*(E ∣ C = 1) using Bayes theorem. Here, assume that the probabilities*P*(A) and*P*(E) derived from the expression data are suitable prior probabilities. Which constellation is most likely to trigger the expression of C?

## References

- 1.Ewens WJ, Grant GR (2005) Statistical methods in bioinformatics: an introduction, 2nd edn. Springer, BerlinCrossRefGoogle Scholar
- 2.Deonier RC, Tavaré S, Waterman MS (2005) Computational genome analysis: an introduction. Springer, BerlinGoogle Scholar
- 3.Davison AC (2009) Statistical models. Cambridge series in statistical and probabilistic mathematics. Cambridge University Press, CambridgeGoogle Scholar
- 4.Ross SM (2007) Introduction to probability models. Academic, LondonGoogle Scholar
- 5.Hardy GH (1908) Mendelian proportions in a mixed population. Science 28:49–50CrossRefGoogle Scholar
- 6.Weinberg W (1908) Über den Nachweis der Vererbung beim Menschen. Jahres Wiertt Ver Vaterl Natkd 64:369–382Google Scholar
- 7.Pachter L, Sturmfels B (2005) Algebraic statistics for computational biology. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- 8.Casella G, Berger RL (2002) Statistical Inference. Thomson Learning, Pacific GroveGoogle Scholar
- 9.Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall/CRC, Boca RatonCrossRefGoogle Scholar
- 10.Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis, vol 46, 2 edn. Chapman and Hall/CRC, Boca RatonGoogle Scholar
- 11.Bishop CM (2006) Pattern recognition and machine leaning. Springer, BerlinGoogle Scholar
- 12.Kass R, Raftery A (1995) Bayes factors. J Am Stat Assoc 90:773–795CrossRefGoogle Scholar
- 13.Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38Google Scholar
- 14.Norris JR (1998) Markov chains. Cambridge University Press, CambridgeGoogle Scholar
- 15.Wright S (1990) Evolution in Mendelian populations. Bull Math Biol 52:241–295CrossRefGoogle Scholar
- 16.Fisher RA (1930) The genetical theory of natural selection. Clarendon Press, OxfordCrossRefGoogle Scholar
- 17.Jukes TH, Cantor CR (1969) Evolution of protein molecules. Mamm Protein Metab 3:21–132CrossRefGoogle Scholar
- 18.Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16:111–120CrossRefGoogle Scholar
- 19.Rabiner LR (1989) A tutorial on HMM and selected applications in speech recognition. Proc IEEE 77:257–286CrossRefGoogle Scholar
- 20.Durbin R (1998) Biological sequence analysis. Probabilistic models of proteins and nucleic acids. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- 21.Viterbi A (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans Inf Theory 13:260–269CrossRefGoogle Scholar
- 22.Baum LE (1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8Google Scholar
- 23.Husmeier D, Dybowski R, Roberts S (2005) Probabilistic modeling in bioinformatics and medical informatics. Springer, New YorkCrossRefGoogle Scholar
- 24.Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques. The MIT Press, CambridgeGoogle Scholar
- 25.Jordan MI (1998) Learning in graphical models. Kluwer Academic Publishers, DordrechtCrossRefGoogle Scholar
- 26.Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464CrossRefGoogle Scholar
- 27.Neal RM (1993) Probabilistic inference using Markov Chain Monte Carlo methods. Intelligence 62:144Google Scholar
- 28.Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97–109CrossRefGoogle Scholar
- 29.Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741CrossRefGoogle Scholar
- 30.Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, SunderlandGoogle Scholar
- 31.Felsenstein J (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol 17:368–376CrossRefGoogle Scholar
- 32.Siepel A, Haussler D (2005) Phylogenetic hidden Markov models. Statistical methods in molecular evolution. Springer, New York, pp 325–351Google Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.