This section introduces TI from a statistical physics perspective. Statistical physics is a branch of physics that uses methods from probability theory and statistics to characterize the behavior of physical systems. One of the key concepts in statistical physics is that the probability of a particle being in a given state follows a probability density, and that all physically relevant quantities can be derived once this distribution is known. Starting from the free energy, we show how key concepts from information theory have developed from their counterparts in statistical physics, motivating the use of TI and providing a link to the variational Bayes approach conventionally used in DCM to approximate the log model evidence (LME).
Free energy: a perspective from statistical physics
In thermodynamics, the analogue of the model evidence is the so-called partition function \(Z\) of a system that consists of an ensemble of particles in thermal equilibrium. A classical discussion of the relationships presented here can be found in Jaynes (1957) and a more modern perspective in Ortega and Braun (2013). For example, let us consider an ideal monoatomic gas, in which the kinetic energy
$$\begin{array}{*{20}c} {\phi \left( \theta \right) = \frac{{m\theta^{2} }}{2} } \\ \end{array}$$
(1)
of individual particles is a function of their velocity \(\theta\) and mass \(m\). If the system is large enough, the velocity of a single particle can be treated as a continuous random variable. The internal energy \(U\) of this ideal gas is proportional to the expected energy per particle. It is computed as the weighted sum of the energies \(\phi \left( \theta \right)\) associated with all possible velocities, where the weights are given by the probability \(q\left( \theta \right)\) of the particle being at a certain velocity:
$$\begin{array}{*{20}c} {U \overset{\text{def}}{=} \int q\left( \theta \right)\phi \left( \theta \right)d\theta .} \\ \end{array}$$
(2)
A second important quantity in statistical physics is the differential entropy \(H\) of \(q\):
$$\begin{array}{*{20}c} {H\left[ q \right] \overset{\text{def}}{=} - k_{B} \int q\left( \theta \right)\ln q\left( \theta \right)d\theta .} \\ \end{array}$$
(3)
Here, \(k_{B}\) is the Boltzmann constant with units of energy per degree temperature. For an isolated system (i.e., no exchange of matter or energy with the environment), the second law of thermodynamics states that its entropy can only increase or stay constant. Thus, the system is at equilibrium when the associated entropy is maximized, subject to the constraint that the system’s internal energy is constant and equal to U, and that \(q\) is a proper density, i.e.: \(q\left( \theta \right) \ge 0\) and \(\int q\left( \theta \right)d\theta = 1.\)
This constrained maximization problem can be solved using Lagrange multipliers (for the derivation see the supplementary material S4), yielding the following distribution:
$$\begin{array}{*{20}c} {q\left( \theta \right) = \frac{1}{Z}\exp \left( { - \frac{\phi \left( \theta \right)}{{k_{B} T}}} \right),} \\ \end{array}$$
(4)
where \(Z\) is referred to as the partition function of the system:
$$\begin{array}{*{20}c} {Z \overset{\text{def}}{=} \int \exp \left( { - \frac{\phi \left( \theta \right)}{{k_{B} T}}} \right)d\theta .} \\ \end{array}$$
(5)
In a closed system, the Helmholtz free energy \(F_{H}\) is defined as the difference between the internal energy \(U\) of the system and its entropy \(H\) times the temperature \(T\):
$$\begin{array}{*{20}c} {F_{H} \overset{\text{def}}{=} U - TH.} \\ \end{array}$$
(6)
The Helmholtz free energy corresponds to the work (i.e., non-thermal energy in joules that is passed from the system to its environment) that can be attained from a closed system. From Eq. 6, we see that the system with constant internal energy \(U\) is at equilibrium (i.e., maximum entropy) when the Helmholtz free energy is minimal. Substituting the internal energy (Eq. 2), the entropy (Eq. 3) and the expression of q (Eq. 4) into Eq. 6, it follows that the log of the partition function corresponds to the negative Helmholtz free energy divided by \(k_{B} T\):
$$\begin{array}{*{20}c} { - \frac{{F_{H} }}{{k_{B} T}} = \ln Z.} \\ \end{array}$$
(7)
Free energy: a perspective from statistics
In order to link perspectives on free energy from statistical physics and (Bayesian) statistics, we assume that the system is examined at a constant temperature \(T\) such that the term \(k_{B} T\) equals unity (normalization of temperature), allowing us to move from a physical perspective on free energy (expressed in joules) to a statistical formulation (expressed in information units proportional to bits). This is the common convention in the statistical literature, and thereby, all quantities become unit-less information theoretic terms. Under this convention, the physical concept of free energy described above gives rise to an analogous concept of free energy in statistics when the energy function is given by the negative log joint probability \({-}\ln p(y,\theta |m)\) (Neal and Hinton 1998):
$$\begin{array}{*{20}c} {\phi \left( \theta \right) = - \ln p\left( {y,\theta |m} \right) = - \ln p\left( {y|\theta ,{ }m} \right)p\left( {\theta |m} \right).} \\ \end{array}$$
(8)
Hence, the log joint (which fully characterizes the system) takes the role of the kinetic energy in the ideal gas example above.
Inserting the expression for \(\phi\) (Eq. 8) into Eq. 4, we obtain the following expression:
$$\begin{array}{*{20}c} {q\left( \theta \right) = \frac{1}{Z}\exp \left( { - \phi \left( \theta \right)} \right) = \frac{1}{Z}\exp \left( {\ln p\left( {y,\theta {|}m} \right)} \right),} \\ \end{array}$$
(9)
which together with the definition of the partition function Z (Eq. 5), reveals that the equilibrium distribution of the system is the posterior distribution (i.e., the joint probability divided by the model evidence):
$$\begin{array}{*{20}c} { q\left( \theta \right) = \frac{{\exp \left( {\ln p\left( {y,\theta {|}m} \right)} \right)}}{{\int \exp \left( {\ln p\left( {y,\theta {|}m} \right)} \right)d\theta }} = \frac{{p\left( {y,\theta {|}m} \right)}}{{p\left( {y{|}m} \right)}} = p\left( {\theta {|}y,m} \right)} \\ \end{array}$$
(10)
Based on this result, we can derive the information theoretic version of the Helmholtz free energy, by inserting the expressions for the internal energy (Eq. 2) and the entropy (Eq. 3) into Eq. 6 and making use of the expression for \(\phi\) from Eq. 8:
$$\begin{array}{*{20}c} { - F_{H} = \int q\left( \theta \right)\ln p\left( {y,\theta |m} \right)d\theta - \int q\left( \theta \right)\ln q\left( \theta \right)d\theta ,} \\ \end{array}$$
(11)
In analogy to Eq. 6, the first term on the right hand side is an expectation over an energy function (cf. Equation 8); while the second term represents the differential entropy \(H\left[ q \right] = - \int q\left( \theta \right)\ln q\left( \theta \right)d\theta\). Notably, under the choice of the energy function in Eq. 8, the partition function (Eq. 5) corresponds to the marginal of the joint probability \(p\left( {y,\theta |m} \right)\) with respect to \(\theta\). Comparing with Eq. 7, we see that the negative free energy is equal to the log model evidence (LME):
$$\begin{array}{*{20}c} { - F_{H} = \ln p\left( {y|m} \right).} \\ \end{array}$$
(12)
Replacing the joint in Eq. 11 by the product of likelihood and prior, the negative free energy can be decomposed into two terms that have important implications for evaluating the goodness of a model:
$$\begin{array}{*{20}c} { - F_{H} = \int q\left( \theta \right)\ln p\left( {y|\theta ,m} \right)p\left( {\theta |m} \right)d\theta - \int q\left( \theta \right)\ln q\left( \theta \right)d\theta ,} \\ \end{array}$$
(13)
$$\begin{array}{*{20}c} { - F_{H} = \int q\left( \theta \right)\ln p\left( {y|\theta ,m} \right)d\theta - \int q\left( \theta \right)\ln \frac{q\left( \theta \right)}{{p\left( {\theta |m} \right)}}d\theta ,} \\ \end{array}$$
(14)
$$\begin{array}{*{20}c} { - F_{H} = accuracy - complexity} \\ \end{array}$$
(15)
The first term (the expected log likelihood under the posterior) represents a measure of model fit or accuracy. The second term corresponds to the Kullback–Leibler (KL) divergence of the posterior from the prior, and can be viewed as an index of model complexity. Hence, maximizing the negative free energy (log evidence) of a model corresponds to finding a balance between accuracy and complexity. We will turn to this issue in more detail below and examine variations of this perspective under TI and VB, respectively.
In the following, we will explicitly display the sign of the negative free energy for notational consistency. In order to highlight similarities with statistical physics and the concepts of energy and potential, we will continue to express the free energy as a functional of a (possibly non-normalized) log density, such that
$$\begin{array}{*{20}c} { - F_{H} \left[ \phi \right] = \ln \int \exp \left( { - \phi \left( \theta \right)} \right)d\theta ,} \\ \end{array}$$
(16)
where \(\phi \left( \theta \right)\) is equivalent to an energy or potential depending on \(\theta\). Figure 1 summarizes the conceptual analogies of free energy between statistical physics and Bayesian statistics.
Thermodynamic integration (TI)
We now turn to the problem of computing the negative free energy. As is apparent from Eq. 16, the free energy contains an integral over all possible \(\theta\), which is usually prohibitively expensive to compute and thus precludes direct evaluation. The basic idea behind TI is to move in small steps along a path from an initial state with known \({F}_{H}\) to the equilibrium state and add up changes in free energy for all steps (Gelman and Meng 1998). This idea was initially introduced in statistical physics to compute the difference in Helmholtz free energy between two states of a physical system (Kirkwood 1935). Other examples for the application of TI in statistical physics are presented in Landau (2015).
In Bayesian statistics, the same idea can be used to compute the LME of a model m. This is because the difference in free energy associated with two potentials corresponding to the negative log prior \(\phi_{0} \left( \theta \right) = - {\text{ln}}p(\theta |m)\) and the negative log joint \(\phi \left( \theta \right) = - \ln p\left( {y{|}\theta ,m} \right) - \ln p(\theta |m)\) (cp. Eq. 8) equals the LME. More precisely, provided the prior is properly normalized, i.e., \(\int p\left( {\theta {|}m} \right)d\theta =\) 1, substituting \(\phi_{0}\) and \(\phi\) into Eq. 16 yields
$$\begin{aligned} F_{H} \left[ \phi \right] - F_{H} \left[ {\phi_{0} } \right] = & - \ln \int p\left( {y{|}\theta ,m} \right)p(\theta {|}m{)}d\theta + \ln \int p\left( {\theta {|}m} \right)d\theta \\ = & - \ln p\left( {y{|}m} \right),\\ \end{aligned}$$
(17)
The goal is now to construct a piecewise differentiable path connecting prior and posterior and then compute the LME by integrating infinitesimal changes in the free energy along this path. A smooth transition between \(F\left[ \phi \right]\) and \(F\left[ {\phi_{0} } \right]\) can be constructed by the power posteriors \(p_{\beta } \left( {\theta {|}y,m} \right)\) (see Eq. 19 below) which are defined by the path \(\phi_{\beta } :\)
$$\begin{array}{*{20}c} {\phi_{\beta } \left( \theta \right) = - \beta \ln p\left( {y{|}\theta ,m} \right) - \ln p\left( {\theta {|}m} \right)} \\ \end{array}$$
(18)
with \(\beta \epsilon \left[ {0,1} \right]\), such that \(\phi_{1} = \phi\). In the statistics literature, \(\beta\) is usually referred to as an inverse temperature because it has analogous properties to physical temperature in many aspects. We will use this terminology and comment on the analogy in more detail below.
The power posterior is obtained by normalizing the exponential of \(- \phi_{\beta } \left( \theta \right)\):
$$\begin{array}{*{20}c} {p_{\beta } \left( {\theta {|}y,m} \right) = \frac{{p\left( {y{|}\theta ,m} \right)^{\beta } p\left( {\theta {|}m} \right)}}{{Z_{\beta } }},} \\ \end{array}$$
(19)
$$Z_{\beta } = \int p\left( {y{|}\theta ,m} \right)^{\beta } p\left( {\theta {|}m} \right)d\theta .$$
Combining this definition with Eq. 17, the LME can be expressed as:
$$\begin{array}{*{20}c} { - \ln p\left( {y{|}m} \right) = F_{H} \left[ \phi \right] - F_{H} \left[ {\phi_{0} } \right],} \\ \end{array}$$
(20)
$$\begin{array}{*{20}c} { = \mathop \int \limits_{\beta = 0}^{\beta = 1} \frac{d}{d\beta }F_{H} \left[ {\phi_{\beta } } \right]d\beta ,} \\ \end{array}$$
(21)
$$\begin{array}{*{20}c} { = - \mathop \int \limits_{\beta = 0}^{\beta = 1} \frac{d}{d\beta }\ln \int p\left( {y{|}\theta ,m} \right)^{\beta } p\left( {\theta {|}m} \right)d\theta d\beta .} \\ \end{array}$$
(22)
Applying the chain rule of differentiation (see supplementary material section S11 for a detailed derivation), the LME can be written in terms of an integral over an expectation with respect to the power posterior:
$$\begin{array}{*{20}c} {\ln p\left( {y{|}m} \right) = \mathop \int \limits_{\beta = 0}^{\beta = 1} \int \frac{{p\left( {y{|}\theta ,m} \right)^{\beta } p\left( {\theta {|}m} \right)}}{{Z_{\beta } }}\ln p\left( {y{|}\theta ,m} \right)d\theta d\beta ,} \\ \end{array}$$
(23)
$$\begin{array}{*{20}c} { = \mathop \int \limits_{\beta = 0}^{\beta = 1} {\text{E}}\left[ {\ln p\left( {y{|}\theta ,m} \right)} \right]_{{p_{\beta } \left( {\theta {|}y,m} \right)}} d\beta .} \\ \end{array}$$
(24)
which we refer to as the basic or fundamental TI equation (Gelman and Meng 1998).
Notably, the TI equation can also be understood in terms of the definition of the free energy (Eq. 14) by noting that the latter can be written as the sum of an expected log likelihood and a cross-entropy term (KL divergence of the power posterior from the prior):
$$\begin{array}{*{20}c} { - F_{H} \left( \beta \right) = \beta \int p_{\beta } \left( {\theta {|}y,m} \right)\ln p\left( {y{|}\theta ,m} \right)d\theta - \int p_{\beta } \left( {\theta {|}y,m} \right)\ln \frac{{p_{\beta } \left( {\theta {|}y,m} \right)}}{{p\left( {\theta |m} \right)}}d\theta ,} \\ \end{array}$$
(25)
$$\begin{array}{*{20}c} { - F_{H} \left( \beta \right) = \beta A\left( \beta \right) - KL\left[ {p_{\beta } \left( {\theta {|}y,m} \right)|p\left( {\theta |m} \right)} \right].} \\ \end{array}$$
(26)
The first term, \(A\left( \beta \right) = - \partial F_{H} /\partial \beta\), is referred to as the accuracy of the model (see, for example, Penny et al. 2004a; Stephan et al. 2009), while the second term constitutes a complexity term. Note that Eq. 26 is typically presented in the statistical literature for the case of \(\beta = 1\) and describes the same accuracy vs. complexity trade-off previously expressed by Eq. 14, but now from the specific perspective of TI. Also note that \(A\left( \beta \right)\) is defined as the negative partial derivative of the free energy. In contrast to the full derivative, the partial derivative only considers the direct dependence of \(F_{H}\) on \(\beta\), and ignores the indirect dependence via the KL divergence term.
Figure 2 shows a graphical representation of the relation conveyed by the fundamental TI equation (Eqs. 24) and 26. For any given \(\beta\), the negative free energy at this position of the path \(- F_{H} \left( \beta \right)\) can be interpreted as the signed area below the curve \(A\left( \beta \right) = - \partial F_{H} /\partial \beta\) (i.e., the integral over \(A\left( \cdot \right)\) from 0 to \(\beta\)), whereas the term \(\beta \times A\left( \beta \right)\) is the rectangular area below \(A\left( \beta \right)\). Equation 26 shows that the area \(\beta A\left( \beta \right) + F_{H} \left( \beta \right)\) is the KL divergence of the corresponding power posterior from the prior.
This relationship holds because, for the power posteriors (Eq. 19), A(\(\beta\)) is a monotonically increasing function of \(\beta\). This is due to the fact that
$$\begin{array}{*{20}c} {\frac{\partial A\left( \beta \right)}{{\partial \beta }} = Var\left[ {\ln p\left( {y{|}\theta ,m} \right)} \right]_{{p_{\beta } \left( {\theta {|}y,m} \right)}} > 0.} \\ \end{array}$$
(27)
See Lartillot and Philippe (2006) for a derivation of this property. From this it follows that the negative free energy is a concave function along \(\beta\).
The theoretical considerations highlighted above and the relation to principles of statistical physics render TI an appealing choice for estimating the LME. However, the question remains how the LME estimator in Eq. 24 can be evaluated in practice. To solve this problem, TI relies on Monte Carlo estimates of the expected value \({\text{E}}\left[ {\ln p\left( {y{|}\theta ,m} \right)} \right]_{{p_{\beta } (\theta |y,m)}}\) in Eq. 24:
$$\begin{array}{*{20}c} {E_{MC} \left( \beta \right): = \frac{1}{K}\mathop \sum \limits_{k = 1}^{K} \ln p\left( {y|\theta_{k} ,m} \right) \approx E\left[ {\ln p\left( {y{|}\theta ,m} \right)} \right]_{{p_{\beta } \left( {\theta {|}y,m} \right)}} ,} \\ \end{array}$$
(28)
where samples \(\theta_{k}\) are drawn from the power posterior \(p_{\beta } (\theta |y,m)\). The remaining integral over \(\beta\) in Eq. 24 is a one dimensional integral, which can be computed through a quadrature rule using a predefined temperature schedule for \(\beta\) (\(0 = \beta_{0} < \beta_{1} < \cdots < \beta_{N - 1} < \beta_{N} = 1\)):
$$\begin{array}{*{20}c} {\ln p\left( {y{|}m} \right) \approx \frac{1}{2}\mathop \sum \limits_{j = 1}^{N - 1} (\beta_{j + 1 } {-}\beta_{j} )\left( {E_{MC} \left( {\beta_{j + 1} } \right) + E_{MC} \left( {\beta_{j} } \right)} \right).} \\ \end{array}$$
(29)
The optimal temperature schedule in terms of minimal variance of the estimator and minimal error introduced by this discretization was outlined previously in the context of linear models by Gelman and Meng (1998) and Calderhead and Girolami (2009).
Note that each step \(\beta_{j}\) in the temperature schedule requires a new set of samples \(\theta_{k}\) to be drawn from the respective power posterior \(p_{{\beta_{j} }} (\theta |y,m)\), contributing to the high computational complexity of TI. However, since these sets of samples are independent from each other, this can in principle be done in parallel, provided suitable soft- and hardware capabilities are available. An efficient way to realize such a parallel sampling procedure is to adopt a population MCMC approach in which MCMC sampling is used to generate, for each \(\beta_{j}\), a chain of samples from the respective power posterior \(p_{{\beta_{j} }} (\theta |y,m)\). In addition, chains from neighboring \(\beta_{j}\) in the temperature schedule are allowed to interact by means of a “swap” accept-reject (AR) step (Swendsen and Wang 1986). This increases the sampling efficiency and speeds up convergence of the individual MCMC samplers. For readers unfamiliar with Monte Carlo methods, a primer on (population) MCMC is provided in the supplementary material S3. For a detailed treatment, we refer to McDowell et al. (2008) and Calderhead and Girolami (2009).
So far, the computational requirement of sampling from an ensemble of distributions (one for each value of \(\beta\)) has limited the application of TI to high performance computing environments and prevented its widespread use in neuroimaging. Luckily, the increase in computing power of stand-alone workstations and the proliferation of graphical processing units (GPU), coupled with efficient population MCMC samplers, offer possibilities to overcome this bottleneck, which will be demonstrated below for a selection of three examples involving synthetic and real-world datasets. First, however, we will complete the theoretical overview by briefly explaining the formal relationship between TI and variational Bayes.
Variational bayes
Variational Bayes (VB) is a general approach to approximate intractable integrals with tractable optimization problems. Importantly, this optimization method simultaneously yields an approximation to the posterior density and a lower bound to the LME.
The fundamental equality which underlies VB is based on introducing a tractable density \(q\left( \theta \right)\) to approximate the posterior \(p(\theta |y, m)\).
$$- F_{H} = \ln p\left( {y{|}m} \right) = \int q\left( \theta \right)\ln \frac{{p\left( {y{|}m} \right)q\left( \theta \right)}}{q\left( \theta \right)}d\theta ,$$
(30)
$$= \int q\left( \theta \right)\ln \frac{{p\left( {y,\theta |m} \right)q\left( \theta \right)}}{p(\theta |y,m)q\left( \theta \right)}d\theta ,$$
(31)
$${ } = \underbrace {{\int q\left( \theta \right)\ln p\left( {y{|}\theta ,m} \right)d\theta }}_{Approx. accuracy} - \underbrace {{\int q\left( \theta \right)\ln \frac{q\left( \theta \right)}{{p\left( {\theta {|}m} \right)}}d\theta }}_{Approx. complexity} + \underbrace {{\int q\left( \theta \right)\ln \frac{q\left( \theta \right)}{{p\left( {\theta {|}y,{ }m} \right)}}d\theta }}_{Error}$$
(32)
The last term in Eq. 32 is the KL divergence of the approximate density \(q\) from the unknown posterior density; this encodes the error or inaccuracy of the approximation. Given that the KL divergence is never negative, the first two terms in Eq. 32 represent a lower bound on the log evidence \(- F_{H}\), and in the following we will refer to it as the negative variational free energy \(- F_{VB}\).
In summary, the relation between the information theoretic version oHelmholtz free energy \(- F_{H}\), log model evidence \(\ln p\left( {y{|}m} \right)\), and variational negative free energy \(- F_{VB}\) is therefore
$$- F_{H} = \ln p\left( {y{|}m} \right) = - F_{VB} + {\text{KL}}[q\left( \theta \right)||p\left( {\theta {|}y,m} \right)]$$
(33)
We highlight this relationship because many readers are rightfully confused that the term ‘negative free energy’ is sometimes used in the literature to denote the logarithm of the partition function Z itself (i.e., \(- F_{H}\)), as we have done above, and sometimes to refer to a lower bound approximation of it (i.e.,\({ } - F_{VB}\)). This is because the variational free energy \(- F_{VB}\) becomes identical to the negative free energy \({-}F_{H}\) when the approximate density \(q\) equals the posterior and hence their KL divergence becomes zero. In this special case
$$\begin{array}{*{20}c} {\mathop {\max }\limits_{{\text{q}}} \left[ { - F_{VB} \left[ q \right]} \right] = - F_{H} .} \\ \end{array}$$
(34)
To maintain consistency in the notation, we will distinguish \(- F_{H}\) and \(- F_{VB}\). throughout the paper.
VB aims to reduce the KL divergence of \(q\) from the posterior density by maximizing the lower bound \(- F_{VB}\) as a functional of \(q\):
$$\begin{array}{*{20}c} { - F_{VB} \left[ q \right] = \int q\left( \theta \right)\ln p\left( {y{|}\theta ,m} \right)d\theta - \int q\left( \theta \right)\ln \frac{q\left( \theta \right)}{{p\left( {\theta {|}m} \right)}}d\theta .} \\ \end{array}$$
(35)
When the functional form of q is fixed and parametrized by a vector \(\eta\), VB can be reformulated as an optimization method in which \(\eta\) is updated according to gradient \(\partial F_{VB} \left[ {q\left( {\theta {|}\eta } \right)} \right]/\partial \eta\) (Friston et al. 2007). Thus, the path followed by \(\eta\) during optimization can be formulated as
$$\begin{array}{*{20}c} {\dot{\eta } = - \frac{{\partial F_{VB} \left[ {q\left( {\theta {|}\eta } \right)} \right]}}{\partial \eta }.} \\ \end{array}$$
(36)
This establishes a connection between TI and VB. In the former, the path of \(\eta\) corresponds to the path of \(\beta\) from 0 to 1, which was selected a priori with the conditions that
$$\begin{array}{*{20}c} {q\left( {\theta {|}\beta = 0} \right) = p\left( \theta \right), q\left( {\theta {|}\beta = 1} \right) \propto p\left( {y|\theta } \right)p\left( \theta \right)} \\ \end{array}$$
(37)
and the gradients
$$\begin{array}{*{20}c} { - \frac{{\partial F_{H} \left[ {q\left( {\theta {|}\beta } \right)} \right]}}{\partial \beta }} \\ \end{array}$$
(38)
are used to numerically compute the free energy.
Different VB algorithms are defined by the particular functional form used for the approximate posterior. In the case of DCM, it is so far most common to use Variational Bayes under the Laplace approximation (VBL). A summary of VBL for DCM is available in the supplementary material S5, while an in-depth treatment is provided in Friston et al. (2007).