An introduction to thermodynamic integration and application to dynamic causal models

In generative modeling of neuroimaging data, such as dynamic causal modeling (DCM), one typically considers several alternative models, either to determine the most plausible explanation for observed data (Bayesian model selection) or to account for model uncertainty (Bayesian model averaging). Both procedures rest on estimates of the model evidence, a principled trade-off between model accuracy and complexity. In the context of DCM, the log evidence is usually approximated using variational Bayes. Although this approach is highly efficient, it makes distributional assumptions and is vulnerable to local extrema. This paper introduces the use of thermodynamic integration (TI) for Bayesian model selection and averaging in the context of DCM. TI is based on Markov chain Monte Carlo sampling which is asymptotically exact but orders of magnitude slower than variational Bayes. In this paper, we explain the theoretical foundations of TI, covering key concepts such as the free energy and its origins in statistical physics. Our aim is to convey an in-depth understanding of the method starting from its historical origin in statistical physics. In addition, we demonstrate the practical application of TI via a series of examples which serve to guide the user in applying this method. Furthermore, these examples demonstrate that, given an efficient implementation and hardware capable of parallel processing, the challenge of high computational demand can be overcome successfully. The TI implementation presented in this paper is freely available as part of the open source software TAPAS. Supplementary Information The online version contains supplementary material available at 10.1007/s11571-021-09696-9.


Introduction
Dynamic causal models (DCMs) are generative models that serve to infer latent neurophysiological processes and circuit properties-e.g., the effective connectivity between neuronal populations-from neuroimaging measurements such as functional magnetic resonance imaging (fMRI) or magneto-/electroencephalography (M/EEG) data (David et al. 2006;Friston et al. 2003). As reviewed by Daunizeau et al. (2011), DCMs consist of two hierarchically related layers: a set of state equations describing neuronal population activity, and an observation model which links neurophysiological states to observed signals and accounts for measurement noise. Equipped with a prior distribution over model parameters, a DCM specifies a full generative forward model that can be inverted using Bayesian techniques.
In addition to inference on model parameters, an important scientific problem is the comparison of competing hypotheses, for example, different network topologies, which are formalized as different models. Under the Bayesian framework, model comparison is based on the evidence or marginal likelihood of a model. The model evidence corresponds to the denominator from Bayes' theorem and represents the probability of the observed data under a given model. It is a widely used score of model quality that quantifies the trade-off between model fit and complexity (Bishop 2006;MacKay 2004).
Unfortunately, in most instances, it is not feasible to derive an analytical expression of the model evidence due to the intractable integrals that arise from the marginalization of the model parameters. While various asymptotical approximations exist, such as the Bayesian Information Criterion (BIC, Schwarz 1978) and more recently the Widely Applicable Bayesian Information Criterion (WBIC, Watanabe 2013), variational Bayes under the Laplace approximation (VBL, Friston et al. 2007) has established itself as the method of choice for DCM, partially because of its computational efficiency. Within the framework of variational Bayes (VB), a lower bound approximation of the log model evidence (LME) is obtained as a byproduct of model inversion: the variational negative free energy (which we refer to as ÀF VB throughout this paper).
While highly efficient, model comparison based on the variational free energy has several potential pitfalls. For example, under the Laplace approximation as used in the context of DCM, there is no guarantee that ÀF VB still represents a lower bound of the LME (Wipf and Nagarajan 2009). Furthermore, VB is commonly performed in combination with a mean field approximation, and the effect of this approximation on the posterior estimates can be difficult to predict (Daunizeau et al. 2011). Finally, in non-linear models, the posterior could become a multimodal density, a condition that aggravates the application of gradient ascent methods regularly used in combination with the Laplace approximation.
For these reasons, Markov Chain Monte Carlo (MCMC) sampling has been explored as an alternative inference technique for DCM (Aponte et al. 2016;Chumbley et al. 2007;Penny and Sengupta 2016;Sengupta et al. 2015Sengupta et al. , 2016Yao and Stephan 2020). MCMC is particularly attractive for variants of DCMs in which Gaussian assumptions might be less adequate, such as nonlinear DCMs for fMRI (Stephan et al. 2008), DCMs of electrophysiological data (Moran et al. 2013), or DCMs for layered fMRI signals . MCMC is also useful when extending DCM to more complex hierarchical models , in which the derivation of update equations for VB becomes difficult (but see Yao et al. 2018). MCMC is asymptotically exact and only assumes that the posterior distribution can be evaluated up to a multiplicative constant. However, in practice, its computational cost often leads to prohibitively long computation times for the datasets and models commonly encountered in neuroimaging. Furthermore, in contrast to VB, MCMC does not provide an estimate of the model evidence by default.
While several MCMC-based strategies for computing the LME in neuroimaging applications have been explored (e.g., Aponte et al. 2016;Penny and Sengupta 2016;Raman et al. 2016), one particularly powerful and theoretically attractive MCMC variant is thermodynamic integration (TI). This method, like VB, rests on the concept of free energy and has been proposed as gold standard for LME estimation (Calderhead and Girolami 2009;Lartillot and Philippe 2006). Despite strong theoretical advantages, so far, the computational costs of TI have prevented its practical use in neuroimaging.
This paper introduces the reader to thermodynamic integration (TI) and its application to DCM. In contrast to existing tutorials on TI (Annis et al. 2019), we provide an in-depth discussion of the theoretical foundations of TI and relate the tutorial specifically to DCM as a generative model that is frequently used in contemporary neuroimaging analyses. Our discussion covers the key concept of free energy starting from its historical origin in statistical physics, with the aim of conveying a deeper understanding of this method that goes beyond a purely technical treatment. In the second part, we present a series of examples involving both synthetic and real-world datasets. These include (1) a validation dataset based on a linear regression model with analytically tractable LME used to verify the accuracy of TI, (2) a synthetic fMRI dataset where the true data-generating model is known for each observation, and (3) a real-world fMRI dataset used to demonstrate LME estimation for nonlinear DCM.
In addition to showcasing the application of TI, these examples also serve to demonstrate that, given an efficient implementation and hardware capable of parallel processing, the challenge of high computational demand of TI can be overcome successfully. The software implementation of TI and DCM used in this paper is available as part of the open-source toolbox TAPAS (Translational Neuromodeling Unit 2014).
To keep this paper short and yet accessible to a broad audience, summaries of key topics such as DCM, Bayesian model selection (BMS), or Markov chain Monte Carlo (MCMC) are offered in the supplementary material (see sections S1, S2 and S3).

Thermodynamic integration and the origin of free energy
This section introduces TI from a statistical physics perspective. Statistical physics is a branch of physics that uses methods from probability theory and statistics to characterize the behavior of physical systems. One of the key concepts in statistical physics is that the probability of a particle being in a given state follows a probability density, and that all physically relevant quantities can be derived once this distribution is known. Starting from the free energy, we show how key concepts from information theory have developed from their counterparts in statistical physics, motivating the use of TI and providing a link to the variational Bayes approach conventionally used in DCM to approximate the log model evidence (LME).

Free energy: a perspective from statistical physics
In thermodynamics, the analogue of the model evidence is the so-called partition function Z of a system that consists of an ensemble of particles in thermal equilibrium. A classical discussion of the relationships presented here can be found in Jaynes (1957) and a more modern perspective in Ortega and Braun (2013). For example, let us consider an ideal monoatomic gas, in which the kinetic energy of individual particles is a function of their velocity h and mass m. If the system is large enough, the velocity of a single particle can be treated as a continuous random variable. The internal energy U of this ideal gas is proportional to the expected energy per particle. It is computed as the weighted sum of the energies / h ð Þ associated with all possible velocities, where the weights are given by the probability q h ð Þ of the particle being at a certain velocity: A second important quantity in statistical physics is the differential entropy H of q: Here, k B is the Boltzmann constant with units of energy per degree temperature. For an isolated system (i.e., no exchange of matter or energy with the environment), the second law of thermodynamics states that its entropy can only increase or stay constant. Thus, the system is at equilibrium when the associated entropy is maximized, subject to the constraint that the system's internal energy is constant and equal to U, and that q is a proper density, i.e.: q h ð Þ ! 0 and R q h ð Þdh ¼ 1: This constrained maximization problem can be solved using Lagrange multipliers (for the derivation see the supplementary material S4), yielding the following distribution: where Z is referred to as the partition function of the system: In a closed system, the Helmholtz free energy F H is defined as the difference between the internal energy U of the system and its entropy H times the temperature T: The Helmholtz free energy corresponds to the work (i.e., non-thermal energy in joules that is passed from the system to its environment) that can be attained from a closed system. From Eq. 6, we see that the system with constant internal energy U is at equilibrium (i.e., maximum entropy) when the Helmholtz free energy is minimal. Substituting the internal energy (Eq. 2), the entropy (Eq. 3) and the expression of q (Eq. 4) into Eq. 6, it follows that the log of the partition function corresponds to the negative Helmholtz free energy divided by k B T: Free energy: a perspective from statistics In order to link perspectives on free energy from statistical physics and (Bayesian) statistics, we assume that the system is examined at a constant temperature T such that the term k B T equals unity (normalization of temperature), allowing us to move from a physical perspective on free energy (expressed in joules) to a statistical formulation (expressed in information units proportional to bits). This is the common convention in the statistical literature, and thereby, all quantities become unit-less information theoretic terms. Under this convention, the physical concept of free energy described above gives rise to an analogous concept of free energy in statistics when the energy function is given by the negative log joint probability À ln pðy; hjmÞ (Neal and Hinton 1998): Hence, the log joint (which fully characterizes the system) takes the role of the kinetic energy in the ideal gas example above.
Inserting the expression for / (Eq. 8) into Eq. 4, we obtain the following expression: which together with the definition of the partition function Z (Eq. 5), reveals that the equilibrium distribution of the system is the posterior distribution (i.e., the joint probability divided by the model evidence): Based on this result, we can derive the information theoretic version of the Helmholtz free energy, by inserting the expressions for the internal energy (Eq. 2) and the entropy (Eq. 3) into Eq. 6 and making use of the expression for / from Eq. 8: In analogy to Eq. 6, the first term on the right hand side is an expectation over an energy function (cf. Equation 8); while the second term represents the differential entropy Notably, under the choice of the energy function in Eq. 8, the partition function (Eq. 5) corresponds to the marginal of the joint probability p y; hjm ð Þwith respect to h. Comparing with Eq. 7, we see that the negative free energy is equal to the log model evidence (LME): Replacing the joint in Eq. 11 by the product of likelihood and prior, the negative free energy can be decomposed into two terms that have important implications for evaluating the goodness of a model: The first term (the expected log likelihood under the posterior) represents a measure of model fit or accuracy. The second term corresponds to the Kullback-Leibler (KL) divergence of the posterior from the prior, and can be viewed as an index of model complexity. Hence, maximizing the negative free energy (log evidence) of a model corresponds to finding a balance between accuracy and complexity. We will turn to this issue in more detail below and examine variations of this perspective under TI and VB, respectively.
In the following, we will explicitly display the sign of the negative free energy for notational consistency. In order to highlight similarities with statistical physics and the concepts of energy and potential, we will continue to express the free energy as a functional of a (possibly nonnormalized) log density, such that where / h ð Þ is equivalent to an energy or potential depending on h. Figure 1 summarizes the conceptual analogies of free energy between statistical physics and Bayesian statistics.

Thermodynamic integration (TI)
We now turn to the problem of computing the negative free energy. As is apparent from Eq. 16, the free energy contains an integral over all possible h, which is usually prohibitively expensive to compute and thus precludes direct evaluation. The basic idea behind TI is to move in small steps along a path from an initial state with known F H to the equilibrium state and add up changes in free energy for all steps (Gelman and Meng 1998). This idea was initially introduced in statistical physics to compute the difference in Helmholtz free energy between two states of a physical system (Kirkwood 1935). Other examples for the application of TI in statistical physics are presented in Landau (2015).
In Bayesian statistics, the same idea can be used to compute the LME of a model m. This is because the difference in free energy associated with two potentials corresponding to the negative log prior / 0 h ð Þ ¼ ÀlnpðhjmÞ and the negative log joint / h ð Þ ¼ À ln p yjh; m ð ÞÀ ln pðhjmÞ (cp. Eq. 8) equals the LME. More precisely, provided the prior is properly normalized, i.e., R p hjm ð Þdh ¼ 1, substituting / 0 and / into Eq. 16 yields The goal is now to construct a piecewise differentiable path connecting prior and posterior and then compute the LME by integrating infinitesimal changes in the free energy along this path. A smooth transition between F / ½ and F / 0 ½ can be constructed by the power posteriors p b hjy; m ð Þ(see Eq. 19 below) which are defined by the path / b : with b 0; 1 ½ , such that / 1 ¼ /. In the statistics literature, b is usually referred to as an inverse temperature because it has analogous properties to physical temperature in many aspects. We will use this terminology and comment on the analogy in more detail below.
The power posterior is obtained by normalizing the exponential of À/ b h ð Þ: Combining this definition with Eq. 17, the LME can be expressed as: Applying the chain rule of differentiation (see supplementary material section S11 for a detailed derivation), the LME can be written in terms of an integral over an expectation with respect to the power posterior: which we refer to as the basic or fundamental TI equation (Gelman and Meng 1998).
Notably, the TI equation can also be understood in terms of the definition of the free energy (Eq. 14) by noting that the latter can be written as the sum of an expected log likelihood and a cross-entropy term (KL divergence of the power posterior from the prior): The first term, A b ð Þ ¼ ÀoF H =ob, is referred to as the accuracy of the model (see, for example, Penny et al. 2004a;Stephan et al. 2009), while the second term constitutes a complexity term. Note that Eq. 26 is typically presented in the statistical literature for the case of b ¼ 1 and describes the same accuracy vs. complexity trade-off previously expressed by Eq. 14, but now from the specific perspective of TI. Also note that A b ð Þ is defined as the negative partial derivative of the free energy. In contrast to the full derivative, the partial derivative only considers the direct dependence of F H on b, and ignores the indirect dependence via the KL divergence term. Figure 2 shows a graphical representation of the relation conveyed by the fundamental TI equation (Eqs. 24) and 26. For any given b, the negative free energy at this position of the path ÀF H b ð Þ can be interpreted as the signed area below the curve A b ð Þ ¼ ÀoF H =ob (i.e., the integral over shows that the area bA b ð Þ þ F H b ð Þ is the KL divergence of the corresponding power posterior from the prior. This relationship holds because, for the power posteriors (Eq. 19), A( b) is a monotonically increasing function of b. This is due to the fact that See Lartillot and Philippe (2006) for a derivation of this property. From this it follows that the negative free energy is a concave function along b.
The theoretical considerations highlighted above and the relation to principles of statistical physics render TI an appealing choice for estimating the LME. However, the question remains how the LME estimator in Eq. 24 can be evaluated in practice. To solve this problem, TI relies on Monte Carlo estimates of the expected value E ln p yjh; m ð Þ ½ p b ðhjy;mÞ in Eq. 24: where samples h k are drawn from the power posterior p b ðhjy; mÞ. The remaining integral over b in Eq. 24 is a one dimensional integral, which can be computed through a quadrature rule using a predefined temperature schedule for The optimal temperature schedule in terms of minimal variance of the estimator and minimal error introduced by this discretization was outlined previously in the context of linear models by Gelman and Meng (1998) and Calderhead and Girolami (2009).
Note that each step b j in the temperature schedule requires a new set of samples h k to be drawn from the respective power posterior p b j ðhjy; mÞ, contributing to the high computational complexity of TI. However, since these sets of samples are independent from each other, this can in principle be done in parallel, provided suitable soft-and hardware capabilities are available. An efficient way to realize such a parallel sampling procedure is to adopt a population MCMC approach in which MCMC sampling is used to generate, for each b j , a chain of samples from the respective power posterior p b j ðhjy; mÞ. In addition, chains from neighboring b j in the temperature schedule are allowed to interact by means of a ''swap'' accept-reject (AR) step (Swendsen and Wang 1986). This increases the sampling efficiency and speeds up convergence of the individual MCMC samplers. For readers unfamiliar with Monte Carlo methods, a primer on (population) MCMC is provided in the supplementary material S3. For a detailed treatment, we refer to McDowell et al. (2008) and Calderhead and Girolami (2009) .   Fig. 2 Graphical representation of the TI equation. The free energy is equal to the signed area below A ¼ ÀoF H =ob, and thus the area Að1Þ þ F H is equal to the KL divergence of the posterior from the prior. The same relation holds for any b 2 ½0; 1 So far, the computational requirement of sampling from an ensemble of distributions (one for each value of b) has limited the application of TI to high performance computing environments and prevented its widespread use in neuroimaging. Luckily, the increase in computing power of stand-alone workstations and the proliferation of graphical processing units (GPU), coupled with efficient population MCMC samplers, offer possibilities to overcome this bottleneck, which will be demonstrated below for a selection of three examples involving synthetic and real-world datasets. First, however, we will complete the theoretical overview by briefly explaining the formal relationship between TI and variational Bayes.

Variational bayes
Variational Bayes (VB) is a general approach to approximate intractable integrals with tractable optimization problems. Importantly, this optimization method simultaneously yields an approximation to the posterior density and a lower bound to the LME.
The fundamental equality which underlies VB is based on introducing a tractable density q h ð Þ to approximate the posterior pðhjy; mÞ.
The last term in Eq. 32 is the KL divergence of the approximate density q from the unknown posterior density; this encodes the error or inaccuracy of the approximation. Given that the KL divergence is never negative, the first two terms in Eq. 32 represent a lower bound on the log evidence ÀF H , and in the following we will refer to it as the negative variational free energy ÀF VB .
In summary, the relation between the information theoretic version oHelmholtz free energy ÀF H , log model evidence ln p yjm ð Þ, and variational negative free energy ÀF VB is therefore We highlight this relationship because many readers are rightfully confused that the term 'negative free energy' is sometimes used in the literature to denote the logarithm of the partition function Z itself (i.e., ÀF H ), as we have done above, and sometimes to refer to a lower bound approximation of it (i.e., À F VB ). This is because the variational free energy ÀF VB becomes identical to the negative free energy ÀF H when the approximate density q equals the posterior and hence their KL divergence becomes zero. In this special case To maintain consistency in the notation, we will distinguish ÀF H and ÀF VB . throughout the paper.
VB aims to reduce the KL divergence of q from the posterior density by maximizing the lower bound ÀF VB as a functional of q: When the functional form of q is fixed and parametrized by a vector g, VB can be reformulated as an optimization method in which g is updated according to gradient oF VB q hjg ð Þ ½ =og ). Thus, the path followed by g during optimization can be formulated as This establishes a connection between TI and VB. In the former, the path of g corresponds to the path of b from 0 to 1, which was selected a priori with the conditions that and the gradients À oF H q hjb ð Þ ½ ob ð38Þ are used to numerically compute the free energy.
Different VB algorithms are defined by the particular functional form used for the approximate posterior. In the case of DCM, it is so far most common to use Variational Bayes under the Laplace approximation (VBL). A summary of VBL for DCM is available in the supplementary material S5, while an in-depth treatment is provided in Friston et al. (2007).

Evaluating the accuracy of TI
In this section, we investigate the accuracy of LME estimates obtained with TI, and compare the performance of TI to that of two other sampling-based LME estimators, the prior arithmetic mean (AME) and posterior harmonic mean (HME) estimators. In contrast to TI, which requires sampling from an ensemble of distributions (see Eq. 29), AME and HME only require samples from the prior or posterior distributions, respectively. Hence, these two methods, which are described in detail in the supplementary material S6, are computationally significantly less demanding than TI.
For the purpose of this comparison, we turn to a Bayesian linear regression model with normal prior and likelihood. This is a useful case for benchmarking because the LME can be computed analytically. This model is described by the following prior and likelihood function: where h is the p Â 1 ½ vector of regression coefficients, y is the M Â 1 ½ vector of data points, X is the M Â p ½ design matrix, and P À1 p and P À1 e are the covariance matrices of the prior and errors, respectively. The analytic solution for the LME of this model is given by Penny (2012) P ¼ P p þ X T P e X; ð41Þ For our simulations, we chose M ¼ 100, P À1 p ¼ 16I p and P À1 e ¼ 10I e , where I p and I e are the corresponding identity matrices. The design matrix was chosen to have a block structure equivalent to a design for a one-way ANOVA with p levels (for those values of p that do not exactly divide by M, the excess data points were assigned to the last cell). Synthetic data where generated by sampling from the generative model defined in Eq. 39. By varying p from 2 to 32 in steps of 1, we created a series of models with increasing dimensionality. For each value of p, we repeated the data generation process 10 times, drawing a new set of values for the regression parameters h each time from the prior, and generating observations y according to the likelihood. We then estimated the LME using TI, AME and HME, and compared the estimates against the analytically computed LME.
The TI approximation to the LME was computed using 64 chains with a 5th order annealing schedule, i.e. a temperature schedule with 64 steps b j , with step size chosen according to a fifth order power rule (Calderhead and Girolami 2009). In each chain, we generated 6000 samples. We then computed AME based on the samples from the prior density and HME based on samples from the posterior. Figure 3 shows the error in the LME estimates as a function of the number of model parameters for the three approaches. Consistent with previous reports, the results show that HME overestimated the LME, while AME underestimated it (Lartillot and Philippe 2006). Only TI provided good estimates over the full range of models. This indicates that, despite comparing unfavorably in terms of computational efficiency, TI should still be preferred in practice due to the large estimation error of AME and HME, especially for higher-dimensional models (but see Penny and Sengupta (2016) for variants of AME and HME that solve some of the issues).

Application to DCM
Having established the accuracy of TI as an estimator for the LME in a case where the analytic solution for the LME is available, we now turn to the case of LME estimation and model selection in the context of DCM. For this purpose, we discuss two example applications. The first example considers a simulated dataset where the true model that generated each observation is known. This serves to determine the ability of TI to identify the datagenerating model. In the second example, we analyze the ''attention to motion'' fMRI dataset (Buchel 1997), which has been analyzed by numerous previous methodological studies. Primers on DCM and Bayesian model selection are provided in the supplementary material.

DCM: simulated data
In the first experiment, we used simulated data from 5 DCMs (linear: model 1; bilinear: models 2-4; nonlinear: model 5) with two inputs (u 1 and u 2 ). The DCMs are Fig. 3 Error in estimating the log evidence of linear models for three different sampling approaches. The curves show mean and standard deviation (error bars) over ten runs at each value of p (number of GLM parameters) for thermodynamic integration (TI), posterior harmonic mean estimator (HME) and prior arithmetic mean estimator (AME) displayed in Fig. 4 and are available for download via the ETH Research Collection (ETH Zurich 2020). The numerical values of the connectivity matrices are listed in the supplementary material S7. The BOLD signal data were simulated assuming a repetition time (TR) = 2 s and 720 scans per simulation. The driving inputs were entered with a sampling rate of 2:0Hz. Simulated time series were corrupted with Gaussian noise yielding a signal-to-noise ratio (SNR) of 1.0. Here, SNR was defined as the ratio of signal standard deviation to noise standard deviation (Welvaert and Rosseel 2013). This means that our simulated data contained identical amounts of noise and signal, representing a relatively challenging SNR scenario.
For each model, we generated 40 different datasets with different instantiations of Gaussian noise, such that the underlying time series remained constant. We then counted how often the data-generating model was assigned the largest model evidence and compared the ensuing values across the different estimators (i.e., AME, HME, TI, VBL). Notably, the absolute value of the log evidence of a given model is irrelevant for model scoring; instead, its difference to the log evidence of other models is decisive.
In a pretesting phase, we found that TI generated stable estimates of the LME using 64 chains. All simulations were executed with a burn-in phase of 1 Â 10 4 samples, followed by an additional 1 Â 10 4 samples used for analysis. We evaluated the convergence of the MCMC algorithm by examining the potential scale reduction fac-torR (Gelman & Rubin 1992) for samples of the log likelihood of all chains. We found thatR was below 1.1 in all but a few instances, indicating convergence. Estimated LME values are displayed in Fig. 5 and Table 1. Consistent with the linear model analysis in the previous section, the HME was always higher and the AME always lower than the TI estimate of the LME. VBL estimates were close to the TI estimate. To test for significant differences in accuracy of recovering the correct model by the different algorithms, v 2 tests were employed (inference method (i.e., TI, VB, HME, AME) vs inference result (i.e., number of times correct and incorrect model was selected)). TI and VBL were not significantly different (v 2 ¼ 0:3; p ¼ 0:56), but both TI and VBL (not shown) were significantly better than AME (v 2 ¼ 189:5; p\10 À5 ) and HME (v 2 ¼ 25:4; p\10 À5 ).
We then examined how often the data-generating model was identified correctly by model comparison, i.e., how often it showed the largest LME of all models. Of all estimators, AME failed most frequently to detect the datagenerating model (Table 2). HME identified the correct Fig. 4 Illustration of the five simulated 3-region DCMs used for cross-model comparison. Self-connections are not displayed. The variables u 1 and u 2 represent two different experimental conditions or inputs. All models represented different hypotheses of how the neuronal dynamics in area x 3 could be explained in terms of the two driving inputs and the effects of the other two regions x 1 and x 2 . Model m 1 can be understood as a 'null hypothesis' in which the activity of all the areas can be explained by the driving inputs. Models m 2 and m 3 correspond to two forms of bilinear effect on the forward connection of areas x 1 and x 2. Model m 4 represents the hypothesis that input u 1 affects the self-connection of area x 3 (not displayed). Model m 5 represents a non-linear interaction between regions x 1 and x 2 . Endogenous connections are depicted by gray arrows, driving inputs by black arrows, bilinear modulations by red arrows and nonlinear modulations by blue arrows. (Color figure online) model more consistently (Table 3). Both VBL and TI displayed a similar behavior (Tables 4 and 5), although model m 5 was identified slightly more consistently identified by VBL. However, as displayed in Table 1, according Fig. 5 Estimated LME for all models relative to TI when inverted with the corresponding data-generating model under SNR = 1for 40 different models. Right panel zooms in the left panel. Red triangles correspond to the HME, blue circles to the AME, and black squares to VBL. HME was always higher and AME always lower than the TI estimate. All LME estimates are shown after subtracting the TI-based estimate for the same model Table 2 Cross-model comparison results for AME in the case of synthetic data (SNR = 1) HME: synthetic data The row label indicates the data-generating model, the column index is the inferred model  The row label indicates the data-generating model, the column index is the inferred model  The row label indicates the data-generating model, whereas the column index is the inferred model Tables display the LME (summed across 40 simulations) of each combination of inverting and generating models. Columns have been normalized by the lowest LME: according to TI. Columns on the right and left tables share the same normalization and their absolute values can be directly compared. On most, but not all occasions, VBL underestimated the LME compared to TI. However, for both VBL and TI the datagenerating model obtained the highest LME (marked in bold) to both inversion schemes, the data-generating model was consistent with the model showing the highest LME.

Empirical data: attention to motion
In this example, we demonstrated TI-based parameter estimation and model comparison for DCM on an empirical dataset. Since the previous two examples have shown that TI consistently outperforms the other sampling-based LME estimators, AME and HME, we limit our comparison to TI and VB, from here on. For the analysis of empirical data, we selected the ''attention to motion'' fMRI dataset (Buchel 1997) that has been analyzed in numerous previous methodological studies (e.g., Friston et al. 2003;Marreiros et al. 2008;Penny et al. 2004a;Penny et al. 2004b;Stephan et al. 2008). The original study investigated the effect of attention on motion perception (Buchel 1997); in particular, the authors examined attentional effects on the connectivity between primary visual cortex (V1), motion-sensitive visual area (V5) and posterior parietal cortex (PPC). In brief, the experimental paradigm consisted of four conditions (all under constant fixation): fixation only (F), presentation of stationary dots (S), passive observation of radially moving dots (N), or attention to the speed of these dots (A). Four sessions were recorded and concatenated, yielding a total of 360 volumes (T E ¼ 40ms; TR ¼ 3:22sÞ. Three inputs were constructed using a combination of the three conditions: stimulus ¼ S þ N þ A, motion ¼ N þ A, attention ¼ A. Driving inputs were resampled at 0:8Hz; requiring a total of 1440 integration steps. Further details of the experimental design and analysis can be found in Buchel (1997).
One reason for selecting this dataset is that Stephan et al. (2008) previously demonstrated that a nonlinear model (model 4 in Fig. 6) had higher evidence than comparable bilinear models (model 1-3 in Fig. 6). This case is of interest for evaluating the quality of different LME estimators, as one would expect that the introduction of nonlinearities represents a challenging case for VBL.
For TI-based LME estimation, 16 Â 10 3 samples were collected from 64 chains, of which 8 Â 10 3 were discarded in the burn-in phase. The convergence of the algorithm was evaluated using theR statistic of the samples of the log likelihood of each chain and model. In all but one chain,R was below 1.1, indicating convergence. Table 6 summarizes the evidence estimates obtained with TI and VBL. In comparison to previous results see Table 8 in Stephan et al. (2008), three findings are worth highlighting. First, as shown in Table 6, the VBL algorithm reproduced the ranking of models reported in Stephan et al. (2008), although an earlier version of the VBL algorithm with different prior parameters and a different integration scheme was used by Stephan et al. (2008). Moreover, our TI implementation produced the same ranking as the one obtained under VBL.
Second, the difference between the VBL free energy estimates and the TI estimates varied considerably across models. To investigate this variability, we compared TI and VBL with regard to the accuracy term. The results are summarized in the lower section of Table 7. Table 6 shows that the discrepancies between VBL and TI varied across models, and the difference was particularly pronounced for the nonlinear model m 4 ([ 40 log units).
Third, while VBL detected the most plausible model, the findings from this dataset suggest that VBL-based inversion of DCMs might not always be fully robust. In particular, the difference between the algorithms could be attributed to the VBL algorithm converging to a local extremum. To assess the differences between TI and VBL more systematically, we initialized each algorithm 10 times from different starting values that were randomly sampled from the prior density. Figure 7 depicts the estimated model evidence and accuracy and Fig. S1 in the supplementary material S10 displays the predicted BOLD signal. VBL estimates of the accuracy and LME displayed much larger variance than the TI estimates. This suggests that the greater variance of the VBL estimates is due to the propensity of the gradient ascent used in VBL to converge to local maxima.
The observations listed above highlight two important challenges faced by VBL. Due to restricting the approximate posterior to a normal distribution, the negative free energy obtained with VBL is a lower bound approximation and hence, will always be smaller than the actual logmodel evidence, especially for nonlinear models with nonnormal posteriors. Independently from this, the presence of local maxima in the optimization process means that VBL The row label indicates the data-generating model, the column index is the inferred model may in practice even fail to find the best lower bound, i.e. the global maximum of the negative free energy. TI is able to address both these issues, which will be important for performing reliable subject-level inference.

Conclusion
In this paper, we have reviewed the theoretical foundation of thermodynamic integration. In the process, we have introduced the concept of free energy, which has found its  way into information theory and Bayesian statistics from its origin in statistical mechanics. Approaching TI from this dual perspective allowed us to highlight the parallels and analogous concepts shared between these different scientific fields.
A key result was obtained in Eq. 24 (the TI equation), which provided (1) a graphical interpretation of the LME as the signed area under the curve given by the accuracy A b ð Þ ¼ ÀoF H =ob; and (2) a reliable method for estimating LME via Monte Carlo samples drawn from the power posteriors. The application of this method was demonstrated in the second part of this paper on synthetic and real-world datasets. Specifically, we started with an experiment involving synthetic data from a linear regression model with analytical solutions for LME. This experiment demonstrated that TI produces accurate LME estimates and outperforms computationally less complex sampling-based LME estimators (AME and HME), justifying the additional complexity.
Finally, we used synthetic and real-world fMRI data to compare TI to VB, which is the current gold standard in the context of model inversion and LME estimation for DCM. Although VB was robust in most instances, we found evidence for variability in the estimates due to local optima in the objective function-especially in the case of the realworld dataset, where the model space included nonlinear DCMs, and for challenging scenarios where the number of network nodes and free parameters is high. While this problem can be ameliorated by initializing the VB algorithm from different starting points or using global optimization methods (see Lomakina et al. 2015), this would reduce computational efficiency, which is the main justification for VB as the default choice for standard applications of DCM.
Hence, sampling-based approaches like TI might become the method of choice when the robustness and validity of single-subject inference is paramount. For example, the utility of generative models for clinical applications, such as differential diagnosis based on model comparison or prediction of individual treatment responses (Stephan et al. 2017), depends on our ability to draw reliable and accurate conclusions from model-based estimates.
In addition, the experiments presented in this paper also demonstrated the practical feasibility of applying TI to complex generative models like DCM, which are characterized by high computational cost for evaluation of the likelihood function. This is made possible by an implementation that relies on parallel computing techniques, offering reasonable execution times on stand-alone workstations. Specifically, the computations for this paper were performed on a workstation equipped with an Intel Core i7 4770 K (CPU) and a Nvidia Geforce GTX 1080 (GPU), with a software implementation that allow obtaining as many as 10 5 samples of realistic DCMs in only a few minutes. Here, the important implication is that TI is no longer a method that is exclusively reserved for users with access to high performance computing clusters. The TI and DCM implementations used in this paper is available to the community as open source software (Translational Neuromodeling Unit 2014).
Finally, we would like to point out that although this paper mainly focuses on the estimation of the model evidence, which is the intended purpose of TI, TI also provides samples from the posterior distribution over model parameters. This is due to the fact that TI's temperature schedule always includes b N ¼ 1 (cf. Equation 29), meaning that the last power posterior being sampled from is equivalent to the posterior distribution. While this is conceptually different from the parametric distribution which VB provides as an approximation to the true posterior distribution, the posterior samples obtained by TI can be used to calculate summary statistics, such as posterior mean and variance or Bayesian credible intervals for the DCM parameters. This enables the user to obtain quantitative estimates of DCM parameters, such as the connection strength between brain regions or the strength of contextual modulations, in addition to the inference over network structure based on the comparison of LME.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.