1 Introduction

A central quantity in Bayesian statistics is the marginal likelihood

$$\begin{aligned} p(D|{\mathcal {M}}) = \int p(D,{\varvec{\theta }}|{\mathcal {M}}) d{\varvec{\theta }} = \int p(D|{\varvec{\theta }},{\mathcal {M}}) p({\varvec{\theta }}|{\mathcal {M}}) d{\varvec{\theta }} \end{aligned}$$
(1)

where \(D\) are the data, and \({\mathcal {M}}\) represents a given statistical model with parameter vector \({\varvec{\theta }}\). The difficulty in practically computing the marginal likelihood is exemplified by considering the Monte Carlo sum

$$\begin{aligned} X \; = \; \frac{1}{M} \sum _{i=1}^M p(D|{\varvec{\theta }}_i,{\mathcal {M}}) \end{aligned}$$
(2)

where \(\{\theta _i\}\) is an iid sample from \(p({\varvec{\theta }}|{\mathcal {M}})\). Under fairly general regularity conditions the estimator X converges almost surely to \(p(D|{\mathcal {M}})\), by the strong law of large numbers, and is asymptotically efficient, with asymptotic variance \(C/\sqrt{N}\) (where N is the size of \(D\)), by the central limit theorem. However, even for modestly complex systems, the constant in the numerator, C, can reach exorbitant magnitudes, rendering the scheme not viable for practical applications. The practical shortcomings of a variety of alternative numerical methods, like the harmonic mean estimator (Gelfand and Dey 1994), bridge sampling (Gelman and Meng 1998), or Chib’s method (Chib and Jeliazkov 2001), have been discussed in the statistics and machine learning literature (e.g. Murphy (2012)). The most widely used and robust method appears to be thermodynamic integration (TI). This method was originally proposed by Kirkwood (1935) and further developed in statistical physics for the mathematically equivalent problem of computing free energies; see e.g. Schlitter (1991) and Schlitter and Husmeier (1992). Gelman and Meng (1998) adapted TI to the computation of marginal likelihoods, Lartillot and Philippe (2006) demonstrated the application of TI to complex systems, and Friel and Pettitt (2008) and Calderhead and Girolami (2009) popularised TI more widely in the statistics community by demonstrating a computationally powerful combination with parallel tempering (Earl and Deem 2005).

Thermodynamic integration is based on an integral of the expected log likelihood along an inverse annealing path from the prior to the posterior distribution. The resulting estimator typically suffers from high variability, which particularly stems from the parameter prior regime. When comparing complex models with differences in a comparatively small number of parameters, these intrinsic errors from sampling fluctuations may outweigh the differences in the log marginal likelihood estimates. The objective of the present study is to explore the scope for variance reduction by directly targeting the log Bayes factor via a modified transition path between the two models such that the high-variance prior domain is avoided. This idea is not new. In statistical physics it is well known (Schlitter 1991; Schlitter and Husmeier 1992) that applying TI to the computation of a reaction free energy, which is mathematically equivalent to the log Bayes factor, is computationally more efficient than the separate computation of the standard free energies for the two reaction states involved (educt versus product states); the latter is mathematically equivalent to the difference of the log marginal likelihoods of two statistical models to be compared. Also in the statistics literature, the direct targeting of the log Bayes factor has been discussed before. For instance, path sampling (Gelman and Meng 1998) and annealed importance sampling (Neal 2001) have been conceived in a way to allow the direct computation of the ratio of two partition functions, \(Z_1\) and \(Z_2\), associated with two models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\). However, in the work of Neal (2001), \(Z_1\) is set to the normalisation factor of the prior distribution, and the method thus reduces to the computation of the log marginal likelihood.Footnote 1 Gelman and Meng (1998) do consider a direct comparison between two alternative models: a homoscedastic versus a heteroscedastic linear regression model. Rather than computing the Bayes factor, the authors apply their path sampling approach to infer the posterior distribution of the entire spectrum of intermediate models. While this is a more ambitious approach than model selection with Bayes factors, it will be computationally onerous beyond the one-dimensional regime considered in their example.

To the best of our knowledge, the present article presents the first systematic study of the variance reduction that can be achieved with a thermodynamic integration path that directly targets the log Bayes factor by transiting between the posterior distributions of the two models involved. The mathematical exposition and implementation of this scheme is combined with a comprehensive comparative performance assessment based on a set of standard benchmark data to quantify the improvement in variance reduction, accuracy and computationally efficiency that can be achieved over state-of-the-art established TI methods, in particular the recent improvement proposed by Friel et al. (2014).

This article is organised as follows. In Sect. 2 we provide a brief rationale for targeting Bayes factors directly rather than indirectly via the marginal likelihood. Section 3.1 reviews standard thermodynamic integration. In Sect. 3.2 we discuss a modified numerical integration and sampling scheme from statistical physics, termed non-equilibrium TI (NETI), to reduce numerical discretisation errors. Section 3.3 describes NETI-DIFF, the proposed new TI scheme along an alternative integration path between two posterior distributions. Sections 3.4, 3.5 describe practical numerical implementations based on Metropolis-Hastings and Gibbs sampling, and Sect. 3.6 proposes a new improved inverse temperature ladder. Section 4 provides an overview of a set of benchmark problems on which we have evaluated the methods, and Sect. 5 presents our empirical findings. We conclude this article in Sect. 6 with a discussion, a comparison with the controlled thermodynamic integral of Oates et al. (2016), and an outlook on future work.

2 Rationale

Consider two alternative models, \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\), and define \(E_i= -\log p(D|{\varvec{\theta }},{\mathcal {M}}_i)\), the negative log likelihood of model i. Further, define the log likelihood ratio \(\varDelta E= E_2-E_1\), the negative unnormalised log posterior \({\tilde{E}}_i= E_i+\log p({\varvec{\theta }}|{\mathcal {M}}_i)\), the negative log posterior ratio \(\varDelta {\tilde{E}}= {\tilde{E}}_2-{\tilde{E}}_1\), and let \(\langle \ldots \rangle _{i}\) denote the posterior average with respect to the posterior distribution \(({\varvec{\theta }}|D,{\mathcal {M}}_i)\). We can then adapt Jarzynski’s theorem from statistical physics (Híjar and Zárate 2010) to show that

$$\begin{aligned} p(D|{\mathcal {M}}_i) = \Big \langle \exp (E_i[{\varvec{\theta }}])\Big \rangle _i^{-1}, \quad \quad \frac{p(D|{\mathcal {M}}_2)}{p(D|{\mathcal {M}}_1)} = \Big \langle \exp (-\varDelta {\tilde{E}}[{\varvec{\theta }}])\Big \rangle _1 \end{aligned}$$
(3)

A proof is given in the Appendix. In real applications with non-trivial models, the negative log likelihood is typically in the order of a two to three digit figure, which when put into the argument of the exponential function will lead to an astronomically large number. An estimator aiming to approximate \(p(D|{\mathcal {M}}_i)= \langle \exp (E_i[{\varvec{\theta }}])\rangle _i^{-1} \) from a limited sample drawn from the posterior distribution will inevitably suffer from substantial variation. For nested models or models with sufficient parameter overlap, on the other hand, \(\varDelta {\tilde{E}}({\varvec{\theta }})\) will typically be small, \( |\varDelta {\tilde{E}}({\varvec{\theta }})| \ll \min \{|E_1({\varvec{\theta }})|,|E_2({\varvec{\theta }})|\}\). We can therefore reduce the intrinsic estimation uncertainty considerably by computing the Bayes factor directly rather than indirectly via two separate marginal likelihood estimations.

3 Methodology

3.1 Thermodynamic integration for marginal likelihoods

Thermodynamic integration is based on an inverse annealing path from the prior to the posterior distribution, and computing the expectation of the log likelihood with respect to the following annealed posterior distributions at inverse temperatures \(\tau \in [0,1]\):

$$\begin{aligned} p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}})= & {} \frac{1}{Z(D|\tau ,{\mathcal {M}})} p(D|{\varvec{\theta }},{\mathcal {M}})^{\tau } p({\varvec{\theta }}|{\mathcal {M}}), \nonumber \\ Z(D|\tau ,{\mathcal {M}})= & {} \int p(D|{\varvec{\theta }},{\mathcal {M}})^{\tau } p({\varvec{\theta }}|{\mathcal {M}}) d{\varvec{\theta }} \end{aligned}$$
(4)

Taking the derivative of \(\log Z(D|\tau ,{\mathcal {M}})\) gives:

$$\begin{aligned} \frac{d}{d\tau } \log Z(D|\tau ,{\mathcal {M}})= & {} \frac{1}{Z(D|\tau ,{\mathcal {M}})} \frac{d}{d\tau } Z(D|\tau ,{\mathcal {M}})\nonumber \\= & {} \frac{1}{Z(D|\tau ,{\mathcal {M}})} \int \frac{d}{d\tau } p(D|{\varvec{\theta }},{\mathcal {M}})^{\tau } p({\varvec{\theta }}|{\mathcal {M}}) d{\varvec{\theta }} \nonumber \\= & {} \int \log p(D|{\varvec{\theta }},{\mathcal {M}}) \frac{ p(D|{\varvec{\theta }},{\mathcal {M}})^{\tau } p({\varvec{\theta }}|{\mathcal {M}})}{Z(D|\tau ,{\mathcal {M}})} d{\varvec{\theta }}\nonumber \\= & {} \int p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}}) \log p(D|{\varvec{\theta }},{\mathcal {M}}) d{\varvec{\theta }} \nonumber \\= & {} {\mathbb {E}}_{\tau }\Big [ \log p(D|{\varvec{\theta }},{\mathcal {M}}) \Big ] \end{aligned}$$
(5)

From Eq. (5) we get:

$$\begin{aligned} \log p(D|{\mathcal {M}})= & {} \log Z(D|\tau =1) - \log Z(D|\tau =0) \nonumber \\= & {} \int _0^1 \frac{d}{d \tau }\log Z(D|\tau ) d \tau = \int _0^1 {\mathbb {E}}_{\tau }\Big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\Big ] d\tau \end{aligned}$$
(6)

This one-dimensional integral can be solved numerically, e.g. with the trapezoid rule:

$$\begin{aligned} \log (p(D|{\mathcal {M}})) \approx \sum _{k=2}^K \frac{\tau _k-\tau _{k-1}}{2} \left\{ {\mathbb {E}}_{\tau _k}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\big ] + {\mathbb {E}}_{\tau _{k-1}}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}}) \big ] \right\} \end{aligned}$$
(7)

Some care has to be taken with respect to the choice of discretisation points \(\tau _k, k=\{0,1,2,\ldots ,K\}\), as the major contributions to the integral usually come from a small region around \(\tau \rightarrow 0\). This motivates the form

$$\begin{aligned} \tau _k = \left( \frac{k-1}{K-1}\right) ^{\alpha }; \quad k \in \{1,2,\ldots ,K\} \end{aligned}$$
(8)

for \(\alpha > 1\). Theoretical results for the optimal choice of \(\alpha \) can be found in Schlitter (1991), but require knowledge that is usually not available in practice (like the functional dependence of \({\mathbb {E}}_{\tau }[\log p(D|{\varvec{\theta }},{\mathcal {M}})]\) on \(\tau \)). In practice, \(\alpha =5\) is widely used, as e.g. in Friel et al. (2014), and we have used this value in the present study. A potentially numerically more stable alternative was proposed in Friel et al. (2014). The authors show that:

$$\begin{aligned} \frac{d}{d\tau } \left\{ {\mathbb {E}}_{\tau }[\log (p(D|{\varvec{\theta }},{\mathcal {M}}))] \right\} _{\tau } = {\mathbb {V}}_{\tau }(\log (p(D|{\varvec{\theta }},{\mathcal {M}})) \end{aligned}$$
(9)

where \({\mathbb {V}}_{\tau }(.)\) is the variance w.r.t. the power posterior in Eq. (4). The second derivative of \({\mathbb {E}}_{\tau }[\log (p(D|{\varvec{\theta }},{\mathcal {M}}))]\) at a point \(\tau \in [\tau _{k-1},\tau _{k}]\) can then be approximated by the difference quotient of the first derivative of \({\mathbb {E}}_{\tau }[\log p(D|{\varvec{\theta }},{\mathcal {M}})]\) Eq. (9):

$$\begin{aligned} \frac{d^2}{dt^2} \left\{ {\mathbb {E}}_{t}\left[ \log (p(D|{\varvec{\theta }},{\mathcal {M}}))\right] \right\} _{t=\tau } \approx \frac{{\mathbb {V}}_{\tau _{k}}(\log (p(D|{\varvec{\theta }},{\mathcal {M}}))-{\mathbb {V}}_{\tau _{k-1}}(\log (p(D|{\varvec{\theta }},{\mathcal {M}}))}{\tau _{k}-\tau _{k-1}} \end{aligned}$$

Friel et al. (2014) then employ the corrected trapezoid ruleFootnote 2 to compute each sub-integral \(\int _{\tau _{k-1}}^{\tau _{k}}{\mathbb {E}}_{\tau }[\log (p(D|{\varvec{\theta }},{\mathcal {M}}))] d\tau \). This yields:

$$\begin{aligned} \log (p(D|{\mathcal {M}}))= & {} \int _{0}^{1} {\mathbb {E}}_{\tau } \big [ \log p(D|{\varvec{\theta }},{\mathcal {M}}) \big ] d \tau = \sum _{k=2}^K \int _{\tau _{k-1}}^{\tau _k} {\mathbb {E}}_{\tau } \big [ \log p(D|{\varvec{\theta }},{\mathcal {M}}) \big ] d \tau \nonumber \\\approx & {} \sum _{k=2}^K \frac{\tau _k-\tau _{k-1}}{2} \Bigg \{ {\mathbb {E}}_{\tau _k}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\big ] + {\mathbb {E}}_{\tau _{k-1}}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}}) \big ] \Bigg \} \nonumber \\&-\sum _{k=2}^K \frac{(\tau _k-\tau _{k-1})^2}{12} \Big \{ {\mathbb {V}}_{\tau _k}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\big ] - {\mathbb {V}}_{\tau _{k-1}}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}}) \big ] \Big \}\nonumber \\ \end{aligned}$$
(10)

3.2 Nonequilibrium thermodynamic integration

The computation of the expectation values \({\mathbb {E}}_{\tau _k}\big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\big ] \) is expensive and limits the number of discretisation points K that can be practically applied. An alternative scheme we use in the present work is to approximate

$$\begin{aligned} \log p(D|{\mathcal {M}})= & {} \int _0^1 {\mathbb {E}}_{\tau }\Big [\log p(D|{\varvec{\theta }},{\mathcal {M}})\Big ] d\tau \approx \int _0^1 \log p(D|{\varvec{\theta }}^{(\tau )},{\mathcal {M}}) d\tau \nonumber \\\approx & {} \sum _{k=2}^K \frac{\tau _k-\tau _{k-1}}{2} \Big \{ \log p(D|{\varvec{\theta }}^{(\tau _k)},{\mathcal {M}}) + \log p(D|{\varvec{\theta }}^{(\tau _{k-1})},{\mathcal {M}}) \Big \}\nonumber \\ \end{aligned}$$
(11)

where \(\theta ^{(\tau )}\) is a single draw from the power posterior defined in Eq. (4), and \(0=\tau _1<\tau _2<\cdots <\tau _K=1\). The computational resources gained are used to choose K orders of magnitude larger than in equilibrium TI,Footnote 3 with the implication that \((\tau _k-\tau _{k-1}) \rightarrow 0\) and discretisation errors in numerical integration are avoided. This scheme was originally proposed in statistical physics (Schlitter and Husmeier 1992) under the name non-equilibrium thermodynamic integration (NETI), and is conceptionally similar to annealed importance sampling (Neal 2001). The underlying rationale is as follows: rather than use the computational resources for the computation of the expectation value at a limited number of discretisation points—and incur discretisation errors—spread the computational resources over the whole “temperature” range and use as fine a discretisation as possible. This avoids the problem that had to be addressed in Friel et al. (2014): how to select the inverse temperatures and minimise the numerical integration error in standard TI. The price to pay is a relaxation error as a consequence of the non-equilibrium nature of the method, as discussed by Schlitter and Husmeier (1992). The authors proposed a scheme for correcting this relaxation error, by running simulations over different simulation lengths \(N_{iter}\), regressing the estimates against an approximate upper bound on the relaxation error \({\mathcal {R}}\), and then extrapolating for \({\mathcal {R}}\rightarrow 0\). In preliminary investigations omitted from the present article, we found that a single simulation with an increased value of \(N_{iter}\) matching the total computational costs of the extrapolation scheme achieved similar results, and we used this conceptionally simpler approach in all our studies.Footnote 4

3.3 Novel thermodynamic integration for Bayes factors

When comparing two models, we are typically interested in the Bayes factor \(p(D|{\mathcal {M}}_2)/p(D|{\mathcal {M}}_1)\). The standard approach is to apply thermodynamic integration to both models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) separately, by independently inversely annealing the prior distributions to the respective posterior distributions. This approach ignores the fact that both models usually have many aspects in common and share certain parameters. This applies particularly to nested models, where all the parameters of the less complex model are also included in the more complex model. One would expect to reduce the estimation uncertainty by following a direct transition path from the posterior distribution of the less complex model to that of the more complex model, rather than transiting through the uninformative prior distribution twice. Consider two models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\) with joint parameter vector \({\varvec{\theta }}\) and a joint parameter prior \(p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2)\) defined such that it reduces to the parameter priors for the separate models by marginalisation:

$$\begin{aligned} p({\varvec{\theta }}|{\mathcal {M}}_1) = \int _{{\mathcal {M}}_2/{\mathcal {M}}_1} p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2) d{\varvec{\theta }}, \quad p({\varvec{\theta }}|{\mathcal {M}}_2) = \int _{{\mathcal {M}}_1/{\mathcal {M}}_2} p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2) d{\varvec{\theta }} \end{aligned}$$
(12)

where \({\mathcal {M}}_2/{\mathcal {M}}_1\) is the subset of parameters contained in model \({\mathcal {M}}_2\), but not in model \({\mathcal {M}}_1\), and \({\mathcal {M}}_1/{\mathcal {M}}_2\) is the subset of parameters contained in model \({\mathcal {M}}_1\), but not in model \({\mathcal {M}}_2\). A mathematically more accurate notation would split \({\varvec{\theta }}\) into three subsets, \({\varvec{\theta }}=\{{\varvec{\theta }}_1,{\varvec{\theta }}_2,{\varvec{\theta }}_{12}\}\) such that \({\varvec{\theta }}_1 \in {\mathcal {M}}_1/{\mathcal {M}}_2\), \({\varvec{\theta }}_2 \in {\mathcal {M}}_2/{\mathcal {M}}_1\) and \({\varvec{\theta }}_{12} \in {\mathcal {M}}_1\cap {\mathcal {M}}_2\). Eq. (12) implies that \(p({\varvec{\theta }}|{\mathcal {M}}_1)=p({\varvec{\theta }}_1,{\varvec{\theta }}_{12}|{\mathcal {M}}_1)\) and \(p({\varvec{\theta }}|{\mathcal {M}}_2)=p({\varvec{\theta }}_2,{\varvec{\theta }}_{12}|{\mathcal {M}}_2)\). For that reason we can use a mathematically redundant but less opaque notation that does not make the partition \({\varvec{\theta }}=\{{\varvec{\theta }}_1,{\varvec{\theta }}_2,{\varvec{\theta }}_{12}\}\) explicit. Define the tempered posterior distribution

$$\begin{aligned} p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}}_1,{\mathcal {M}}_2) = \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1)^{1-\tau } p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2)}{Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2)} \end{aligned}$$
(13)

where

$$\begin{aligned} Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2) = \int \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)} \right) ^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1) p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2) d{\varvec{\theta }} \end{aligned}$$
(14)

From Eq. (12) we get:

$$\begin{aligned} p(D|{\mathcal {M}}_1) = Z(D|\tau =0,{\mathcal {M}}_1,{\mathcal {M}}_2), \quad p(D|{\mathcal {M}}_2) = Z(D|\tau =1,{\mathcal {M}}_1,{\mathcal {M}}_2) \end{aligned}$$
(15)

Taking the derivative of the partition function in Eq. (14) gives:

$$\begin{aligned}&\frac{d}{d\tau } \log Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2) = \frac{1}{Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2)} \frac{d}{d\tau } Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2) \nonumber \\&\quad = \frac{1}{Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2)} \int \frac{d}{d\tau } \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)} \right) ^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1)\nonumber \\&\qquad \times \, p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2) d{\varvec{\theta }} \nonumber \\&\quad = \int \log \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)} \right) \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1)^{1-\tau } p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2)}{Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2)} d{\varvec{\theta }} \nonumber \\&\quad = \int p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}}_1,{\mathcal {M}}_2) \log \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)} \right) d{\varvec{\theta }} \nonumber \\&\quad = {\mathbb {E}}_{\tau }\left[ \log \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)}\right) \right] \end{aligned}$$
(16)

Combining Eqs. (15, 16) gives the following thermodynamic integral for the log Bayes factor:

$$\begin{aligned} \log \left( \frac{p(D|{\mathcal {M}}_2)}{p(D|{\mathcal {M}}_1)}\right)= & {} \log Z(D|\tau =1,{\mathcal {M}}_1,{\mathcal {M}}_2) - \log Z(D|\tau =0,{\mathcal {M}}_1,{\mathcal {M}}_2) \nonumber \\= & {} \int _0^1 \left[ \frac{d}{d\tau } \log Z(D|\tau ,{\mathcal {M}}_1,{\mathcal {M}}_2)\right] d \tau \nonumber \\= & {} \int _0^1 {\mathbb {E}}_{\tau }\left[ \log \left( \frac{p(D|{\varvec{\theta }},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }},{\mathcal {M}}_1)}\right) \right] d \tau \end{aligned}$$
(17)

Again, we follow the idea of non-equilibrium thermodynamic integration and make the approximation

$$\begin{aligned} \log \left( \frac{p(D|{\mathcal {M}}_2)}{p(D|{\mathcal {M}}_1)}\right)\approx & {} \int _0^1 \left[ \log \left( \frac{p(D|{\varvec{\theta }}^{(\tau )},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }}^{(\tau )},{\mathcal {M}}_1)}\right) \right] d \tau \nonumber \\\approx & {} \sum _{k=2}^K \frac{\tau _k-\tau _{k-1}}{2} \Bigg \{ \log \left( \frac{p(D|{\varvec{\theta }}^{(\tau _k)},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }}^{(\tau _k)},{\mathcal {M}}_1)}\right) +\log \left( \frac{p(D|{\varvec{\theta }}^{(\tau _{k-1})},{\mathcal {M}}_2)}{p(D|{\varvec{\theta }}^{(\tau _{k-1})},{\mathcal {M}}_1)}\right) \Bigg \}\nonumber \\ \end{aligned}$$
(18)

where \({\varvec{\theta }}^{(\tau )}\) is a single draw from the tempered posterior distribution defined in Eq. (13), \(0=\tau _1<\tau _2<\cdots <\tau _K=1\), \(K\gg 1\), and \((\tau _{k}-\tau _{k-1})\ll 1\).

In comparison with statistical physics, the proposed scheme corresponds to the direct computation of a free energy difference (Schlitter 1991; Schlitter and Husmeier 1992), which is more efficient, in terms of reduced estimation variance for given computational costs, than computing the difference of two separately computed standard free energies. The analogy from classical statistics is model comparison via a paired test, which is known to have higher power than an unpaired test.

In what follows, we refer to the estimator defined by Eq. (18) as NETI-DIFF. We describe how to compute the variance of this estimator in the Appendix 7.2.

3.4 Metropolis–Hastings scheme

The implementation of a Metropolis-Hastings scheme to target the distribution in (13) is straightforward. Given the current parameters \({\varvec{\theta }}\), sample new parameters \({{\tilde{{\varvec{{\theta }}}}}}\) from a proposal distribution \(q({{\tilde{{\varvec{{\theta }}}}}}|{\varvec{\theta }})\), and accept these new parameters with the following acceptance probability:

$$\begin{aligned} a({{\tilde{{\varvec{{\theta }}}}}}|{\varvec{\theta }})= \min \left\{ \frac{p(D|{{\tilde{{\varvec{{\theta }}}}}},{\mathcal {M}}_2)^{\tau } p(D|{{\tilde{{\varvec{{\theta }}}}}},{\mathcal {M}}_1)^{1-\tau } p({{\tilde{{\varvec{{\theta }}}}}}|{\mathcal {M}}_1,{\mathcal {M}}_2)q({\varvec{\theta }}|{{\tilde{{\varvec{{\theta }}}}}})}{p(D|{\varvec{\theta }},{\mathcal {M}}_2)^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1)^{1-\tau } p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2)q({{\tilde{{\varvec{{\theta }}}}}}|{\varvec{\theta }})} ,1 \right\} \end{aligned}$$
(19)

Otherwise, set \({{\tilde{{\varvec{{\theta }}}}}}={\varvec{\theta }}\), and follow this scheme iteratively.

3.5 Gibbs sampling for linear models

Consider a standard linear model with parameter vector \({\varvec{\theta }}\), design matrix \(\mathbf{{D}}\), and prior distribution

$$\begin{aligned} p({\varvec{\theta }}|\delta ^2,\sigma ^2) = N(\varvec{\mu }_0,\sigma ^2\delta ^2\mathbf{I}) \end{aligned}$$
(20)

The data, \(D=\{y_1,\ldots ,y_{T}\}\) or \(\mathbf{Y}= (y_1,\ldots ,y_{T})^{{}^{\mathrm{T}}}\), are assumed to be obtained under the assumption of independent and identically distributed normal noise, with variance \(\sigma ^2\):

$$\begin{aligned} p(\mathbf{y}|{\varvec{\theta }},\sigma ^2)= N(\mathbf{{D}}{\varvec{\theta }},\sigma ^2\mathbf{I}) \end{aligned}$$
(21)

We want to compare two competing models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\), represented by two alternative design matrices \(\mathbf{{D}}^{(1)}\) and \(\mathbf{{D}}^{(2)}\):

$$\begin{aligned} p(\mathbf{y}|{\varvec{\theta }},\sigma ^2,{\mathcal {M}}_1)= N(\mathbf{{D}}^{(1)}{\varvec{\theta }},\sigma ^2\mathbf{I}), \quad p(\mathbf{y}|{\varvec{\theta }},\sigma ^2,{\mathcal {M}}_2)= N(\mathbf{{D}}^{(2)}{\varvec{\theta }},\sigma ^2\mathbf{I}) \end{aligned}$$
(22)

For notational compactness we choose a representation that leaves the dimension of \({\varvec{\theta }}\) invariant with respect to changing model dimensions by padding obsolete entries in the design matrix with zeros. For instance, to compare the models \({\mathcal {M}}_1{:} y= \theta _1 x_1 + \theta _2 x_2\), and \({\mathcal {M}}_2{:} y= \theta _1 x_1 + \theta _3 x_3 + \theta _4 x_4\) based on a data set of n observations \( \{y_t,x_{1,t},x_{2,t},x_{3,t},x_{4,t}\}\), \(t=1,\ldots ,n\), we get the following design matrices:

$$\begin{aligned} \mathbf{{D}}^{(1)}= \left( \begin{array}{cccc} x_{1,1} &{}\quad x_{2,1} &{}\quad 0 &{}\quad 0 \\ x_{1,2} &{}\quad x_{2,2} &{}\quad 0 &{}\quad 0 \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ x_{1,n} &{}\quad x_{2,n} &{}\quad 0 &{}\quad 0 \\ \end{array} \right) , \; \mathbf{{D}}^{(2)}= \left( \begin{array}{cccc} x_{1,1} &{}\quad 0 &{}\quad x_{3,1} &{}\quad x_{4,1} \\ x_{1,2} &{}\quad 0 &{}\quad x_{3,2} &{}\quad x_{4,2} \\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ x_{1,n} &{}\quad 0 &{}\quad x_{3,n} &{}\quad x_{4,n} \\ \end{array} \right) \end{aligned}$$

From (13) we get

$$\begin{aligned} p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}}_1,{\mathcal {M}}_2)\propto & {} p(D|{\varvec{\theta }},{\mathcal {M}}_2)^{\tau } p(D|{\varvec{\theta }},{\mathcal {M}}_1)^{1-\tau } p({\varvec{\theta }}|{\mathcal {M}}_1,{\mathcal {M}}_2)\nonumber \\\propto & {} N(\mathbf{{D}}^{(1)}{\varvec{\theta }},\sigma ^2\mathbf{I})^{1-\tau } N(\mathbf{{D}}^{(2)}{\varvec{\theta }},\sigma ^2\mathbf{I})^{\tau } N({\varvec{\mu }}_0,\sigma ^2\delta ^2\mathbf{I}) \nonumber \\\propto & {} \exp \left( \frac{-(1-\tau )}{2\sigma ^2} \left[ \mathbf{{D}}^{(1)}{\varvec{\theta }}-\mathbf{y}\right] ^{{}^{\top }} \left[ \mathbf{{D}}^{(1)}{\varvec{\theta }}-\mathbf{y}\right] \right) \nonumber \\&\quad \exp \left( \frac{-\tau }{2\sigma ^2} \left[ \mathbf{{D}}^{(2)}{\varvec{\theta }}-\mathbf{y}\right] ^{{}^{\top }} \left[ \mathbf{{D}}^{(2)}{\varvec{\theta }}-\mathbf{y}\right] \right) \end{aligned}$$
(23)
$$\begin{aligned}&\exp \left( \frac{-1}{2\sigma ^2\delta ^2} [{\varvec{\theta }}-\varvec{\mu }_0]^{{}^{\top }} [{\varvec{\theta }}-\varvec{\mu }_0] \right) \nonumber \\= & {} \exp \left( \frac{-1}{2\sigma ^2} {\varvec{\theta }}^{{}^{\top }} \left[ \tau \{\mathbf{{D}}^{(2)}\}^{{}^{\top }}\mathbf{{D}}^{(2)}+ (1-\tau )\{\mathbf{{D}}^{(1)}\}^{{}^{\top }}\mathbf{{D}}^{(1)}+ \delta ^{-2}{} \mathbf{I}\right] {\varvec{\theta }}\right) \nonumber \\&\exp \left( \frac{1}{\sigma ^2} {\varvec{\theta }}^{{}^{\top }} \left( \left[ \tau \{\mathbf{{D}}^{(2)}\}^{{}^{\top }} + (1-\tau )\{\mathbf{{D}}^{(1)}\}^{{}^{\top }} \right] \mathbf{y}+ \delta ^{-2}\varvec{\mu }_{0}\right) \right) C(\mathbf{y})\nonumber \\ \end{aligned}$$
(24)

where the factor \(C(\mathbf{y})\) does not depend on \({\varvec{\theta }}\). Comparing this with the identity

$$\begin{aligned} N({\varvec{\theta }}|\varvec{\mu },\sigma ^2\mathbf{H}^{-1})\propto & {} \exp \left( \frac{-1}{2\sigma ^2} [{\varvec{\theta }}-\varvec{\mu }]^{{}^{\top }}{} \mathbf{H} [{\varvec{\theta }}-\varvec{\mu }] \right) \nonumber \\= & {} \exp \left( \frac{-1}{2\sigma ^2} {\varvec{\theta }}^{{}^{\top }}{} \mathbf{H}{\varvec{\theta }}\right) \exp \left( \frac{1}{\sigma ^2} {\varvec{\theta }}^{{}^{\top }}{} \mathbf{H}\varvec{\mu }\right) C(\varvec{\mu }) \end{aligned}$$

we get:

$$\begin{aligned} p_{\tau }({\varvec{\theta }}|D,{\mathcal {M}}_1,{\mathcal {M}}_2) \; = \; N({\varvec{\theta }}|\varvec{\mu },\sigma ^2\mathbf{H}^{-1}) \end{aligned}$$
(25)

where

$$\begin{aligned} \mathbf{H}= & {} \tau \{\mathbf{{D}}^{(2)}\}^{{}^{\top }} \mathbf{{D}}^{(2)}+ (1-\tau )\{\mathbf{{D}}^{(1)}\}^{{}^{\top }} \mathbf{{D}}^{(1)}+ \delta ^{-2} \mathbf{I}, \nonumber \\ \varvec{\mu }= & {} \mathbf{H}^{-1} \left( \left[ \tau \{\mathbf{{D}}^{(2)}\}^{{}^{\top }} + (1-\tau )\{\mathbf{{D}}^{(1)}\}^{{}^{\top }} \right] \mathbf{y}+ \delta ^{-2} {\varvec{\mu }}_0 \right) \end{aligned}$$
(26)

Hence, we can directly sample \({\varvec{\theta }}\) from the tempered conditional distributions in a Gibbs sampling scheme without having to resort to Metropolis-Hastings. For linear models where the variance \(\sigma ^2\) is not known and has to be sampled from the tempered posterior distribution too, we refer to Appendix 7.6.

3.6 Sigmoid inverse temperature ladder

Given a single model \({\mathcal {M}}\), conventional TI follows an inverse annealing path from the prior \(p({\varvec{\theta }}|{\mathcal {M}})\) to the posterior \(p({\varvec{\theta }}|{\mathcal {M}},D)\), symbolically \(p({\varvec{\theta }}|{\mathcal {M}})\rightarrow p({\varvec{\theta }}|{\mathcal {M}},D)\). Unlike TI, NETI-DIFF is based on a direct transition from the posterior of one model \({\mathcal {M}}_1\) to the posterior of another model \({\mathcal {M}}_2\), \(p({\varvec{\theta }}|{\mathcal {M}}_1,D)\rightarrow p({\varvec{\theta }}|{\mathcal {M}}_2,D)\). For nested models, e.g. \({\mathcal {M}}_1 \subset {\mathcal {M}}_2\), we start at the less complex model \({\mathcal {M}}_1\) and move towards the more complex model \({\mathcal {M}}_2\), e.g. using the power-law inverse temperature ladder, defined in Eq. (8). For a power \(\alpha >1\) the distances \(\tau _{i+1}-\tau _{i}\) between neighbouring discretisation points \(\tau _{k}\) and \(\tau _{k+1}\) increase in k and the discretisation points will be concentrated around the nested model, \({\mathcal {M}}_1\) (\(\tau =0\)), and fewer points will be set near \({\mathcal {M}}_2\) (\(\tau =1\)). However, in many applications non-nested models have to be compared, and it is then not clear which of the two models should be used as starting point. Imbalances can be avoided by choosing a sigmoid inverse temperature ladder, such that the discretisation points are mirrored at the midpoint \(\tau ^{\star }=0.5\) of the interval [0, 1]. Every discretisation point \(\tau <0.5\) closer to \({\mathcal {M}}_1\) then has its counterpart \(\tau ^{\star }=1-\tau \) with the same distance \(\tau \) to \({\mathcal {M}}_2\), and vice-versa.

To obtain a sigmoid inverse temperature ladder for NETI-DIFF we apply the following procedure. We first specify 50% of the discretisation points \(\tau _1<\cdots <\tau _{\frac{N_{iter}}{2}}\) within the interval [0, 0.5], and then we mirror the ladder at the midpoint \(\tau =0.5\).Footnote 5 This yields the remaining \(50\%\) of the discretisation points, \(\tau _{\frac{N_{iter}}{2}+i}=1-\tau _{\frac{N_{iter}}{2}+1-i}\) (\(i=1,\ldots ,\frac{N_{iter}}{2}\)). As we want the first \(50\%\) of the discretisation points to get as close as possible to the midpoint \(\tau =0.5\) subject to a power law with power \(\alpha \), we determine the minimal integer \(N^{\star }\) such that

$$\begin{aligned} \tau _i := \left( \frac{i}{N^{\star }}\right) ^\alpha < 0.5 \quad (i=1,\ldots ,\frac{N_{iter}}{2}) \end{aligned}$$
(27)

The solution is: \( N^{\star } = \lfloor x^{\star } \rfloor \), where

$$\begin{aligned} \left( \frac{N_{iter}}{2x^{\star }}\right) ^\alpha = 0.5 \Leftrightarrow x^{\star } = \frac{N_{iter}}{2} \cdot 0.5^{-\frac{1}{\alpha }} \end{aligned}$$
(28)

4 Benchmark problems and data

We have evaluated the proposed method on four benchmark data sets. Given data \(D\) the goal is to estimate the log Bayes factor \({\mathcal {B}}\) between two models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\). We assume the models to be equally likely a priori, \(p({\mathcal {M}}_1)=p({\mathcal {M}}_2)\), so that the Bayes factor is the ratio of marginal likelihoods:

$$\begin{aligned} {\mathcal {B}}({\mathcal {M}}_1,{\mathcal {M}}_2) = \log \left\{ \frac{ p({\mathcal {M}}_2|D)}{p({\mathcal {M}}_1|D)} \right\} = \log \left\{ \frac{ p(D|{\mathcal {M}}_2)}{p(D|{\mathcal {M}}_1)} \right\} \end{aligned}$$
(29)

For nonuniform prior distributions, \(p({\mathcal {M}}_1)\ne p({\mathcal {M}}_2)\), it is straightforward to add the correction factor \(\log p({\mathcal {M}}_2)/p({\mathcal {M}}_1)\), which is computationally cheap compared to the marginal likelihood ratio.

For method evaluation, we need to compare with a ground truth. For a linear model, we have a proper ground truth, as the Bayes factor can be computed analytically. This applies to the Radiata pine data (Sect. 4.1) and the Radiocarbon data (Sect. 4.3). For the Pima Indian data (Sect. 4.2), we use a generalised linear model, and for the biopathway data (Sect. 4.4), we use a nonlinear model. In these cases, a closed-form solution of the Bayes factor does not exist. For the Pima Indian data, we follow the method suggested in Friel et al. (2014) and use the numerical result from a very long MCMC run as an approximate gold standard. For the biopathway data, we use the knowledge of the true interaction structure of the system as a surrogate gold standard and assess the performance in terms of network reconstruction accuracy. We think this provides an adequate balance between using linear models, for which a strong ground truth exists, and generalised linear/non-linear models, for which a strong ground truth is intrinsically unavailable, and a weaker surrogate ground truth has to be used instead.

4.1 Radiata pine

The Radiata pine data have been used in Friel et al. (2014) and were originally published in Williams (1959). Like Friel et al. (2014) we focus on the log Bayes factor between two competing non-nested linear regression models for explaining the ’maximum compression strength’ y of \(n=42\) Radiata pine specimens. Both linear models contain an intercept and one single covariate. The first model (\({\mathcal {M}}_1\)) uses the ’density’ \(x_1\) and the second model (\({\mathcal {M}}_2\)) the ’adjusted density’ \(x_2\) of the specimen. After standardizing the observation vectors \({\mathbf{x}}_1\) and \({\mathbf{x}}_2\) of the two covariates to mean 0, the likelihood of model \({\mathcal {M}}_k\) (\(k=1,2\)) is:

$$\begin{aligned} {\mathcal {M}}_k: p({\mathbf{y}}|{\varvec{\theta }}^{(k)},\sigma ^2) = N({\mathbf{D}}^{(k)} {\varvec{\theta }}^{(k)}, \sigma ^2 {\mathbf{I}}) \end{aligned}$$
(30)

where \({\mathbf{y}}\) is the vector of the observed ’maximum compression strengths’, \({\mathbf{D}}^{(k)}=({\mathbf{1}},{\mathbf{x}}_k)\) is the n-by-2 design matrix and \({\varvec{\theta }}^{(k)}\) is the 2-dimensional vector of regression coefficients of model \({\mathcal {M}}_k\). Both models share the intercept parameter \(\theta _0\), but differ w.r.t. the second parameter, i.e. \({\varvec{\theta }}^{(k)}=(\theta _0,\theta _k)^{\top }\). For comparability we use exactly the same Bayesian model as in Friel et al. (2014), where an inverse Gamma prior is imposed on the noise variance: \(p(\sigma ^{-2})= GAM(3,2\cdot 300^2 )\) and Gaussian priors are used for the regression coefficient vectors:Footnote 6

$$\begin{aligned} {\mathcal {M}}_k: p({\varvec{\theta }}^{(k)}) =N \left( \left( \begin{array}{r}3000\\ 185\\ \end{array} \right) , \left( \begin{array}{cc}0.06^{-1}&{}0\\ 0&{}6^{-1}\end{array} \right) \right) \end{aligned}$$
(31)

This is a model with fully conjugate priors, so that the marginal likelihoods \(p({\mathbf{y}}|{\mathcal {M}}_k)\) can be computed in closed form (Friel et al. 2014). With Eq. (29) we obtain for the log Bayes factor \({\mathcal {B}}({\mathcal {M}}_1,{\mathcal {M}}_2)=8.8571\). Like Friel et al. (2014) we apply Gibbs sampling and re-sample the model parameters iteratively from their full conditional distributions: \(p(\sigma ^2|{\mathbf{y}},{\varvec{\theta }}^{(k)})\) and \(p({\varvec{\theta }}^{(k)}|{\mathbf{y}},\sigma ^2)\).

4.2 Pima Indians

The Pima Indians data have also been used in Friel et al. (2014) and were originally published in Smith et al. (1988). Like Friel et al. (2014) we focus on the log Bayes factor between two nested logistic regression models for explaining the binary ’diabetes disease status’ y of \(n=532\) female Pima Indians. The first model (\({\mathcal {M}}_1\)) contains an intercept and 4 covariates, namely ’the number of pregnancies’, ’the plasma glucose concentration’, ’the body mass index’, and ’the diabetes pedigree function’, while the second model (\({\mathcal {M}}_2\)) extends model \({\mathcal {M}}_1\) by including one additional covariable ’age’. After standardizing all covariates to mean 0 and variance 1, the likelihood of model \({\mathcal {M}}_k\) (\(k=1,2\)) is:

$$\begin{aligned} {\mathcal {M}}_k: p({\mathbf{y}}| {\varvec{\theta }}^{(k)} ) = \prod _{i=1}^{n} \frac{\left\{ \exp (-{\mathbf{x}}_{i,k}^{\top }{\varvec{\theta }}^{(k)}) \right\} ^{y_i} }{1+\exp (-{\mathbf{x}}_{i,k}^{\top }{\varvec{\theta }}^{(k)})} \end{aligned}$$

where the i-th element of \({\mathbf{y}}\), \(y_i\in \{0,1\}\), is the diabetes status of female i, \({\mathbf{x}}_{i,k}\) is the corresponding vector of covariates, including an initial 1 for the intercept, and \({\varvec{\theta }}^{(k)}\) is the vector of regression coefficients of dimension \(m=5\) (\({\mathcal {M}}_1\)) or \(m=6\) (\({\mathcal {M}}_2\)). Again we follow Friel et al. (2014) and impose the following Gaussian priors on the regression coefficient vectors: \(p({\varvec{\theta }}^{(k)}|\delta ^2) = N({\mathbf{0}}, \delta ^2 {\mathbf{I}} )\), where \(\delta ^2=100\) gives rather uninformative priors. For the logistic regression neither the marginal likelihoods nor the full conditional distributions can be computed in closed form. We therefore use the Metropolis Hastings based Markov chain Monte Carlo (MCMC) sampling scheme from Friel et al. (2014), which employs the following proposal mechanism: In each iteration a new candidate regression coefficient vector is obtained by adding a sample \({\mathbf{u}}\) from an m-dimensional multivariate Gaussian distribution to the current vector \({\varvec{\theta }}^{(k)}\). The Gaussian distribution of \({\mathbf{u}}\) has a zero mean vector and a diagonal covariance matrix, whose diagonal entries \(d_1,\ldots ,d_m\) depend on the inverse temperature \(\tau \in [0,1]\) of the power posterior. For the TI approaches we set: \(d_i=\min \{0.01\tau ^{-1},100\}\), as in Friel et al. (2014). For the proposed NETI-DIFF approach we use \(d_6=\min \{0.01 \tau ^{-1},100\}\), while we fix the first five diagonal entries \(d_1,\ldots ,d_5=0.01\). This modification is required, as the first five regression coefficients appear in both models \({\mathcal {M}}_1\) and \({\mathcal {M}}_2\). That is, they effectively appear constantly with inverse temperature \(\tau =1\) throughout NETI-DIFF simulations. The marginal likelihoods cannot be computed in closed-form. We therefore use those values reported in Friel et al. (2014), which were obtained from long TI simulations, as gold-standard: \(\log \{p({\mathbf{y}}|{\mathcal {M}}_1)\}=-257.2342\) and \(\log \{p({\mathbf{y}}|{\mathcal {M}}_2)\}=-259.8519\). Equation (29) yields the log Bayes factor: \({\mathcal {B}}({\mathcal {M}}_1,{\mathcal {M}}_2)=-2.6177\).

4.3 Radiocarbon dating

We use the Radiocarbon data from Pearson and Qua (1993) to compute the Bayes factors among 10 nested linear regression models. For predicting the ’true calendar age’ y of \(n=343\) Irish oaks from one single covariable: ’the Radiocarbon dating process’ x, we fit polynomial calibration curves \({\mathcal {M}}_i\) (\(i=1,\ldots ,10\)) of the following type:

$$\begin{aligned} {\mathcal {M}}_i: y = \theta _0 + \sum _{j=1}^i \theta _j x^j \end{aligned}$$
(32)

The likelihood of model \({\mathcal {M}}_i\) is then

$$\begin{aligned} {\mathcal {M}}_i: p({\mathbf{y}}|\sigma ^2,{\varvec{\theta }}^{(i)}) = N({\mathbf{D}}^{(i)} {\varvec{\theta }}^{(i)} , \sigma ^2 {\mathbf{I}}) \end{aligned}$$
(33)

where \({\mathbf{y}}\) is the vector of calendar ages, \({\varvec{\theta }}^{(i)}=(\theta _0,\theta _1,\ldots ,\theta _i)^{\top }\) is the vector of regression coefficients, and \({\mathbf{D}}^{(i)}\) is the n-by-\((i+1)\) design matrix. The first column of the design matrix consists entirely of ones (for the intercept), and the subsequent columns are built from the observation vector \({\mathbf{x}}\), \({\mathbf{D}}^{(i)}=({\mathbf{1}},{\mathbf{x}}^1,\ldots ,{\mathbf{x}}^i)\), where \({\mathbf{x}}^j\) denotes an element-wise power operation on \({\mathbf{x}}\). We impose conjugate priors on the parameters. For \(\sigma ^2\) we use an inverse Gamma distribution: \(p(\sigma ^{-2})= GAM(\frac{a}{2},\frac{b}{2})\), and on \({\varvec{\theta }}^{(i)}\) we impose Gaussian priors:

$$\begin{aligned} p({\varvec{\theta }}^{(i)}|\sigma ^2,\delta ^2) = N({\mathbf{0}},\sigma ^2\delta ^2 {\mathbf{I}}) \end{aligned}$$
(34)

For fixed hyperparameters a, b, and \(\delta ^2\) the marginal likelihood for a model \({\mathcal {M}}\) with design matrix \({\mathbf{D}}\) is given by:

$$\begin{aligned} p({\mathbf{y}}|{\mathcal {M}}) = \frac{ \varGamma (\frac{n+a}{2}) \cdot b^{\frac{a}{2}} \cdot (b + {\mathbf{y}}^{\top } ({\mathbf{I}}+\delta ^2 {\mathbf{D}} {\mathbf{D}}^{\top })^{-1} {\mathbf{y}})^{-\frac{n+a}{2} } }{ \varGamma (\frac{a}{2}) \cdot \pi ^{\frac{n}{2}} \cdot \det \left( {\mathbf{I}} + \delta ^2 {\mathbf{D}} {\mathbf{D}}^{\top } \right) } \end{aligned}$$

so that the log Bayes factors \({\mathcal {B}}({\mathcal {M}}_{i},{\mathcal {M}}_{l})\) for two models \({\mathcal {M}}_i\) and \({\mathcal {M}}_l\) can be computed in closed form with Eq. (29). For the Radiocarbon data we fix \(a=b=0.2\), \(\delta ^2=1\), and we sample the parameters iteratively from their conditional distributions \(p(\sigma ^2|{\mathbf{y}},{\varvec{\theta }}^{(i)})\) and \(p({\varvec{\theta }}^{(i)}|{\mathbf{y}},\sigma ^2)\) with Gibbs sampling.

Fig. 1
figure 1

Gene regulatory networks of the circadian clock in Arabidopsis thaliana: wildtype and mutant. The network displayed in panel a is the P2010 network proposed by Pokhilko et al. (2010). Panel b shows a mutant network, in which the proteins PRR9 and PRR7 are dysfunctional and can no longer form a protein complex with NI. The nodes in the network represent proteins and genes, the edges indicate interactions. Arrows symbolize activations and bars inhibitions. Solid lines show protein-gene interactions; dashed lines show protein interactions. The regulatory influence of light is symbolized by a sun symbol. Grey boxes group sets of regulators or regulated components. Figure reproduced from Aderhold et al. (2014)

4.4 Biopathway

The objective of the last application is model selection with respect to two alternative candidate interaction structures of ten genes in the circadian gene regulatory network of Arabidopsis thaliana, shown in Fig. 1. The statistical model used for inference is a semi-mechanistic Bayesian hierarchical model for transcriptional regulation (Aderhold et al. (2017)). Let \(x_i(t)\) denote the mRNA concentration of gene i at time t, and \(\pi _i\) the set of its regulators. For instance, in the gene network of Fig. 1a, the regulators of gene PRR9 are two other genes, TOC1 and LHY. So if \(i=PRR9\), then \(\pi _i=\{TOC1,LHY\}\). A regulator can either act as activator or as repressor, and we represent that with the binary variable \(I_{u,i}\), with \(I_{u,i}=1\) indicating that gene u is an activator for gene i, and \(I_{u,i}=0\) indicating that gene u is an inhibitor for gene i. For the example above, LHY is an activator for PRR9, hence \(I_{u,i}=1\), while TOC1 is an inhibitor for PRR9, hence \(I_{u,i}=0\). From the fundamental equation of transcriptional regulation based on Michaelis–Menten kinetics we have for the gradient of \(x_i\) (Barenco et al. 2006):

$$\begin{aligned} \frac{d x_i(t)}{dt}|_{t=t^{\star }} = -v_{0,i} x_i(t^{\star }) + \sum _{u\in \pi _i} v_{u,i} \frac{I_{u,i} x_u(t^{\star })+(1-I_{u,i}) k_{u,i}}{x_u(t^{\star })+k_{u,i}} \end{aligned}$$
(35)

where the sum is over all genes u that are in the regulator set of \(\pi _i\) of gene i. The first term, \(-v_{0,i} x_i(t^{\star })\), takes the degradation of \(x_i\) into account, while \(v_{u,i}\) and \(k_{u,i}\) are the maximum reaction rate and Michaelis–Menten parameters for the regulatory effect of gene \(u\in \pi _i\) on gene i, respectively. See the supplementary material of Pokhilko et al. (2010, 2012) for similar examples in the mathematical biology literature. Without loss of generality, we now assume that \(\pi _i\) is given by \(\pi _i= \{x_1,\ldots ,x_s\}\). Equation (35) can then be written in vector notation:

$$\begin{aligned} \frac{d x_i(t)}{dt}|_{t=t^{\star }} = {\mathbf{D}}_{i,t^{\star }}^{\top } {\mathbf{V}}_i \end{aligned}$$
(36)

where \({\mathbf{V}}_i=(v_{0,i},v_{1,i}\ldots ,v_{s,i})^{\top }\) is the vector of the maximum reaction rate parameters, and the vector \({\mathbf{D}}_{i,t^{\star }}\) depends on the measured concentrations \(x_u(t^{\star })\) and the Michaelis–Menten parameters \(k_{u,i}\) (\(u\in \pi _i\)) via Eq. (35):

$$\begin{aligned} {\mathbf{D}}_{i,t^{\star }}^{\top } = \Big (-x_i(t^{\star }),\frac{I_{1,i} x_1(t^{\star })+(1-I_{1,i}) k_{1,i}}{x_1(t^{\star })+k_{1,i}},\ldots , \frac{I_{s,i} x_{s}(t^{\star })+(1-I_{s,i}) k_{s,i}}{x_s(t^{\star })+k_{s,i}} \Big )\nonumber \\ \end{aligned}$$
(37)

We combine the s Michaelis–Menten parameters \(k_{u,i}\) in a vector \({\mathbf{K}}_i =(k_{1,i}\ldots ,k_{s,i})^{\top }\). For n time points \(t^{\star }\in \{t_1,\ldots ,t_n\}\) we obtain n row vectors from Eq. (37), and we can arrange them successively in an n-by-\((|\pi _i|+1)\) design matrix \({\mathbf{D}}_i={\mathbf{D}}_i({\mathbf{K}}_i)\). The corresponding gradient vector is given by \({\mathbf{y}}_i:=(y_{i,1},\ldots ,y_{i,n})^{\top }\), where \(y_{i,j}\) is the gradient of \(x_i\) at time point \(t_j\). With \({\mathbf{y}}_i\) being the response vector the likelihood is:

$$\begin{aligned} p({\mathbf{y}}_i|{\mathbf{K}}_i,{\mathbf{V}}_i,\sigma _i^2)= (2\pi \sigma _i^2)^{-\frac{n}{2}} e^{-\frac{1}{2\sigma ^2_{i}} ({\mathbf{y}}_i - {\mathbf{D}}_i {\mathbf{V}}_i)^{\top } ({\mathbf{y}}_i - {\mathbf{D}}_i {\mathbf{V}}_i)} \end{aligned}$$

where \({\mathbf{D}}_i={\mathbf{D}}_i({\mathbf{K}}_i)\) is the design matrix, given the Michaelis–Menten parameter vector \({\mathbf{K}}_i\). To ensure non-negative Michaelis–Menten parameters, truncated Normal prior distributions are used:

$$\begin{aligned} {\mathbf{K}}_i \sim {\mathcal {N}}_{\left\{ {\mathbf{K}}_i\ge 0 \right\} }({\mathbf{1}},\nu {\mathbf{I}}) \end{aligned}$$
(38)

where \(\nu >0\) is a hyperparameter, and the subscript, \(\left\{ {\mathbf{K}}_i\ge 0 \right\} \), indicates the truncation condition, i.e. that each element of \({\mathbf{K}}_i\) has to be non-negative. For the maximum reaction rates, we use a truncated ridge regression prior:

$$\begin{aligned} {\mathbf{V}}_i|\sigma ^2_i,\delta ^2_i \sim {\mathcal {N}}_{\left\{ {\mathbf{V}}_i\ge 0\right\} }( {\mathbf{1}}, \delta ^2_i \sigma _i^2 {\mathbf{I}} ) \end{aligned}$$
(39)

where \(\delta ^2_i\) is a hyperparameter that regulates the prior strength. For \(\sigma _i^2\) and \(\delta ^2_i\) we use inverse Gamma priors, \(\sigma _i^{2}\sim IG(a_{\sigma },b_{\sigma })\) and \(\delta ^2_i \sim IG(a_{\delta },b_{\delta })\). A graphical model representation can be found in Fig. 2.

Fig. 2
figure 2

Hierarchical Bayesian model used for gene regulatory network reconstruction. Grey nodes refer to fixed quantities such as the observed response data or low-level hyperparameters. White nodes refer to quantities that can change, which includes the model parameters and high-level hyperparameters. Note that the design matrix \(\mathbf{D}_i\) is not fixed because it depends on the Michaelis–Menten parameter vector \(\mathbf{K}_i\)

The posterior distribution of the parameters and hyperparameters has no closed-form solution, and we therefore resort to an MCMC scheme to sample from it. From the graphical model in Fig. 2 it can be seen that with the sole exception of the Michaelis–Menten parameters \(\mathbf{K}_i\), the conditional distribution of each parameter conditional on its Markov blanketFootnote 7 is of standard form (due to conjugacy) and can be sampled from directly. The MCMC scheme is therefore of the form of a Gibbs sampler, in which all parameters are sampled directly from their conditional distributions, except for \(\mathbf{K}_i\), which is sampled via a Metropolis-Hastings within Gibbs step. The conditional distribution of the maximum rate parameter vector \(\mathbf{V}_i\) is obtained from Eqs. (2526) by replacing \({\varvec{\theta }}\) by \( \mathbf{V}_i, \) and adding an index i, for association with gene i, to all other quantities except for the identity matrix \(\mathbf{I}\) and the inverse temperature \(\tau \). The derivation of the other conditional distributions is straightforward. Pseudo code of the standard MCMC algorithm can be found in Aderhold et al. (2017). Pseudo code of the modified MCMC algorithm integrated into the proposed NETI-DIFF scheme is provided in the Appendix, Table 2.

The data used for inference were obtained from Aderhold et al. (2014). These are synthetic gene expression time series, which were generated from a biologically realistic simulation of the molecular interactions in these networks, using the mathematical framework described in Guerriero et al. (2012) and implemented in the Biopepa software package (Ciocchetta and Hillston 2009). These time series correspond to gene expression measurements in 2h intervals over 24 h, repeated 11 times for different experimental conditions related to various gene knockouts. We repeated the simulations twice, for both of the two networks shown in Fig. 1. Hence, the true interaction network is known, which can be used to evaluate the accuracy of Bayesian model selection based on the modelling framework described above.

5 Results

In this section, we compare the efficiency and accuracy of three algorithms: standard thermodynamic integration (TI-standard) and optimal thermodynamic integration (TI-optimal) for computing the log marginal likelihood, and the proposed non-equilibrium thermodynamic integration for directly targeting the difference of the log marginal likelihood (NETI-DIFF).

In TI-standard we compute, based on Eq. (5), the expectation of the log likelihood w.r.t. the power posterior, \( {\mathbb {E}}_{\tau }[ \log p(D|{\varvec{\theta }},{\mathcal {M}})] \), for a set of a priori fixed inverse temperatures \(\{\tau _i\}\), \(i=1,\ldots ,K\), spaced according to the power law of Eq. (8). Following Friel et al. (2014) we have set \(K \in \{10,20,50,100\}\) and \(\alpha =5\) in Eq. (8). The log marginal likelihood is computed with the trapezoid rule (Eq. 7).

TI-optimal uses the two improvements proposed in Friel et al. (2014): the log marginal likelihood is computed with the improved numerical integration (Eq. 10), and the inverse temperatures are set iteratively according to an optimality criterion that aims to minimise the expected uncertainty; see Friel et al. (2014) for details.Footnote 8

Finally, NETI-DIFF is the algorithm proposed in the present article.

For each inverse temperature \(\tau \) in TI-standard and TI-optimal, we discarded the first 20% of the MCMC steps as burn-in (following Friel et al. (2014)). For NETI-DIFF, we discarded the first 1000 MCMC steps with the inverse temperature kept fixed at \(\tau =0\), as burn-in.Footnote 9 We recorded the total number of non-burn-in MCMC steps for all three algorithms, \(N_{iter}\). As discussed in Appendix 7.7 this is a measure of the total computational complexity.

We repeated the MCMC simulations \(N_{simu}=5\) times from different initialisations. Let \({\mathcal {B}}_i\) denote the log Bayes factor obtained from the ith MCMC simulation, and \({\mathcal {B}}_{true}\) the ‘true’ log Bayes factor. For the Bayesian linear regression models applied to the Radiata and Radiocarbon data, a closed-form expression for \({\mathcal {B}}_{true}\) is available. For the Bayesian logistic regression model applied to the Pima Indians data, and the hierarchical Bayesian model from Fig. 2 for biopathway data, the log Bayes factor is not analytically tractable, and \({\mathcal {B}}_{true}\) was obtained from a very long simulation, as in Friel et al. (2014). We assessed the intrinsic estimation uncertainty in terms of the variance:

$$\begin{aligned} {\mathbb {V}}= \frac{1}{N_{simu}-1} \sum _{i=1}^{N_{simu}} [{\mathcal {B}}_i - \overline{{\mathcal {B}}}]^2, \quad \quad \overline{{\mathcal {B}}} = \frac{1}{N_{simu}} \sum _{i=1}^{N_{simu}} {\mathcal {B}}_i \end{aligned}$$
(40)

and the accuracy in terms of the mean absolute error:

$$\begin{aligned} {\mathbb {A}}= \frac{1}{N_{simu}} \sum _{i=1}^{N_{simu}} |{\mathcal {B}}_i - {\mathcal {B}}_{true}| \end{aligned}$$
(41)

5.1 Radiata pine and Pima Indians

We start our empirical evaluation study with the analysis of the Radiate pine data (Sect. 4.1) and the Pima Indians data (Sect. 4.2). These two data sets have been used in the literature before for the evaluation of the TI method proposed by Friel et al. (2014), and in both cases the goal is to estimate the Bayes factor between two competing Bayesian regression models. For the Radiata pine data we compare two non-nested linear regression models. For the Pima Indians data we compare two logistic regression models, where the first model, \({\mathcal {M}}_1\), is nested in the second, \({\mathcal {M}}_2\) . We apply the NETI-DIFF approach with a sigmoid inverse temperature ladder, defined in Sect. 3.6, and we instantiate NETI-DIFF such that in both applications the transition path runs from the first model, \({\mathcal {M}}_1\) (\(\tau =0\)), to the second, \({\mathcal {M}}_2\) (\(\tau =1\)). For the Pima Indians data this is the natural path, as \({\mathcal {M}}_1\) is nested within \({\mathcal {M}}_2\).

Figures 3 and 4 show the average absolute deviations (Eq. 41) between the analytically computed log Bayes factors and the estimated log Bayes factors for different total iteration numbers \(N_{iter}\). Figure 6 compares the variance of the log Bayes factor estimates for the three different methods: TI-standard, TI-optimal, and NETI-DIFF. Figure 7 shows ratios of the variances obtained with TI-optimal and NETI-DIFF.

For the Radiata data, NETI-DIFF only achieves a slight reduction in the absolute deviation (Fig. 3) and the variance (Figs. 6, 7) for the lowest number of iterations, \(N_{iter}=64k\); otherwise NETI-DIFF and TI-optimal are on a par. Note that the two alternative linear regression models applied to the Radiata data only share the intercept, while their sets of covariables are disjunct. This lack of model overlap presents the least favourable scenario for NETI-DIFF, and our results confirm that there is little room for improvement over standard TI.

Fig. 3
figure 3

Average absolute error on the Radiata pine data: TI versus NETI-DIFF. The figure shows the average absolute deviation between the estimated and the true log Bayes factor in dependence on the total number of MCMC iterations \(N_{iter}\). In each panel the same NETI-DIFF results are shown, while the two TI approaches (TI-standard and TI-optimal) were applied with different numbers of discretisation points (10, 20, 50 and 100). The error bars represent standard deviations. The horizontal axes give the total number of (power posterior) MCMC iterations, \(N_{iter}\)

For the Pima Indians data, NETI-DIFF achieves a significant reduction in the absolute deviation (Fig. 4) and the variance (Figs. 6, 7) and clearly outperforms both TI methods: TI-standard and TI-optimal. The variance reduction ranges between ratios of 5 and 50. As opposed to the models applied to the Radiata data, the alternative logistic regression models applied to the Pima Indians data are nested, with the parameters of the less complex model forming a subset of those of the more complex one. Our results demonstrate that in this scenario, the new thermodynamic integration path of NETI-DIFF has potential for significant improvement over the established TI methods.

Fig. 4
figure 4

Average absolute error on the Pima Indians data: TI versus NETI-DIFF. The figure shows the average absolute deviation between the estimated and the true log Bayes factor in dependence on the total number of MCMC iterations \(N_{iter}\). In each panel the same NETI-DIFF results are shown, while the two TI approaches (TI-standard and TI-optimal) were applied with different numbers of discretisation points (10, 20, 50 and 100). The error bars represent standard deviations. The horizontal axes give the total number of (power posterior) MCMC iterations, \(N_{iter}\)

Fig. 5
figure 5

Comparison of two inverse temperature ladders and two NETI-DIFF paths. The vertical bars show the average absolute deviations between the estimated and true log Bayes factor, with error bars representing standard deviations. The horizontal axes give the total number of MCMC iterations \(N_{iter}\). The two inverse temperature ladders compared are the power law, Eq. (8), versus the sigmoid function, defined in Sect. 3.6. The alternative NETI-DIFF path swaps the initial model at \(\tau =0\) with the final model at \(\tau =1\). Top row: Radiata pine data. Bottom row: Pima Indians data

We also investigated the effect of the inverse temperature ladder (’sigmoid’ vs. ’power 5’) and the starting point (\({\mathcal {M}}_1\) vs. \({\mathcal {M}}_2\)). To this end, we systematically applied the proposed NETI-DIFF approach with all four combinations (two inverse temperature ladders times two starting points) to the two data sets: Radiata pine and Pima Indians. The results can be found in Fig. 5. First, consider the Pima Indians data, where the two alternative models are nested, and the power inverse temperature scheme of Eq. (8) has been applied. There is a clear advantage of starting the thermodynamic integration at the less complex model over starting at the more complex model: the absolute errors are significantly higher in the latter case. This is not surprising. It is well known from standard TI for computing marginal likelihoods that for the power law of Eq. (8), the optimal transition path is from the prior to the posterior, with the majority of the inverse temperature points at the prior end. Applying this principle to NETI-DIFF, starting the transition path for the differential parameters (i.e. the parameters that are only in the more complex model) at the prior, implies that the overall inverse temperature transition path has to lead from the less complex to the more complex model, in confirmation of our findings. Interestingly, for the sigmoid temperature ladder from Sect. 3.6, the difference between the two directions is substantially reduced, which is a natural consequence of the symmetry inherent in this scheme. There is no significant performance difference between the sigmoid and the power law inverse temperature paths when the models are nested (Pima Indians data, top row in Fig. 5). For the Radiata pine data on the other hand (bottom row in Fig. 5), where the alternative models are not nested, the power law of Eq. (8) is intrinsically suboptimal,Footnote 10 and the sigmoidal inverse temperature path of Sect. 3.6 is to be preferred.

5.2 Radiocarbon dating

Next, we consider model selection amongst different polynomial orders for polynomial regression on the Radiocarbon data. Since this is a linear model, the log Bayes factor is known and can be used for evaluating the accuracy of the different thermodynamic integration schemes. Besides comparing the proposed NETI-DIFF scheme with the established TI methods, we investigate the influence of the inverse temperature ladder and the transition path. Due to the comparatively low computational costs, we have increased the number of discretisation points from \(K \in \{10,20,50,100\}\) to \(K \in \{20,50,100,200\}\).

Figure 8 shows the absolute error (see Eq. 41) for NETI-DIFF and the better of the two established TI methods: TI-optimal. The task is to compute the log Bayes factor for the pairwise comparison of various polynomial orders, as indicated by the horizontal axis of each panel. It turns out that for TI-optimal, the accuracy of the estimate deteriorates with increasing difference of the model orders (black bars in the top panels of Fig. 8), while NETI-DIFF is unaffected by model choice.Footnote 11 In addition, NETI-DIFF considerably outperforms TI-optimal for the lower iteration numbers, as again seen from the top row in Fig. 8.

The right column of Fig. 6 compares the variances between NETI-DIFF and TI-optimal, and the right column of Fig. 7 shows the corresponding variance ratios. It is seen that NETI-DIFF consistently outperforms TI-optimal, with the variance ratios ranging between 5 and 2000. It appears that for low iteration numbers \(N_{iter}\), the improvement is most pronounced when the alternative models differ substantially (polynomial order 1 vs. 9), while for high iteration numbers \(N_{iter}\), the clearest improvement is achieved when the alternative models are more similar (polynomial orders 4 vs. 6).

Fig. 6
figure 6

Variance of log Bayes factor estimators. Left panel: Radiata pine data, centre panel: Pima Indians data, right panel: Radiocarbon data. The vertical bars show the variance \({\mathbb {V}}\), Eq. (40), for the TI-standard, TI-optimal and NETI-DIFF estimators of the log Bayes factor. For the Radiata pine and the Pima Indians data we varied the number of total MCMC iteration (horizontal axes). For the Radiocarbon data we performed \(N_{iter}=1024k\) iterations and considered four different pairwise model comparisons (horizontal axis). The rows represent different numbers of discretisation points for TI (NETI-DIFF is unaffected). The three columns refer to the four panels in Fig. 3 (right), Fig. 4 (center) and Fig. 8 (right). The corresponding ratios of the variances are shown in Fig. 7

Fig. 7
figure 7

Variance ratios of log Bayes factors estimators. Left panel: Radiata pine data, centre panel: Pima Indians data, right panel: Radiocarbon data. The vertical bars show the variance ratios of the log Bayes factor estimators: TI-standard versus NETI-DIFF, and TI-optimal versus NETI-DIFF (obtained from the variances in Fig. 6). The horizontal reference line at value 1 indicates equal performance; values above 1 indicate that NETI-DIFF achieves a variance reduction over the established TI schemes. For the Radiata pine and the Pima Indians data we varied the number of total MCMC iterations \(N_{iter}\) (horizontal axes). For the Radiocarbon data we performed \(N_{iter}=1024k\) iterations and considered four different pairwise model comparisons (horizontal axis). The rows represent different numbers of discretisation points for TI (NETI-DIFF is unaffected). The three columns refer to the four panels in Fig. 3 (left), Fig. 4 (center) and Fig. 8 (right)

Fig. 8
figure 8

Average absolute error on the Radiocarbon data: NETI versus TI-optimal. The figure shows the average absolute deviation between the estimated and the true log Bayes factor. In each panel the same NETI-DIFF results are shown, while TI-optimal was applied with different numbers of discretisation points (20, 50, 100 and 200). The bars represent standard deviations and the horizontal axes indicate different model comparisons (polynomials of orders i vs. j). The total number of (power posterior) MCMC iterations was kept fixed at \(N_{iter}=1024k\)

Fig. 9
figure 9

Influence of the inverse temperature ladder and the transition path. The bars show the average absolute deviation \({\mathbb {A}}\), Eq. (41), between the estimated and true log Bayes factor, computed with NETI-DIFF for the Radiocarbon data. The error bars show standard deviations. The horizontal axes indicate different model comparisons (polynomials of orders i vs. j). The total number of (power posterior) MCMC iterations was kept fixed at \(N_{iter}=1024k\). Left panel: Comparison of two NETI-DIFF inverse temperature ladders (sigmoid vs. power 5). Right panel: Comparison of three NETI-DIFF transition strategies (staggered vs. intermediate vs. direct)

Fig. 10
figure 10

Comparison of the two inverse temperature ladders—Radiocarbon data. The figures show the standard deviation of the partial NETI-DIFF integral, Eq. (18), over the partial inverse temperature range \([0,\tau ]\), obtained from five independent NETI-DIFF simulations. The right panel shows a section of the left panel at higher resolution. Dashed line: power law, Eq. (8). Solid line: sigmoid function, defined in Sect. 3.6

The left panel of Fig. 9 compares the two inverse temperature ladders: the power law of Eq. (8) versus the sigmoidal form of Sect. 3.6. Since the models are nested, we would expect the polynomial scheme to perform well, like for the Pima Indians data discussed above. Interestingly, the sigmoidal scheme achieves a better stabilization of the results w.r.t. model order, and a slightly better performance for the largest difference between the polynomial orders of the two alternative models considered. To shed more light on this trend, we have investigated the evolution of the standard deviation of the thermodynamic integral up to a given inverse temperature \(\tau \). The results are shown in Fig. 10. While the power law indeed achieves a lower standard deviation than the sigmoidal scheme at the low-inverse-temperature end (near the low-complex model), it contributes a larger proportion to the standard deviation at the high-inverse-temperature end (near the high-complex model). This suggests that the sparsity of inverse temperatures at the high-inverse-temperature end can be counterproductive due to insufficient sample size.

We finally investigated different model transition paths, with a comparison of three alternative schemes: (1) a staggered path from the low-complexity to the high-complexity model via a series of all intermediate models; (2) a transition via one intermediate model of medium complexity; and (3) a direct transition. The results are shown in the right panel of Fig. 9. The differences are small without a clear trend. This suggests that NETI-DIFF is remarkably robust w.r.t. the choice of the model transition path.

5.3 Biopathway

For the biopathway example, we considered two types of data. The first type was obtained from the wild type gene regulatory network shown in Fig. 1a; the second type was obtained from the mutant network shown in Fig. 1b. As we do not have a closed-form expression of the log Bayes factor we chose, as a proxy, the average of the log Bayes factors obtained with the longest TI and NETI-DIFF simulations, which tended to be in reasonably good agreement. Table 1 shows the values of the log Bayes factor thus obtained, which confirms that Bayesian model selection based on the hierarchical model of Fig. 2 consistently identifies the true gene network.

Table 1 Log Bayes factor for the biopathway data.
Fig. 11
figure 11

Mean absolute error and mean variance for different inverse temperature ladders and biopathway data. Panels a, b show a comparison of the mean absolute error \({\mathbb {A}}\) (Eq. 41) and panels c, d show a comparison of the mean variance \({\mathbb {V}}\) (Eq. 40) between two inverse temperature ladders: the power law from Eq. (8) as black boxes, and the sigmoid form from Sect. 3.6 as white boxes. Results were obtained from 5 independent data instantiations from the wildtype biopathway of Fig. 1a, and 5 independent data instantiations from the PRR7/PRR9 mutant biopathway of Fig. 1b. Histogram height: average. Error bars: standard deviation

In a preliminary study, we compared the two inverse temperature ladders for NETI-DIFF: power law (see Eq. (8)) with power 5, as in Friel et al. (2014), versus the sigmoid transfer function of Sect. 3.6. We repeated the simulations on the 5 data sets of Table 1. From these data sets, we computed the mean of the variance \({\mathbb {V}}\), Eq. (40), and the mean absolute error \({\mathbb {A}}\), Eq. (41). The results are shown in Fig. 11. The trend is not as clear as in Fig. 5. However, the sigmoid inverse temperature ladder achieves more often a performance improvement over the power law (in terms of lower mean absolute error \({\mathbb {A}}\) and average variance \({\mathbb {V}}\)) than the other way round, and we therefore adopted it for all subsequent studies.

The main question of interest is to compare TI and NETI-DIFF with respect to accuracy, estimation uncertainty and computational efficiency. To improve the clarity of the presentation, we only show the comparison between NETI-DIFF and TI-optimal, i.e. the TI scheme with the improvements proposed by Friel et al. (2014). In what follows, we refer to “TI-optimal” simply as “TI”. The simulations were repeated for different total iteration lengths, \(N_{iter}\), ranging from \(N_{iter}=10{,}000\) to \(N_{iter}=6{,}400{,}000\) MCMC steps. We repeated TI for different numbers of inverse temperatures, K, ranging from \(K=10\) to \(K=100\) [(the same values as used in Friel et al. (2014)].

Fig. 12
figure 12

Log Bayes factors for the biopathway data: comparison between NETI-DIFF and TI. The figure shows the distribution of the log Bayes factor \(\log p(D|{\mathcal {M}}_2)/p(D|{\mathcal {M}}_1)\), where \({\mathcal {M}}_1\) is the biopathway from Fig. 1a (wildtype), and \({\mathcal {M}}_2\) is the biopathway from Fig. 1b (PRR7/PRR9 mutant). NETI-DIFF is the same for all four rows. Left column: data generated from \({\mathcal {M}}_1\); negative log Bayes factors select the correct model. Right column: data generated from \({\mathcal {M}}_2\); positive log Bayes factors select the correct model. The horizontal line shows the ‘true’ value of the log Bayes factor (in the sense defined in the text). The box plots show distributions over 5 independent MCMC runs. The horizontal axis shows \(N_{iter}\), the total number of iterations, ranging from 10 to 6400 k a TI with \(K=10\) inverse temperatures. b TI with \(K=20\) inverse temperatures. c TI with \(K=50\) inverse temperatures. d TI with \(K=100\) inverse temperatures

Fig. 13
figure 13

Variance of log Bayes factor estimation: comparison between NET-DIFF and TI on the biopathway data. The variance measures are obtained from five repeated simulations of the same data set. The mean measures correspond to the average from five different data instantiations, obtained from the biopathway of Fig. 1a (wildtype, left column), and from the biopathway of Fig. 1b (PRR7/PRR9 mutant, right column) a Mean of variance \({\mathbb {V}}\), defined in Eq. (40) with standard deviations (error bars). b Ratio of the mean variance obtained with TI, divided by the average variance obtained with NETI-DIFF: \({\overline{{\mathbb {V}}}}(\mathrm {TI})/{\overline{{\mathbb {V}}}}(\mathrm {NETI-DIFF})\). Values above 1 indicate a performance improvement with NETI-DIFF over TI. c Distribution of the variance ratios \({\mathbb {V}}(\mathrm {TI})/{\mathbb {V}}(\mathrm {NETI-DIFF})\) for TI with \(K=10\) inverse temperatures. Values above 1 indicate a performance improvement with NETI-DIFF over TI. d Same as panel c, but for TI with \(K=(20,50,100)\) inverse temperatures

Fig. 14
figure 14

Mean absolute deviation of log Bayes factor estimation: comparison between NETI-DIFF and TI on the biopathway data. Simulations were repeated for 5 independent data instantiations, obtained from the biopathway of Fig. 1a (wild type), and from the biopathway of Fig. 1b (PRR7/PRR9 mutant). Shown is the mean absolute deviation \({\mathbb {A}}\), defined in Eq. (41). Vertical bar height: average over the five data instantiations. Error bars: standard deviation. The horizontal axis shows the total number of iterations \(N_{iter}\). For each value of \(N_{iter}\), the leftmost bar represents NETI-DIFF. The other bars with different grey shadings represent TI with different numbers of inverse temperatures, ranging from 10 to 100

Figure 12 shows the distribution of estimated log Bayes factors obtained from \(N_{simu}=5\) independent MCMC runs.Footnote 12 The two columns refer to the different data types (from the wild type network, left column, and the mutant network, right column), and the rows (Panels 12a–d) to the number of inverse temperatures used for TI (from \(K=10\) to \(K=100\); note that NETI-DIFF is unaffected by that choice). The horizontal dashed lines show the ‘true’ value, as described above. As expected, the distribution width tends to decrease with increasing computational costs, \(N_{iter}\), and for the highest value, TI and NETI-DIFF tend to be in close agreement, with distributions tightly focused on the ‘true’ values. However, for lower computational costs, \(N_{iter} \le 400k\), bias and uncertainty tend to be considerably lower for NETI-DIFF than for TI, irrespective of the number of inverse temperatures used for TI.

For a more systematic investigation, we repeated the MCMC simulations on ten independent data instantiations, for the ten data sets used in Table 1. Five data sets were obtained from the biopathway of Fig. 1a (wildtype), and five data sets were obtained from the biopathway of Fig. 1b (PRR7/PRR9 mutant). For each data set, we computed the mean absolute deviation \({\mathbb {A}}\), defined in Eq. (41), and the variance \({\mathbb {V}}\), as defined in Eq. (40).

The top row in Fig. 13 shows the average variance \({\overline{{\mathbb {V}}}}\), averaged over all data instantiations. The second row shows the ratio of the average variance obtained with TI, divided by the average variance obtained with NETI-DIFF, averaged over all five data instantiations: \({\overline{{\mathbb {V}}}}(\mathrm {TI})/{\overline{{\mathbb {V}}}}(\mathrm {NETI-DIFF})\). The third and fourth rows show the distribution of the variance ratios \({\mathbb {V}}(\mathrm {TI})/{\mathbb {V}}(\mathrm {NETI-DIFF})\) over the five different data instantiations, for different numbers of inverse temperatures (for TI), and different total interation numbers \(N_{iter}\). For all ratios, values above 1 indicate a performance improvement with NETI-DIFF over TI. Our results indicate that NETI-DIFF consistently achieves a considerable variance reduction over TI. This reduction is particularly pronounced for small numbers of inverse temperatures, where it reaches up to three orders of magnitude. However, even for the highest number of inverse temperatures the variance reduction NETI-DIFF achieves over TI still varies between one and two orders of magnitude. This clear reduction in estimation uncertainty is matched by a consistent reduction in the estimation error, as quantified in terms of \({\mathbb {A}}\) and shown in Fig. 14. The reduction becomes stronger with decreasing iteration numbers \(N_{iter}\) and decreasing numbers of inverse temperatures, which indicates that the performance improvement of NETI-DIFF over TI is particularly relevant in the regime of limited computational resources.

6 Discussion

The objective of our work has been the direct targeting of the log Bayes factor via a modified thermodynamic integration path. This has been motivated by statistical physics, where the computation of a reaction free energy (mathematically equivalent to the log Bayes factor) is computationally more efficient than the computation of the difference of standard free energies (equivalent to the difference of log marginal likelihoods). The modified transition path directly connects the posterior distributions of the two models involved. In this way, the high variance prior regime is avoided. We have carried out a comparative evaluation with the state-of-the-art TI method of Friel et al. (2014). Our study confirms that a substantial variance reduction can be achieved when the models to be compared are nested. There is little room for improvement when comparing non-nested models with non-overlapping parameter sets. However, even in this least favourable case, the performance achieved with the proposed method, referred to as NET-DIFF in the present manuscript, is still on a par with established TI methods. For inference in a complex systems described by coupled nonlinear differential equations (biopathway), we found that NETI-DIFF reduces the variance by up to two orders of magnitude over state-of-the-art TI methods. Our work has also revealed that NETI-DIFF achieves a considerable performance stabilisation with respect to a variation of the parameter prior.

When the task is model selection out of a set of cardinality m, carrying out direct pairwise comparisons is of computational complexity \(m^2\) and may not be viable in practice. However, rather than reverting to the standard TI scheme and computing the marginal likelihoods

$$\begin{aligned} p(D|{\mathcal {M}}_1),\ldots ,p(D|{\mathcal {M}}_m) \end{aligned}$$
(42)

it appears more sensible to compute the Bayes factors

$$\begin{aligned} \frac{p(D|{\mathcal {M}}_1)}{p(D|{\mathcal {M}}_0)},\ldots ,\frac{p(D|{\mathcal {M}}_m)}{p(D|{\mathcal {M}}_0)} \end{aligned}$$
(43)

where \({\mathcal {M}}_0\) is a typical or representative model chosen from the set of models compared. The results for the Radiocarbon data, reported in Sect. 5.2, have demonstrated a remarkable robustness of the proposed method w.r.t. a variation of the model transition path, meaning that there is no significant difference in efficiency and accuracy between the direct computation of \(\log \frac{p(D|{\mathcal {M}}_1)}{p(D|{\mathcal {M}}_2)}\), and the indirect computation via \(\log \frac{p(D|{\mathcal {M}}_1)}{p(D|{\mathcal {M}}_0)}\) and \(\log \frac{p(D|{\mathcal {M}}_2)}{p(D|{\mathcal {M}}_0)}\). This suggests that 1-out-of-m model selection can also be improved with the method we have proposed. It is beyond the scope of this article to investigate this conjecture at greater depth, but it appears plausible that targeting Bayes factors along an annealing path starting from a reference posterior distribution associated with a reference model should give smaller posterior variance than conventionally targeting marginal likelihoods along an annealing path starting from the prior distribution.

If there are only those m models, \({\mathcal {M}}_1,\ldots ,{\mathcal {M}}_m\), then the m Bayes factors in Eq. (43) together with the (pre-defined) model prior probabilities \(p({\mathcal {M}}_i)\) (\(i=1,\ldots ,m\)) and the normalisation condition fully specify the model posterior probabilities \(p({\mathcal {M}}_i|D)\). With the definition:

$$\begin{aligned} b_{i,j} := \frac{p(D|{\mathcal {M}}_i)}{p(D|{\mathcal {M}}_j)} \cdot \frac{p({\mathcal {M}}_i)}{p({\mathcal {M}}_j)} = \frac{p(D|{\mathcal {M}}_i)}{p(D|{\mathcal {M}}_0)} \cdot \left( \frac{p(D|{\mathcal {M}}_j)}{p(D|{\mathcal {M}}_0)}\right) ^{-1} \cdot \frac{p({\mathcal {M}}_i)}{p({\mathcal {M}}_j)} \;\;\;\;\;\; \left( i,j\in \{1,\ldots ,m\}\right) \end{aligned}$$

where the two Bayes factors on the right are known from Eq. (43), we get:

$$\begin{aligned} p({\mathcal {M}}_i|D) = \frac{p(D|{\mathcal {M}}_i)\cdot p({\mathcal {M}}_i)}{\sum _{j=1}^{m} p(D|{\mathcal {M}}_j)\cdot p({\mathcal {M}}_j) } = \frac{p(D|{\mathcal {M}}_i)\cdot p({\mathcal {M}}_i)}{\sum _{j=1}^{m} \frac{p(D|{\mathcal {M}}_i)\cdot p({\mathcal {M}}_i)}{b_{i,j}} } = \left( \sum _{j=1}^m b_{i,j}^{-1} \right) ^{-1} \end{aligned}$$
(44)

Equation (44) is formally equivalent to Eq. (4) in Berger and Delampady (1987). We have m models with discrete prior probabilities \(\pi _i = P({\mathcal {M}}_i)>0\) and \(\sum _{i=1}^m \pi _i=1\). We get, e.g., for model \({\mathcal {M}}_1\):

$$\begin{aligned} p({\mathcal {M}}_1|D)= & {} \left( 1 + \sum _{j=2}^m b_{1,j}^{-1} \right) ^{-1} = \left( 1 + \sum _{j=2}^m \frac{\pi _j}{\pi _1} \cdot \frac{p(D|{\mathcal {M}}_j)}{p(D|{\mathcal {M}}_1)} \right) ^{-1} \\= & {} \left( 1 + \frac{1}{\pi _1 } \sum _{j=2}^m \pi _j \cdot \frac{p(D|{\mathcal {M}}_j)}{p(D|{\mathcal {M}}_1)} \right) ^{-1} = \left( 1 + \frac{1-\pi _1}{\pi _1} \sum _{j=2}^m \frac{\pi _j}{1-\pi _1} \cdot \frac{p(D|{\mathcal {M}}_j)}{p(D|{\mathcal {M}}_1)} \right) ^{-1}\\= & {} \left( 1 + \frac{1-\pi _1}{\pi _1} \cdot \frac{\sum _{j=2}^m p(D|{\mathcal {M}}_j) \cdot \frac{\pi _j}{1-\pi _1}}{p(D|{\mathcal {M}}_1)} \right) ^{-1} = \left( 1 + \frac{1-\pi _1}{\pi _1} \cdot \frac{1}{B} \right) ^{-1} \end{aligned}$$

where B is the Bayes factor:

$$\begin{aligned} B:=\frac{p(D|{\mathcal {M}}_1)}{\sum _{j=2}^m p(D|{\mathcal {M}}_j)\cdot g({\mathcal {M}}_j)} = \frac{p(D|{\mathcal {H}}_0)}{p(D|{\mathcal {H}}_1)} \;\; with \;\; g({\mathcal {M}}_j):=\frac{\pi _j}{1-\pi _1} = \frac{\pi _j}{\sum _{j=2}^m \pi _j} \end{aligned}$$
(45)

and the hypotheses stand for: \({\mathcal {H}}_0{:}{\mathcal {M}}={\mathcal {M}}_1\) and \({\mathcal {H}}_1{:}{\mathcal {M}}\in \{{\mathcal {M}}_2,\ldots ,{\mathcal {M}}_m\}\) which are assumed to be true with the prior probabilities \(\pi _1\) and \(1-\pi _1\), respectively.  Equation (45) corresponds to Eq. (2) in Berger and Delampady (1987).Footnote 13

One of the referees raised the interesting question of how the proposed method is applied to graphical Gaussian models and mixture models.

We have included an additional section in the Appendix 7.4 where we discuss in detail how the proposed method can be applied to Graphical Gaussian models. We have also carried out an additional simulation study to illustrate the application of our method to Graphical Gaussian models. The key idea is to not apply the method to the configuration space of precision matrices directly, which would be cumbersome due to the constrained topology of this space (restriction to positive definite matrices). Instead, we make use of the theorem that every multivariate normal density can be represented by a Gaussian belief network, and vice versa; see Geiger and Heckerman (1994). This effectively defines an isomorphism between the space of Gaussian graphical models and the space of Gaussian belief networks. We exploit this isomorphism by defining the proposed NETI scheme in the space of Gaussian belief networks, as discussed in detail in Appendix 7.4.

For mixture models, the proposed NETI method will not achieve any improvement over the standard thermodynamic integration scheme. The reason is that according to Eq. (18), the modified thermodynamic integration path that we have proposed has the potential for a variance reduction if the two model likelihoods in the numerator and denominator share a substantial number of parameters. For mixture models, this is not the case, due to the intrinsic identifiability problem. In Appendix 7.5, we demonstrate on an empirical simulation study that for a mixture model, the proposed new method and the established thermodynamic integration scheme are on a par.

The focus of our study has been a comparison with the improved TI method proposed in Friel et al. (2014). Recently, a powerful new method for variance reduction in thermodynamic integration based on control variates, termed CTI (controlled thermodynamic integral), has been proposed (Oates et al. 2016). The idea is to add a zero-mean function from a given function family (e.g. a polynomial) to the integrand and then apply variational calculus to minimise the variance of the estimator. The resulting optimality equations depend on expectation values w.r.t. the unknown posterior distribution, which the authors approximate with samples from initial MCMC simulations.

On the Radiata data, CTI outperforms NETI-DIFF, due to the fact that NETI-DIFF offers little room for improvement on non-nested models with disjunct parameter sets, as discussed above. On the Pima Indians data, both NETI-DIFF and CTI achieve a significant variance reduction over the state-of-the-art TI method of Friel et al. (2014). Oates et al. (2016) applied their method with the standard trapezoid sum of Eq. (7), CTI-1, and with the improved trapezoid sum of Eq. (10), CTI-2. A comparison between Fig. 7 in the present paper and Fig. 3 in Oates et al. (2016) shows that the performance of NETI-DIFF, which reduces the variance over state-of-the-art TI by a whole order of magnitude, lies between CTI-1 and CTI-2. Oates et al. (2016) argue that the linear curvature sum of Eq. (7) is known to be biased, and the quadratic curvature rule of Eq. (10) should be used. However, in Aderhold et al. (2017) it was demonstrated that quadratic curvature can lead to an increase in the estimation error when vague prior distributions are used, and it is therefore not always the automatic method of choice.

Current work in statistics is increasingly aiming to tackle more complex models, e.g. based on coupled nonlinear differential equations, like the biopathway model discussed in Sect. 4.4. For data generated from an ordinary differential equation model of circadian regulation (Goodwin oscillator), Oates et al. (2016) found that CTI achieved little improvement over state-of-the-art TI. The authors discuss that a potential problem CTI faces for complex models is multimodality of the posterior distributions, rendering the approximation of the posterior expectation values, which enter the optimality equations from variational calculus, less reliable. NETI-DIFF, on the other hand, does not rely on such estimates. In fact, our results, presented in Fig. 13, suggest that NETI-DIFF achieves the most substantial variance reduction over state-of-the-art TI for the most complex, nonlinear biopathway model, reaching up to and exceeding two orders of magnitude.

We conclude that CTI and NETI-DIFF are not competing methods, but rather conceptionally different approaches with the potential to complement each other. CTI aims to achieve variance reduction by adding control variates to the integrand; it requires a reliable estimation of posterior averages of quantities related to these control variates from initial MCMC runs. NETI-DIFF aims to achieve variance reduction by modifying the thermodynamic integration path; it works best for models with substantial parameter overlap. Both approaches can be combined, that is, the natural next step is to add control variates and change the integration path, i.e. to target the log Bayes factor with the principles of CTI. This combination of NETI-DIFF and CTI has the potential to further extend the feasibility of Bayesian model selection to ever more complex models, and a closer investigation of such a hybrid approach poses a promising avenue for future research.