A numerically stable algorithm for integrating Bayesian models using Markov melding

When statistical analyses consider multiple data sources, Markov melding provides a method for combining the source-specific Bayesian models. Markov melding joins together submodels that have a common quantity. One challenge is that the prior for this quantity can be implicit, and its prior density must be estimated. We show that error in this density estimate makes the two-stage Markov chain Monte Carlo sampler employed by Markov melding unstable and unreliable. We propose a robust two-stage algorithm that estimates the required prior marginal self-density ratios using weighted samples, dramatically improving accuracy in the tails of the distribution. The stabilised version of the algorithm is pragmatic and provides reliable inference. We demonstrate our approach using an evidence synthesis for inferring HIV prevalence, and an evidence synthesis of A/H1N1 influenza.


Introduction
Many modern applied statistical analyses consider several data sources, which differ in size and complexity. The wide variety of problems and information sources has produced numerous methods for multi-source inference (Lanckriet et al. 2004;Coley et al. 2017;Besbeas and Morgan 2019), as well as general methodologies including evidence synthesis methods (Sutton and Abrams 2001;Ades and Sutton 2006;Spiegelhalter et al. 2004), and the data fusion model (Kedem et al. 2017). These methods require an appropriate joint model for all data, which can be challenging to specify.
An alternative approach is to model smaller, simpler aspects of the data, such that designing these submodels is easier, then combine the submodels. The premise is that the combination of many smaller submodels will serve as a good approximation to a larger joint model, which may be methodologically or computationally infeasible. Markov melding (Goudie et al. 2019 The Alan Turing Institute, British Library, London, UK combining these submodels. Specifically, Markov melding joins together submodels that share a common quantity φ into a single joint model. Consider M submodels indexed by m = 1, . . . , M that share φ, have submodel specific parameters ψ m and submodel specific data Y m , denoting the m th submodel p m (φ, ψ m , Y m ). Markov melding forms a single joint melded model p meld (φ, ψ 1 , . . . , ψ M , Y 1 , . . . , Y M ), which enables information to flow through φ from one model to another. The melded model posterior thus incorporates uncertainty from all sources of data.
Multi-stage sampling methods are useful, pragmatic tools for estimating complex joint models -such as p meld -in a computationally feasible manner, and have been applied in settings including statistical genetics and phylogeny (Tom et al. 2010), meta-analysis (Lunn et al. 2013;Blomstedt et al. 2019), spatial statistics (Hooten et al. 2019), and joint models in survival analysis (Mauff et al. 2020). Whilst it is preferable to sample the joint posterior directly, this is often infeasible due to the complexity of the model, the size of the data, the limitations of probabilistic programming languages such as JAGS and Stan, or the complications of re-expressing complicated submodels in a common programming language (Johnson et al. 2020). Improving the stability of multi-stage estimation techniques is thus of interest to applied statisticians.
Evidence synthesis models consider multiple sources of data (evidence), including randomised controlled trials or observational studies, to understand complex phenomena. Each source of data has an associated submodel and set of parameters it informs; combining all the submodels requires assuming deterministic or probabilistic relationships between the submodel-specific parameters. For example, De  collected many surveys of partially overlapping subpopulations, albeit at different frequencies, and combined these in an evidence synthesis model to estimate human immunodeficiency virus (HIV) prevalence in the United Kingdom. An introduction to evidence synthesis can be found in Chapter 8 of Spiegelhalter et al. (2004); other applications include estimating the prevalence of campylobacteriosis (Albert et al. 2011) and influenza .
We can form evidence synthesis models by applying Markov melding to the various sources of data and their submodels. However, the common quantity φ may be a complex, non-invertible function of the parameters in one of the submodels. This is a challenge for Markov melding, as the method requires the prior marginal density of φ under each submodel p m (φ), which may not be analytically tractable. Instead, prior samples of φ are drawn, and the prior marginal density is estimated using a kernel density estimate (KDE) p m (φ) (Wand and Jones 1995). However, the use of a KDE in lieu of the analytic density function has poor implications for the numerical stability of the Markov Chain Monte Carlo (MCMC) method used to estimate the melded posterior, even in our low dimensional examples. Specifically, we illustrate that the multi-stage MCMC sampler of Goudie et al. (2019) is sensitive to error in p m (φ), particularly in low probability regions.
To address this sensitivity, we first note that Markov melding strictly only requires an estimate of the self-density ratio (Hiraoka et al. 2014), r(φ nu , φ de ) = p m (φ nu ) / p m (φ de ), as we will show in Sect. 2. In Sect. 3 we develop methodology that reduces the error in the self-density ratio estimate r(φ nu , φ de ) by using weighted-sample KDEs (Vardi 1985;Jones 1991), which are more accurate in low probability regions. Multiple weighted-sample estimates of r(φ nu , φ de ) are combined via a weighted average to further improve performance. We call this methodology weighted-sample self-density ratio estimation (WSRE), and demonstrate the effectiveness of our methodology in two examples. The first is a toy example from Ades and Cliffe (2002). We show that output from the multi-stage estimation process that uses WSRE is closer to reference samples than the naive approach, which uses a single KDE for p m (φ). The second example is an involved evidence synthesis, previously considered in Goudie et al. (2019). Here we show that the multi-stage estimation process that employs WSRE produces plausible samples, whilst the naive approach produces nonsensical results. In these examples φ is a 1 or 2 dimensional quantity. We discuss the applicability of our method for higher dimensional φ in Sect. 6.

Markov melding
The Markov melding framework is able to join together any number of submodels which share a common component φ.
As the examples in this paper only consider two submodels, we limit our exposition to the M = 2 model case; for the more general case see Goudie et al. (2019). Markov melding constructs a joint model using the conditional distributions for submodel-specific parameters ψ m and data Y m , denoted p m (ψ m , Y m | φ). These conditional distributions are then combined with a global prior for φ called the pooled prior p pool (φ), which we discuss in Sect. 2.1. Mathematically, assuming that the supports of the relevant conditional, joint, and marginal distributions containing φ are appropriate, we define the melded joint distribution as The submodel-specific conditional densities p m (ψ m , Y m ) | φ) may be analytically intractable. Hence, it is easier to work with the submodel-joint densities p m (φ, ψ m , Y m ) and their prior marginal distributions p m (φ) as specified in Eq. (2), because the former can be factorised into the data generating process specified during submodel construction.

Forming the pooled prior
The pooled prior should represent previous knowledge of φ in the absence of other information. A general approach to constructing p pool (φ) is to consider a weighted combination of the prior marginal distributions p m (φ), with submodel weights λ m . Selection of the pooling method and specific values of the weights is a topic covered in much detail elsewhere (Clemen and Winkler 1999;O'Hagan et al. 2006); a full summary of this field is beyond the scope of this article.

Two-stage Markov chain Monte Carlo sampler
Directly estimating the melded model's posterior distribution necessitates simultaneously evaluating both submodels. This can be impractical if the submodels are implemented in different probabilistic programming languages or have bespoke implementations. We use the two-stage Markov chain Monte Carlo (MCMC) sampler of Goudie et al. (2019) to sample from p meld (φ, ψ 1 , ψ 2 | Y 1 , Y 2 ) without the need to evaluate Eq.
(3) all at once. This involves a two-stage MCMC procedure, first sampling from a partial product of the terms in Eq.
(3), then using these samples as a proposal distribution in the second stage. The result is a convenient cancellation of the common terms in the stage two acceptance probability, whilst still ensuring that the final samples come from the melded posterior distribution of Eq. (3).
In stage one of the sampler we may, for example, opt to target the first submodel p 1 , but with an (improper) flat prior for φ so we construct a standard Markov chain in which a proposed move from (φ, ψ 1 ) → (φ * , ψ * 1 ), with proposal density q(φ * , ψ * 1 | φ, ψ 1 ), is accepted with probability This Markov chain asymptotically emits samples from p meld,1 . In stage two we update φ and ψ 2 using Metropoliswithin-Gibbs updates, targeting the full melded posterior distribution of Eq. (3). Updating φ uses the stage one samples as a proposal distribution. For a sample of size N from p meld,1 denoted {φ (meld,1) n } N n=1 we sample an index n * uniformly at random between 1 and N , and use the corresponding value as the proposal φ * = φ (meld,1) n * . This results in a stage two acceptance probability for a move from φ → φ * of since all stage one terms cancel, providing a form of "modularisation" in the algorithm. The update for ψ 2 has an acceptance probability for a move from ψ 2 → ψ * 2 , drawn from a proposal distribution q(ψ * 2 | ψ 2 ), of , as all terms that do not contain ψ 2 cancel. Samples from the melded posterior distribution for ψ 1 , p meld (ψ 1 | Y 1 , Y 2 ), can be obtained by storing the indices n used to draw values of φ from {φ (meld,1) n } N n=1 in stage two. The stored indices are then used to resample the stage one samples An interesting property of Equations (4) and (5) is that our interaction with the unknown prior marginal distribution is limited to the self-density ratio r(φ, φ * ) = p m (φ) / p m (φ * ). In Sect. 3 we develop methodology that uses self-density ratios to improve the numerical stability of the acceptance probability calculations.
We do not have to target p meld,1 (φ, ψ 1 | Y 1 ) with an improper prior in stage one; we are free to choose any of the components of Eq. (3). The choice of stage one components will affect MCMC mixing, yet is often constrained by the practicalities of sampling the subposterior distributions. In the example of Sect. 5 the common quantity φ is a non-invertible function of parameters in p 1 , and it is possible to sample from the subposterior p 1 (φ, ψ 1 | Y 1 ) using JAGS. Hence, we draw stage one samples from p 1 (φ, ψ 1 | Y 1 ), with stage two, implemented partially in Stan, accounting for the remaining terms: 1 / p 1 (φ), p 2 (φ, ψ 2 | Y 2 ) / p 2 (φ), and p pool (φ). This process highlights another interesting advantage of Markov melding; we can use samples produced from one statistical software package in combination with a model implemented in another, mixing and matching as is most convenient.

Naive prior marginal estimation
The expressions in Eqs. (4) and (5) explicitly include both models' prior marginal distributions p m (φ) for m = 1, 2, and implicitly includes them in p pool (φ). In our examples we do not have analytic expressions for these marginals. More generally, if φ is not a root node in the directed acyclic graph representation of either submodel (see e.g. π 12 in Fig. 1), or is the aggregate output of a non-invertible deterministic link function, then the analytic form of p m (φ) will likely be intractable.
The approach proposed by Goudie et al. (2019), which we call the naive approach, estimates the prior marginal dis-tributions by sampling p m (φ, ψ m , Y m ) for each model using simple Monte Carlo, as the samples of φ will be distributed according to the correct marginal, and employs a standard KDE p m (φ) (Wand and Jones 1995). The two-stage sampler then targets the corresponding estimate of the melded posterior wherep pool (φ) is the approximation to p pool (φ) obtained by plugging inp m (φ) for m = 1, 2.

Numerical issues in the naive approach
Sampling the melded posterior using Eq. (6) can be numerically unstable. Say we propose a move from φ → φ * , where φ * is particularly improbable under p m . The KDE estimate at this value, p m (φ * ), is poor in terms of relative error particularly in the tails of the distribution (Koekemoer and Swanepoel 2008). In our experience, the KDE is typically an underestimate in the tails, which can lead to an explosion in the self-density ratio estimater Hence, improbable values for φ * are accepted far too often. Once at this improbable value, i.e. when φ is improbable under p m (φ), the error in the KDE then leads to a dramatically reduced value for the acceptance probability. This results in Markov chains that get stuck at improbable values. For example, see the top left panel of Fig. 5. In which stage this instability arises depends on which prior marginal densities are intractable, and how the terms in Eq. (3) are apportioned across the stages. In the example of Sect. 4, p 1 (φ) is unknown and is part of both stage one (in Eq. (4)) and stage two (via p pool (φ) in Eq. (5)). Thus both stages are numerically brittle. Our second example, contained in Sect. 5, represents a more typical scenario, where the first submodel posterior is used as the proposal for the melded posterior. In this case, all unknown prior marginal terms are factorised into the stage two target, and the instability is confined to the second stage.

Self-density ratio estimation
As described in Sect. 2.4, the self-density ratios associated with both p 1 (φ) and p 2 (φ) may be required by the two-stage MCMC algorithm. To simplify notation, we consider in this section a generic joint density p(φ, γ ) that we can evaluate pointwise, but whose marginal p(φ) = p(φ, γ )dγ we cannot obtain analytically. Our interest is in the self-density ratio evaluated at φ nu and φ de (the subscripts are abbreviations of numerator and denominator respectively) which we denote as In our examples we set φ nu = φ and φ de = φ * for use in Eqs. (4) and (5); and define γ = (ψ m , Y m ) and p = p m where m = 1 or 2 as appropriate (see Sects. 4 and 5 for details).
To avoid the numerical issues associated with the naive approach, we need to improve the ratio estimate r(φ nu , φ de ) for improbable values of φ nu and φ de , e.g. values more than two standard deviations away from the mean. The fundamental flaw in the naive approach in this context is that it minimises the absolute error in the high density region (HDR) of p(φ), i.e. the region R ε (p(φ)) = {φ : p(φ) > ε}. But this is not necessarily the sole region of interest, and we are concerned with minimising the relative error. To address this we reweight p(φ) towards a particular region, and thus obtain a more accurate estimate in that region. We then exploit the fact that we only interact with the prior marginal distribution via its self-density ratio to combine estimates from multiple reweighted distributions.

Single weighting function
We can shift p(φ) by multiplying the joint distribution p(φ, γ ) by a known weighting function w(φ; ξ), controlled by parameter ξ , then account for this shift in our KDE. This will improve the accuracy of the KDE in the region to which we shift the marginal. We first generate N samples denoted {(φ n , γ n )} N n=1 , from a weighted version of the joint distribution where , obtained by ignoring the samples of γ n , are distributed according to a weighted version s(φ; ξ) of the marginal distribution p(φ) where Z 2 = p(φ)w(φ; ξ)dφ. Typically (7) and form our weighted-sample self-density ratio estimatê .
The cancellation of the normalisation constant Z 3 is crucial, as accurately estimating constants like Z 3 is known to be challenging.

Choice of weighting function
The choice of w(φ; ξ) affects both the validity and efficacy of our methodology. The weighted marginal s(φ; ξ) must satisfy the requirements for a density for our method to be valid. Hence, the specific form of w(φ; ξ) is subject to some restrictions. Our first requirement is that w(φ; ξ) > 0 for all φ in the support of p(φ, γ ). We also require that the weighted joint distribution, defined in (7), has finite integral, to ensure that it can be normalised to a probability distribution, and that the marginal s(φ; ξ) is positive over the support of interest, also with finite integral.

Multiple weighting functions
The methodology of Sect. 3.1 produces a single estimatê r(φ nu , φ de ) usingp(φ) from Eq. (8). It is accurate for values in the HDR of s(φ; ξ), i.e. R ε (s(φ; ξ)), and we can control the location of R ε (s(φ; ξ)) through ξ . This is similar to importance sampling, with s(φ; ξ) acting as the proposal density. Nakayama (2011) notes importance sampling can be used to improve the mean square error (MSE) of a KDE in a specific local region, at the cost of an increase in global MSE. To ameliorate the decrease in global performance, we specify multiple regions in which we want accurate estimates forp(φ), and then combine the corresponding estimates of r(φ nu , φ de ) to provide a single estimate that is accurate across all regions. We elect to use W different weighting functions, indexed by w = 1, . . . , W , with function-specific parameters ξ w denoted w(φ; ξ w ). Samples are then drawn from each of the W weighted distributions s w (φ; ξ w ) ∝ p(φ)w(φ; ξ w ). Denote the samples from the w th weighted distribution by {φ (s w ) n } N n=1 . Each set of samples produces a separate ratio estimater w (φ nu , φ de ) in the manner described in Sect. 3.1.
Each individualr w is accurate (in terms of relative accuracy) only in the HDR of s w (φ; ξ w ). Thus, when combining multiple ratio estimates, simply taking the mean of all w = 1, . . . , W estimates (for a specific value of φ nu and φ de ) would not make use of our knowledge about the region in whichr w is accurate. We therefore propose a weighted average of all the individual ratio estimates, where the weights approximately come from s w (φ nu ; ξ w )s w (φ de ; ξ w ) -this quantity is largest whenr w (φ nu , φ de ) is most accurate. This ensures the more accurate terms are given more weight in our final estimate. Specifically, we use {φ Finally, we form the weighted-sample self-density ratio estimater WSRE (φ nu , φ de ), which is a weighted mean of the individual ratio estimateŝ

Choosing values for w
We are interested in accurately evaluating the self-density ratio for two points in this region. We will obtain W choices for ξ w by specifying V weighting functions for each of the D components, such that W = V D .
Assume that the weighting function w(φ; ξ) is composed of D independent component weighting functions where ξ [d] is the d th component of ξ . We can then define the marginal of the weighted target where φ [−d] represents the D−1 components of φ that are not φ [d] . For typical choices of ξ and w(φ; ξ), the corresponding HDR of t(φ [d] ; ξ [d] ) does not span the region of interest. That is, |R ε (t(φ [d] ; Our aim is to choose, for each of the d components, values We employ the following heuristic argument, first choosing a "minimum" ξ 1,d and a "maximum" ξ V ,d such that In words, we choose a minimum value ξ 1,d so that the corresponding HDR of the weighted target includes the lower limit of the region of interest. An analogous argument is used to choose the maximum ξ V ,d . We then interpolate V − 2 values between ξ 1,d and ξ V ,d ensuring that there is sufficient, but not complete, overlap between the corresponding HDRs.
Denote an element from the set of all W possible values for the parameter of the weighting function with , noting that ξ w is a D-vector. The practitioner typically has some knowledge of p(φ) and A from prior predictive checks and previous attempts at running the two-stage sampler. Thus only a small number of trial-and-error attempts should be needed to determine ξ 1,d and ξ V ,d for all dimensions. These attempts are also used to check for overlap between the HDRs, and increase V if the overlap is insufficient. Section 6 contains further discussion of this selection process and its relationship to umbrella sampling (Torrie and Valleau 1977)

Practicalities and software
In our examples we use Gaussian density functions for m(φ [d] ; ξ v,d ), Our definition of sufficient overlap is that 0.95 empirical quantile of t(φ [d] ; ξ v,d ) is equal or slightly greater than the 0.05 empirical quantile of t(φ [d] ; ξ v+1,d ), for v = 1, . . . , V − 1.
Our implementation of our WSRE methodology is available in an R (R Core Team 2021) package at https://github. com/hhau/wsre. It is built on top of Stan (Carpenter et al. 2017) and Rcpp (Eddelbuettel and François 2011). Package users supply a joint density p(φ, γ ) in the form of a Stan model; choose the parameters ξ w of each of the W weighting functions; and the number of samples N drawn from each s w (φ; ξ w ). The combined estimater WSRE (φ nu , φ de ) is returned. A vignette on using wsre is included in the package, and documents the specific form of Stan model required.

An evidence synthesis for estimating the efficacy of HIV screening
To illustrate our approach we artificially split an existing joint model into two submodels, then compare the melded posterior estimates obtained by the two-stage algorithm using the naive and WSRE approaches. Artificially splitting this joint model serves several purposes: it demonstrates that the numerical instability can occur in a simple, low dimensional setting; we can obtain a good parametric approximation to the prior marginal to use as a reference; and the simplicity of the model allows us to focus on our methodology, not the complexity of the model.

Model
The model is a simple evidence synthesis model for inferring the efficacy of HIV screening in prenatal clinics (Ades and Cliffe 2002), and has 8 basic parameters ρ 1 , ρ 2 , . . . , ρ 8 , which are group membership probabilities for particular risk groups and subgroups thereof. The first risk group partitions the prenatal clinic attendees into those born in sub-Saharan Africa (SSA), injecting drug users (IDU), and the remaining women. These groups have corresponding membership probabilities ρ 1 , ρ 2 , and 1 − ρ 1 − ρ 2 . The groups are subdivided based on whether they are infected with HIV, with group specific HIV positivity ρ 3 , ρ 4 and ρ 5 respectively; and if they had already been diagnosed before visiting the clinic, with pre-clinical diagnosis probabilities ρ 6 , ρ 7 and ρ 8 . An additional probability is also included in the model, denoted ρ 9 , which considers the prevalence of HIV serotype B. This parameter enables the inclusion of study 12, which further informs the other basic parameters. Table 1 summarises the full joint model, including the s = 1, . . . , 12 studies with observations y s and sample size n s ; the basic parameters ρ 1 , . . . , ρ 9 ; and the link functions that relate the study proportions π 1 , . . . , π 12 to the basic parameters.
We make one small modification to original model of Ades and Cliffe (2002), to better highlight the impact of WSRE on the melded posterior estimate. The original model adopts a flat, Beta(1, 1) prior for ρ 9 . This induces a prior on π 12 that is not flat, but not overly informative, making it difficult to demonstrate the issues caused by an inaccurate density estimate of the tail of the prior marginal distribution. Instead, we adopt a Beta(3, 1) prior for ρ 9 . This prior would have been reasonable for the time and place in which the original evidence synthesis was constructed, since the distribution of HIV serotypes differs considerably between North America and sub-Saharan Africa (Hemelaar 2012).
The prior p 1 (π 12 ) on the common quantity φ = π 12 is implicitly defined, so its analytic form is unknown, hence it needs to be estimated.

Self-density ratio estimation
We now compute the self-density ratio estimatê r WSRE (φ nu , φ de ) of p 1 (φ nu ) / p 1 (φ de ). In the notation defined in Sect. 3.4, this example has D = 1, and we use V = W = 7 Gaussian weighting functions. We fix the variance parameter of the weighting function σ 2 w = 0.08 2 for all w, and use the heuristic described in Sect. 3.4 to choose values for the mean parameter of the weighting function. Specifically, we set the minimum to be ξ 1,1 = μ 1 = 0.05, with maximum ξ 7,1 = μ 7 = 0.08 and 5 additional, equally spaced values between these extrema. We draw 3000 MCMC samples in total, apportioned equally across the 7 weighting functions. We thus draw 428 post warmup MCMC samples from each weighted target.

Results
We compare the melded posterior obtained by the naive approach and using WSRE. For a fair comparison, we estimate the prior marginal distribution of interest p 1 (φ) using 3000 Monte Carlo samples, This set-up is slightly advantageous for the naive approach, which uses Monte Carlo samples, rather than the MCMC samples of the self-density ratio estimate; the naive approach makes use of a sample comprised of 3000 effective samples, whilst the self-density ratio estimate uses fewerthan 3000 effective samples. A reference estimate of the melded posterior is obtained using a parametric density estimatep ref,1 (φ) for the unknown prior marginal, based on 5 × 10 6 prior samples. The reference sample also  Fig. 2 Top: Stage one trace plot for φ using the naive method. At any moment in time chains can jump to the spurious mode, which is an artefact ofp 1 (φ). Bottom: Corresponding stage two trace plot. The stage two target has the same numerical instability, and because the stage one samples are the proposal distribution, all chains encounter the instability contains some error, asp ref,1 (φ) is not perfect. However, in the absence of an analytic form for p 1 (φ) it serves as a very close approximation. We estimate the melded posterior using the two-stage sampler of Sect. 2.2, targeting in stage one and the full melded posterior in stage two. To demonstrate the numerical instability of interest, we run 24 chains that target p meld,1 in (9) using the naive approach. The top panel of Fig. 2 displays the trace plot of the postwarmup samples. Many chains have already converged to a spurious model around φ ≈ 0.02, and other chains jump to this mode after a variable number of additional iterations. As discussed in Sect. 2.4, this mode is an artefact of the naive KDE employed forp 1 (φ), and is also visible in the corresponding stage two trace plot (bottom panel of Fig. 2). Because the stage one samples act as the proposal for stage two, all stage two chains quickly jump to the spurious mode.
The samples surrounding the spurious mode introduce substantial bias in estimate of the melded posterior under the naive approach. This is visible in the quantile-quantile plot in Fig. 3, where the naive approach produces an implausible estimate compared to the reference quantiles. In contrast, the WSRE approach rectifies the numerical instability, and uses the two-stage sampler to produce a sensible estimate of the melded posterior.

An evidence synthesis to estimate the severity of the H1N1 pandemic
We now consider a more involved example, where the prior for the common quantity does not have an analytical form under either submodel, and the two priors contain a substantially different quantity of information. Presanis et al. (2014) undertook a large evidence synthesis in order to estimate the severity of the H1N1 pandemic amongst the population of England. This model combines independent data on the number of suspected influenza cases in hospital's intensive care unit (ICU) into a large severity model. Here, we reanalyse the model introduced in Goudie et al. (2019) that uses Markov melding to join the independent ICU model (m = 1) with a simplified version of the larger, complex severity model (m = 2). In this example the melded model has no obvious implied joint model, so there are no simple "gold standard" joint model estimates to use as a baseline reference. However, we demonstrate that the naive approach is highly unstable, whereas the WSRE approach produces stable results. The code to reproduce all figures and outputs for this example is available at https://github.com/hhau/full-melding-example.

ICU submodel
The data for the ICU submodel (m = 1) are aggregate weekly counts of patients in the ICU of all the hospitals in England, for 78 days between December 2010 and February 2011.
Observations were recorded of the number of children a = 1 and adults a = 2 in the ICU on days U = {8, 15, . . . , 78}, and we denote a specific weekly observation as y a,t for t ∈ U . To appropriately model the temporal nature of the weekly ICU data we use a time inhomogeneous, thinned Poisson process with rate parameter λ a,t for t ∈ T where T = {1, 2, . . . , 78}. This is the expected number of new ICU admissions; the corresponding age group specific ICU exit rate is μ a . There is also a discrepancy between the observation times U and the daily support of our Poisson process T . We address this in the observation model through different supports for t in Eqs. (10) and (11). An identifiability assumption of η a,1 = 0 is required, which enforces the reasonable assumption that no H1N1 influenza patients were in the ICU at time t = 0. Weekly virological positivity data are available at weeks V = {1, . . . , 11}, and inform the proportion of influenza cases which are attributable to the H1N1 virus π pos a,t . The virology data consists of the number of H1N1-positive swabs z pos a,v and the total number of swabs n pos a,z tested for influenza that week. This proportion relates the counts to π pos a,t via a truncated uniform prior on π pos a,t , π pos a,t ∼ Unif(ω a,v , 1), t ∈ T z pos a,v ∼ Bin(n pos a,z , ω a,v ), v ∈ V , with v = 1 for t = 1, 2, . . . , 14, and v = (t − 1) / 7 for t = 15, 16, . . . , 78 to align the temporal indices. The positivity proportion π pos a,t is combined with λ a,t to compute the lower bound on the total number of H1N1 cases φ a = t∈T π pos a,t λ a,t where φ = (φ 1 , φ 2 ) is the quantity common to both submodels. This summation is a non-invertible function, which necessitates either considering this model in stage one of our two-stage sampler, or appropriately augmenting the definition of φ a such that it is invertible. We elect to consider this submodel in stage one, and further discuss model ordering in Sect. 6.

Severity submodel
A simplified version of the large severity model (m = 2) of Presanis et al. (2014) is considered here, in which parts of the severity model are collapsed into informative priors. The cumulative number of ICU admissions φ a is assumed to be an underestimate of the true number of ICU admissions due to H1N1, χ a . This motivates φ a ∼ Bin(χ a , π det ), π det ∼ Beta(6, 4), χ 1 ∼ LN(4.93, 0.17 2 ), χ 2 ∼ LN(7.71, 0.23 2 ), where π det is the age group constant probability of detection, and the priors on χ a appropriately summarise the remainder of the large severity model.  Fig. 4 Heatmap of the severity submodel prior p 2 (φ), ICU submodel prior p 1 (φ), and the stage one (ICU submodel) posterior p 1 (φ | Y 1 ) Figure 4 displays p m (φ) for both submodels, as well as the subposterior for the ICU (m = 1) submodel p 1 (φ | Y 1 ). The melded posterior will be largely influenced by the product of p 1 (φ | Y 1 ) and p 2 (φ), since p 1 (φ) is effectively uniform (see the centre panel of Fig. 4), and there are no data observed in the severity submodel, i.e. Y 2 = ∅. In stage one we target the ICU submodel posterior p 1 (φ, ψ 1 | Y 1 ), enabling the use of the original JAGS (Plummer 2018) implementation. These samples for φ are displayed in the right panel of Fig. 4, and we see that whilst there is substantial overlap with p 2 (φ) (left panel), p 1 (φ | Y 1 ) is more disperse, particularly for φ 1 . Our region of interest is thus the HDR of p 1 (φ | Y 1 ), as the two-stage sampler involves evaluating the samples from p 1 (φ | Y 1 ) under p 2 (φ).

Self-density ratio estimation
The stage two acceptance probability for a move from φ In both the severity and ICU submodels, the prior marginal distribution p m (φ) is unknown. This necessitates estimating the self-density ratio for both p 1 (φ) and p 2 (φ). However, the uniformity of p 1 (φ) corresponds to a self-density ratio that is effectively 1 everywhere. In contrast, the severity submodel prior marginal p 2 (φ) is clearly not uniform over our region of interest; appropriately estimating the melded posterior thus requires an accurate estimate of p 2 (φ) / p 2 (φ * ).

Results
We compare the melded posterior estimate obtained using WSRE against the naive approach. For the latter we draw 10 5 samples from p 2 (φ) so that both approaches have the same number of samples, although the naive approach has a larger effective sample size. Figure 5 displays trace plots of 15 stage two MCMC chains, where α(φ * , φ) is computed using the naive approach (left column), and using WSRE (right column). The erroneous behaviour displayed in the left column is due to underestimation of the tails of p 2 (φ) using a standard KDE. This underestimation results in an overestimation of the acceptance probability for proposals in the tails of p 2 (φ), since the proposal term p 2 (φ * ) is in the denominator of Eq. (12). Hence, moves to improbable values of φ * have acceptance probabilities that are dominated by Monte Carlo error. Once at this improbable value the error then has the opposite effect; the underestimate yields chains unable to move back to probable values. This produces the monotonic step-like behaviour seen in the top left panel of Fig. 5. Although this behaviour is not visible in all 15 chains, it will eventually occur if the chains are run for more iterations, as a sufficiently improbable value for φ * will be proposed. The results from this sampler are thus unstable.
Whilst there is no baseline "truth" to compare to in this example, the sampler that employsr WSRE (φ, φ * ) as an estimate of p 2 (φ) / p 2 (φ * ) produces plausible results, in contrast to the naive approach. No step-like behaviour is visible when employing the WSRE approach (right column of 5). Whilst the between-chain mixing is not optimal, this can be ameliorated by running the chains for longer, which cannot be said for the naive method. This improved behaviour is obtained using the same number of samples from the prior marginal distribution, or weighted versions thereof. Users of this algorithm can be much more confident that the results are not artefactual.

Discussion
The complexity of many phenomena necessitates intricate, large models. Markov melding allows the practitioner to channel modelling efforts into smaller, simpler submodels, each of which may have data associated with it, then coherently combine these smaller models and disparate data. Multi-stage, sequential sampling methods, such as the Fig. 5 Trace plots of 15 replicate stage two chains for φ 1 and φ 2 , using the naive approach (left column) and the WSRE approach (right column) sampler used for Markov melding, are important tools for estimating these models in a pragmatic, computationally feasible manner.
In particular, when an analytic form of the prior marginal distribution is not available, we have demonstrated that the two-stage sampling process is particularly sensitive to the corresponding KDE in regions of low probability. Tail probability estimation is an important and recurrent challenge in statistics (Hill 1975;Béranger et al. 2019). We addressed this issue in the Markov melding context by noting that we can limit our focus to the self-density ratio estimate, and sample weighted distributions to improve performance in low probability areas, for lower computational cost than simple Monte Carlo. Our examples show that for equivalent sample sizes, we improve the estimation of the melded posterior compared to the naive approach.
The issue addressed in this paper arises to due differences in the intermediary distributions of the two-stage sampling process, particularly where the proposal distribution is wider than the target distribution. The presence or absence of this issue is dependent upon the order in which the components of the melded model are considered in the sampling process, which is often constrained by the link function used to define φ in each model. In both our examples the link function is non-invertible. Goudie et al. (2019) show extensions of the link function that render it invertible are valid; that is, the model is theoretically invariant to the choice of extension. However, the practical performance of the twostage sampler is heavily dependent on the appropriateness of such extensions, and designing such extensions is extremely challenging. Hence, the ordering of the submodels in the two-stage sampler is often predetermined; we are practically constrained by the non-invertible link function. In our examples this corresponds to sampling the less informative model for φ first. If we are free to choose the ordering of the twostage sampler, we may still prefer to sample the wider model first, as the melded posterior is more likely to be captured in a reweighted sample from a wider distribution than such a sample from a narrow distribution. However, if the melded posterior distribution is substantially narrower than the stageone target distribution then we are susceptible to the sample degeneracy and impoverishment problem (Li et al. 2014). Addressing this issue in the melding context, whilst retaining the computational advantages of the two-stage sampler, is an avenue for future work.
The examples we consider contain 1 or 2 dimensional φ. For higher dimensional φ we anticipate encountering issues associated with the curse of dimensionality. Specifically, the decrease in accuracy of any KDE and increase in the required number of weighting functions will scale exponentially with dimension. Applying the argument in Sect. 3.4 to locate these additional weighting functions will be challenging. As such we recommend WSRE, like other KDE methods, for settings where dim(φ) ≤ 5 (Wand and Jones 1995). This requirement may be relaxed when there is structure in φ that allows it to be split into lower-dimensional components, such as when φ contains a collection of subject-specific parameters that are independent a priori. More generally, in high dimensions almost everywhere is a 'region of low probability' and the performance of KDEs is known to be poor, making choosing both an appropriate number of weighting functions and their parameters difficult. Machine learning methods have proven to be effective for estimating densities of moderate to high dimension (see Wang and Scott (2019) for a review), however the performance of these methods in low probability regions has not, to our knowledge, been thoroughly investigated.
There are potential alternatives to our weighted-sample self-density ratio estimation technique. Umbrella sampling (Torrie and Valleau 1977;Matthews et al. 2018) aims to accurately estimate the tails of a density p(φ) by constructing an estimate p(φ) from W sets of weighted samples {φ n,w } N n=1 ∼ s w (φ; ξ w ), However, umbrella sampling requires estimates of the normalising constants Z 2,w = s w (φ; ξ w )dφ to combine the density estimates computed from each weighted sample. Our approach is able to avoid computing normalising constants by focusing on the self-density ratio. Umbrella sampling also requires choosing the location of the weighting functions, i.e. choosing ξ w appropriately. A heuristic strategy, similar to that of Sect. 3.4, is seen as necessary by Torrie and Valleau (1977). Adaptive procedures that automatically choose values of ξ w based on other criteria exist, but these assume that s w (φ) is a Gaussian distribution (Mitsuta et al. 2018) or operate on a predefined grid of possible values (Wojtas-Niziurski et al. 2013). We cannot use the generic tempering methodology advocated by Matthews et al. (2018), as sampling from p(φ, γ ) 1 / τ , for τ > 1, does not generally produce marginal samples from p(φ) 1 / τ . Another possibility would be to sample p meld using a pseudo-marginal approach (Andrieu and Roberts 2009). A necessary condition of the pseudo-marginal approach is that we possess an unbiased estimate of the target distribution. Kernel density estimation produces biased estimates of p(φ) for finite N . A KDE can be debiased (Calonico et al. 2018;Cheng and Chen 2019), but doing so requires substantial computational effort. Moreover, we also require an unbiased estimate of 1 / p(φ). Debiasing estimates of 1 / p(φ) is possible with pseudo-marginal methods like Russian roulette (Lyne et al. 2015), but Park and Haran (2018) observe prohibitive computational costs when doing so. The presence of both p pool (φ) and 1 / p(φ) in the melded posterior further complicates the production of an unbiased estimate, particularly when p pool (φ) is formed via logarithmic pooling.