Stochastic approximation cut algorithm for inference in modularized Bayesian models

Bayesian modelling enables us to accommodate complex forms of data and make a comprehensive inference, but the effect of partial misspecification of the model is a concern. One approach in this setting is to modularize the model and prevent feedback from suspect modules, using a cut model. After observing data, this leads to the cut distribution which normally does not have a closed form. Previous studies have proposed algorithms to sample from this distribution, but these algorithms have unclear theoretical convergence properties. To address this, we propose a new algorithm called the stochastic approximation cut (SACut) algorithm as an alternative. The algorithm is divided into two parallel chains. The main chain targets an approximation to the cut distribution; the auxiliary chain is used to form an adaptive proposal distribution for the main chain. We prove convergence of the samples drawn by the proposed algorithm and present the exact limit. Although SACut is biased, since the main chain does not target the exact cut distribution, we prove this bias can be reduced geometrically by increasing a user-chosen tuning parameter. In addition, parallel computing can be easily adopted for SACut, which greatly reduces computation time.

• Reasonable overlap: For neighbouring ϕ (i) 0 and ϕ (j) 0 , we require p(θ|Y, ϕ (i) 0 ) and p(θ|Y, ϕ (j) 0 ) should have a reasonable overlap. Hence, the probability of accepting new proposal of ϕ ∈ Φ 0 is reasonably large given a fixed θ for the auxiliary chain. This ensures that the auxiliary chain can mix well.
The grid Φ 0 is chosen by following the Max-Min procedure (Liang et al., 2016) and the purpose is to use this discrete set as a representation of the domain of p(ϕ|Z). After we have decided the number of auxiliary parameters (that is m), the auxiliary parameter set Φ 0 is formed by selecting ϕ from a larger set Φ (M ) = {ϕ (1) , · · · , ϕ (M ) } (an arbitrary but substantially larger M > m) which are drawn from the marginal posterior p(ϕ|Z) by any standard MCMC algorithm. Before starting the iterative process, we standardize all ϕ (i) 0 , i = 1, · · · , M (i.e, ϕ new = (ϕ − ϕ min )/(ϕ max − ϕ min )). We then add values to the grid using the following iterative process, initialised by randomly selecting a ϕ (1) 0 as the first step. Then suppose at the k th step we have Φ has not yet been selected, we calculate the minimum distance to the set Φ (k) 0 according to some pre-defined distance measure (e.g., Euclidean distance). This is the 'Min' process. We then find the ϕ that has not been selected but has the maximum distance to the set Φ (k) 0 . This is the 'Max' process. We then add this particular ϕ into Φ (k) Here we use a toy example to illustrate the procedure of selecting m. To make clear visualization of ϕ straightforward, we use the example derived from Section 4.1 in the main text but reduce the dimension of ϕ to 2.
We first draw a large number (M ) of samples of ϕ from its marginal posterior p(ϕ|Z) by a standard MCMC method and pool all samples together as a benchmark sample set. This step is feasible because p(ϕ|Z) is a standard posterior distribution so there is no double intractability. If M is large enough, it is appropriate to regard the convex hull of the benchmark sample set as a good approximation of the true domain of p(ϕ|Z).
Next, we select several candidates values of m. In this illustration, we simply select m from {10, 20, 50, 100, 500}. Given a selected value of m, we use the Max-Min procedure (Liang et al., 2016) to construct the Φ (m) 0 , and calculate its corresponding convex hull C Φ (m) 0 numerically using the R package geometry. See the attached Figure 1. It can be seen that the shape of the convex hull approximates the benchmark convex hull increasingly accurately as m increases. be the benchmark convex hull, and define the coverage ratio R m by: where R m can be numerically approximated using Monte Carlo. The m can be selected by choosing the smallest m that gives a high coverage ratio R m (e.g., R m > 0.95). In this example, since the the coverage ratio is > 0.98 we choose m = 50 to satisfy the first condition (Full representation). We then check whether this value of m also satisfies the second condition (Reasonable overlap). This involves checking that the m different p(θ|Y, ϕ 0 , overlap sufficiently. To do this, for each ϕ (i) 0 , we draw samples of θ using a standard MCMC method and compare the empirical distribution of θ ∼ p(θ|Y, ϕ In general, m grows with the dimension of ϕ. However, the exact relationship between them depends on the context of the real problem. Also, a large m is not a necessary condition for proposed theorems to be theoretically valid. This is because m is involved in the construction of the proposal distribution for the importance sampling procedure that forms the numerator and denominator of P * n (θ|Y, ϕ) and we could use any proposal distribution only if it has a correct domain, although a proposal distribution based on a smaller m could leads to a slow convergence of the auxiliary chain. In the special case when the marginal cut distribution p(θ|Y, ϕ) is not sensitive to the change of ϕ, a small m might be good enough. A larger m will bring fewer benefits when two conditions hold in practice. This differs significantly from the fact that we always require n to go to infinity. Hence, increasing m has a diminishing marginal utility after two conditions hold. Notwithstanding, when the dimension of ϕ increases, it will be more difficult to check whether two conditions hold because the numerical calculation of the convex hull will become extremely timeconsuming and also checking for overlap visually will be infeasible. A more practical way could be simply checking whether the auxiliary chain can converge well given a reasonable time period by running some preliminary trials as suggested in Liang et al. (2016). B Construction of P * n (θ|Y, ϕ) Given a particular ϕ , measure P * n (θ|Y, ϕ) is actually a weighted average of 1 {θ∈B} . It is used to approximate B p(Y |θ, ϕ )p(θ)dθ/p(Y |ϕ ), where numerator and denominator are separately approximated by dynamic importance sampling. The only difference is the domain of integration (i.e., B versus Θ). Here, we only show the numerator. We have where K(ϕ ) is a intractable normalizing constant of the target distribution. Hence, we cannot sample θ from p(Y |θ, ϕ )p(θ)/K(ϕ ). We have shown in the main text that, when iteration number j is large enough, we actually sample (θ,φ) from an iteration-specific proposal distribution The above expectation can be approximated by a step j single sample Monte Carlo esti- Note that, K(ϕ ) will be canceled out in the denominator and numerator of measure (5) so we can omit it. Now we have a total of n samples {(θ j ,φ j )} n j=1 , we then add all single sample Monte Carlo estimators and calculate the average Similarly, 1/n will be canceled out.

C Density Function Approximation by Simple Function
Here, we show how a density function f can be approximated by a simple function that is constant on a hypercube. We show that the degree of approximation can be easily controlled, and is dependent on the gradient of f . The use of a simple function to approximate a density function has been discussed previously (Fu and Wang, 2002;Malefaki and Iliopoulos, 2009), but here we use a different partition of the support of the function, determined by rounding to a user-specified number of decimal places. For a compact set Ψ ⊂ R d with dimension d, define a map R κ : Ψ → Ψ that rounds every element of ψ ∈ Ψ to κ decimal places, where κ ∈ Z, as R κ (ψ) = 10 κ ψ + 0.5 /10 κ . Since Ψ is compact, R κ (Ψ) is a finite set and we let R κ denote its cardinality. We partition Ψ in terms of (partial) hypercubes Ψ r whose centres ψ r ∈ R κ (Ψ) are the rounded elements of Ψ, and the boundary setΨ κ , It is clear that Rκ r=1 Ψ r = Ψ. Hence {Ψ r \Ψ κ } Rκ r=1 andΨ κ form a partition of Ψ. Using this partition, we are able to construct a simple function density that approximates a density function. Let C be the set of all continuous and integrable probability density functions f : Ψ → R, and let S be the set of all simple functions f : Ψ → R. Define a map S κ : C → S for any f ∈ C as The sets Ψ r , r = 1, ..., R κ , are the level sets of the simple function approximation, and the value S κ (f )(ψ), ψ ∈ Ψ\Ψ κ , is the (normalized) probability of a random variable with density f taking a value in Ψ r , r = 1, ..., R κ . Note that, when Ψ r is a full hypercube, µ(Ψ r ) = 10 −dκ ; and if the set Ψ is known, then µ(Ψ r ) is obtainable for partial hypercubes. Figure 3 illustrates how this simple function approximates the truncated standard normal density function f norm : [−4, 4] → R, when κ = 0 and κ = 1. Note that this is the optimal Exact density Simple function approximation к=1 Simple function approximation к=0 simple function for the approximation in terms of Kullback-Leibler divergence (Malefaki and Iliopoulos, 2009).
Hence, S κ (f ) is a well-defined density function. We have the following theorem.
Then we have ∀κ > N , When the density function f is also continuously differentiable, we can obtain the following result about the rate of convergence.
In addition, the global convergence holds: Proof. Following the result of Theorem 1, for a given ψ ∈ E , we have Since f has a continuous gradient on a compact set, then by the mean value theorem we have: (f ψ,max − f ψ,min ) = | ∇f (y), (ψ max − ψ min ) |.
where ·, · means inner product, f (ψ max ) = f ψ,max , f (ψ min ) = f ψ,min , y ∈ Ψ (κ) ψ . By the Cauchy-Schwarz inequality, we have Now we prove the local convergence result. Since ∇f is continuous on the d-dimensional compact set Ψ, we can write Since µ(Ψ (κ) ψ ) → 0, it is easy to check that ε(ψ, κ) → 0 when κ → ∞. Moreover, we have both ψ max and ψ min are in set Ψ (κ) ψ , and we have sup a,b∈Ψ Then by the triangle inequality, we have Now we prove the global convergence result. Since ∇f is continuous on compact set Ψ, then ∇f 2 is bounded. We have Note that, this means that |S κ (f )(ψ) − f (ψ)| is uniformly bounded. Hence, it implies Corollary 1 shows that the rate of convergence of S κ (f ) to f is geometric. It states that, (a) for any ψ ∈ E , the rate of convergence is locally controlled by its gradient ∇f (ψ) 2 ; and (b) the rate of convergence is uniformly controlled by the upper bound of the gradient. Hence, as is intuitively expected, convergence is faster if the target function f has a smaller total variation on the set E .
Remark 1. When the scale of each component of ψ ∈ Ψ is not same, a more complex partition can be formed by choosing component-specific precision parameters κ = (κ 1 , ..., κ d ). Denote • as the Hadamard product and 10 ±κ := (10 ±κ 1 , ..., 10 ±κ d ), we redefine We build a (partial) d-orthotope around ψ r ∈ R κ (Ψ) We do not discuss this more complex partition but all results in this paper that are based on the basic partition in (2) and (3) can be easily extended to this more complex partition.

D.1 Proof of Lemma 1
We write the explicit form of p (κ) (θ|Y, ϕ): Thus, using Equation 10 from the main text, it is clear that Since µ(Θ κ ) = 0, we are done.

D.2 Proof of Theorem 1
The theorem naturally holds when n = 1, we consider the case when n ≥ 2. Since the dimension d and precision parameter κ are known and fixed, we suppose that the parameter space Θ is equally partitioned and the total number of d-orthotopes is R κ and each orthotope is indexed as Θ r , r = 1, ..., R κ . Since we suppose that the auxiliary chain has converged before we start collecting auxiliary variableθ, by equation 6 in the main text, we could write the probability of the original proposal distribution P * n taking a value in each partition component Θ r as the integral with respect to the target distribution p(θ|Y, ϕ): W ∞ (Θ r |Y, ϕ) = Θr p(θ|Y, ϕ)dθ, r = 1, ..., , R κ . Now we define binary random variables I r , r = 1, ..., , R κ as: 1 if orthotope r is never visited by auxiliary variablesθ i , i = 1, ..., n; 0 otherwise, We then have the expected number of orthotope visited is (1 − W ∞ (Θ r |Y, ϕ)) n .
By the method of the Lagrange multipliers, we write the Lagrange function as: Conduct first order partial derivatives, we have These equations hold when W ∞ (Θ r |Y, ϕ) = 1/R κ , r = 1, ..., R κ . We now consider the second order derivatives, we have Hence the Hessian matrix is negative definite, E |Θ (κ) n | achieves its maxima when W ∞ (Θ r |Y, ϕ) = 1/R κ , r = 1, ..., R κ . If we additionally require this to be held for any precision parameter κ, the target distribution p(θ|Y, ϕ) has to be uniform distribution.
Since the support of p (κ) n is Θ, we have g(ϕ) > 0, for all ϕ ∈ Φ. In addition, since each element of W n (ϕ) is a continuous function on the compact set Φ (see equation 5 and 9 in the main text), then g(ϕ) is also a continuous function on Φ. Since Φ is compact, g(ϕ) reaches its minima ε = min ϕ∈Φ g(ϕ).
Thus p (κ) n (θ|Y, ϕ) > ε for all θ ∈ Θ and ϕ ∈ Φ, and local positivity holds. By the same reasoning, it is also true for the proposal distribution with density p (κ) . The definition can be extended to cases with more than two measures in a natural way.

D.6 Proof of Lemma 4
Given the filtration G n , the transition kernel U (1) and V (1) n both admit an irreducible and aperiodic Markov chain by assumption. Therefore, to prove that transition kernel T (1) n also holds same property, it suffices to prove that for any s ∈ N, (θ 0 , ϕ 0 ) ∈ Θ × Φ, and Borel We prove this by mathematical induction.
We need to show that it also holds at step s = s * + 1. For an initial value (θ 0 , ϕ 0 ), consider a Borel set B such that V (s * +1) n (B) > 0. We proceed by contradiction. Suppose that This implies that the function T (1) n (B|·, G n ) = 0 almost surely with respect to the measure T (s * ) n . Because the induction assumption holds at step s * , which means that any V (s * ) nmeasurable set of positive measure is a subset of a T (s * ) n -measurable set of positive measure, we have that the function T (1) n (B|·, G n ) = 0 almost surely with respect to the measure V (s * ) n . This further implies that the function V (1) n (B|·, G n ) = 0 almost surely with respect to the measure V (s * ) n . It is clear that this contradicts the fact that V (s * +1) n (B) > 0. Hence, we are done.