Product-form estimators: exploiting independence to scale up Monte Carlo

We introduce a class of Monte Carlo estimators that aim to overcome the rapid growth of variance with dimension often observed for standard estimators by exploiting the target's independence structure. We identify the most basic incarnations of these estimators with a class of generalized U-statistics, and thus establish their unbiasedness, consistency, and asymptotic normality. Moreover, we show that they obtain the minimum possible variance amongst a broad class of estimators; and we investigate their computational cost and delineate the settings in which they are most efficient. We exemplify the merger of these estimators with other well-known Monte Carlo estimators so as to better adapt the latter to the target's independence structure and improve their performance. We do this via three simple mergers: one with importance sampling, another with importance sampling squared, and a final one with pseudo-marginal Metropolis-Hasting. In all cases, we show that the resulting estimators are well-founded and achieve lower variances than their standard counterparts. Lastly, we illustrate the various variance reductions through several examples.


Introduction
Monte Carlo methods are sometimes said to overcome the curse of dimensionality because, regardless of the target's dimension, their rates of convergence are square root in the number of samples drawn. In practice, however, one encounters several problems when computing high-dimensional integrals using Monte Carlo, prominent among which is the issue that the constants present in the convergence rates typically grow rapidly with the target's dimension. Hence, even if we are able to draw independent samples from a high-dimensional target, the number of samples necessary to obtain estimates of a satisfactory accuracy is often prohibitively large [63,64,9,1]. However, many of these targets possess strong independence structures (e.g. see [26,25,39,36,11] and the many references therein). In this paper, we investigate whether the rapid growth of these constants can be mitigated by exploiting these structures.
Variants of the following toy example are sometimes given to illustrate the issue (e.g. [15, p.95]). Let µ be a K-dimensional isotropic Gaussian distribution with unit means and variances, and consider the basic Monte Carlo estimator for the mean (µ(ϕ) = 1) of the product (ϕ(x) := x 1 x 2 . . . x K ) of its components (x 1 , . . . , x K ): ϕ(X n 1 , . . . , X n K ) = 1 N N n=1 X n 1 · · · X n K , where (X n 1 , . . . , X n K ) N n=1 denote i.i.d. samples drawn from µ. Because the estimator's asymptotic variance equals 2 K − 1, the number of samples required to obtain a reasonable estimate of µ(ϕ) grows exponentially with the target's dimension. Hence, it is impractical to use µ N (ϕ) if K is even modestly large. For instance, if K = 20, we would require ≈ 10 10 samples to obtain an estimate with standard deviation of 0.01 = 1%µ(ϕ), reaching the limits of most present-day personal computers, and if K = 30, we would require ≈ 10 13 samples, exceeding these limits.
There is, however, a trivial way of overcoming the issue for the above example that does not require any knowledge about µ beyond the fact that it is product-form. Because µ is the product µ 1 ×· · ·×µ K of K univariate unit-mean-and-variance Gaussian distributions µ 1 , . . . , µ K and ϕ is the product ϕ 1 · · · ϕ K of K univariate functions ϕ 1 (x 1 ) = x 1 , . . . , ϕ K (x K ) = x K , we can express µ(ϕ) as the product µ 1 (ϕ 1 ) · · · µ K (ϕ K ) of the corresponding K univariate means µ 1 (ϕ 1 ), . . . , µ K (ϕ K ). As we will see in Section 2.1, estimating each of these means separately and taking the resulting product, we obtain an estimator for µ(ϕ) whose asymptotic variance is K: Consequently, the number of samples necessary for µ N × (ϕ) to yield a reasonable estimate of µ(ϕ) only grows linearly with the dimension, allowing us to practically deal with Ks in the millions.
The middle term of (2) makes sense regardless of whether ϕ is the product of univariate test functions. It defines a type of (unbiased, consistent, and asymptotically normal) Monte Carlo estimators for general ϕ and product-form µ which we refer to as product-form estimators. Their salient feature is that they achieve lower variances than the standard estimator (1) given the same number of samples from the target. The reason behind the variance reduction is simple: if (X n 1 ) N n=1 ,. . . , (X n K ) N n=1 are independent sequences of samples respectively drawn from µ 1 , . . . , µ K , then every 'permutation' of these samples has law µ, that is, (X n 1 1 , . . . , X n K K ) ∼ µ ∀n 1 , . . . , n K ≤ N.
Hence, µ N × (ϕ) in (2) averages over N K tuples with law µ while its conventional counterpart (1) only averages over N such tuples. This increase in tuple number leads to a decrease in estimator variance and we say that the product-form estimator is more statistically efficient than the standard one. Moreover, obtaining these N K tuples does not require drawing any further samples from µ and, in this sense, product-form estimators make the most out of every sample available (indeed, we will show in Theorem 2 that they are minimum variance unbiased estimators, or MVUEs, for product-form targets). However, in contrast with the tuples in (1), those in (2) are not independent (the same components are repeated across several tuples). For this reason, product-form estimators achieve the same O(N −1/2 ) rate of convergence that the standard ones do and the variance reduction materializes only in lower proportionality constants (i.e. lim N →∞ Var(µ N × (ϕ))/Var(µ N (ϕ)) = C for some constant C ≤ 1).
The space complexity of product-form estimators scales linearly with dimension: to utilize all N K permuted tuples in (2) we need only store KN numbers (X 1 1 , . . . , X N 1 ; . . . ; X 1 K , . . . , X N K ). However, in the absence of any sort of special structure in the test functions, their time complexity scales exponentially with dimension: brute-force computation of the sum in (2) requires 1 O(N K ) operations. Consequently, the use of product-form estimators for general ϕ proves to be a balancing act in which one must weigh the cost of acquiring new samples from µ (be it a computational one if the samples are obtained from simulations, or a real-life one if they are obtained from experiments) against the extra overhead required to evaluate these estimators, and it is limited to Ks no greater than ten.
If, however, the test function ϕ also possesses some 'product structure', then µ N × (ϕ) can often be evaluated in far fewer than O(N K ) operations. The most extreme examples of such ϕ are functions that factorize fully and sums thereof (which we refer to as 'sums of products' or 'SOPs'), for which the evaluation cost is easily lowered to just O(KN ). For instance, in the case of the toy Gaussian example above, we can evaluate the product-form estimator in O(KN ) operations by expressing it as the product of the component-wise sample averages and computing each average separately (i.e. using the final expression in (2)). This cheaper approach just amounts to a dimensionality reduction technique: we re-write a high-dimensional integral as a polynomial of low-dimension integrals, estimate each of low-dimension integral separately, and plug the estimates back into the polynomial to obtain an estimate of the original integral. More generally, if the test function can be expressed as a sum of partially-factorized functions, it is often possible to lower the cost to O(N d ) where d < K depends on the amount of factorization, and taking this approach also amounts to a type of dimensionality reduction (this time featuring nested integrals). This paper has two goals. First, to provide a comprehensive theoretical characterization of product-form estimators. Second, to illustrate their use for non-product targets when combined with, or embedded within, other more sophisticated Monte Carlo methodology. It is in these settings, where product-form estimators are deployed to tackle the aspects of the problem exhibiting product structure or conditional independences, that we believe these estimators find their greatest use. To avoid unnecessary technical distractions, and in the interest of accessibility, we achieve the second goal using simple examples. While we anticipate that the most useful such combinations or embeddings will not to be so simple, we believe that the underlying ideas and guiding principles will be the same.
Relation to the literature.
In their basic form, product-form estimators (2) are a subclass of generalized U-statistics (see [44,41] for comprehensive surveys): multisample U-statistics with 'kernels' ϕ that take as arguments a single sample per distribution for several distributions (K > 1). Even though product-form estimators are unnatural examples of U-statistics because the original unisample U-statistics [34] fundamentally involve symmetric kernels that take as arguments multiple samples from a single distribution (K = 1), the methods used to study either of these overlap significantly. The arguments required in the basic product-form case are simpler than those necessary for the most general case (multiple samples from multiple distributions) and, by focusing on the results that are of greatest interest from the Monte Carlo perspective, we are able to present readily accessible, intuitive, and compact proofs for the theoretical properties of (2). This said, whenever a result given here can be extracted from the U-statistics literature, we provide explicit references.
While U-statistics have been extensively studied since Hoeffding's seminal work [34] and are commonly employed in a variety of statistical tests (e.g. independence tests [35], two-sample tests [30], Section 3 gives simple examples illustrating how one may embed product-form estimators within standard Monte Carlo methodology and extend their use beyond product-form targets. In particular, we combine them with importance sampling and obtain estimators applicable to targets that are absolutely continuous with respect to product-form distributions (Section 3.1) and partiallyfactorized distributions (Section 3.2), and we consider their use within pseudo-marginal MCMC (Section 3.3). We then examine the numerical performance of these extensions on a simple hierarchical model (Section 3.4).
The paper has six appendices. The first five contain proofs: Appendix A those for the basic properties of product-form estimators, Appendix B that for their MVUE property, Appendix C those for the basic properties of the 'partially product-form' estimators introduced in Section 3.2, Appendix D that for the latter's MVUE property, and Appendix E that for the statistical efficiency of the product-form pseudo-marginal MCMC estimators (vis-à-vis their non-product counterparts) in Section 3.3. Appendix F contains an additional, simple extension of product-form estimators (to targets that are mixture of product-form distributions), omitted from the main text in the interest of brevity.

Product-form estimators
Consider the basic Monte Carlo problem: given a probability distribution µ on a measurable space (S, S) and a function ϕ belonging to the space L 2 µ of square µ-integrable real-valued functions on S, estimate the average Throughout this section, we focus on the question 'by exploiting the product-form structure of a target µ, can we design estimators of µ(ϕ) that are more efficient than the usual ones?'. By product-form, we mean that µ is the product of K > 1 distributions µ 1 , . . . , µ K on measurable spaces (S 1 , S 1 ), . . . , (S K , S K ) satisfying S = S 1 × · · · × S K and S = S 1 × · · · × S K , where the latter denotes the product sigma-algebra. Furthermore, if A is a non-empty subset of [K] := {1, . . . , K}, then we use µ A := k∈A µ k to denote the product of the µ k s indexed by ks in A and µ A (ϕ) to denote the measurable function on k ∈A S k obtained by integrating the arguments of ϕ indexed by ks in A with respect to µ A : under the assumption that this integral is well-defined for all

Theoretical characterisation
Suppose that we have at our disposal N i.i.d. samples X 1 , . . . , X N drawn from µ. We can view these samples as N tuples (X 1 1 , . . . , X 1 K ), . . . , (X N 1 , . . . , X N K ) of i.i.d. samples X 1 1 , . . . , X N 1 , . . . , X 1 K , . . . , X N K independently and respectively drawn from µ 1 , . . . , µ K . As we will see in Section 2.2, the product-form estimator, where X n with n = (n 1 , . . . , n K ) denotes the 'permuted' tuple (X n 1 1 , . . . , X n K K ) (i.e. a tuple obtained as one of the N K component-wise permutations of the original samples), is a better estimator for µ(ϕ) than the conventional choice using the same samples, regardless of whether the test function ϕ possesses any sort of product structure. The reason behind this is as follows: while the conventional estimator directly approximates the target with the samples' empirical distribution, the product-form estimator instead first approximates the marginals µ 1 , . . . , µ K of the target with the corresponding component-wise empirical distributions, and then takes the product of these to obtain an approximation of µ, The built-in product structure in µ N × makes it a better suited approximation to the product-form target µ than the non-product µ N . Before pursuing this further, we take a moment to show that µ N × (ϕ) is a well-founded estimator for µ(ϕ) and obtain expressions for its variance.
If, furthermore, ϕ belongs to L 2 µ , then µ A c (ϕ) belongs to L 2 µ A for all subsets A of [K]. The estimator's variance is given by for every N > 0, where |A| and |B| denote the cardinalities of A and B and for all ψ in L 2 is strongly consistent and asymptotically normal: where ⇒ denotes convergence in distribution and σ 2 × (ϕ) := K k=1 σ 2 k (ϕ) with As mentioned in Section 1, product-form estimators are special cases of multisample U-statistics and Theorem 1 can be pieced together from results in the U-statistics literature, e.g. see [41, p.35] for the unbiasedness, [41, p.38] for the variance expressions, [41, Theorem 3.2.1] for the consistency (which also holds for µ-integrable ϕ), [41,Theorem 4.5.1] for the asymptotic normality. To keep the paper self-contained we include a simple proof of Theorem 1, specially adapted for product-form estimators, in Appendix A. It has two key ingredients, the first being the following decomposition expressing the 'global approximation error' µ N × − µ as a sum of products of 'marginal approximation The other is the following expression for the L 2 norm of a generic product of marginal errors (see [41, p.152] for its multisample U-statistics analogue). It tells us that the product of l of these errors has O(N −l/2 ) norm, as one would expect given that the errors are independent and that classical theory (e.g. [15, p.168]) tells us that the norm of each is O(N −1/2 ).

Lemma 1. If
A is a non-empty subset of [K], ψ belongs to L 2 µ A , and σ 2 A,B (ψ) is as in (9), then Proof. This lemma follows from the equation which, together with (12)

Proof. See Appendix B.
While it is well-known that unisample U-statistics are MVUEs (e.g. [17]), we have been unable to locate an explicit proof that covers the general multisample case and, in particular, that of product-form estimators. Instead, we adapt the argument given in [44,Chap. 1] (whose origins trace back to [32]) for unisample U-statistics and prove Theorem 2 in Appendix B.
Theorem 2 implies that product-form estimators achieve lower variances than their conventional counterparts: Proof. See Appendix B.
In other words, product-form estimators are more statistically efficient than their standard counterparts: using the same number of independent samples drawn from the target, µ N × (ϕ) achieves lower variance than µ N (ϕ). The reason behind this variance reduction was outlined in Section 1: the product-form estimator uses the empirical distribution of the collection (X n ) n∈[N ] K of permuted tuples as an approximation to µ. Because µ is product-form, each of these permuted tuples is as much a sample drawn from µ as any of the original unpermuted tuples (X n ) N n=1 . Hence, productform estimators transform N samples drawn from µ into N K samples and, consequently, lower the variance. However, the permuted tuples are not independent and we get a diminishing returns effect: the more permutations we make, the greater the correlations among them, and the less 'new information' each new permutation affords us. For this reason, the estimator variance remains O(N −1 ), cf. (8), instead of O(N −K ) as would be the case for the standard estimator using N K independent samples. As we discuss in Section 4, there is also a pragmatic middle ground here: use N < M < N K permutations instead of all N K possible ones. In particular, by choosing these M permutations to be as uncorrelated as possible (e.g. so that they have few overlapping entries), it might be possible to retain most of the variance reduction while avoiding the full O(N K ) cost (cf. [40] and references therein for similar feats in the U-statistics literature).
Given that the variances of both estimators are (asymptotically) proportional to each other, we are now faced with the question 'how large might the proportionality constant be?'. If the test function is linear or constant, e.g. S 1 = · · · = S K = R and then the two estimators trivially coincide, no variance reduction is achieved, and the constant is one. However, these are the cases in which the standard estimator performs well (e.g. for (14), µ N (ϕ)'s variance breaks down into a sum of K univariate integrals and, consequently, grows slowly with the dimension K). However, if the test function includes dependencies between the components, then the proportionality constant can be arbitrarily large and the variance reduction unbounded as the following example illustrates.
where Φ denotes the CDF of a standard normal, and similarly for µ 2 (ϕ)(x 1 ). In addition, It then follows that It is not difficult to glean some intuition as to why the product-form estimator yields far more accurate estimates than its standard counterpart for large α. In these cases, unpermuted tuples with both components greater than α are extremely rare (they occur with probability [1 − Φ(α)] 2 ) and, until one arises, the standard estimator is stuck at zero (a relative error of 100%). On the other hand, for the product-form estimator to return a non-zero estimate, we only require unpermuted tuples with a single component greater than α, which are generated much more frequently (with probability [1 − Φ(α)]).
Of particular interest is the case of high-dimensional targets (i.e. large K) for which obtaining accurate estimates of µ(ϕ) proves challenging. Even though the exact manner in which the variance reduction achieved by the product-form estimator scales with dimension of course depends on the precise target and test function, it is straightforward to gain some insight by revisiting our starting example: for some univariate distribution ρ and test function ψ satisfying ρ(ψ) = 0. In this case, where CV := ρ([ψ − ρ(ψ)] 2 ) ρ(ψ) denotes the coefficient of variation of ψ w.r.t. ρ. Hence, and we see that the reduction in variance grows exponentially with the dimension K. At first glance, (15) might appear to imply that the number of samples required for µ N × (ϕ) to yield a reasonable estimate of µ(ϕ) grows exponentially with K if |ρ(ψ)| > 1. However, what we deem a 'reasonable estimate' should take into account the magnitude of the average µ(ϕ) we are estimating. In particular, it is natural to ask for the standard deviation of our estimates to be ε |µ(ϕ)| for some prescribed relative tolerance ε > 0. In this case, we find that the number of samples required by the product-form estimator is approximately In the case of the conventional estimator µ N (ϕ), the number required to achieve the same accuracy is instead That is, the number of samples necessary to obtain a reasonable estimate grows linearly with dimension for µ N × (ϕ) and exponentially for µ N (ϕ). Notice that the univariate coefficient of variation CV features heavily in Example 2's analysis: the greater it is, the greater the variance reduction, and the difference gets amplified exponentially with the dimension K. This observation might be explained as follows: if µ is highly peaked (so that the coefficient is close to zero), then the unpermuted tuples are clumped together around the peak (Fig. 1a), permuting their entries only yields further tuples around the peak (Fig. 1b), and the empirical average changes little. If, on the other hand, µ is spread out (so that the coefficient is large), then the unpermuted pairs are scattered across the space (Fig. 1c), permuting their entries reveals unexplored regions of the space (Fig. 1d), and the estimates improve. Of course, how spread out the target is must be measured in terms of the test function and we end up with the coefficient of variation in (16).

Computational efficiency
As shown in the previous section, product-form estimators are always at least as statistically efficient as their conventional counterparts: the variances of the former are bounded above by those of the latter. These gains in statistical efficiency come at a computational cost: even though both conventional and product-form estimators share the same O(N ) memory needs, the latter requires evaluating the test function N K times, while the former requires only N evaluations. For this reason, the question of whether product-form estimators are more computationally efficient than their conventional counterparts (i.e. achieve smaller errors given the same computational budget) is not as straightforward. In short, sometimes but not always.
One way to answer the computational efficiency question is to compare the cost incurred by each estimator in order to achieve a desired given variance σ 2 . To do so, we approximate the variance of µ N × (ϕ) with its asymptotic variance divided by the sample number (as justified by Theorem 1). The number of samples required for the variance to equal σ 2 is N := σ 2 (ϕ)/σ 2 for the conventional estimator and (approximately) N × := σ 2 × (ϕ)/σ 2 for the product-form one. The costs of evaluating the former with N samples and the latter with N × samples are N C ϕ + N C X + N and respectively, where C ϕ and C X are the costs, relative to that of a single elementary arithmetic operation, of evaluating ϕ and generating a sample from µ, respectively, and the rightmost N and N K × terms account for the cost of computing the corresponding sample average once all evaluations of ϕ are carried out. It follows that µ N × (ϕ) is (asymptotically) at least as computationally efficient as µ N (ϕ) if and only if the ratio of their respective costs is no smaller than one or, after some re-arranging, where C r := (C ϕ + 1)/C X denotes the relative cost of evaluating the test function and drawing samples. Our first observation here is that, because σ 2 (ϕ) ≥ σ 2 × (ϕ) (Corollary 1), the above is always satisfied in the limit C r → 0. This corresponds the case where the cost of acquiring the samples dwarfs the overhead of evaluating the sample averages (for instance, if the samples are obtained from long simulations or real-life experiments). If so, we do really want to make the most of the samples we have and product-form estimators help us to do so. Conversely, if samples are cheap to generate and the test function is expensive to evaluate (i.e. C r → ∞), then we are better off using the basic estimator.
To investigate the case where the costs of generating samples and evaluating the test function are comparable (C r ≈ 1), note that the variance approximation Var(µ N × × (ϕ)) ≈ σ 2 × (ϕ)/N × and, consequently, (17) are valid only if σ 2 × (ϕ) > σ 2 . Otherwise, N × = 1 and the product-form estimator simply equals ϕ(X 1 ) with variance σ 2 (ϕ). In the high-dimensional (i.e. large K) case which is of particular interest, (17) then (approximately) reduces to To gain insight into whether it is reasonable to expect the above to hold, we revisit Example 2.
Example 3. Setting once again our desired standard deviation to be proportional to the magnitude of the target average (i.e. σ = ε |µ(ϕ)| = ε|ρ(ψ)| K ) and calling on (15,16), we re-write (18) as The expression shows that, in this full O(N K ) cost case, µ N (ϕ) outperforms µ N × (ϕ) in computational terms for large dimensions K (and, even more so, for small relative tolerances ε).
In summary, unless the cost of generating samples is significantly larger than that of evaluating ϕ, we expect the basic estimator to outperform the product-form one. Simply put, independent samples are more valuable for estimation than correlated permutations thereof. Hence, if independent samples are cheap to generate, then we are better off drawing further independent samples instead of permuting the ones we have.
That is, unless one can find a way to evaluate the product-form estimator that does not require summing over all N K permutations. Indeed, the above analysis is out of place for Example 3 because, in this case, we can express the product-form estimator as the product of the univariate sample averages µ N 1 (ψ), . . . , µ N K (ψ) and evaluate each of these separately at a total O(KN ) cost. Given that the number of samples required for µ N × (ϕ) to yield a reasonable estimate scales linearly with dimension (Example 2), it follows that the cost incurred by computing such an estimate scales quadratically with dimension. In the case of µ N (ϕ), the number of samples required, and hence the cost, scales exponentially with dimension; making the product-form estimator the clear choice for this simple case. This type of trick significantly expands the usefulness of productform estimators, as we see in the following section.

Efficient computation
Recall our starting example from Section 1. In that case, the product-form estimator trivially breaks down into the product of K sample averages (2) and, consequently, we can evaluate it in O(KN ) operations. We can exploit this trick whenever the test function possesses product-like structure: if ϕ is a sum , the product-form estimator decomposes into a sum of products (SOP) of univariate averages, and we are able to evaluate µ N × (ϕ) in O(KN ) operations. (Of course, 'univariate' need not mean that the function is defined on R and we can be strategic in our choice of component spaces for some functions f : R 2 → R and g : R → R, we could pick K := 2, S 1 := R 2 , and S 2 := R.) In these cases, the use of product-form estimators amounts to nothing more than a dimensionality-reduction technique: we exploit the independence of the target to express our K-dimensional integral in terms of an SOP of one-dimensional integrals, estimate each of these separately, and replace the one-dimensional integrals in the SOP with their estimates to obtain an estimate for the K-dimensional integral: . By so exploiting the structure in µ and ϕ, the product-form estimator achieves a lower variance than the standard estimator (Corollary 1). Moreover, evaluating each univariate sample average µ N k (ϕ j k ) requires only O(N ) operations and, consequently the computational complexity of µ N × (ϕ) is O(KN ). The running time can be further reduced by calculating the univariate sample averages in parallel.
Similar considerations apply if the test function ϕ is a product of low-dimensional functions (and sums thereof) instead of univariate ones, e.g.
for a collection of factors ϕ 1 , . . . , ϕ I with arguments indexed by subsets A 1 , . . . , A I of [K]. As with the SOP case, one should aim to swap as many summation and product signs in as the factors permit. Exactly how best to do this is obvious for simple situations such as that in Example 5 in Section 3.1. For more complicated ones, we advice using the 'variable elimination' algorithm (e.g. [39,Chapter 9]) commonly employed for inference in discrete graphical models. The complexity of the resulting procedure essentially depends on the order in which one attempts the swapping (however, it is easy to find bounds thereon, for instance, it is bounded below by both the maximum cardinality of A 1 , . . . , A I and half the length of the longest cycle in ϕ's factor graph). While finding the ordering with lowest complexity for general partially-factorized ϕ itself proves to be a problem whose worst-case complexity is exponential in K, good suboptimal orderings can often be found using cheap heuristics (cf. [39,Section 9.4.3]). For general ϕ lacking any sort of product structure, we are sometimes able to extend the linearcost approach by approximating ϕ with SOPs (e.g. using truncated Taylor expansions for analytic ϕ). The idea is that, if ϕ ≈ ϕ sop for some SOP ϕ sop , then ), and we can use µ N × (ϕ sop ) ≈ µ(ϕ sop ) as a linear-cost estimator for µ(ϕ) without significantly affecting the variance reduction. This of course comes at the expense of introducing a bias in our estimates, albeit one that can often be made arbitrarily small by using more and more refined approximations (these biases may in principle be removed using multi-level randomization [50,59]). The choice of approximation quality itself proves to be a balancing act as more refined approximations typically incur higher evaluation costs. If these costs are high enough, then any potential computational gains afforded by the reduction in variance are lost. In summary, this SOP approximation approach is most beneficial for test functions (a) that are effectively approximated by SOP functions (so that the bias is low), (b) whose SOP approximations are relatively cheap to compute (so that the cost is low), and (c) that have a high-dimensional product-form component to them (so that the variance reduction is large, cf. Section 2.2). In these cases, the gains in performance can be substantial as illustrated by the following toy example. Example 4. Let µ 1 , . . . , µ K be uniform distributions on the interval [0, a] of length a > 1 and consider the test function ϕ(x) := e x 1 ...x K . The integral can be expressed in terms of the generalized hypergeometric function p F q , and grows super-exponentially with the dimension K (see Fig. 2a). Because for large enough truncation cutoffs J, we have that as an estimator for µ(ϕ), we lower the computational cost from O(N K ) to O(KN ). In exchange, we introduce a bias: , the bias decays super-exponentially with the cutoff J, at least for sufficiently large J. In practice, we found it to be significant for Js smaller than 0.8a K and negligible for Js larger than 1.2a K (Fig. 2b). In particular, the cutoff J necessary for µ N × (ϕ J ) to yield estimates with small bias grows exponentially with the dimension K.
However, similar manipulations to those above reveal that and we find that the variance reduction achieved by µ N × (ϕ J ) far outpaces the growth in K of the cutoff (and, thus, the computational cost of µ N × (ϕ J )) necessary to achieve a small bias (Fig. 2c). Indeed, the asymptotic-standard-deviation-to-mean ratio, σ(ϕ)/µ(ϕ), rapidly diverges with K in the case of the standard estimator (Fig. 2d, solid). In that of the biased product-form estimator, the ratio, σ × (ϕ J )/µ(ϕ), also diverges with K but at a much slower rate (Fig. 2d, dashed). For this reason, the number of samples necessary for obtain a, say, 1% accuracy estimate of µ(ϕ) using µ N × (ϕ J ) remains manageable for a substantially larger range of as and Ks than in the case of µ N (ϕ), even after factoring in the extra cost required to evaluate µ N × (ϕ J ) for J's large enough that the bias is insignificant. For instance, with an interval length of 1.5, ten dimensions, a cutoff of seventy, one million samples, and less than one minute of computation time suffices for µ N × (ϕ J ) to produce a 1% accuracy estimate of µ(ϕ) ≈ 6.68 × 10 7 (Fig. 2e). Using the same one million samples and the standard estimator, we obtain very poor estimates (Fig. 2f). Indeed, its asymptotic variance equals 4.45 × 10 29 and, so, we would need approximately 10 18 samples to obtain 1% accuracy estimates using µ N (ϕ), something far beyond current computational capabilities. Kernel density estimator with plug-in bandwidth [67] (blue) obtained with a = 1.5, K = 10, J = 70, and 100 repeats of µ N × (ϕ J ) each involving N = 10 6 samples is a good match to the corresponding sampling distribution (magenta) predicted by the CLT in Theorem 1. Comparing with the target average (yellow), we find a mean absolute error across repeats of 6.73 × 10 5 ≈ 1%µ(ϕ). (f ) As in (e) but for µ N (ϕ). This time, the predicted sampling distribution is extremely wide (with a standard deviation of 6.7 × 10 14 ) and a poor match to the kernel density estimator (almost a Dirac delta close to zero). The mean absolute error is 6.67 × 10 7 ≈ µ(ϕ). The estimator's failure stems from the extreme rarity of samples X n achieving very large values of ϕ(X n ) (i.e. those with components that are all close to a). Because the components are independent, they are extremely unlikely to simultaneously be close to a and the aforementioned samples are not observed for realistic ensemble sizes N . The product-form estimator avoids this issue by averaging over each component separately. stochastic reaction networks [3,14] to the mean-field approximations used in variational inference [58,10]), most target distributions encountered in practice are not product-form. In this section, we demonstrate how to combine product-form estimators with other Monte Carlo methodology and expand their utility beyond the product-form case.
We consider three simple extensions: one to targets that are absolutely continuous with respect to fully-factorized distributions (Section 3.1), resulting in a product-form variant of importance sampling (e.g. [15,Chapter 8]); another to targets that are absolutely continuous with respect to partially-factorized distributions (Section 3.2), resulting in a product-form version of importance sampling squared [66]; and a final one to targets with intractable densities arising from latent variable models (Section 3.3), resulting in a product-form variant of pseudo-marginal MCMC [61]. In all cases, we show theoretically that the product-form estimators achieve smaller variances than their standard counterparts. Then, we investigate their performance numerically by applying them to a simple hierarchical model (Section 3.4).
Lastly, we mention here that a further extension, this time to targets that are mixtures of product-form distributions, can be found in Appendix F. Because many distributions may be approximated with these mixtures, this extension potentially opens the door to tackling still more complicated targets (at the expense of introducing some bias).
Corollary 2 tells us that γ N × (ϕ) is more statistically efficient than the conventional IS estimator γ N (ϕ) regardless of whether the target γ is product-form or not. In a nutshell, µ N × is a better approximation to the proposal µ than µ N and, consequently, γ N × (dx) = w(x)µ N × (dx) is a better approximation to γ(dx) = w(x)µ(dx) than γ N (dx) = w(x)µ N (dx). Indeed, by constructing all N K permutations of the tuples X 1 , . . . , X N , we reveal other areas of similar µ probability and further explore the state space. This can be particularly useful when the proposal and target are mismatched as it can amplify the number of tuples landing in the target's high probability regions (i.e. achieving high weights w) and, consequently, substantially improve the quality of the finite sample approximation (Figure 3).
Similarly, the self-normalized version π N × (ϕ) := γ N × (ϕ)/γ N × (S) of the product-form IS estimator γ N × (ϕ) is a consistent and asymptotically normal estimator for averages π(ϕ) with respect to the normalized target π := γ/γ(S). As in the case of the standard self-normalized importance sampling (SNIS) estimator π N (ϕ) := γ N (ϕ)/γ N (S), the ratio in π N × (ϕ)'s definition introduces an O(N −1 ) bias and stops us from obtaining analytical expression for the finite sample variance (that the bias is O(N −1 ) follows from an argument similar to that given for standard SNIS in [47, p.35] and requires making assumptions on the higher moments of ϕ(X 1 )). Otherwise, π N × (ϕ)'s theoretical properties are analogous to those of the product-form estimator µ N × (ϕ) and its importance sampling extension γ N × (ϕ): Corollary 3. If wϕ belongs to L 2 µ , then π N × (ϕ) is strongly consistent, asymptotically normal, and its asymptotic variance is bounded above by that of π N (ϕ): where w π denotes the normalized weight function w/γ(S) and σ 2 × (w π [ϕ − π(ϕ)]) is as in Theorem 1. Proof. Given Theorem 1 and Corollary 1, the arguments here follow closely those for standard SNIS. In particular, because π N Given that µ N × (w) tends to µ(w) almost surely (and, hence, in probability) as N approaches infinity (Theorem 1), the strong consistency and asymptotic normality of π N × (ϕ) then follow from those of µ N × (w π [ϕ−π(ϕ)]) (Theorem 1) and Slutsky's theorem. The asymptotic variance bound follows from that in Corollary 1.
This type of approach is best suited for targets π possessing at least some product structure. The structure manifest itself in partially-factorized weight functions w and substantially lowers the evaluation costs of γ N × (ϕ) and π N × (ϕ) for simple test functions ϕ, as the following example illustrates. Example 5 (A simple hierarchical model). Consider the following basic hierarchical model: It has a single unknown parameter, the variance θ of the latent variables X 1 , . . . , X K , which we infer using a Bayesian approach. That is, we choose a prior p(dθ) on θ and draw inferences from the corresponding posterior, where y = (y 1 , . . . , y K ) denotes the vector of observations. For most priors, no analytic expressions for the normalizing constant can be found and we are forced to proceed numerically. One option is to choose the proposal in which case (Note that, were we to be using standard IS instead of product-form variant, the proposal would be the natural choice, a point we return to after the example.) Hence, to estimate the normalizing constant or any integral w.r.t. to a univariate marginal of the posterior, we need to draw samples from µ and evaluate the product-form estimator µ N × (ϕ) for a test function of the form ϕ(θ, x) = f (θ) K k=1 g k (θ, x k ), the cost of which totals O(KN 2 ) operations because We return to this in Section 3.4, where we will make use of the following expression for the (unnormalized) posteriors's θ-marginal available due to the Gaussianity in (21): Clearly, the above expression opens the door to simpler and more effective methods for computing integrals with respect to this marginal than estimators targeting the full posterior. However, the estimators we discuss can be applied analogously to the many commonplace hierarchical models (e.g. see [26,25,39,36,11] and the many references therein) for which such expressions are not available.
When applying IS, or extensions thereof like SMC, one should choose the proposal to be as close as possible to the target (e.g. see [1]). In this regard, the product-form IS approach is not entirely satisfactory for the above example: by definition, the proposal must be fully factorized while the target, π in (22), is only partially so (the latent variables are independent only when conditioned on the parameter variable). As we show in the next section, it is straightforward to adapt this product-form IS approach to match such partially-factorized targets.

Partially-factorized targets and proposals
Consider a target or proposal µ over a product space (Θ × S, T × S) with the same partial product structure as the target in Example 5: where, for each k in [K], θ → M k (θ, dx k ) denotes a Markov kernel mapping from (Θ, T ) to (S k , S k ). Suppose that we are given M i.i.d. samples θ 1 , . . . , θ M drawn from µ 0 and, for each of these, N (conditionally) i.i.d. samples X m,1 , . . . , X m,N drawn from the product kernel M(θ, dx) := K k=1 M k (θ, dx k ) evaluated at θ m . Given a test function ϕ on Θ × S, consider the following 'partially product-form' estimator for µ(ϕ): for all M, N > 0. It is well-founded (for simplicity, we only consider the estimator's asymptotics as M → ∞ with N fixed, but other limits can be studied by combining the approaches in Appendix A with those in the proof of Theorem 3): (27) is unbiased and strongly consistent: for all N > 0, If, furthermore, ϕ belongs to L 2 µ , then M [K]\A (ϕ) belongs to L 2 µ 0 ⊗M A for all subsets A of [K], where M A (θ, dx A ) := k∈A M k (θ, dx k ), and the estimator is asymptotically normal: where ⇒ denotes convergence in distribution and  Proof. See Appendix C.
In fact, modulo a small caveat (cf. Remark 1 below), µ M,N × (ϕ) yields the best unbiased estimates of µ(ϕ) achievable using only the knowledge that µ is partially-factorized and M i.i.d. samples drawn from µ 0 ⊗M N : a perhaps unsurprising fact given that it is the composition of two minimum variance unbiased estimators (Theorem 2). then

Proof. See Appendix D.
Remark 1 (The importance of (29)). Consider the extreme scenario that µ 0 is a Dirac delta at some θ * , so that θ 1 = · · · = θ M = θ * with probability one and In this case, we are clearly better off (at least in terms estimator variance) stacking all of our X samples into one big ensemble and replacing the partially product-form estimator with the (fully) product-form estimator, where (X l ) l∈ [M N ] denotes (X m,n ) m∈ [M ],n∈ [N ] in vectorized form (indeed Theorem 2 implies that µ M N × (ϕ) is a minimum variance unbiased estimator in this situation). More generally, note that, because µ 0 not possessing atoms, i.e. (29), is equivalent to µ 2 0 ({θ 1 = θ 2 }) = 0. It is then straightforward to argue that (29) is equivalent to the impossibility of several θ m coinciding or, in other words, to Were this not to be the case, the estimator in (27) would not posses the MVU property. To recover it, we would need to amend the estimator as follows: 'if several θ m s take the same value, first stack their corresponding X m,1 , . . . , X m,N samples, and then apply a product-form estimator to the stacked samples'. However, to not overly complicate this section's exposition and Theorem 4's proof, we restricted ourselves to distributions satisfying (29).
We are now in a position to revisit Example 5 and better adapt the proposal to the target. This leads to a special case of an algorithm known as 'importance sampling squared' or 'IS 2 ' [66]: Example 6 (A simple hierarchical model, revisited). Consider again the model in Example 5. Recall that our previous choice of proposal did not quite capture the conditional independence structure in the target π: the former was fully factorized while the latter is only partially so. It seems more natural to instead use the proposal in (24), which is also easy to sample from but both mirrors π's independence structure and leads to further cancellations in the weight function (in particular, it no longer depends on θ): It follows that, to estimate the normalizing constant or any integral w.r.t. to a univariate marginal of the posterior, we need to draw samples from µ and evaluate the partially product-form estimator µ M,N × (ϕ) for a test function of the form ϕ(θ, x) = f (θ) K k=1 g k (x k ). The total cost then reduces to O(KM N ), because We also return to this is Section 3.4.

Grouped independence Metropolis-Hastings
As a further example of how one may embed product-form estimators within more sophisticated Monte Carlo methodology and exploit the independence structure present in the problem, we revisit Beaumont's Grouped Independence Metropolis-Hastings (GIMH, [8]), a simple and well-known pseudo-marginal MCMC sampler [4]. Like many of these samplers, it is intended to tackle targets whose densities cannot be evaluated pointwise but are marginals of higher-dimensional distributions whose densities can be evaluated pointwise. Our inability to evaluate the target's density precludes us from directly applying the Metropolis-Hastings algorithm (MH, e.g. see [7, Chap. XIII]) as we cannot compute the necessary acceptance probabilities. For instance, in the case of a target π(dθ) on a space (Θ, T ) and an MH proposal Q(θ, dθ) with respective densities π(θ) and Q(θ,θ), we would need to evaluate where θ denotes the chain's current state andθ ∼ Q(θ, ·) the proposed move. GIMH instead replaces the intractable π(θ) and π(θ) in the above with importance sampling estimates thereof: if π(θ, x) denotes the density of the higher-dimensional distribution π(dθ, dx) whose marginal is π(dθ), and w(θ, x) := π(θ, x)/M(θ, x) for a given Markov kernel M(θ, dx) with density M(θ, x), where As explained in [4,5], the algorithm's correctness does not require the density estimates to be generated by (31), only for them to be unbiased. In particular, if the estimates are unbiased, GIMH may be interpreted as an MH algorithm on an expanded state space with an extension of π(dθ) as its invariant distribution. Consequently, provided that the density estimator is suitably well-behaved, GIMH returns consistent and asymptotically normal estimates of the target under conditions comparable to those for standard MH algorithms (e.g. the GIMH chain is uniformly ergodic whenever the associated 'marginal' chain is and the estimator is uniformly bounded [4]; see [5] for further refinements). Consequently, if the kernel is product-form (i.e. M(θ, dx) is productform for each θ), we may replace (31) with their product-form counterparts: where K denotes the dimensionality of the x-variables (the unbiasedness follows from X n andX n having respective laws M(θ, dx) and M(θ, dx) for any n in [N ] K ). Thanks to the results in [6], it is straightforward to show that this choice leads to lower estimator variances, at least asymptotically: be the GIMH chains generated using (31) and (32), respectively, and the same proposal Q(θ, dθ). If ϕ belongs to L 2 π , then Given the argument used in the proof, the results of [6] (Theorem 10 in particular) imply much more than the variance bound in the corollary's statement. For instance, if the target is not concentrated on points, then the spectral gap of (θ m,N × ) ∞ m=1 is bounded below by that of (θ m,N ) ∞ m=1 . We finish the section by returning to our running example: resulting in a evaluation cost of O(KN ) for (31,32) and, regardless of which density estimates we use, a total cost of O(KM N ) where M denotes the number of steps we run the chain for. We return to this in the following section.

Numerical comparison
Here, we apply the estimators discussed throughout Sections 3.1-3.3 to the simple hierarchical model introduced in Example 5 and we examine their performance. To benchmark the latter, we choose the prior to be conditionally conjugate to the model's likelihood: p(dθ) is the Inv-Gamma(α/2, αβ/2) distribution, in which case and we can alternatively approximate the posterior, π(dθ, dx) in (22), using a Gibbs' sampler. Note that the above expressions are unnecessary for the evaluation of the estimators in Sections 3.1-3.3.
To compare with standard methodology that also does not requires such expressions, we also approximate the posterior using Random Walk Metropolis (RWM) with the proposal variance tuned so that the mean acceptance probability (approximately) equals 25%. To keep the comparison honest, we run these two chains for N 2 steps and set M = N for the estimators in Sections 3.2 and 3.3; in which case all estimators incur a similar O(KN 2 ) cost. We further fix K := 100, α := 1, β := 1, and N := 100 and generate artificial observations y 1 , . . . , y 100 by running (21) with θ := 1. Figure 4 shows approximations to the posteriors's θ-marginal π(dθ) obtained using a Gibbs sampler, RWM, IS (Section 3.1), IS 2 (Section 3.2), GIMH (Section 3.3), and the last three's productform variants (PFIS, PFIS 2 , and PFGIMH, respectively). In the cases of Gibbs, RWM, GIMH, and PFGIMH, we used a 20% burn-in period and approximated the marginal with the empirical distribution of the θ-components of the states visited by the chain. For GIMH and PFGIMH we also used a random walk proposal with its variance tuned so that the mean acceptance probability hovered around 25%. For IS, PFIS, IS 2 , and PFIS 2 , we used the proposals specified in Examples 5 and 6 and computed the approximations using .
(Note that for IS, we are using N 2 samples instead of N so that its cost is also O(KN 2 ).) Our first observation is that the approximations produced by IS, IS 2 , and GIMH are very poor. The first two exhibit severe weight degeneracy (in either case, a single particle had over 50% of the probability mass and three had over 90%), something unsurprising given the target's moderately high dimension of 101 2 . The third possesses a large spurious peak close to zero (with over 70% of the mass) caused by large numbers of rejections in that vicinity. Replacing the i.i.d. estimators embedded within these algorithms with their product-form counterparts removes both the weight degeneracy and the spurious peak; and PFIS, PFIS 2 , and PFGIMH return much improved approximations. The best approximation is the one returned by the Gibbs sampler: an expected outcome given that the sampler's use of the conditional distributions makes it the estimator most 'tailored' or 'well-adapted' to the target. However, these distributions are not available for most models (precluding application of these samplers to such models) and even just taking the, usually obvious, independence structure into account can make a substantial difference: the quality of the approximations returned by PFIS and PFIS 2 exceeds the quality of that returned by the common, or even default, choice of RWM. Note that this is the case even though the proposal variance in RWM was tuned, while that in the other two was simply set to 1 (a reasonable choice given that θ = 1 was used to generate the data, but likely not the optimal one). In fact, for this simple model, it is easy to sensibly incorporate observations into the PFIS and PFIS 2 proposals (e.g. use p(dθ) K k=1 N (dx k ; y k , 1) for PFIS and p(dθ) K k=1 N (dx k ; y k θ[1 + θ] −1 , θ[1 + θ] −1 ) for PFIS 2 ) and potentially improve their performance.   Figure 4: Empirical cumulative density functions for π(dθ) obtained using the eight approximations discussed in the text (black solid lines). As guides, we also plot a high-quality approximation π REF (dθ) using (25) and quadrature, and the pointwise absolute difference between π REF and the eight approximations (grey lines). Insets. Wasserstein-1 distance (W 1 ) between π REF and the panel's approximation (area under the grey line, e.g. see [62, p.64]) and corresponding Kolmogorov-Smirnov statistic (KS, maximum of the grey line).
To benchmark the approaches more thoroughly, we generated R := 100 replicates of the eight full posterior approximations and computed various error metrics (Tables 1 and 2). For the θcomponent, we used the high-quality reference approximation π REF described in Figure 4's caption to obtain the average (across repeats) W 1 distance and KS statistic (as described in the caption), and the average absolute error of the posterior mean and standard deviation estimates normalized by the true mean or standard deviation (i.e. M −1 θ R −1 R r=1 |M r θ − M θ | for the posterior mean estimates, where M θ denotes the true mean and M r θ the r th estimate thereof, and similarly for the standard deviation estimates). For the x-components, we instead used high-accuracy estimates for the component-wise means and standard deviations (obtained by running a Gibbs sampler for N 4 = 10 8 steps) to compute the corresponding total absolute errors across replicates and components ( K k=1 R r=1 |M r k − M k |, where M k denotes the true mean for the k th x-component and M r k the r th estimate thereof, and similarly for the standard deviation estimates).
Once again, the product-form estimators far outperformed their i.i.d. counterparts. Moreover, they perform just as well or better than RWM. PFIS 2 's estimates are particularly accurate: a fact that does not surprise us given that its proposal has the same partially-factorized structure as the target, in this sense making it the best adjusted estimator to the problem. That is, best except for the Gibbs sampler that exploits the conditional distributions (encoding more information than this structure). We conclude with an interesting detail: PFIS 2 and PFIS perform similarly when approximating the θ-marginal (cf.  Table 2: Total absolute error for the mean and standard deviation estimates of π(dx)'s univariate marginals. Note that no results are given for GIMH and PFGIMH since these algorithms directly target the θ-marginal π(dθ).
latent variable marginals (cf. Table 2). This is perhaps not too surprising because, in the case of the θ-marginal approximation, both PFIS 2 and PFIS employ the same number N of θ-samples, while, in that of k th latent variable, PFIS 2 uses N 2 x k -samples but PFIS uses only N such samples.

Discussion
The main message of this paper is that, when using Monte Carlo estimators to tackle problems possessing some sort of product structure, one should endeavour to exploit this structure and improve the estimators' performance. The resulting product-form estimators are not a panacea for the curse of dimensionality in Monte Carlo, but they are a useful and sometimes overlooked tool in the practitioner's arsenal and make certain problems solvable when they otherwise would not be. More specifically, whenever the target, or proposal, we are drawing samples from is product-form, these estimators achieve a smaller variance than their conventional counterparts. In our experience (e.g. Examples 2 and 4), the gap in variance grows at least exponentially with dimension whenever the integrand does not decompose into a sum of low-dimensional functions like in the trivial case (14). For the reasons given in Section 2.2, we expect the variance reduction to be further accentuated by targets that are 'spread out' rather than highly peaked. The gains in statistical efficiency come at a computational price: in the absence of exploitable structure in the test function, product-form estimators incur an O(N K ) cost limiting their applicability to K < 10, while conventional estimators only carry an O(N ) cost (although in practice the cost of obtaining reasonable estimates using the latter often scales poorly with K, with the effect hidden in the proportionality constant, e.g. Examples 2 and 4). Hence, for general test functions, product-form estimators are of most use when the variance reduction is particular pronounced or when samples are expensive to acquire (both estimators require drawing the same number N of samples) or store (as, for example, when one employs physical random numbers and requires reproducibility [54]). In the latter case, product-form estimators enable us to extract the most possible from the samples we have gathered so far: by permuting the samples' components, the estimators artificially generate further samples. Of course, the more permutations we make, the more correlated our sample ensemble becomes and we get a diminishing returns effect that results in an O(N −1/2 ) rate of convergence instead of the O(N −K/2 ) rate we would achieve using N K independent samples. There is a middle ground here that remains unexplored: using N < M < N K permutations instead of all N K possible, so lowering the cost to O(M ) at the expense of some of the variance reduction (see [45,46] for similar ideas in the Monte Carlo literature). In particular, by choosing the M permutations so that the correlations among them are minimized (e.g. the M permutations with least overlap among their components), it might be possible to substantially reduce the cost without sacrificing too much of the variance reduction. Indeed, by setting the number M of permutations to be such that M evaluations of the test function incurs a cost comparable to that of generating the N unpermuted tuples, one can ensure that the overall cost of the resulting estimator never greatly exceeds that of the conventional estimator. This type of approach has been studied in the sparse grid literature [28], and is closely related to the theory of incomplete U-statistics [44,Chap. 4.3], an area in which there are ongoing efforts directed at designing good reduced-cost estimators (e.g. [40]).
There are, however, settings in which product-form estimators should be applied without hesitation: if the integrand is a sum of products (SOP) of univariate functions, the cost comes down to O(N ) without affecting the variance reduction (Section 2.4). For instance, when estimating ELBO gradients to optimize mean-field approximations [58] of posteriors e v with SOP potentials v. More generally, if the test function is a sum of partially-factorized functions, the estimators' evaluation costs can often be substantially reduced (see also Section 2.4) so that the variance reduction far outweighs the more mild increases in cost. For instance, as we saw with the applications of importance sampling and its product-form variant in Section 3.4.
For integrands lacking this sort of structure, and at the expense of introducing some bias, these types of cost reductions can sometimes be retained if one is able to find a good SOP approximation to the integrand (Example 4). How to construct these approximations for generic functions (or for function classes of interest in given applications) is an open question upon whose resolution the success of this type of approach hinges. In reality, combining product-form estimators with SOP approximations amounts to nothing more than an approximate dimensionality reduction technique: we approximate a high-dimensional integral with a linear combination of products of low-dimension integrals, estimate each of the latter separately, and plug the estimates back into the linear combination to obtain an estimate of the original integral. It is certainly not without precedents: for instance, [57,49,27,12] all propose, in rather different contexts, similar approximations except that the low-dimensional integrals are computed using closed-form expressions or quadrature (for a very well-known example, see the delta method for moments [51]). In practice, the best option will likely involve a mix of these: use closed-form expressions where available, quadrature where possible, and Monte Carlo (or Quasi Monte Carlo) for everything else.
About the computational resources required to evaluate product-form estimators, and the allocation thereof, we ought to mention one interesting variant of the estimators that we omitted from the main text to keep the exposition simple. In particular, throughout we assumed that the same number of samples are drawn from each marginal µ 1 , . . . , µ K of the product-form target or proposal µ. This need not be the case: straightforward extensions of our arguments show that the estimator behaves much as (4) does, even if a different number of samples N k are used per marginal µ k . This variant potentially allows us to concentrate our computational budget on 'the most important dimensions', an idea that has found significant success in other areas of numerical integration (e.g. [28,29,53]). In our case, this could be done using the pertinent generalizations of the variance expressions in Theorem 1, which are identical except that N |A| therein must be replaced by k∈A N k (these can be obtained by retracing the steps in the theorem's proof). In particular, one could estimate the terms in these expressions and adjust the sample sizes so that the estimator variance is minimized, potentially in an iterative manner leading to an adaptive scheme. Combining product-form estimators with other Monte Carlo methodology expands their utility beyond product-form targets. We illustrated this in Section 3 by describing the three simplest and most readily accessible such combinations we could think of: their merger with importance sampling applicable to targets that are absolutely continuous with respect to product-form distributions (Section 3.1), that with importance sampling squared applicable to targets that are absolutely continuous with respect to partially-factorized distributions (Section 3.2, see also [66]), and that with pseudo-marginal MCMC applicable to targets with intractable densities (Section 3.3, see also [61]). In all of these cases, we demonstrated theoretically that the resulting estimators are more statistically efficient than their standard counterparts (Corollaries 2-5). Many other extensions are possible. For instance, one can embed product-form estimators within random weight particle filters [60,23,24]-and, more generally, algorithms reliant on unbiased estimation-much the same way we did for IS 2 and GIMH in Sections 3.2-3.3. For an example of a slightly different vein, see Appendix F where we consider 'mixture-of-product-form' estimators applicable to targets which are mixtures of product-form distributions and, by combining these with importance sampling, we obtain a product-form version of (stratified) mixture importance sampling estimators [52,33] that is particularly appropriate for multi-modal targets. For further examples, see the divide-and-conquer SMC algorithm [46,43] obtained by combining product-form estimators with SMC and Tensor Monte Carlo [2] obtained by merging the estimators with variational autoencoders.
When choosing among the resulting (and at times bewildering) constellation of estimators, we recommend following one simple principle: pick estimators that somehow 'resemble' or 'mirror' the target. Good examples of this are well-parametrized Gibbs samplers which generate new samples using the target's exact conditional distributions and, consequently, often outperform other Monte Carlo algorithms (e.g. Section 3.4). While for many targets these conditional distributions cannot be obtained (nor are good parametrizations known), their (conditional) independence structure is usually obvious (e.g. see [26,25,39,36,11] and the many references therein) and can be mirrored using product-form estimators within one's methodology of choice. Indeed, in the case of the simple hierarchical model (Example 5), it was the PFIS 2 estimator utilizing samples with exactly the same independence structure as the model's that performed best (besides the Gibbs sampler). Of course, this model's independence structure was particularly simple, and so were the resulting estimators. However, we believe that broadly the same considerations apply to models with more complex structures and that product-form estimators can be adapted to such structures by following analogous steps.
To summarize, we believe that product-form estimators are of greatest use not on their own, but embedded within more complicated Monte Carlo routines to tackle the aspects of the problem exhibiting product structure. There remains much work to be done in this direction.

A Proof of Theorem 1
We begin with the proof of Lemma 1: Proof of Lemma 1. We start with a simple identity: because j i=0 (−1) i j i vanishes unless j = 0, Next, we show that, for any non-empty A ⊆ [K] and ψ belonging to L 2 µ A , ψ A in (13) satisfies Because µ B (ψ A ) = µ B\{k} (µ k (ψ A )) for any k in a non-empty subset B of A, we need only show that µ k (ψ A ) = 0 for any k in A. Fix any such k and note that We obtain µ k (ψ A ) = 0 by integrating both sides with respect to µ k , and (34) follows. Returning to (13), assume, without loss of generality, that A = {1, . . . , |A|}, and note that and similarly for X j . Suppose that i k = j k for some k and let F i k ,j k denote the sigma algebra generated by (X i l l , X j l l ) l =k . Because the components are independent and both X i k k and X j k k have law µ k , where the last equality follows from (34). Hence, Because A is non-empty by assumption and (−1) |B| = (−1) −|B| for all B ⊆ A, (33) implies that But, However, our starting identity implies that We are now ready to tackle the proof of Theorem 1. We do it in steps: Step 1: unbiasedness. Because X n ∼ µ for all n in [N ] K , this follows directly from the estimator's definition in (4).
Step 2: square integrability of µ A c (ϕ). That, for any subset A of [K], µ A c (ϕ) is square µ A -integrable follows from our assumption that ϕ in square µ-integrable and Jensen's inequality: Step 3: variance expressions. By (12,13), where ψ A is as in (13) with ψ := µ A c (ϕ). Fix any non-empty subsets A, A of [K] such that A\A is non-empty, set F A to be the sigma algebra generated by all (X n l ) N n=1 with l in A , and note that . But, once again assuming without loss of generality that A = {1, . . . , |A|} and applying (34), almost surely, and we find that E µ N A (ψ A )µ N A (ψ A ) = 0. Because an analogous argument shows that this is the case if A \A is non-empty, (37) reduces to and the variance expressions in (8) follow from Lemma 1.
Step 4: consistency and asymptotic normality. From (12), it follows that, for all N > 0, where . Given the square integrability in Step 2 and that µ(ϕ) = µ k (µ {k} c (ϕ)), the classical law of large numbers and central limit theorem imply that lim N →∞ µ N k (µ {k} c (ϕ)) − µ(ϕ) = 0 almost surely, for all k in [K]. Because (X n 1 ) ∞ n=1 , . . . , (X n K ) ∞ n=1 are independent sequences, so are and the continuous mapping theorem yields as N → ∞. Hence, if we can show that the remainder term R N tends to zero fast enough, i.e.
then (10,11) follow from (38) and Slutsky's theorem. To do so, note that the same argument as in Step 3 shows that Markov's inequality then implies that Summing over N and applying a Borel-Cantelli argument, we obtain the first limit in (39). For the second, set ε := N −1/2 δ, for any δ > 0, and take the limit N → ∞.

B Proofs of Theorem 2 and its corollary
Throughout this appendix we use colons to indicate ranges, with p : q = p, p + 1, . . . , q for integers p ≤ q, and use ranges in indices to compactly describe tuples. For instance, x 1:N or x 1:N 1:K refer to a tuple in S N , x 1:N k to one in S N k , x n 1:K to one in S, etc. The key to Theorem's 2 proof is the following.
Iterating this argument gives us the desired bound:

C Proof of Theorem 3 and its corollary
Theorem 3's proof is straightforward: , and the unbiasedness follows: Because Z N m is a function of (θ m , X m,1 , . . . , X m,N ) that does not depend on m and the sequence (θ m , X m,1 , . . . , X m,N ) M m=1 is i.i.d. and drawn from µ 0 ⊗ M N , (Z N m ) M m=1 forms an i.i.d. sequence and the consistency follows from the law of large numbers: Similarly, because Jensen's inequality implies that the central limit theorem for sums of i.i.d. square-integrable random variables tells us that, if ϕ belongs to L 2 µ , then (28) holds with σ 2 ×,N (ϕ) = Var(Z N 1 ). The expression for the asymptotic (in M ) variance then follows from the law of total variance: where, for each θ in Θ, V N x,× (θ) denotes the RHS of (8) with M(θ, dx) replacing µ(dx) therein. Lastly, using once again the independence of Z

D Proof of Theorem 4
In this appendix we use colon notation analogous to that introduced in Appendix B. This proof is a conceptually straightforward (although notationally arduous) extension of that given for Theorem 2 in Appendix B, and we only sketch it. In short, one needs to exploit the two symmetries present in the problem-i.e. in which case (41)

E Proof of Corollary 5
This corollary follows from Theorem 10 and Remark 12 in [6] and the fact that, for all θ in Θ and N > 0, π N × (θ) is bounded above by π N (θ) in the convex order: for all convex functions f : R → R. To argue the above, fix any such f , θ, and N , and note that ) has law π(dx 2 , . . . , dx K ) and is independent of (X 1 1 , . . . , X N 1 ), (Y N m ) m∈[N ] K−1 is a collection of identically distributed random variables. Moreover, by definition π N (θ) = Y N (N −1,...,N −1) , so (Y N m ) m∈[N ] K−1 all share the same distribution as π N (θ) and (49) follows from Jensen's inequality:

F Estimators for targets that are mixtures of product-form distributions
Suppose that our target µ is not a product-form distribution but a mixture of several: where the mixture weights θ 1 , . . . , θ I > 0 satisfy I i=1 θ i = 1 and, for each i in [I], µ i is the product . Notice that, to further broaden the applicability of the ensuing estimators, we allow for mixture components that are products over some but not all dimensions (in the case where every mixture component does factorize fully, K i = K and K i k = {k} for each i in [I] and k in [K]). Fix a square µ-integrable test function ϕ and suppose we wish to compute µ(ϕ). It is well-known [52,33] that the basic Monte Carlo estimator, where X 1 , . . . , X N denote i.i.d. samples drawn from µ, is outperformed by its stratified variant: where (X 1,1 , . . . , X 1,N 1 ), . . . , (X I,1 , . . . , X I,N I ) are I independent sequences of i.i.d. samples respectively drawn from µ 1 , . . . , µ I and N 1 = θ 1 N, . . . , N I = θ I N for some N > 0 (out of convenience, we are assuming that these are integers). In particular, µ N s (ϕ) is a consistent, unbiased, and asymptotically normal estimator for µ(ϕ) whose variance is bounded above by that of µ N (ϕ): Given our assumption that the distributions in the mixture are product-form, we easily improve on µ N s (ϕ) using the product-form analogues of µ 1,N 1 (ϕ), . . . , µ I,N I (ϕ): is more statistically efficient than µ N s (ϕ). In particular: Corollary 6. If ϕ is µ-integrable, then µ N s,× (ϕ) in (52) is an unbiased estimator for µ(ϕ). If ϕ belongs to L 2 µ and N 1 , . . . , N I are integers proportional to N (i.e. N 1 = α 1 N, . . . , N I = α I N for some α 1 , . . . , α I > 0), then µ N s,× (ϕ) is strongly consistent and asymptotically normal with variance Var(µ N s,× (ϕ)) = If, in particular, α 1 = θ 1 , . . . , α I = θ I , then the variances of µ N s,× are bounded above by those of µ N s : Proof. Follows from Theorem 1, Corollary 1, and the continuous mapping theorem.
Employing a Lagrange multiplier as in [7, p.153], we find that N * i ∝ θ i σ i × (ϕ) for all i in [I] achieve the smallest possible asymptotic variance ( I i=1 θ i σ i × (ϕ)) for fixed N = I i=1 N i . Of course, σ 1 × (ϕ), . . . , σ I × (ϕ) are unknown in practice and must be estimated, which would lead to an adaptive scheme similar to those for µ N s (ϕ) in [56,22,20].