Convergence rates for optimised adaptive importance samplers

Adaptive importance samplers are adaptive Monte Carlo algorithms to estimate expectations with respect to some target distribution which adapt themselves to obtain better estimators over a sequence of iterations. Although it is straightforward to show that they have the same O(1/N)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(1/\sqrt{N})$$\end{document} convergence rate as standard importance samplers, where N is the number of Monte Carlo samples, the behaviour of adaptive importance samplers over the number of iterations has been left relatively unexplored. In this work, we investigate an adaptation strategy based on convex optimisation which leads to a class of adaptive importance samplers termed optimised adaptive importance samplers (OAIS). These samplers rely on the iterative minimisation of the χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document}-divergence between an exponential family proposal and the target. The analysed algorithms are closely related to the class of adaptive importance samplers which minimise the variance of the weight function. We first prove non-asymptotic error bounds for the mean squared errors (MSEs) of these algorithms, which explicitly depend on the number of iterations and the number of samples together. The non-asymptotic bounds derived in this paper imply that when the target belongs to the exponential family, the L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document} errors of the optimised samplers converge to the optimal rate of O(1/N)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {O}(1/\sqrt{N})$$\end{document} and the rate of convergence in the number of iterations are explicitly provided. When the target does not belong to the exponential family, the rate of convergence is the same but the asymptotic L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L_2$$\end{document} error increases by a factor ρ⋆>1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sqrt{\rho ^\star } > 1$$\end{document}, where ρ⋆-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho ^\star - 1$$\end{document} is the minimum χ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\chi ^2$$\end{document}-divergence between the target and an exponential family proposal.


Introduction
The class of adaptive importance sampling (AIS) methods is a key Monte Carlo methodology for estimating integrals that cannot be obtained in closed form (Robert and Casella 2004). Ö.D.A. is funded by the Lloyds Register Foundation programme on Data Centric Engineering through the London Air Quality project. This work was supported by The Alan Turing Institute for Data Science and AI under EPSRC Grant EP/N510129/1. J.M. acknowledges the support of the Spanish Agencia Estatal de Investigación (awards TEC2015-69868-C2-1-R ADVENTURE and RTI2018-099655-B-I00 CLARA) and the Office of Naval Research (Award No. N00014-19-1-2226 This problem arises in many settings, such as Bayesian signal processing and machine learning (Bugallo et al. 2015(Bugallo et al. , 2017 or optimal control, (Kappen and Ruiz 2016) where the quantities of interest are usually defined as intractable expectations. Adaptive importance samplers are versions of classical importance samplers (IS) which iteratively improve the proposals to generate samples better suited to the estimation problem at hand. Its variants include, for example, population Monte Carlo methods (Cappé et al. 2004) and adaptive mixture importance sampling (Cappé et al. 2008). Since there has been a surge of papers on the topic of AIS recently, a comprehensive review is beyond the scope of this article; see e.g. Bugallo et al. (2017) for a recent review.
Due to the popularity of the adaptive importance samplers, their theoretical performance has also received attention in the past few years. The same as conventional IS methods, AIS schemes enjoy the classical O(1/ √ N ) convergence rate of the L 2 error, where N is the number of Monte Carlo samples used in the approximations, see e.g. Robert and Casella (2004) and Agapiou et al. (2017). However, since an adaptation is performed over the iterations and the goal of this adaptation is to improve the proposal quality, an insightful convergence result would provide a bound which explicitly depends on the number of iterations, t, (which sometimes we refer to as time) and the number of samples, N . Although there are convergence results of adaptive methods (see Douc et al. (2007) for a convergence theory for population Monte Carlo based on minimizing Kullback-Leibler divergence), none of the available results yields an explicit bound of the error in terms of the number of iterations and the number of particles at the same time.
One difficulty of proving such a result for adaptive mixture samplers is that the adaptive mixtures form an interacting particle system, and it is unclear what kind of adaptation they perform or whether the adapted proposals actually get closer to the target for some metric. An alternative to adaptation using mixtures is the idea of minimizing a cost function in order to adapt the proposal. This idea has been popular in the literature, in particular, minimizing the variance of the weight function has received significant attention, see, e.g. Arouna (2004a, b); Kawai (2008); Lapeyre and Lelong (2011); Ryu and Boyd (2014); Kawai (2017Kawai ( , 2018. Relevant to us, in particular, is the work of Ryu and Boyd (2014), who have proposed an algorithm called Convex Adaptive Monte Carlo (Convex AdaMC). This scheme is based on minimizing the variance of the IS estimator, which is a quantity related to the χ 2 divergence between the target and the proposal. Ryu and Boyd (2014) have shown that the variance of the IS estimator is a convex function of the parameters of the proposal when the latter is chosen within the exponential family. Based on this observation, Ryu and Boyd (2014) have formulated Convex AdaMC, which draws one sample at each iteration and construct the IS estimator, which requires access to the normalised target. They proved a central limit theorem (CLT) for the resulting sampler. The idea has been further extended for self-normalised importance samplers by Ryu (2016), who considered minimising the α-divergence between the target and an exponential family. Similarly, Ryu (2016) proved a CLT for the resulting sampler. Similar ideas were also considered by Kawai (2017Kawai ( , 2018, who also aimed at minimizing the variance expression. Similarly, Kawai (2018) showed that the variance of the weight function is convex when the proposal family is suitably chosen and provided general conditions for such proposals. Kawai (2018) has also developed an adaptation technique based on the stochastic approximation, which is similar to the scheme we analyse in this paper. There have been other results also considering χ 2 divergence and relating it to the necessary sample size of the IS methods, see, e.g. Sanz-Alonso (2018). Following the approach of Chatterjee et al. (2018), Sanz-Alonso (2018) considers and ties the necessary sample size to χ 2 -divergence, in particular, shows that the necessary sample size grows with χ 2 -divergence, hence implying that minimizing it can lead to more efficient importance sampling procedures.
In this work, we develop and analyse a family of adaptive importance samplers, coined optimised adaptive importance samplers (OAIS), which relies on a particular adaptation strategy based on convex optimisation. We adapt the proposal with respect to a quantity (essentially the χ 2 -divergence between the target and the proposal) that also happens to be the constant in the error bounds of the IS [see, e.g. (Agapiou et al. 2017)]. Assuming that proposal distributions belong to the exponential family, we recast the adaptation of the proposal as a convex optimisation problem and then develop a procedure which essentially optimises the L 2 error bound of the algorithm. By using results from convex optimisation, we obtain error rates depending on the number of iterations, denoted as t, and the number of Monte Carlo samples, denoted as N , together. In this way, we explicitly display the trade-off between these two essential quantities. To the best of our knowledge, none of the papers on the topic provides convergence rates depending explicitly on the number of iterations and the number of particles together, as we do herein.
The paper is organised as follows. In Sect. 2, we introduce the problem definition, the IS and the AIS algorithms. In Sect. 3, we introduce the OAIS algorithms. In Sect. 4, we provide the theoretical results regarding optimised AIS and show its convergence using results from convex optimisation. Finally, we make some concluding remarks in Sect. 5.

Notation
For L ∈ N, we use the shorthand [L] = {1, . . . , L}. We denote the state space as X and assume X ⊆ R d x , d x ≥ 1. The space of bounded real-valued functions and the set of probability measures on space X are denoted as B(X) and P(X), respectively. Given ϕ ∈ B(X) and π ∈ P(X), the expectation of ϕ with respect to (w.r.t.) π is written as The unnormalised density associated with π is denoted with Π(x). We denote the proposal as q θ ∈ P(X), with an explicit dependence on the parameter θ ∈ Θ. The parameter space is assumed to be a subset of d θ -dimensional Euclidean space, i.e. Θ ⊆ R d θ .
Whenever necessary, we denote both the probability measures, π and q θ , and their densities with the same notation. To be specific, we assume that both π(dx) and q θ (dx) are absolutely continuous with respect to the Lebesgue measure, and we denote their associated densities as π(x) and q θ (x). The use of either the measure or the density will be clear from both the argument (sets or points, respectively) and the context.

Background
In this section, we review importance and adaptive importance samplers.

Importance sampling
Consider a target density π ∈ P(X) and a bounded function ϕ ∈ B(X). Often, the main interest is to compute an integral of the form (2.1) While perfect Monte Carlo can be used to estimate this expectation when it is possible to sample exactly from π(x), this is in general not tractable. Hereafter, we consider the cases when the target can be evaluated exactly and up to a normalising constant, respectively. Importance sampling (IS) uses a proposal distribution which is easy to sample and evaluate. The method consists in weighting these samples, in order to correct the discrepancy between the target and the proposal, and finally constructing an estimator of the integral. To be precise, let q θ ∈ P(X) be the proposal which is parameterised by the vector θ ∈ Θ. The unnormalised target density is denoted as Π : X → R + . Therefore, we have where Z π := X Π(x)dx < ∞. Next, we define functions w θ , W θ : respectively. For a chosen proposal q θ , the IS proceeds as follows. First, a set of independent and identically distributed (iid) samples {x (i) } N i=1 is generated from q θ . When π(x) can be evaluated, one constructs the empirical approximation of the probability measure π , denoted π N θ , as where δ x (dx) denotes the Dirac delta measure that places unit probability mass at x = x . For this case, the IS estimate of the integral in (2.1) can be given as However, in most practical cases, the target density π(x) can only be evaluated up to an unknown normalizing proportionality constant (i.e. we can evaluate Π(x) but not Z π ). In this case, we construct the empirical measure π N θ as .
Remark 1 For the IS estimator (2.2), this bound can be improved so that c ϕ = ϕ 2 ∞ . However, this improvement does not effect our results in this paper; hence, we present a single bound of the form in (2.4) for the estimators (2.2) and (2.3) for conciseness.
Note that ρ(θ) can also be expressed in terms of the variance of the weight function w θ , which coincides with the χ 2divergence, i.e. ρ(θ) = var q θ (w θ (X )) + 1.
Remark 3 Remark 2 implies that when both π and q θ belong to the same parametric family (i.e. there exists θ ∈ Θ such that π = q θ ), one readily obtains Remark 4 For the IS estimator (2.2), the bound in Theorem 1 can be modified so that it holds for unbounded test functions ϕ as well; see, e.g. Ryu and Boyd (2014). Therefore, a similar quantity to ρ(θ), which includes ϕ while still retaining convexity, can be optimised for this case. Unfortunately, obtaining such a bound is not straightforward for the SNIS estimator (2.3) as shown by Agapiou et al. (2017). In order to significantly simplify the presentation, we restrict ourselves to the class of bounded test functions, i.e. we assume ϕ ∞ < ∞.
Finally, we present a bias result from Agapiou et al. (2017).

Parametric adaptive importance samplers
Standard importance sampling may be inefficient in practice when the proposal is poorly calibrated with respect to the target. In particular, as implied by the error bound provided in Theorem 1, the error made by the IS estimator can be high if the χ 2 -divergence between the target and the proposal is large. Therefore, it is more common to employ an iterative version of importance sampling, also called as adaptive importance sampling (AIS). The AIS algorithms are importance sampling methods which aim at iteratively improving the proposal distributions. More specifically, the AIS methods specify a sequence of proposals (q t ) t≥1 and perform Algorithm 1 Parametric AIS 1: Choose a parametric proposal q θ with initial parameter θ = θ 0 . 2: for t ≥ 1 do 3: Adapt the proposal,

6:
Report the point-mass probability measure and the estimator .

7: end for
importance sampling at each iteration. The aim is to improve the proposal so that the samples are better matched with the target, which results in less variance and more accuracy in the estimators. There are several variants, the most popular one being population Monte Carlo methods (Cappé et al. 2004) which uses previous samples in the proposal.
In this section, we review one particular AIS, which we refer to as parametric AIS. In this variant, the proposal distribution is a parametric distribution, denoted q θ . Over time, this parameter θ is updated (or optimised) with respect to a predefined criterion resulting in a sequence (θ t ) t≥1 . This yields a sequence of proposal distributions denoted as (q θ t ) t≥1 .
One iteration of the algorithm goes as follows. Assume at time t − 1 we are given a proposal distribution q θ t−1 . At time t, we first update the parameter of this proposal, , is a sequence of (deterministic or stochastic) maps, e.g. gradient mappings, constructed so that they minimise a certain cost function. Then, in the same way, we have done in conventional IS, we sample , and finally construct the empirical measure The estimator of the integral (2.1) is then computed as in Eq.
(2.3). The full procedure of the parametric AIS method is summarized in Algorithm 1. Since this is a valid IS scheme, this algorithm enjoys the same guarantee provided in Theorem 1. In particular, we have the following theorem.

Theorem 3 Assume that given a sequence of proposals
Proof The proof is identical to the proof of Theorem 1. We have just re-stated the result to introduce the iteration index t.
However, this theorem does not give an insight of what happens as the number of iterations increases, i.e. when t → ∞, with the bound. Ideally, the adaptation of the AIS should improve this bound with time. In other words, in the ideal case, the error should decrease as t grows. Fortunately, Theorem 3 suggests that the maps T t : Θ → Θ can be chosen so that the function ρ is minimised over time. More specifically, the sequence (θ t ) t≥1 can be chosen so that it leads to a decreasing sequence (at least in expectation) (ρ(θ t )) t≥1 . In the following sections, we will summarize the deterministic and stochastic strategies to achieve this aim.

Remark 5
We define the unnormalised version of ρ(θ) and denote it as R(θ ). It is characterised as follows: Hence, R(θ ) can also be expressed as . (2.6)

AIS with exponential family proposals
Following Ryu and Boyd (2014), we note that when q θ is chosen as an exponential family density, the function ρ(θ) is convex. In particular, we define Then, we have the following lemma adapted from Ryu and Boyd (2014).
Lemma 1 shows that ρ is a convex function; therefore, optimising it could give us provably convergent algorithms (as t increases). Next lemma, borrowed from Ryu and Boyd (2014), shows that ρ is differentiable and its gradient can indeed be computed as an expectation.

Lemma 2
The gradient ∇ρ(θ) can be written as The proof is straightforward since q θ is from an exponential family and A(θ ) is differentiable.

Remark 6
Note that Eqs. (2.6) and (2.8) together imply that . (2.9) We also note (see Remark 5) that (2.10) In the following sections, we assume that ρ(θ) is a convex function. Thus, Lemma 1 constitutes an important motivation for our approach. We leave general proposals which lead to nonconvex ρ(θ) for future work.
(ii) an unbiased estimate of the gradient of ρ(θ) can be obtained, and (iii) an unbiased estimate of the gradient of R(θ ) can be obtained.
Scenario (i) is unrealistic in practice but gives us a guideline in order to further develop the idea. In particular, the error bounds for the more complicated cases follow the same structure as this case. Therefore, the results obtained in case (i) provide a good qualitative understanding of the results introduced later. Scenario (ii) can be realized in cases where it is possible to evaluate π(x), in which case the IS leads to unbiased estimators. Scenario (iii) is what a practitioner would most often encounter: the target can only be evaluated up to the normalizing constant, i.e. Π(x) can be evaluated but π(x) cannot. We finally remark that for the cases where we assume a stochastic gradient can be obtained for ρ and R (namely, the case (ii) and the case (iii) respectively), we consider two possible algorithms to perform adaptation. The first method is a vanilla SGD algorithm (Bottou et al. 2016) and the second method is a SGD scheme with iterate averaging (Schmidt et al. 2017). While vanilla SGD is easier to implement and algorithmically related to population-based Monte Carlo methods, iterate averaged SGD results in a better theoretical bound and it has some desirable variance reduction properties.

Exact gradient OAIS
We first introduce the OAIS scheme where we assume that the exact gradients of ρ(θ) are available. Since ρ is defined as an expectation (an integral), this assumption is unrealistic. However, the results we can prove for this procedure shed light onto the results that will be proved for practical scenarios in the following sections.
In particular, in this scheme, given θ t−1 , we specify T t as where γ > 0 is the step-size parameter of the map and Proj Θ denotes projection onto the compact parameter space Θ. This is a classical gradient descent scheme on ρ(θ). In Sect. 4.1, we provide non-asymptotic results for this scheme. However, as we have noted, this idea does not lead to a practical scheme and cannot be used in most cases in practice as the gradients of ρ in exact form are rarely available.

Remark 7
We use a projection operator in Eq. (3.1) because we assume throughout the analysis in Sect. 4 that the parameter space Θ is compact.

Stochastic gradient OAIS
Although it has a nice and simple form, exact-gradient OAIS is often intractable as, in most practical cases, the gradient can only be estimated. In this section, we first look at the case where π(x) can be evaluated, which means that an unbiased estimate of ∇ρ(θ) can be obtained. Then, we consider the general case, where one can only evaluate Π(x) and can obtain an unbiased estimate of ∇ R(θ ).
In the following subsections, we consider an algorithm where the gradient is estimated using samples which can also be used to construct importance sampling estimators. The procedure is outlined in Algorithm 3 for the case in which only Π(x) can be evaluated and ∇ R(θ ) is estimated.

Normalised case
If we assume that the density π(x) can be evaluated exactly, then the algorithm can be described as follows. Given (θ k ) 1≤k≤t−1 , at iteration t, we compute the next parameter iterate as where g t is an unbiased estimator of ∇ρ(θ t−1 ). We note that due to the analytical form of ∇ρ (see Eq. (2.8)), the samples and weights generated at iteration t − 1, i.e.
can be reused to estimate the gradient. This makes an algorithmic connection to the population Monte Carlo methods where previous samples and weights are used to adapt the proposal (Cappé et al. 2004).
Given the updated parameter θ t , the algorithm first samples from the updated proposal x (i) t ∼ q θ t , i = 1, . . . , N , and then proceeds to construct the IS estimator as in (2.2). Namely,

Self-normalised case
For the general case, where we can only evaluate Π(x), the algorithm proceeds similarly. Given (θ k ) 1≤k≤t−1 , the method proceeds by first updating the parameter whereg t is an unbiased estimator of ∇ R(θ t−1 ). Given the updated parameter, we first sample x (i) Algorithm 2 Stochastic gradient OAIS 1: Choose a parametric proposal q θ with initial parameter θ = θ 0 . 2: for t ≥ 1 do 3: Update the proposal parameter, whereg t is computed by approximating the expectation in Eq. (2.9) using the samples Sample, .

7: end for
and then construct the SNIS estimate as in (2.3), i.e.

Stochastic gradient OAIS with averaged iterates
Next, we describe a variant of the stochastic gradient OAIS that uses averages of the iterates generated by the SGD scheme (Schmidt et al. 2017) in order to compute the proposal densities, generate samples and compute weights. In Sect. 4, we show that the convergence rate for this method is better than the rate that can be guaranteed for Algorithm 2.

Normalised case
We assume first that the density π(x) can be evaluated. At the beginning of the t-th iteration, the algorithm has generated the sequence (θ k ) 1≤k≤t−1 . First, in order to perform the adaptive importance sampling steps, we set and samplex (i) t ∼ qθ t for i = 1, . . . , N . Following the standard parametric AIS procedure (Algorithm 1), we obtain the estimate of (ϕ, π ) as, Next, we update the parameter vector using the projected stochastic gradient step where g t is an unbiased estimate of ∇ρ(θ t−1 ), i.e. E[g t ] = ∇ρ(θ t−1 ) and Proj Θ denotes projection onto the set Θ. Note that in order to estimate this gradient using (2.8), we sample x (i) t ∼ q θ t−1 for i = 1, . . . , N , and estimate the expectation in (2.8). It is worth noting that the samples {x

Self-normalised case
In general, π(x) cannot be evaluated exactly; hence, a stochastic unbiased estimate of ∇ρ(θ) cannot be obtained. When the target can only be evaluated up to a normalisation constant, i.e. only Π(x) can be computed, we can use the SNIS procedure as explained in Sect. 2. Therefore, we introduce here the most general version of the stochastic method, coined stochastic gradient OAIS, which uses the averaged iterates in (3.3) to construct the proposal functions. The scheme is outlined in Algorithm 3. To run this algorithm, given the parameter vectorθ t in (3.3), we first generate a set of samples {x Then, the integral estimate given by the SNIS can be written as, Finally, for the adaptation step, we obtain the unbiased estimate of the gradient ∇ R(θ ) and adapt the parameter as Algorithm 3 Stochastic gradient OAIS with averaged iterates 1: Choose a parametric proposal q θ with initial parameter θ = θ 0 . 2: for t ≥ 1 do 3: Compute the average parameter vector
6: Report the point-mass probability measure and the estimator

7:
Update the parameter vector, whereg t is an estimate of ∇ R(θ t−1 ) computed by approximating the expectation in Eq. (2.9) using a set of iid samples x (i) t ∼ q θt−1 , i = 1, . . . , N . 8: end for whereg t is an unbiased estimate of ∇ R(θ t−1 ), i.e. E[g t ] = ∇ R(θ t−1 ). Note that as in the normalised case, this gradient is estimated by approximating the expectation in (2.9) using iid samples x (i) t ∼ q θ t−1 , i = 1, . . . , N . These samples are different, again, from the set {x

Remark 8 In Algorithm 3, the samples {x
are not used to compute the gradient estimatorg t which, in turn, is needed to generate the next iterate θ t . Therefore, if we can afford to generate T iterates, θ 0 , . . . , θ T −1 , with T known before hand, and we are only interested in the estimator (ϕ, π N θ T ) obtained at the last iteration (once the proposal function has been optimized), then it is be possible to skip steps 3-6 in Algorithm 3 up to time T − 1. Only at time t = T , we would compute the average parameter vectorθ T , samplex

Analysis
Theorem 1 yields an intuitive result about the performance of IS methods in terms of the divergence between the target π and the proposal q θ . We now apply ideas from convex optimisation theory in order to minimize ρ(θ) and obtain finite-time, finite-sample convergence rates for the AIS procedures outlined in Sect. 3.

Convergence rate with exact gradients
Let us first assume that we can compute the gradient of ρ(θ) exactly. In particular, we consider the update rule in Eq.
(3.1). For the sake of the analysis, we impose some regularity assumptions on the ρ(θ).
Assumption 1 Let ρ(θ) be a convex function with Lipschitz derivatives in the compact space Θ. To be specific, ρ is convex and differentiable, and there exists a constant L < ∞ such that for any θ, θ ∈ Θ.
Remark 9 Assumption 1 holds when the density q θ (x) belongs to an exponential family (see Sect. 2.3) and Θ is compact (Ryu and Boyd 2014), even if it may not hold in general for θ ∈ R d θ .
Lemma 3 If Assumption 1 holds and we set a step-size γ ≤ 1/L, then the inequality is satisfied for the sequence {θ t } t≥0 generated by the recursion (3.1) where θ is a minimum of ρ.
This rate in (4.1) is one of the most fundamental results in convex optimisation. Lemma 3 enables us to prove the following result for the MSE of the AIS estimator adapted using exact gradient descent in Eq. (3.2).
Proof See Appendix A.3.

Remark 10
Theorem 4 sheds light onto several facts. We first note that ρ in the error bound (4.2) can be interpreted as an indicator of the quality of the parametric proposal. We recall that ρ = 1 when both π and q θ belong to the same exponential family. For this special case, Theorem 4 implies that In other words, when the target and the proposal are both from the exponential family, this adaptation strategy is leading to an asymptotically optimal Monte Carlo estimator (optimal meaning that we attain the same rate as a Monte Carlo estimator with N iid samples from π ). On the other hand, when π and q θ do not belong to the same family, we obtain i.e. the L 2 rate is again asymptotically optimal, but the constant in the error bound is worse (bigger) by a factor √ ρ > 1.
This bound shows that as t → ∞, what we are left with is essentially the minimum attainable IS error for a given parametric family {q θ } θ∈Θ . Intuitively, when the proposal q θ is from a different parametric family than π , the gradient OAIS optimises the error bound in order to obtain the best possible proposal. In particular, the MSE has two components: First an O(1/t N ) component which can be made to vanish over time by improving the proposal and a second O(1/N ) component which is related to ρ . The quantity ρ is related to the minimum χ 2 -divergence between the target and proposal. This means that the discrepancy between the target and optimal proposal (according to the χ 2 -divergence) can only be tackled by increasing N . This intuition is the same for the schemes we analyse in the next sections, although the rate with respect to the number of iterations necessarily worsens because of the uncertainty in the gradient estimators.

Remark 11
When γ = 1/L, Theorem 4 implies that if t = O(L/ρ ) and N = O(ρ /ε), for some ε > 0, then we have We remark that once we choose the number of samples N = O(ρ /ε), the number of iterations t for adaptation is independent of N and ε.

Remark 12
One can use different maps T t for optimisation. For example, one can use Nesterov's accelerated gradient descent (which has more parameters than just a step size), in which case, one could prove (by a similar argument) the inequality (Nesterov 2013) This is an improved convergence rate, going from O(1/t) to O(1/t 2 ) in the first term of the bound.

Convergence rate with averaged SGD iterates
While, for the purpose of analysis, it is convenient to assume that the minimization of ρ(θ) can be done deterministically, this is rarely the case in practice. The 'best' realistic case is that we can obtain an unbiased estimator of the gradient. Hence, we address this scenario, under the assumption that the actual gradient functions ∇ρ and ∇ R are bounded in Θ (i.e. ρ(θ) is Lipschitz in Θ).

Assumption 2
The gradient functions ∇ρ(θ) and ∇ R(θ ) are bounded in Θ. To be specific, there exist finite constants G ρ and G R such that We note that this is a mild assumption in the case of interest in this paper, where Θ ⊂ R d θ is assumed to be compact.

Normalised target
First, we assume that we can evaluate π(x), which means that at iteration t, we can obtain an unbiased estimate of ∇ρ(θ t−1 ), denoted g t . We use the optimisation algorithms called stochastic gradient methods, which use stochastic and unbiased estimates of the gradients to optimise a given cost function (Robbins and Monro 1951). Particularly, we focus on optimised samplers using stochastic gradient descent (SGD) as an adaptation strategy.
We start proving that the stochastic gradient estimates (g t ) t≥0 have a finite mean-squared error (MSE) w.r.t. the true gradients. To prove this result, we need an additional regularity condition.
Remark 13 Let us rewrite D j π in Assumption 3 in terms of the weight function, namely When q θ (x) belongs to the exponential family, we readily obtain where T i (X ) is the i-th sufficient statistic for q θ (x). Let us construct a bounding function for the weights of the form If we choose the compact space Θ in such a way that K (θ ) is bounded, then we readily have where we have used the fact that ∂ m A(θ) . Therefore, if the weights remain bounded in Θ, a sufficient condition for Assumption 3 to hold is that the sufficient statistics of the proposal distribution all have finite variances, i.e. max i∈{1,...,d θ } T i (X ) < ∞. There are alternative conditions that, when satisfied, lead to Assumption 3 holding true. As an example, in Appendix A.4, we provide an alternative sufficient condition in terms of the function ρ 2 (θ ) := E[w 4 θ (X )].
Now, we show that when g t is an iid Monte Carlo estimator of ∇ρ, we have the following finite-sample bound for the MSE.

Lemma 4 If Assumption 3 holds, the following inequality holds,
where d θ is the parameter dimension and c ρ , D π < ∞ are constant w.r.t. N .
Proof See Appendix A.5.
In order to obtain convergence rates for the estimator (ϕ, π N θ t ), we first recall a classical result for the SGD [see, e.g. Bubeck et al. (2015)].

Lemma 5 Let Assumptions 2 and 3 hold, apply recursion (3.4) and let (g t ) t≥0 be the stochastic gradient estimates in Lemma 4. If we choose the step-size sequence
Proof See Appendix A.6 for a self-contained proof.
We can now state the first core result of the paper, which is the convergence rate for the AIS algorithm using a SGD adaptation of the parameter vectors θ t .
Proof See Appendix A.7.

Remark 14
Note that the expectation on the left-hand side of (4.4) is taken w.r.t. the distribution of the measure-valued random variable π N θ t . Theorem 5 can be interpreted similarly to Theorem 4. One can see that the overall rate of the MSE bound is O 1/ √ t N + 1/N . This means that as t → ∞, we are only left with a rate that is optimal for the AIS for a given parametric proposal family. In particular, again, ρ is related to the minimal χ 2 -divergence between the target π and the parametric proposal q θ . When the proposal and the target are from the same family, we are back to the case ρ = 1, thus the adaptation leads to the optimal Monte Carlo rate O(1/ √ N ) for the L 2 error within this setting as well.

Self-normalised estimators
We have noted that it is possible to obtain an unbiased estimate of ∇ρ(θ) when the normalised target π(x) can be evaluated. However, if we can only evaluate the unnormalised density Π(x) instead of π(x) and use the self-normalized IS estimator, the estimate of ∇ρ(θ) is no longer unbiased. We refer to Sec. 5 of Tadić and Doucet (2017) for stochastic optimisation with biased gradients for adaptive Monte Carlo, where the discussion revolves around minimizing the Kullback-Leibler divergence rather than the χ 2 -divergence. The results presented in Tadić and Doucet (2017), however, are asymptotic, while herein we are interested in finite-time bounds. Due to the structure of the AIS scheme, it is possible to avoid working with biased gradient estimators. In particular, we can implement the proposed AIS schemes using unbiased estimators of ∇ R(θ ) instead of biased estimators of ∇ρ(θ). Since optimizing the unnormalised function R(θ ) leads to the same minima as optimizing the normalised function ρ(θ), we can simply use ∇ R(θ ) for the adaptation in the self-normalised case.
Similar to the argument in Sect. 4.2.1, we first start the assumption below, which is the obvious counterpart of Assumption 3.

Assumption 4
The unnormalized target Π(x) and the proposal densities q θ (x) satisfy the inequalities Remark 13 holds directly for Assumption 4 as long as Z π < ∞. Next, we prove an MSE bound for the stochastic gradients (g t ) t≥0 employed in recursion (3.5), i.e. the unbiased stochastic estimates of ∇ R(θ ).

Lemma 6 If Assumptions 2 and 4 hold, the inequality
Proof The proof is identical to the proof of Lemma 4.
We can now obtain explicit rates for the convergence of the unnormalized function R(θ t ), evaluated at the averaged iteratesθ t .
Lemma 7 If Assumptions 2 and 4 hold and the sequence (θ t ) t≥1 is computed via recursion (3.5), with step-sizes γ k = β/ √ k for 1 ≤ k ≤ t and β > 0, we obtain the inequality r.t. t and N . Relationship 4.5 implies that (4.6) Proof The proof of the rate in (4.5) is identical to the proof of Lemma 5. The rate in (4.6) follows by observing that ρ(θ) = R(θ )/Z 2 π for every θ ∈ Θ.
Finally, using Lemma 7, we can state our main result: an explicit error rate for the MSE of Algorithm 3 as a function of the number of iterations t and the number of samples N .
Theorem 6 Let Assumptions 2 and 4 hold and let the sequence (θ t ) t≥1 be computed via recursion (3.5), with stepsizes γ k = β/ √ k for 1 ≤ k ≤ t and β > 0. We have the following inequality for the sequence of proposal distributions (qθ t ) t≥1 , and c ϕ = 4 ϕ 2 ∞ are finite constants independent of t and N .
Proof The proof follows from Lemma 7 and mimicking the exact same steps as in the proof of Theorem 5.
Remark 15 Theorem 6, as in Remark 10, provides relevant insights regarding the performance of the stochastic gradient OAIS algorithm. In particular, for a general target π , we obtain This result shows that the L 2 error is asymptotically optimal. As in previous cases, if the target π is in the exponential family, then the asymptotic convergence rate is O(1/ √ N ) as t → ∞.
Remark 16 Theorem 6 also yields a practical heuristic to tune the step-size and the number of particles together. Assume that 0 < β < 1 and let N = 1/β (which we assume to be an integer without loss of generality). In this case, the rate (4.7) simplifies into Therefore, one can control the error using the step-size of the optimisation scheme provided that other parameters of the algorithm are chosen accordingly. The same argument also holds for Theorem 5.

Remark 17
It is not straightforward to compare the rates in inequality (4.7) (for the unnormalized target Π(x)) and inequality (4.4) (for the normalized target π(x)). Even if (4.7) may "look better" by a constant factor compared to the rate in (4.4), this is usually not the case. Indeed, the variance of the errors in the unnormalised gradient estimators is typically higher and this reflects on the variance of the moment estimators. Another way to look at this issue is to realise that, very often, Z π << 1, which makes the bound in (4.7) much greater than the bound in (4.4).
Finally, we can adapt Theorem 2 to our case, providing a convergence rate of the bias of the importance sampler given by Algorithm 3.
Theorem 7 Under the setting of Theorem 6, we have (4.8) where C 1 , C 2 , C 3 , C 4 are finite constants given in Theorem 6 and independent of t and N .
Proof The proof follows from Theorem 2 and mimicking the same proof technique used to prove Theorem 6.

Convergence rate with vanilla SGD
The arguments of Section 4.2 can be carried over to the analysis of Algorithm 2, where the proposal functions q θ t (x) are constructed using the iterates θ t rather than the averagesθ t . Unfortunately, achieving the optimal O(1/ √ t) rate for the vanilla SGD is difficult in general. The best available rate without significant restrictions on the step-size is given by Shamir and Zhang (2013). In particular, we can adapt (Shamir and Zhang 2013, Theorem 2) to our setting in order to state the following lemma.
The obtained rate is, in general, O log t √ t . This is known to be suboptimal, and it can be improved to the informationtheoretical optimal O(1/ √ t) rate by choosing a specific stepsize scheduling, see, e.g. Jain et al. (2019). However, in this case, the scheduling of (γ t ) t≥1 depends directly on the total number of iterates to be generated, in such a way that the error O(1/ √ t) is guaranteed only for the last iterate, at the final time t.
We can extend Lemma 8 to obtain the following result.
Theorem 8 Apply recursion (3.5) for the computation of the iterates (θ t ) t≥1 , choose the step-sizes γ k = β/ √ k for 1 ≤ k ≤ t, where β > 0, and let Assumptions 2 and 4 hold. If we construct the sequence of proposal distributions (q θ t ) t≥1 be the sequence of proposal distributions, we obtain the following MSE bounds and c ϕ = 4 ϕ 2 ∞ are finite constants independent of t and N .
Proof The proof follows from Lemma 8 with the exact same steps as in the proof of Theorem 5.
Finally, it is also straightforward to adapt the bias result in Theorem 7 to this case, which results in the similar bound. We skip it for space reasons and also because it has the same form as in Theorem 7 with an extra log t factor.

Conclusions
We have presented and analysed optimised parametric adaptive importance samplers and provided non-asymptotic convergence bounds for the MSE of these samplers. Our results display the precise interplay between the number of iterations t and the number of samples N . In particular, we have shown that the optimised samplers converge to an optimal proposal as t → ∞, leading to an asymptotic rate of O( √ ρ /N ). This intuitively shows that the number of samples N should be set in proportion to the minimum χ 2 -divergence between the target and the exponential family proposal, as we have shown that the adaptation (in the sense of minimising χ 2 -divergence or, equivalently, the variance of the weight function) cannot improve the error rate beyond O( √ ρ /N ). The error rates in this regime may be dominated by how close the target is to the exponential family. Note that the algorithms we have analysed require constant computational load at each iteration and the computational load does not increase with t as we do not reuse the samples in past iterations. Such schemes, however, can also be considered and analysed in the same manner. More specifically, in the present setup the computational cost of each iteration depends on the cost of evaluating Π(x).
Our work opens up several other paths for research. One direction is to analyse the methods with more advanced optimisation algorithms. Another challenging direction is to consider more general proposals than the natural exponential family, which may lead to non-convex optimisation problems of adaptation. Analysing and providing guarantees for this general case would provide foundational insights for general adaptive importance sampling procedures. Also, as shown by Ryu (2016), similar theorems can also be proved for αdivergences.
Another related piece of work arises from variational inference (Wainwright and Jordan 2008). In particular, Dieng et al. (2017) have recently considered performing variational inference by minimising the χ 2 -divergence, which is close to the setting in this paper. In particular, the variational approximation of the target distribution in the variational setting coincides with the proposal distribution we consider within the importance sampling context in this paper. This also implies that our results may be used to obtain finite-time guarantees for the expectations estimated using the variational approximations of target distributions.
Finally, the adaptation procedure can be modified to handle the non-convex case as well. In particular, the SGD step can be converted into a stochastic gradient Langevin dynamics (SGLD) step. The SGLD method can be used as a global optimiser when ρ and R are non-convex and a global convergence rate can be obtained using the standard SGLD results, see, e.g. Raginsky et al. (2017); Zhang et al. (2019). Global convergence results for other adaptation schemes such as stochastic gradient Hamiltonian Monte Carlo (SGHMC) can also be achieved using results from nonconvex optimisation literature, see, e.g. Akyildiz and Sabanis (2020).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

A.1 Proof of Theorem 1
We first note the following inequalities, We take squares of both sides and apply the inequality (a + b) 2 ≤ 2(a 2 + b 2 ) to further bound the rhs, We now take the expectation of both sides, Note that, both terms in the right-hand side are perfect Monte Carlo estimates of the integrals. Bounding the MSE of these integrals yields Therefore, we can straightforwardly write, Now, it remains to show the relation of the bound to χ 2 divergence. Note that Note that ρ is not exactly χ 2 divergence, which is defined as ρ − 1. Plugging everything into our bound, we have the result,

A.2 Proof of Lemma 1
We adapt this proof from Ryu and Boyd (2014) by following the same steps. We first show that A(θ ) is convex by first showing that exp(A(θ )) is convex. Choose 0 < η < 1 and using Hölder's inequality, Taking log of both sides yields which shows the convexity of A(θ ). Note that A(θ )−θ T (x) is convex in θ since it is a sum of a convex and a linear function of θ . Since exp is an increasing convex function and the composition of convex functions is convex, M(θ, x) := exp(A(θ ) − θ T (x)) is convex in θ . Finally, we prove that ρ(θ) is convex. First let us write it as Then, we have the following sequence of inequalities which concludes the claim.

A.3 Proof of Theorem 4
First note that using Theorem 3, we have where the last inequality follows from Lemma 3.
Proposition 1 Let the ρ 2 be Lipschitz with Lipschitz derivatives, let q θ (x) belong to the exponential family and let Θ be compact. If the sufficient statistics T (X ) of the distribution q θ all have finite variances, i.e. max i=1,...,d θ Var(T i (X )) < ∞, then Assumption 3 holds.
Proof Using the fact that ∂q θ (x) ∂θ i = q θ (x) ∂ log q θ (x) ∂θ i one can readily calculate the second order derivatives of ρ 2 (θ ). In particular, Now the expectation is a standard Monte Carlo error for the test function, Assumption 3 together with Lemma A.1 in Crisan and Míguez (2014) yields we arrive at whereθ t = 1 t t−1 k=0 θ k .