Functional Central Limit Theorem and Strong Law of Large Numbers for Stochastic Gradient Langevin Dynamics

We study the mixing properties of an important optimization algorithm of machine learning: the stochastic gradient Langevin dynamics (SGLD) with a fixed step size. The data stream is not assumed to be independent hence the SGLD is not a Markov chain, merely a Markov chain in a random environment, which complicates the mathematical treatment considerably. We derive a strong law of large numbers and a functional central limit theorem for SGLD.


Introduction
We consider a recursive stochastic scheme called "stochastic gradient Langevin dynamics" (SGLD), first suggested by Welling and Teh [17].Let λ > 0 be the stepsize, the measurable function H : R d × R m → R d the updating function and define the R d -valued stochastic process θ n , n ≥ 1 recursively by Here ξ n , n ≥ 1 is an independent sequence of standard d-dimensional Gaussian random variables, Y n , n ∈ Z is an R m -valued strict sense stationary process, independent of (ξ n ) n∈N , which represents the data stream fed into this procedure.Furthermore, we assume (for simplicity) that the initial value θ 0 ∈ R d is deterministic.
The algorithm (1) is used for approximate sampling from high-dimensional probability distributions that are not necessarily log-concave.More precisely, let U : R d → R + be differentiable with derivative h = ∇U such that h(θ) = E[H(θ, Y 0 )], θ ∈ R d .Assume U has a unique minimum at θ † .For λ small and n large, Law(θ n ) is expected to be close to the probability defined by see e.g.[17,1,8].If √ 2λ in (1) is replaced by 2λ/β for some β > 0 then the procedure samples from a distribution with density proportional to e −βU(x) which means, for β large, that for n large enough and λ small enough.(In this paper we keep β = 1 for simplicity.)* The second author was supported by the National Research, Development and Innovation Office within the framework of the Thematic Excellence Program 2021; National Research subprogramme "Artificial intelligence, large networks, data security: mathematical foundation and applications".
Example 1.1.We consider a regularized logistic regression where m ≥ 2, d := m − 1 and (Q n , Z n ) ∈ {0, 1} × R m , n ∈ Z is a stationary sequence of random variables.The purpose is to optimize the regression parameters θ ∈ R d in such a way that the functional is minimized, where σ(x) = 1/(1 + e −x ) is the sigmoid function and c > 0 is a constant.One thus tries to guess the binary variable Q from the variables Z.We then have H i (θ, (q, z)) = −(q − σ( θ, Z 0 ))z i + 2cθ i for all i = 1, . . ., d.
As can be easily verified, this functional satisfies Assumption 2.1.The SGLD algorithm in this context could be applied to standard sentiment analysis problems where, based on the occurrences of key words (represented by the coordinates of Z) it should be decided whether a given review on a webshop is positive or not (Q = 1 or Q = 0), see e.g.[3].
Review data continuously arrive and often exhibit temporal dependencies and non-i.i.d.characteristics.This is because customers' reviews can be influenced by previous reviews, current trends, or the changing sentiment of other customers, leading to dependencies between reviews.Consequently, the occurrence of certain key words and the overall sentiment may not be independent across reviews.For such sentiment analysis problems, variants of stochastic gradient descent are commonly used.However, due to the lack of convexity, it is worth considering the use of SGLD.
Furthermore, sentiment analysis faces the challenge of concept drift, which refers to the situation where the underlying sentiment distribution of the data changes over time.This could be due to various factors, such as changes in product features, external events, or trends.The SGLD algorithm is capable of adapting to concept drift scenarios by continuously updating the model parameters as new data arrives.
One would try to numerically approximate the integral in (2) by However, to guarantee the consistency of such a procedure, one needs to establish a corresponding law of large numbers.
In the case where λ in (1) is replaced by λ n with a decreasing sequence λ n , n ≥ 0, under suitable assumptions, the averages converge almost surely to R d φ(z)π(dz) for appropriate functions φ as shown in [13], where a related central limit theorem is also established.
In the case of fixed λ, [16] estimated the L 2 distance of the averages from the mean of π.Both these papers, like most available studies, assume that Y n , n ∈ Z are i.i.d.This does not hold true in several applications, prominently in the case of financial times series, see e.g.[10], where stochastic approximation schemes were treated in a setting with possibly dependent data.See also [1,8,14,11] for more about SGLD with dependent data.
When the Y n are independent, θ n is a Markov chain.However, the case of general stationary Y n is an order of magnitude more involved mathematically since θ n is only a Markov chain in a random environment, see Section 2 for details.
In this article, we establish a law of large numbers (LLN) for functionals for (3) when employing a fixed stepsize λ > 0. Additionally, we will establish an invariance principle.These results serve as crucial theoretical guarantees for the consistency of estimates, such as (3), and form the foundation for constructing confidence intervals for these estimates.Our work builds upon and extends the findings in [6], where LLN and CLT were shown for the stochastic gradient method with dependent data, specifically in the special case of a linear updating rule.
Our arguments are based on results of [7] which require establishing mixing properties for the process θ t .The recent paper [15] is closely related to this part of our work: it shows mixing for a certain class of processes.That setting, however, does not cover ours since the strong minorization property A2 in [15] does not hold for our processes.
Section 2 states and explains our main results.Their proof in Section 3 is presented in a series of subsections.

The main result
First we formulate our working assumption on the stochastic iterative scheme given by (1).
and for some Furthermore, we assume that the process (Y t ) t∈Z is strictly stationary, and there is M > 0 such that Condition ( 4) is a standard dissipativity requirement, ( 5) is also mild and holds for Lipschitzcontinuous H.By stationarity, (6) implies uniform boundedness of the data stream.This may look stringent from the mathematical point of view, but it is evidently applicable in practice due to two main reasons.First, many real-world applications involve data that can be naturally bounded within certain ranges.For example, pixel values in images are confined to specific ranges (e.g., 0 to 255 for grayscale images).Second, scaling the data to a compact domain is a common preprocessing step in machine learning.In conclusion, the assumptions we have made are met by a wide range of learning problems of considerable practical importance.
Next, we briefly recall the main concepts of α-mixing.Throughout this paper the probability space is (Ω, F , P), and for any two sub-σ-algebras G, H ⊂ F , we define the measure of dependence Furthermore, for an arbitrary sequence of random variables (W t ) t∈Z , we define the σ-algebras and introduce the dependence coefficients The mixing coefficient of W is α W (n) = sup j∈Z α W j (n), n ≥ 1 which is obviously non-increasing in n.Note that, for strictly stationary W , α W j (n) does not depend on j, and thus α In [11] it was established (under somewhat weaker conditions than Assumption 2.1) that Law(θ n ) converges in total variation to a limiting probability µ λ as n → ∞.A rate estimate of the order exp(−n1/3 ) was obtained.Clearly, µ λ differs from π and the bias is O( √ λ) under suitable conditions, see [8].
In this paper, using results of [5], we prove an exponential convergence rate of Law(θ n ) to µ λ under Assumption 2.1.More importantly, a functional central limit theorem is established under the additional Assumption 2.2.In the sequel, φ : R d → R denotes an at most polynomially growing measurable function i.e. for fixed but arbitrary constants c φ , r > 0, Our main results are summarized in the next two theorems.
Theorem 2.3.Let Assumption 2.1 be in force, and 0 < λ ≤ ∆ K 2 be fixed.Then there is a strictly stationary process (θ * t ) t∈N on R d and there are constants c, κ > 0 depending only on λ, ∆, b, K and M such that for any k ∈ N and indices 0 ≤ i 1 < . . .< i k , Furthermore, we have almost surely and in L p , for all p ≥ 1 provided that (Y t ) t∈Z is ergodic.
Remark 2.5.As we shall see later (c.f.Corollary 3.2 and Lemma 3.11), it is also true that )) also satisfies the invariance principle.

Proofs
Throughout the rest of the paper, we use the notation X := R d and Y := R m moreover B(X ) will be used for the standard Borel σ-algebra of X .As pointed out in Section 5 of [11] and also in [14], the recursive stochastic scheme (1) can be considered as a Markov chain in an exogenous random environment (MCRE) which means that there is a parametric kernel almost surely, for all A ∈ B(X ).In our case, transition kernel is given by where ξ 0 is as in the recursion (1) i.e. a standard d-dimensional Gaussian random variable.
Here we give a brief explanation of the proof strategy.First, we fix a trajectory of Y (that is, we consider the "quenched" version of the process) and using a standard representation of MCREs by iterated random functions, we deduce an upper estimate for the coupling probability between realizations of the chain starting from different, possibly random, initial values Lemma 3.10).To achieve this, we demonstrate that small sets, where coupling can occur with a positive probability, are visited frequently enough with large probability.The so-called "annealed version" of this crucial result (Lemma 3.11) allows us to establish that the process (θ t ) t∈N inherits the mixing properties of the environment (Lemma 3.14).The proof of Theorem 2.3 also heavily relies on this inequality.We actually prove a bit more: we show that there exist an almost surely finite random time at which suitable versions of (θ t ) t∈N and (θ * t ) t∈N are coupled to each other.Finally, the proof of the invariance principle (Theorem 2.4) boils down to verifying conditions of Corollary 1 in Herrndorf's paper [7]: we verify that the mixing coefficients decrease sufficiently fast and that the covariance function of the process θ converges to its stationary counterpart.

Drift and minorization conditions for θ t
In this point, we establish suitable versions of the standard drift and minorization conditions, known from the theory of Markov chains (See e.g.[12]), for (θ t ) t∈N .According to the next lemma, there is an a = a(λ) > 0 such that for the Lyapunov function V (θ) = exp(a θ 2 ) and for the parametric kernel Q, a Foster-Lyapunov-type drift condition holds.
Proof.We can write where by Assumption 2.1, for y ≤ M , we have To sum up, we obtained that there are which completes the proof.Corollary 3.2.By induction, easily follows that for any collection {y a , y a+1 , . . ., hence by the tower rule, we can estimate further and obtain From now on, let us fix a λ ∈ (0, ∆/K 2 ) and a > 0 as in Lemma 3.1.
In the theory of Markov chains, the Foster-Lyapunov condition is often accompanied by a minorization condition on suitable "small sets".In the current model we do have such a minorization condition on every compact set.(In other words, compact sets are small.)To see this, for fixed R > 0, θ ≤ R and y ≤ M , we can write where f ξ0 is the probability density function of ξ 0 and the positive constant m R,M,λ,K is given by We note this observation in the next lemma.
Lemma 3.3.For every R > 0, there is a Borel probability measure ν R on B(X ) and a coefficient αR ∈ (0, 1) such that for y ≤ M and θ ≤ R,

Stationary initialization
We need to show that, starting from a suitable random initial state θ * 0 , the process (θ * t ) t∈N has a stationary version (in the strict sense).
Let M Y be the set of Borel probability laws on X × Y Z such that their second marginal equals to the law of (Y t ) t∈Z , and M Y b denotes the set of those µ ∈ M Y for which the process (θ ′ t ) t∈N started from some random initial state θ ′ 0 with Law((θ ′ 0 , (Y t ) t∈Z )) = µ satisfies By Corollary 3.2 and the Markov inequality, for every random variable In particular, for any deterministic , where δ θ0 stands for the Dirac measure concentrated on Proof.The statement follows from Rásonyi and Gerencsér's recent result, Theorem 3.10.in [5].They also prove that Law((θ * t , (Y k+t ) k∈Z )) = µ * for each t ∈ N. Since (θ * t , (Y t+k ) k∈Z ), t ∈ N is a timehomogeneous Markovian process, strong stationarity follows.
Remark 3.5.By Corollary 3.2 and Lemma 3.4, for any arbitrary but deterministic θ 0 ∈ X , we have and thus by the strong stationarity of (Y t ) t∈Z , immediately follows that µ * ∈ M Y b .
Remark 3.6.With the above form of the drift and minorization condition in hand and using a recent result of Truquet's (Theorem 1 in [15]), we could as well deduced the existence of a stationary process (θ * t ) t∈Z satisfying However, we will need a bit more.We aim to show that there is a coupling between the iterations (θ t ) t∈N initialized with the deterministic value θ 0 ∈ X and an appropriate version of (θ * t ) t∈Z .That's why we preferred the technology presented in [5].
It is also shown in [15] that the distribution of (θ * t ) t∈Z is unique, moreover the process (θ * t , Y t ) t∈Z is ergodic provided that (Y t ) t∈Z is ergodic.The latter will be very important for us, since the proof of Theorem 2.4 relies on this result.
In addition, Truquet proved that under a milder form of the drift and minorization conditions (See Assumptions A2 and A3 in [15]), Law(θ t ) → Law(θ * 0 ) in total variation as t → ∞.However, as Truquet remarked, assumptions of [15] did not to get a rate of convergence for Law(θ t ).
In the rest of this subsection, we show an alternative approach to a bit stronger result on the convergence of (Law(θ t )) t∈N using the results of the recent paper [4].The reader can skip this part without affecting the understanding.The (1 + V )-weighted total variation distance for any pair of Borel probability measures µ, ν on B(X ) is defined by Proof.With the above choice of V , we have E(V (θ 0 ) 2 + V (θ 1 ) 2 ) < ∞ hence the moment condition on initial values i.e.Assumption 2.6. in [4] is in force, and also the other assumptions of [4] are clearly met (with the quantities λ, α, K constant and with ℓ ≡ 0 since Y is bounded) too.Hence Theorem 2.11 of [4] implies the convergence of Law(θ t ) towards the limiting distribution Law(θ * 0 ) at a geometric rate in d 1+V TV .
Corollary 3.8.It is clear from the definition of d 1+V and from Lemma 3.7 that for any φ satisfying (8),

Coupling construction
Let R > 0 which we fix later, and (ε t ) t∈Z be a sequence of i.i.d.uniform variables on [0, 1] independent of (Y k ) k∈Z and also independent of (ξ n ) n≥1 .The next Lemma is a standard representation results for parametric kernels satisfying the minorization condition (10).
We drop the dependence of the mappings T on ε t in the notation and will simply write T t (y)θ := T (θ, y, ε t ).For s ∈ Z and θ ∈ X , define the family of auxiliary processes Z θ,y s,t = θ, t ≤ s, Z θ,y s,t = T t (y t−1 )Z θ,y s,t−1 , t > s, where y = (. . ., y −1 , y 0 , y 1 , . ..) ∈ Y Z is a fixed trajectory.Clearly, for any random variable (θ ′ 0 , (Y k ) k∈Z ) and s ∈ N, Z θ ′ s ,Y s,t , t ≥ s is a version of the process (θ ′ t ) t∈N defined through the iterative scheme (1), starting from θ ′ 0 and driven by (Y k ) k∈Z .Furthermore, the process Z θ0,y s,t , t ≥ s is a time-inhomogeneous Markov chain that follows the dynamics of θ t , t ∈ N with the environment being "frozen".Since the process (Y k ) k∈Z is almost surely bounded by M > 0, we can restrict ourselves to trajectories y ∈ Y Z satisfying sup k∈Z y k ≤ M , and thus Z θ0,y s,t , t ≥ s is a Harris recurrent chain.The next lemma controls the coupling time between processes starting from different initial values.Lemma 3.10.Let θ 1 , θ 2 ∈ X be arbitrary but fixed and y ∈ Y Z such that sup k∈Z y k ≤ M .Then there exists constants κ > 0 and N ∈ N depending only on λ, ∆, b, K and M such that for n ≥ N , Proof.First, we fix γ < γ ′ < 1 and choose R > 0 so large such that 2C < (γ ′ − γ)e a 2 R 2 .Furthermore, we introduce the notations Z n := Z θ1,y 0,n , Z θ2,y 0,n , Z n := max Z θ1,y 0,n , Z θ2,y 0,n and the sequence of successive visiting times and thus for k ≥ 1 and s ≥ 0, we obtain Iteration of this argument leads to the following estimation.
Along similar lines, we can show that Let us fix γ ′′ such that γ ′ < γ ′′ < 1.For the generating function of the time elapsed between the kth and (k + 1)th visits, we get and similarly, for k = 0, hence by the Markov inequality and the tower rule,for 0 < m < n, we obtain Again we fix a constant γ ′′′ such that γ ′′ < γ ′′′ < 1, and define Obviously, for n is so large such that m n ≥ 1, we have Next, we estimate the probability of no-coupling on events when the small set is visited at least m n -times.According to Lemma 3.9, for j = 1, . . ., m n , θ → T (y, θ, ε σj +1 ) is constant of the ball {θ | θ ≤ R} with probability at least αR hence we can write where we used that for every j, ε σj +1 is independent of F ε −∞,σj .At least, we combine this estimate with that one what we got for the tail probability of the visiting times, and obtain P(Z θ1,y 0,n = Z θ2,y 0,n ) ≤ P(Z θ1,y 0,n = Z θ2,y 0,n , σ mn < n) + P(σ mn ≥ n) which completes the proof.
The following annealed version of Lemma 3.10 will be important later.
Lemma 3.11.Let θ 1 , θ 2 be random variables independent of F ε m+1,∞ for some m ∈ N. Then Proof.Estimate the conditional probability using Lemma 3.10 above yields where S m Y refers to the m-times left-shifted trajectory of Y i.e. (S m Y) k = Y k+m , k ∈ Z. Finally, we take expectations and obtain the claimed inequality.

Mixing properties
In what follows, we show that mixing properties of the exogenous environment transfer to the process θ t , t ∈ N.For any system of sub-σ-algebras A i ⊂ F , i ∈ I, we use the notation i∈I A i for the σ-algebra generated by the system (A i ) i∈I .
Remark 3.13.We need the special case when A 1 , A 2 , B 1 , B 2 ⊂ F are σ-algebras, where A 2 and B 2 are independent too.For this, Lemma 3.12 gives α(A . By the definition of the measure of dependence between sigma algebras (7), the reverse inequality trivially holds hence The next lemma provides an upper bound for the strong mixing coefficient of the chain (θ t ) t∈N given α Y .Lemma 3.14.For the dependence coefficient α θ j (n), we have the following upper estimate where κ and N are as in Lemma 3.10.
Proof.We introduce the notations θ →t := (θ 0 , θ 1 , . . ., θ t ) and θ t→ := (θ t , θ t+1 , . ..), and also Z θ0,y s,→t := (Z θ0,y s,s , . . ., Z θ0,y s,t ), Z θ0,y s,t→ := (Z θ0,y s,t , Z θ0,y s,t+1 , . ..).Let A ∈ F θ 0,j and B ∈ F θ j+n,∞ be arbitrary events.Then by the definition of the generated σ-algebra, exist A X ∈ B(X j+1 ) and B X ∈ B(X N ) such that So, for any r n satisfying 0 ≤ r n ≤ n − N we can write Observe that ,j+n , and also the σalgebras F ε 1,j and F ε j+rn+1,j+n are independent of each other hence by Remark 3.13 and the stationarity of (Y k ) k∈Z , we have By Lemma 3.11 and Corollary 3.2, we can estimate the second term on the right-hand side of ( 13) Combining this with (14), and taking the supremum on the left-hand side of (13) yields for any 0 ≤ r n ≤ n − N .By choosing r n = ⌊n/2⌋, we obtain the desired inequality

Proof of Theorem 2.3
Lemma 3.15.The sequence (φ(θ t )) t∈N is uniformly L p -bounded for every 1 ≤ p < ∞ that is Proof.Using x s ≤ Γ(s + 1)e x , x, s ≥ 0 and (8), by Corollary 3.2, we can write , where the upper bound does not depend on t hence (15) clearly holds.
By interchanging the role of (θ t ) t∈N and (θ * t ) t∈N , we obtain Next, we take supremum on the left hand-side in A ∈ B(X k ) and then by Lemma 3.11 and Remark 3.5, we arrive at where N ∈ N is as in Lemma 3.10 and 3.11.
In what follows, we proceed with the proof of the law of large numbers both in strong and L p sense.Again by Lemma 3.11 and Remark 3.5, exist an almost surely finite random variable τ such that Furthermore, for the tail distribution of τ , Finally, by Lemma 3.15, the sequence 1 n (φ(θ 0 ) + . . .+ φ(θ n−1 )), n ≥ 1 is uniformly integrable on every power 1 ≤ p < ∞, and thus the law of large numbers holds in L p -sense too for 1 ≤ p < ∞ which completes the proof.

Proof of Theorem 2.4
The subsequent lemma establishes a stability result for the autocovariance function of the sequence (φ(θ k )) k∈N .Additionally, it provides an explicit upper bound for sup k∈N |Cov(φ(θ k ), φ(θ k+l ))| in terms of the α-mixing coefficient of Y .For the sake of readability, the proof is delegated to Appendix A. where ǫ > 0 is as in Assumption 2.2.
At last, by Lemma 3.14, for the mixing coefficient α X j (n), we have and thus for α X (n To sum up, we have shown that all the conditions of Corollary 1 in [7] are satisfied hence we can conclude that, if σ > 0 then the sequence of random function (B n ) n≥1 given by

A Proof of Lemma 3.16
The following auxiliary result is a variation of Theorem 17.2.2from the renowned book by Ibragimov and Linnik [9].In the interest of self-contained explanation and for future reference, we have chosen to present this result here in the required form.
We can estimate where and thus we arrive at |Cov( , and also by the Cauchy-Schwartz inequality, we get At last, we set L = α(A 1 , A 2 ) −ǫ/2 , and since α(A 1 , A 2 ) ≤ 1, we obtain which completes the proof.
Proof of Lemma 3.16.i) Let l ∈ N be arbitrary.Then by the Cauchy-Schwartz inequality and Lemma 3.15, we can write Let k ≥ N , where N is as in Lemma 3.10 and 3.11.We can estimate further by which completes the proof.
holds with constants κ > 0 and N ∈ N as inLemma 3.10 and 3.11.When the data stream (Y k ) k∈Z is ergodic, by Remark 3.6, the process Zθ * 0 ,Y 0,n , n ∈ N is also ergodic, moreover by Remark 3.5, E(φ(θ * 0 )) < ∞ for any φ : X → Rsatisfying (8) hence by Birkhoff's ergodic theorem, φ(θ * 0 )), n → ∞, P − a.s.Combining this with the above result on the almost surely finite coupling time yields the strong law of large numbers for φ Z θ0,Y 0,n , n ∈ N. As we mentioned earlier, the discrete-time processes (θ n ) n∈N and Z θ0,Y 0,n n∈N are versions of each other hence the strong law of large numbers holds for (φ(θ n )) n∈N , as well.