A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $d \in \mathbb{N}$ neurons on the input layer, $H \in \mathbb{N}$ neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.


Introduction
Artificial neural networks (ANNs) are these days widely used in several real world applications, including, e.g., text analysis, image recognition, autonomous driving, and game intelligence. Stochastic gradient descent (SGD) optimization methods provide the standard schemes which are used for the training of ANNs. Nonetheless, until today, there is no complete mathematical analysis in the scientific literature which rigorously explains the success of SGD optimization methods in the training of ANNs in numerical simulations.
However, there are several interesting directions of research regarding the mathematical analysis of SGD optimization methods in the training of ANNs. The convergence of SGD optimization schemes for convex target functions is well understood, cf., e.g., [4,33,34,35,38] and the references mentioned therein. For abstract convergence results for SGD optimization methods without convexity assumptions we refer, e.g., to [1,7,13,14,18,26,29,31] and the references mentioned therein. We also refer, e.g., to [10,24,32,41] and the references mentioned therein for lower bounds and divergence results for SGD optimization methods. For more detailed overviews and further references on SGD optimization schemes we refer, e.g., to [8], [18,Section 1.1], [23,Section 1], and [39]. The effect of random initializations in the training of ANNs was studied, e.g., in [6,20,21,25,32,42] and the references mentioned therein. Another promising branch of research has investigated the convergence of SGD for the training of ANNs in the so-called overparametrized regime, where the number of ANN parameters has to be sufficiently large. In this situation SGD can be shown to converge to global minima with high probability, see, e.g., [12,16,17,22,30,43] for the case of shallow ANNs and see, e.g., [2,3,15,40,44] for the case of deep ANNs. These works consider the empirical risk, which is measured with respect to a finite set of data.
Another direction of research is to study the true risk landscape of ANNs and characterize the saddle points and local minima, which was done in Cheridito et al. [11] for the case of affine target functions. The question under which conditions gradient-based optimization algorithms cannot converge to saddle points was investigated, e.g., in [27,28,36,37] for the case of deterministic GD optimization schemes and, e.g., in [19] for the case of SGD optimization schemes.
In this work we study the plain vanilla SGD optimization method in the training of fullyconnected feedforward ANNs with ReLU activation in the special situation where the target function is a constant function. The main result of this work, Theorem 3.12 in Subsection 3.6, proves that the risk of the SGD process converges to zero in the almost sure and the L 1 -sense if the learning rates are sufficiently small but fail to be summable. We thereby extend the findings in our previous article Cheridito et al. [9] by proving convergence for the SGD optimization method instead of merely for the deterministic GD optimization method, by allowing the gradient to be defined as the limit of the gradients of appropriate general approximations of the ReLU activation function instead of a specific choice for the approximating sequence, by allowing the learning rates to be non-constant and varying over time, by allowing the input data to multi-dimensional, and by allowing the law of the input data to be an arbitrary probability distribution on [a, b] d with a ∈ R, b ∈ (a, ∞), d ∈ N instead of the continuous uniform distribution on [0,1].
To illustrate the findings of this work in more details, we present in Theorem 1.1 below a special case of Theorem 3.12. Before we present below the rigorous mathematical statement of Theorem 1.1, we now provide an informal description of the statement of Theorem 1.1 and also briefly explain some of the mathematical objects that appear in Theorem 1.1 below.
In Theorem 1.1 we study the SGD optimization method in the training of fully-connected feedforward artificial neural networks (ANNs) with three layers: the input layer, one hidden layer, and the output layer. The input layer consists of d ∈ N = {1, 2, ...} neurons (the input is thus d-dimensional), the hidden layer consists of H ∈ N neurons (the hidden layer is thus H-dimensional), and the output layer consists of 1 neuron (the output is thus one-dimensional).
In between the d-dimensional input layer and the H-dimensional hidden layer an affine linear transformation from R d to R H is applied with Hd + H real parameters and in between the H-dimensional hidden layer and the 1-dimensional output layer an affine linear transformation from R H to R 1 is applied with H + 1 real parameters. Overall the considered ANNs are thus described through d = (Hd + H) + (H + 1) = Hd + 2H + 1 (1) real parameters. In Theorem 1.1 we assume that the target function which we intend to learn is a constant and the real number ξ ∈ R in Theorem 1.1 specifies this constant. The real numbers a ∈ R, b ∈ (a, ∞) in Theorem 1.1 specify the set in which the input data for the training process lies in the sense that we assume that the input data is given through [a, b] d -valued i.i.d. random variables.
In Theorem 1.1 we study the SGD optimization method in the training of ANNs with the rectifier function R ∋ x → max{x, 0} ∈ R as the activation function. This type of activation is often also referred to as rectified linear unit activation (ReLU activation). The ReLU activation function R ∋ x → max{x, 0} ∈ R fails to be differentiable at the origin and the ReLU activation function can in general therefore not be used to define gradients of the considered risk function and the corresponding gradient descent process. In implementations, maybe the most common procedure to overcome this issue is to formally apply the chain rule as if all involved functions would be differentiable and to define the "derivative" of the ReLU activation function as the left derivative of the ReLU activation function. This is also precisely the way how SGD is implemented in TensorFlow and we refer to Subsection 3.7 for a short illustrative example Python code for the computation of such generalized gradients of the risk function.
In this article we mathematically formalize this procedure (see (2), (69), and item (ii) in Proposition 3.2) by employing appropriate continuously differentiable functions which approximate the ReLU activation function in the sense that the employed approximating functions converge to the ReLU activation function and that the derivatives of the employed approximating functions converge to the left derivative of the ReLU activation function. More specifically, in Theorem 1.1 the function R ∞ : R → R is the ReLU activation function and the functions R r : R → R, r ∈ N, serve as continuously differentiable approximations for the ReLU activation function R ∞ . In particular, in Theorem 1.1 we assume that for all x ∈ R it holds that R ∞ (x) = max{x, 0} and In Theorem 1.1 the realization functions associated to the considered ANNs are described through the functions r = ( φ r ) φ∈R d : (cf. (5) below). The input data which is used to train the considered ANNs is provided through the random variables X n,m : Ω → [a, b] d , n, m ∈ N 0 , which are assumed to be i.i.d. random variables. Here (Ω, F, P) is the underlying probability space. The function L : R d → R in Theorem 1.1 specifies the risk function associated to the considered supervised learning problem and, roughly speaking, for every neural network parameter φ ∈ R d we have that the value L(φ) ∈ [0, ∞) of the risk function measures the error how well the realization function φ ∞ : R d → R of the neural network associated to φ approximates the target function [a, b] The sequence of natural numbers (M n ) n∈N 0 ⊆ N describes the size of the mini-batches in the SGD process. The SGD optimization method is described through the SGD process Θ : N 0 × Ω → R d in Theorem 1.1 and the real numbers γ n ∈ [0, ∞), n ∈ N 0 , specify the learning rates in the SGD process. The learning rates are assumed to be sufficiently small in the sense that sup n∈N 0 γ n ≤ (5 + 5 Θ 0 ) −2 (max{|ξ|, |a|, |b|, d}) −5 (4) and the learning rates may not be summable and instead are assumed to satisfy ∞ k=0 γ k = ∞. Under these assumptions Theorem 1.1 proves that the true risk L(Θ n ) converges to zero in the almost sure and the L 1 -sense as the number of gradient descent steps n ∈ N increases to infinity. We now present Theorem 1.1 and thereby precisely formalize the above mentioned paraphrasing comments.
, assume that Θ 0 and (X n,m ) (n,m)∈(N 0 ) 2 are independent, and assume for all n ∈ N 0 , ω ∈ Ω that Θ n+1 (ω) = Θ n (ω) − γ n G n (Θ n (ω), ω), 18(max{|ξ|, |a|, |b|, d}) 5 γ n ≤ (1 + Θ 0 (ω) ) −2 , and ∞ k=0 γ k = ∞. Then (i) there exists C ∈ R such that P sup n∈N 0 Θ n ≤ C = 1, Theorem 1.1 is a direct consequence of Corollary 3.13 in Subsection 3.6 below. Corollary 3.13, in turn, follows from Theorem 3.12. Theorem 3.12 proves that the true risk of the considered SGD processes (Θ n ) n∈N 0 converges to zero both in the almost sure and the L 1 -sense in the special case where the target function is constant. In Section 2 we establish an analogous result for the deterministic GD optimization method. More specifically, Theorem 2.16 demonstrates the the true risk of the considered GD processes converges to zero if the target function is constant. Our proofs of Theorem 2.16 and Theorem 3.12 make use of similar Lyapunov estimates as in Cheridito et al. [9]. The contradiction argument we use to deal with the case of non-constant learning rates in the proofs of Theorem 2.16 and Theorem 3.

Convergence of gradient descent (GD) processes
In this section we establish in Theorem 2.16 in Subsection 2.8 below that the true risks of GD processes converge in the training of ANNs with ReLU activation to zero if the target function under consideration is a constant. Theorem 2.16 imposes the mathematical framework in Setting 2.1 in Subsection 2.1 below and in Setting 2.1 we formally introduce, among other things, the considered target function f : [a, b] d → R (which is assumed to be an element of the continuous functions C([a, b] d , R) from [a, b] d to R), the realization functions φ ∞ : R d → R, φ ∈ R d , of the considered ANNs (see (8) in Setting 2.1), the true risk function L ∞ : R d → R, a sequence of smooth approximations R r : R → R, r ∈ N, of the ReLU activation function (see (7) in Setting 2.1), as well as the appropriately generalized gradient function G = (G 1 , . . . , G d ) : R d → R d associated to the true risk function. In the elementary result in Proposition 2.2 in Subsection 2.2 below we also explicitly specify a simple example for the considered sequence of smooth approximations of the ReLU activation function. Proposition 2.2 is, e.g., proved as Cheridito et al. [9,Proposition 2.2].
Item (ii) in Theorem 2.16 in Subsection 2.8 below shows that the true risk L ∞ (Θ n ) of the GD process Θ : N 0 → R d converges to zero as the number of gradient descent steps n ∈ N increases to infinity. In our proof of Theorem 2.16 we use the upper estimates for the standard norm of the generalized gradient function G : Our proof of Lemma 2.12 uses the elementary representation result for the gradient function of the Lyapunov function V : R d → R in Proposition 2.8 in Subsection 2.6 below as well as the identities for the gradient flow dynamics of the Lyapunov function in Proposition 2.9 and Corollary 2.10 in Subsection 2.6 below.
The findings in this section extend and/or generalize the findings in Section 2 and Section 3 in Cheridito et al. [9] (to the more general and multi-dimensional setup considered in Setting 2.

Description of artificial neural networks (ANNs) with ReLU activation
and

Properties of the approximating true risk functions and their gradients
and

Proof of Proposition 2.3. Throughout this proof let
Observe that the assumption that for all r ∈ N it holds that R r ∈ C 1 (R, R) implies that for all r ∈ N, x ∈ R we have that R r (x) = R r (0) + x 0 (R r ) ′ (y) dy. This, the assumption that for all x ∈ R it holds that sup r∈N sup y∈[−|x|,|x|] |(R r ) ′ (y)| < ∞ and the fact that sup r∈N |R r (0)| < ∞ prove that for all This, the assumption that for all r ∈ N it holds that R r ∈ C 1 (R, R), the chain rule, and the dominated convergence theorem establish items (i) and (ii). Next note that for all r ∈ N, The fact that for all and the dominated convergence theorem hence prove that lim r→∞ L r (φ) = L ∞ (φ). This establishes item (iii). Moreover, observe that (11), the dominated convergence theorem, and the fact that Next note that for all and Furthermore, observe that (11) shows that for all r ∈ N, The dominated convergence theorem and (13) hence prove that for all i ∈ {1, 2, . . . , H}, j ∈ {1, 2, . . . , d} we have that Moreover, note that (14), (15), and the dominated convergence theorem demonstrate that for all i ∈ {1, 2, . . . , H}, j ∈ {1, 2, . . . , d} it holds that Furthermore, observe that for all In addition, note that (11) ensures that for all This, (18), and the dominated convergence theorem demonstrate that for all i ∈ {1, 2, . . . , H} it holds that Combining this, (12), (16), and (17) establishes items (iv) and (v). The proof of Proposition 2.3 is thus complete.

Local Lipschitz continuity properties of the true risk functions
Furthermore, note that the fact that K is compact ensures that there exists κ ∈ [1, ∞) such that for all ϕ ∈ K it holds that ϕ ≤ κ.
Note that (23) and (24) show that there exists ℒ ∈ R which satisfies for all φ, ψ ∈ K that Hence, we obtain that for all φ, ψ ∈ K it holds that This, (24), (25), and the fact that for all Combining this with (25) establishes (22). The proof of Lemma 2.4 is thus complete.
Proof of Corollary 2.6. Observe that Lemma 2.4 and the assumption that K is compact ensure that sup φ∈K L ∞ (φ) < ∞. This and Lemma 2.5 complete the proof of Corollary 2.6.

Upper estimates associated to Lyapunov functions
Proof of Lemma 2.7. Observe that the fact that for all φ ∈ R d it holds that |φ d − 2ξ| 2 ≥ 0 assures that for all φ ∈ R d we have that Furthermore, note that the fact that for all x, y ∈ R it holds that (x − y) 2 ≤ 2(x 2 + y 2 ) ensures that for all φ ∈ R d it holds that Combining this with (36) establishes (35). The proof of Lemma 2.7 is thus complete. and Proof of Proposition 2.8. Observe that the assumption that for all φ ∈ R d it holds that . Moreover, note that item (i) establishes item (ii). The proof of Proposition 2.8 is thus complete. Proposition 2.9. Assume Setting 2.1 and let φ ∈ R d . Then Proof of Proposition 2.9. Observe that Proposition 2.8 demonstrates that . This and (10) imply that Hence, we obtain that This completes the proof of Proposition 2.9.

Lyapunov type estimates for GD processes
Proof of Lemma 2.12. Throughout this proof let e ∈ R d satisfy e = (0, 0, . . . , 0, 1) and let g : R → R satisfy for all t ∈ R that Observe that (47) and the fundamental theorem of calculus prove that Corollary 2.10 hence demonstrates that Proposition 2.8 therefore proves that Hence, we obtain that The proof of Lemma 2.12 is thus complete.
Proof of Corollary 2.13. Note that Lemma 2.5 and Lemma 2.7 demonstrate that Lemma 2.12 therefore shows that The proof of Corollary 2.13 is thus complete.

Corollary 2.14. Assume Setting 2.1, assume for all
Proof of Corollary 2.14. Observe that Corollary 2.13 establishes (55). The proof of Corollary 2.14 is thus complete.
Proof of Lemma 2.15. Throughout this proof let g ∈ R satisfy g = sup n∈N 0 γ n . We now prove (56) by induction on n ∈ N 0 . Note that Corollary 2.14 and the fact that γ 0 ≤ g imply that This establishes (56) in the base case n = 0. For the induction step let n ∈ N satisfy for all m ∈ {0, 1, . . . , n − 1} that Observe that (58) shows that V (Θ n ) ≤ V (Θ n−1 ) ≤ · · · ≤ V (Θ 0 ). The fact that γ n ≤ g and Corollary 2.14 hence demonstrate that Induction therefore establishes (56). The proof of Lemma 2.15 is thus complete.

Convergence analysis for GD processes in the training of ANNs
Proof of Theorem 2. 16. Throughout this proof let η ∈ (0, ∞) satisfy η = 8(1−[sup n∈N 0 γ n ][a 2 (d+ 1)V (Θ 0 )+1]) and let ε ∈ R satisfy ε = ( 1 /3)[min{1, lim sup n→∞ L ∞ (Θ n )}]. Note that Lemma 2.15 implies that for all n ∈ N 0 we have that V (Θ n ) ≤ V (Θ n−1 ) ≤ · · · ≤ V (Θ 0 ). Combining this and the fact that for all n ∈ N 0 it holds that Θ n ≤ [V (Θ n )] 1/2 establishes item (i). Next observe that Lemma 2.15 implies for all N ∈ N that Hence, we have that This and the assumption that ∞ n=0 γ n = ∞ ensure that lim inf n→∞ L ∞ (Θ n ) = 0. We intend to complete the proof of item (ii) by a contradiction. In the following we thus assume that Note that (62) implies that This shows that there exist (m k , n k ) ∈ N 2 , k ∈ N, which satisfy for all k ∈ N that m k < n k < m k+1 , L ∞ (Θ m k ) > 2ε, and L ∞ (Θ n k ) < ε ≤ min j∈N∩[m k ,n k ) L ∞ (Θ j ). Observe that (61) and the fact that for all k ∈ N, j ∈ N ∩ [m k , n k ) it holds that 1 ≤ 1 Next note that Corollary 2.6 and item (i) ensure that there exists C ∈ R which satisfies that Observe that the triangle inequality, (64), and (65) prove that Moreover, note that Lemma 2.4 and item (i) demonstrate that there exists ℒ ∈ R which satisfies for all m, n ∈ N 0 that |L ∞ (Θ m ) − L ∞ (Θ n )| ≤ ℒ Θ m − Θ n . This and (66) show that Combining this and the fact that for all k ∈ N 0 it holds that L ∞ (Θ n k ) < ε < 2ε < L ∞ (Θ m k ) ensures that This contradiction establishes item (ii). The proof of Theorem 2.16 is thus complete.

Convergence of stochastic gradient descent (SGD) processes
In this section we establish in Theorem 3.12 in Subsection 3.6 below that the true risks of SGD processes converge in the training of ANNs with ReLU activation to zero if the target function under consideration is a constant. In this section we thereby transfer the convergence analysis for GD processes from Section 2 above to a convergence analysis for SGD processes. Theorem 3.12 in Subsection 3.6 postulates the mathematical setup in Setting 3.1 in Subsection 2.1 below. In Setting 3.1 we formally introduce, among other things, the constant ξ ∈ R with which the target function coincides, the realization functions φ ∞ : R d → R, φ ∈ R d , of the considered ANNs (see (70) in Setting 3.1), the true risk function L : R d → R, the sizes M n ∈ N, n ∈ N 0 , of the employed mini-batches in the SGD optimization method, the empirical risk functions L n ∞ : R d × Ω → R, n ∈ N 0 , a sequence of smooth approximations R r : R → R, r ∈ N, of the ReLU activation function (see (69) in Setting 3.1), the learning rates γ n ∈ [0, ∞), n ∈ N 0 , used in the SGD optimization method, the appropriately generalized gradient functions G n = (G n 1 , . . . , G n d ) : R d × Ω → R d , n ∈ N 0 , associated to the empirical risk functions, as well as the SGD process Θ = (Θ n ) n∈N 0 : Items (ii) and (iii) in Theorem 3.12 in Subsection 3.6 below prove that the true risk L(Θ n ) of the SGD process Θ : N 0 × Ω → R d converges in the almost sure and L 1 -sense to zero as the number of stochastic gradient descent steps n ∈ N increases to infinity. In our proof of Very roughly speaking, Proposition 3.2 in Subsection 3.2 below transfers Proposition 2.3 in Subsection 2.3 above to the SGD setting, Lemma 3.6 in Subsection 3.4 below transfers Lemma 2.5 in Subsection 2.5 above to the SGD setting, Lemma 3.7 in Subsection 3.4 below transfers Corollary 2.6 in Subsection 2.5 above to the SGD setting, Lemma 3.8 in Subsection 3.5 below transfers Corollary 2.10 in Subsection 2.6 above to the SGD setting, Lemma 3.9 in Subsection 3.5 below transfers Lemma 2.12 in Subsection 2.7 above to the SGD setting, Lemma 3.10 in Subsection 3.5 below transfers Corollary 2.14 in Subsection 2.7 above to the SGD setting, Corollary 3.11 in Subsection 3.5 below transfers Lemma 2.15 in Subsection 2.7 above to the SGD setting, and Theorem 3.12 in Subsection 3.6 below transfers Theorem 2.16 in Subsection 2.8 above to the SGD setting.  Let d, H, d ∈ N, ξ, a, a ∈ R, b ∈ (a, ∞) satisfy d = dH + 2H + 1 and a = max{|a|, |b|, 1}, let R r : R → R, r ∈ N∪{∞}, satisfy for all x ∈ R that r∈N {R r } ⊆ C 1 (R, R), R ∞ (x) = max{x, 0}, and

Proof of Proposition 3.3.
Observe that the assumption that X n,m : Ω → [a, b] d , n, m ∈ N 0 , are i.i.d. random variables ensures that for all n ∈ N 0 , φ ∈ R d it holds that The proof of Proposition 3.3 is thus complete.
Lemma 3.4. Assume Setting 3.1 and let F n ⊆ F, n ∈ N 0 , satisfy for all n ∈ N that F 0 = σ(Θ 0 ) and F n = σ Θ 0 , (X n,m ) (n,m)∈(N∩[0,n))×N 0 . Then (ii) it holds for all n ∈ N 0 that Θ n is F n /B(R d )-measurable, and (iii) it holds for all m, n ∈ N 0 that σ(X n,m ) and F n are independent.
Proof of Lemma 3.4. Note that Lemma 2.4 and (72) prove that for all n ∈ N 0 , r ∈ N, ω ∈ Ω it holds that R d ∋ φ → (∇ φ L n r )(φ, ω) ∈ R d is continuous. Furthermore, observe that (72) and the fact that for all n, m ∈ N 0 it holds that X n,m is This and, e.g., [5,Lemma 2.4] show that for all n ∈ N 0 , r ∈ N it holds that Combining this with item (ii) in Proposition 3.2 demonstrates that for all n ∈ N 0 it holds that is (B(R d ) ⊗ F n+1 )/B(R d )-measurable. This establishes item (i). In the next step we prove item (ii) by induction on n ∈ N 0 . Note that the fact that F 0 = σ(Θ 0 ) ensures that Θ 0 is F 0 /B(R d )-measurable. For the induction step let n ∈ N 0 satisfy that Θ n is F n /B(R d )measurable. Observe that item (i) and the fact that F n ⊆ F n+1 ensure that G n (Θ n ) is Combining this, the fact that F n ⊆ F n+1 , and the assumption that Θ n+1 = Θ n − γ n G n (Θ n ) demonstrates that Θ n+1 is F n+1 /B(R d )-measurable. Induction thus establishes item (ii). Next note that the assumption that X n,m , n, m ∈ N 0 , are independent and the assumption that Θ 0 and (X n,m ) (n,m)∈(N 0 ) 2 are independent establish item (iii). The proof of Lemma 3.4 is thus complete.
Hence, we obtain that for all n ∈ N 0 it holds that L n ∞ (Θ n ) = L n (X n,1 , . . . , X n,Mn , Θ n ).
Lemma 3.7. Assume Setting 3.1 and let K ⊆ R d be compact. Then Proof of Lemma 3.7. Observe that Lemma 2.4 proves that there exists C ∈ R which satisfies for all φ ∈ K that sup x∈[a,b] d | φ ∞ (x)| ≤ C. The fact that for all n, m ∈ N 0 , ω ∈ Ω it holds that X n,m (ω) ∈ [a, b] d hence establishes that for all n ∈ N 0 , φ ∈ K, ω ∈ Ω we have that Combining this and Lemma 3.6 completes the proof of Lemma 3.7.

Lyapunov type estimates for SGD processes
Proof of Lemma 3.8. Note that the fact that V (φ) = φ 2 + |c φ − 2ξ| 2 ensures that This and (73) imply that Hence, we obtain that The proof of Lemma 3.8 is thus complete.
(93) Proposition 2.8 therefore proves that Hence, we obtain that The proof of Lemma 3.9 is thus complete.
Proof of Lemma 3.10. Note that Lemma 2.7 and Lemma 3.6 prove that for all n ∈ N 0 it holds that Lemma 3.9 hence demonstrates that for all n ∈ N 0 it holds that The proof of Lemma 3.10 is thus complete.
Then it holds for all n ∈ N 0 that Proof of Corollary 3.11. Throughout this proof let g ∈ R satisfy g = sup n∈N 0 γ n . We now prove (99) by induction on n ∈ N 0 . Observe that Lemma 3.10 and the fact that γ 0 ≤ g imply that it holds P-a.s. that This establishes (99) in the base case n = 0. For the induction step let n ∈ N satisfy that for all m ∈ {0, 1, . . . , n − 1} it holds P-a.s. that Note that (101) shows that it holds P-a.s. that V (Θ n ) ≤ V (Θ n−1 ) ≤ · · · ≤ V (Θ 0 ). The fact that γ n ≤ g and Lemma 3.10 hence demonstrate that it holds P-a.s. that Induction therefore establishes (99). The proof of Corollary 3.11 is thus complete.
This contradiction proves that lim sup n→∞ L(Θ n (ω)) = 0. This and the fact that P(A) = 1 establish item (ii). Next note that item (i) and the fact that L is continuous show that there exists ∈ R which satisfies that P sup n∈N 0 L(Θ n ) ≤ = 1. This, item (ii), and the dominated convergence theorem establish item (iii). The proof of Theorem 3.12 is thus complete.
Theorem 3.12 hence establishes items (i), (ii), and (iii). The proof of Corollary 3.13 is thus complete.