Beyond the Bakushinskii veto: Regularising linear inverse problems without knowing the noise distribution

This article deals with the solution of an ill-posed equation $K\hat{x}=\hat{y}$ for a given compact linear operator $K$ on separable Hilbert spaces. Often, one only has a corrupted version $y^{\delta}$ of $\hat{y}$ at hand and the Bakushinskii veto tells us, that we are not able to solve the equation if we do not know the noise level $\| \hat{y}-y^{\delta}\|$. But in applications it is ad hoc unrealistic to know the error of a measurement. In practice, the error of a measurement is usually estimated through averaging of multiple measurements. In this paper, we integrate the probably most natural approach to that in our analysis, ending up with a scheme allowing to solve the ill-posed equation without any specific assumption for the error distribution of the measurement. More precisely, we consider noisy but multiple measurements $Y_1,...,Y_n$ of the true value $\hat{y}$. Furthermore, assuming that the noisy measurements are unbiased and independently and identically distributed according to an unknown distribution, the natural approach would be to use $(Y_1+..+Y_n)/n$ as an approximation to $\hat{y}$ with the estimated error $s_n/\sqrt{n}$, where $s_n$ is an estimation of the standard deviation of one measurement. We study whether and in what sense this natural approach converges. In particular, we show that using the discrepancy principle yields, in a certain sense, optimal convergence rates.


Introduction
The goal is to solve the ill-posed equation Kx =ŷ, where K : X → Y is a compact operator between seperable infinite dimensional Hilbert spaces.We do not know the right hand sideŷ exactly, but we are given several measurements Y 1 ,Y 2 , ... of it, which are independent, identical distributed and unbiased (EY i =ŷ) random variables 1 . The measurements naturally lead to an estimator ofŷ, namely the sample mean But, in general K +Ȳ n → K +ŷ for n → ∞, because the generalised inverse (section 6.2, Definition 6) of a compact operator is not continuous. So the inverse is replaced with a family of continuous approximations (R α ) α>0 , called regularisation.The regularisation parameter α has to be chosen accordingly to the dataȲ n and the true data error δ true n := Ȳ n −ŷ , which is also a random variable. Sinceŷ is unknown, δ true n is also unkown and has to be guessed. A natural guess is where either s n = 1 is constant or s n = ∑ i≤n Y i −Ȳ n 2 /(n − 1) is the square root of the sample variance. The natural approach is now to use a (deterministic) regularisation method together withȲ n and δ est n . We are in particular interested in the discrepancy principle, wich is known to provide optimal convergence rates in the classical deterministic setting. The following main result states, that in a certain sense, the natural approach converges and yields the optimal deterministic rates asymptotically.
Moreover, if K +ŷ satisfies K +ŷ ν ≤ ρ for ρ > 0 and 0 < ν < ν 0 (section 6.2, Definition 11 and 16), then for all ε > 0, Moreover it is shown, that the approach does not yield convergence in L 2 for the discrepancy principle, but it does for a priori regularisation. We also discuss quickly, how one has to estimate the error to obtain almost sure convergence.
The main difference to existing results is, that we have no assumptions for the error distribution. The natural approach can be easily used by everyone, who can measure multiple times. Stochastic inverse problems are an active field of research with close ties to high dimensional statistics ( [14], [13], [25]). In general, there are two approaches to tackle an ill-posed problem with stochastic noise. The Bayesian setting considers the solution of the problem itself as a random quantity, on which one has some a priori knowledge (see [18]). This opposes the frequentist setting, where the inverse problem is assumed to have a deterministic, exact solution ( [7], [5]). We are working in the frequentist setting, but compared to the existing literature we are staying much closer to the classic deterministic theory of linear inverse problems ( [10], [26], [27]). The error is often modelled as a Hilbert space process (for example Gaussian white noise), that is in contrast to our, more classic error model, it is not an element of the Hilbert space itself. Still, usually there are made additional distribution specific assumptions. This more general error model makes it impossible to determine the regularisation parameter through the discrepancy principle [24], which is known to work optimal in the classic deterministic setting. Instead, typical methods to determine the regularisation parameter are cross validation [29], Lepski's balancing principle [23] or penalised empirical risk minimisation [8]. Methods motivated by the discrepancy principle were studied recently ( [21], [6], [22]). In the frequentist literature, people are overwhelmingly aiming to prove (optimal) convergence in L 2 (often called convergence of the mean squared error) via so called oracle inequalities. Another approach is to transfer results from the classical deterministic theory using the Ky-Fan metric, which metrises convergence in probability ( [16], [12]). Finally, aspects of the Bakushinskii veto for stochastic inverse problems are discussed in ( [3], [4], [30]) under assumptions for the noise distribution. It may be surprising that the natural approach has not been studied yet. We conclude the introduction with a small insight in the main difficulties, which are in our opinion the reason for this. By the strong law of large numbers (section 6.1, Theorem 10) for integrable independent random variables it holds that P lim n→∞ δ true Then the definition of a regularisation method gives Infact, by Bakushinskii's veto [2] it is not possible to achieve convergence without knowing an upper bound for the data error δ true n . We quickly motivate why δ est n is a natural guess for δ true n . In our independent setting, the central limit theorem (section 6.1, Theorem 9) gives us the asymptotic distribution of δ true n . Roughly speaking, for large n, where Z is a centred Gaussian random variable with the same covariance structure as Y 1 (in particular, E Z 2 = E Y 1 −ŷ 2 ). Moreover, we can calculate the second moment of δ true n for any n: So, in some sense δ true n ∼ 1/ √ n. But, for n → ∞, √ nδ true n converges (weakly) to a Gaussian random variable, while √ nδ est n converges (almost surely) to a constant, see Figure 1. That is, the estimator δ est n often underestimates the true data error δ true n . This is one reason, that this naive approach is not used widely in the community. It is more popular to choose estimators, which rather overestimate the true data error ( [9], [28]). But we believe that many people in applications, especially those without a mathematical background, use the naive way presented here to estimate their data (error). In fact, the underestimating is only a minor problem for a priori regularisation, since it is clear from the deterministic theory, that we are allowed to underestimate the error by a constant and in our given case, we underestimate by a random constant. On the other hand, using the discrepancy principle together with an underestimated error yields non convergence in a general classic setting. It turns out that with our error model the problem of underestimating is compensated by the following: In deterministic regularisation one always considers the worst case error, that is the noise may come from arbitrarily bad directions (i.e. it is supported mainly on eigenvectors corresponding to arbitrarily small eigenvalues). Here, our estimatorȲ n , when multiplied by √ n, somehow stabilises along (random) directions determined by Z and (thus determined by the distribution of Y 1 ). In the following section we apply our approach to a priori regularisations and in the main part we consider the widely used discrepancy principle, which is known to work optimal in the classic deterministic theory. After that we quickly show how to choose Fig. 1 Here we see the distributions of lim n→∞ √ nδ true n (blue) and lim n→∞ √ nδ est n (red and yellow), for the one-dimensional case and E|Y 1 −ŷ| 2 = 1/ √ 2. The first one is a halfnormal distribution, while the latter ones are in fact degenerated delta distributions, due to the almost sure convergence. Moreover, if we use the sample variance for δ est n , we can quantify the asymptotic probability of underestimating the true data error δ true n with Lemma (4) (section 6.2)). For large n and any τ > 0 it holds that P (τδ est n > δ true n ) ≈ where Φ is the cumulative distribution function of a standard Gaussian. In particular, for τ=1 we have that P(δ est n < δ true n ) ≈ 0.31, which equals the area under the blue graph on the right of the yellow line. δ est n to obtain almost sure convergence and we give some numerical demonstrations of the main Theorems 3 and 4. In section 6 we collected definitions and facts from probability theory and deterministic regularisation theory.

The error model
We have in mind, that one is able to measure the true valueŷ multiple times and that the measurements are correct in expectation. This can be formally modelled as an independent and identically distributed sequence Y 1 ,Y 2 , ... : Ω → Y of random variables with values in Y , such that EY 1 =ŷ ∈ D(K + ). In order to use the central limit theorem we require that 0 < E Y 1 2 < ∞. A justification for this model is given in Theorem 6 (section 6.1). Note that some popular statistical error models, for example Gaussian white noise, are not included, since we require the measurement to be an element of the Hilbert space.

A priori regularisation
Here we apply the above approach to a priori parameter choice strategies α(y δ , δ ) = α(δ ). The deterministic theory suggests that one should choose δ est n = 1/ √ n, that is not to estimate the variance. Also otherwise it would not be an a priori regularisation method anymore since the sample variance depends, of course, on the data. This choice has the advantage, that δ est n and hence α(δ est n ) are deterministic. Since we also know that Eδ true n 2 = E Y 1 −ŷ 2 /n (see equation (4)), it is natural to try to prove convergence of E R α(δ est n )Ȳn − K +ŷ 2 . It turns out, that the convergence proof goes through without any problems.
Theorem 1 (a priori regularisation) Assume that K : X → Y is a compact operator between separable Hilbert spaces and that Y 1 , −→ 0. SetȲ n := ∑ i≤n Y i /n and δ est n := n −1/2 . Then lim n→∞ E R α(δ est n )Ȳn − K +ŷ 2 = 0. Proof Note that in fact it suffice to assume K to be bounded and linear. Remember Therefore, As in the deterministic case, under source conditions (section 6.2, Definition 11) we can prove convergence rates. It is common to consider regularisations induced by a filter (section 6.2, Definition 14). We make the following standard assumptions for the filter: Remark 1 The generalising filter of the following regulariation methods fullfill the Assumption 1: 1. Tikhonov regularisation, 2. generalised Tikhonov regularisation, 3. truncated singular value regularisation, 4. Landweber iteration.
Theorem 2 (order of convergence for a prioi regularisation) Assume that K : X → Y is a compact operator between separable Hilbert spaces and that Y 1 ,Y 2 , ...
Let R α be induced by a filter fullfilling Assumption 1. SetȲ n := ∑ i≤n Y i /n and δ est n = n −1/2 . Assume that for 0 < ν ≤ ν 0 and ρ > 0 we have that K +ŷ ∈ X ν , K +ŷ ν ≤ ρ. Then for Proof We proceed similiary to the proof of Theorem 1, using additionally Proposition 2 and 3 of section 6.2.
Remark 2 In case of Theorem 1 one could alternatively argue as follows: The spaces X := L 2 (Ω , X ) = {X : Ω → X : E X 2 < ∞} and Y := L 2 (Ω , Y ) are also Hilbert spaces, with scalar products (X,X) X := E(X,X) X and (·, ·) Y defined similary. The compact operator K : X → Y induces naturally a linear operator K : X → Y , X → KX. Clearly we have thatŷ ∈ Y , and (Ȳ n ) n is a sequence in Y which fullfills and we can use the classic deterministic results for K : X → Y andȲ n and δ est n . The same argumentation does not work exactly for Theorem 2, since the induced operator K is not compact.

The discrepancy principle
In practice the above parameter choice strategies are of limited interest, since they require the knowledge of the abstract smoothness parameters ν and ρ. The classical discrepancy principle would be to choose α n such that which is not possible because of the unknown δ true n . So we replace it with our estimator δ est n and implement the discrepancy principle with Algorithm 1.
Algorithm 1 Discrepancy principle with estimated data error 1: Given measurements Y 1 , ...,Y n ; 2: SetȲ n := ∑ i≤n Y i /n and δ est n = s n with s n = 1 or Algorithm 1 converges under two assumptions. The first one is also necessary in the classical deterministic setting. Equation (5) has a solution for all δ true n > 0 andŷ ∈ Y if K is injective. Since we replace δ true n with δ est n , we have to assure that δ est n = 0. This may not be the case if we choose to use the sample variance, it may happen that Y 1 = ... = Y n . The assumption E Y 1 −ŷ 2 > 0 guarantees that this happens with probability 1 only finitely many times. Anyway, in fact the distribution of Y 1 is usually absolutely continuous wich implies that P(Y 1 = ... = Y n ) = 0 for all n ∈ N. Under this assumptions, at first glance, one may try to prove convergence in squared expectation, similiar to the previous section. But here we have the problem, that α n is random, since it depends by defintion on the random data. Indeed, we show that Algorithm 1 does not yield such a convergence. The rough idea is the following: with a diminishing probability p we are underestimating the data error significantly, thus the discrepancy principle gives a way too small α and we still have p R α 1. We give a small rigorous example. As a compact operator, K admits a singular value decomposition (section 6.2, Definition 10) which we denote by (σ l , v l , u l ).

A counter example for L 2 convergence
To simplify calculations we pick Gaussian noise and the truncated singular value regularisation and we set δ est n = 1/ √ n. We choose X := l 2 (N) with the standard basis {u k := (0, ..., 0, 1, 0, ...)} and consider the diagonal operator Hence the σ l = (1/100) l 2 are the Eigenvalues of K and We assume that the noise is distributed along y := ∑ l≥2 1/ l(l − 1)u l , so we have that ∑ l>n (y, u l ) 2 = 1/n and thus y ∈ l 2 (N). That is we setȲ ..n}, a (very unlikely) event on which we significantly underestimate the true data error. We get that P(Ω n ) := P(Z 1 ≥ 1) n ≥ 1/10 n . Moreover, by the definition of the discrepancy principle

It follows that
That is the probability of the events Ω n is not small enough to compensate the huge error we have on these events, so in the end E R α nȲ n − K +ŷ 2 → ∞ for n → ∞.

Convergence in probability of the discrepancy principle
We saw that, unlike in the a priori setting, we have no convergence in squared expectation. The following main theorem of the section assures convergence in probability. For convergence in probability it does not matter how large the error is on sets with diminishing probability. We again consider regularisations induced by a filter, compared to Assumptions 1 we need an additional monotinicity property for the filter (section 6.2, Definition 15).
Assumption 2 (F α ) α>0 is a regularising and monotone filter with Still, the additional assumption is compatible with the prominent regularisation methods.
Theorem 3 Assume that K is a compact and injective operator between separable Hilbert spaces X and Y and that Let R α be induced by a filter fullfilling Assumption 2.. Applying Algorithm 1 yields a sequence (α n ) n . Then we have that for all ε > 0 Remark 4 If one tried to argue as in Remark 1 to show convergence in squared expectation, one would have to determine the regularisation parameter not as given by equation (5), but such that E (KR α − Id)Ȳ n 2 ≈ δ est n , which is not practicable since we cannot calculate the expectation on the left hand side.
The popularity of the discrepancy principles is a result of the fact that it guarantees optimal convergence rates, if the true solution fullfills some abstract smoothness conditions. More precisely, the general classic result is the following: Assuming that the true solution fullfills ŷ X ν ≤ ρ, then there is a constant C > 0 such that The next theorem shows the analogous result for the natural approach: A similiar bound to (6) holds with increasing probability, where δ ν ν+1 is replaced with the maximum of δ est n ν ν+1 and δ true n ν ν+1 (δ true n /δ est n ) 1 ν+1 . That is, with a probability tending to 1, if δ true n ≤ δ est n the determinstic bound (6) holds with δ replaced by δ est n . This is consistent, it is no problem to overestimate the true data error. On the other hand, if one underestimates the data error, that is if δ true n > δ est n , the optimal bound (6) holds only modulo a fine (δ true n /δ est n ) for a Gaussian Z. Note that the bound is optimal if δ est n = δ true n , which gives a reason to estimate the sample variance. Note that in the deterministic setting, determining the regularisation parameter with some δ < δ would yield non convergence in general.
Theorem 4 (Discrepancy principle) Assume that K is a compact and injective operator between separable Hilbert spaces X and Y . Moreover, Y 1 ,Y 2 , ... are i.i.d. Y −valued random variables with EY 1 =ŷ ∈ D(K + ) and 0 < E Y 1 −ŷ 2 < ∞. Let R α be induced by a filter fullfilling Assumption 2. Moreover, assume that there is a 0 < ν ≤ ν 0 − 1 and a ρ > 0 such that K +ŷ ∈ X ν and K +ŷ ν ≤ ρ. Applying Algorithm 3 yields a sequence (α n ) n . Then there is a constant L, such that While we see here the similiarities to the classic deterministic case, one may wish to have a deterministic bound on R α nȲ n − K +ŷ (for n large). Because of the central limit theorem, we expect that the error behaves like 1/ √ n. Indeed, we will see this rate asymptotically.
Corollary 2 Under the assumptions of Theorem 4, for all ε > 0 it holds that

Almost sure convergence
The results so far delievered either convergence in probability or convergence in L 2 . We give a short remark how one can obtain almost sure convergence. Roughly speaking, one has to multiply a √ log log n term to δ est n . This is a simple consequence of the following theorem Theorem 5 (Law of the iterated logarithm) Assume that Y 1 ,Y 2 , ... is an i.i.d sequence with values in some seperable Hilbert space Y . Moreover, assume that EY 1 = 0 and E Y 1 2 < ∞. Then we have that Proof This is a simple consequence of Corollary 8.8 in [20].
that is, with probability 1 it holds that δ true n ≤ 2E Y 1 −ŷ 2 log log n n for n large enough. Consequently, for some τ > 1 the estimator should be δ est n := τσ n 2 log log n n , where σ n is the square root of the sample variance. Since P(lim n→∞ σ 2 n = E Y 1 − y 2 ) = 1 and τ > 1 it holds that E Y 1 −ŷ ≤ τσ n for n large enough with probability 1 and thus δ true n ≤ δ est n for n large enough with probability 1. In other words, there is an event Ω 0 ⊂ Ω with P(Ω 0 ) = 1 such that for any ω ∈ Ω 0 there is a N(ω) ∈ N with δ true n (ω) ≤ δ est n (ω) for all n ≥ N(ω). So we can useȲ n and δ est n together with any deterministic regularisation method to get almost sure convergence.

Proofs of theorem 3 and 4
We have to check that the inevitable underestimating of δ true n does not yield a too small regularisation parameter. In the classic proof one uses some bound f given by the true data (KR α − Id)ŷ ≤ f (α) to control the size of α. Following these arguments we get by the definition of the discrepancy principle. The classical worst case error bound (which is for example (KR α − Id) ≤ 1 for the Tikhonov regularisation) would give f (α) ≥ δ est n −δ true n , that is no control of α n if δ true n > δ est n . But in our setting, although the noise is random, it is not coming from arbitrarily bad directions -it stabilises via the central limit theorem. Roghly speaking, √ n(Ȳ n −ŷ) ≈ Z for a Gaussian variable Z and therefore √ n (KR α n − Id)(Ȳ n −ŷ) ≈ (KR α n − Id)Z n→∞ −→ 0, since α n → 0. Thus for large n we have that for any constant c > 0. To make this rigorous, we have to carefully decouple the envolved two limites. This can be done by a montonicity assumption on the filter (section 6.2, Definition 15), which is natural and which is fullfilled by all the standard filters which are used in practice. To prove convergence we introduce events Ω n,c on which the equation (7) hold. For 0 < c < 1 we define Here we need (KR 1 − Id)Ȳ n > δ est n to guarantee, that Algorithm 1 terminates with a k > 0, which implies that (KR α n /q − Id)Ȳ n > δ est n . We first have to treat the special case where K +ŷ = R αŷ for α small enough and then we show that P(Ω n,c ) → 1 for n → ∞ otherwise. First note, that the monotonicity implies that for all α ≤ β and y ∈ Y , since We begin with the special case, where the problem is in fact well-posed. Here one may see already the key ideas of the proof of the main Lemma 2.
Lemma 1 Assume that it holds that there is an a 0 such that R αŷ = K +ŷ for all α ≤ a 0 (this may happen ifŷ has a finite expression in terms of the {u l } l∈N and if R α is the truncated singular value regularisation). Then, for any sequence (x n ) n converging monotonically to 0, it holds that Proof We need to control (α n ) n , so the first step is to show that P(α n ≥ qx n ) → 1, where q is defined in Algorithm 1. For x n ≤ a 0 it holds that (KR x n − Id)ŷ = 0. So, for n large enough and m ≤ n because of equation (8) and (9). Using the central limit theorem Finally, by Portemanteau's lemma (section 6.1, Lemma 3) for all m ∈ N. By pointwise convergence we have that Now we set Ω n := α n ≥ qx n , δ est n ≤ δ true n /x n }.
So equation (13) and (14) yield (for n large enough) for n → ∞ and hence P(Ω n ) → 1 for n → ∞. Moreover, Because the above holds by assumption for all α ≤ a 0 , the boundedness of F α implies that there is a L ∈ N, such that (ŷ, u l ) = 0 for all l ≥ L. So We deduce that for some monotonically to 0 converging sequence (x n ) n∈N . After redefining x n := x n , lim n→∞ P R α nȲ n − K +ŷ ≤ δ true n /x n ≥ lim n→∞ P(χ Ω n ) = 1.

Now we can formulate the central lemma.
Lemma 2 Assume that we have that R αŷ = K +ŷ for all α > 0. Then it holds that P(Ω n,c ) n→∞ −→ 1 for all 0 < c < 1.
Proof As in the special case, we need a deterministic bound for α n (with high probability) to separate the two limites. By the strong law of large numbers we have that P (lim n→∞Ȳn =ŷ) = 1. In particular, for all k ∈ N we have that So we define Thus we have that 1 ≤ K 1 ≤ K 2 ≤ ... (note that K n = ∞ is possible). It holds that Moreover, again by the strong law of large numbers, we have that Where γ is either 1 or E Y 1 −ŷ 2 , depending on if we use the estimated sample variance or not.
Now we follow the ideas of the proof of Lemma 1(equation (10) to (14)). For m ≤ n We conclude that where we used the monotonicity of P in line (26), the monotinicity of F α in line (27), the subadditivity of P in line (29) and equation (20) and (22) to (24) in line (30). The same argumentation shows that lim n→∞ P(Ω 3 n,c ) = 1 too. From the inequality (21) it follows that for n large enough Proof (Theorem 3) By Lemma 1, it suffices to consider the case where R αŷ = K +ŷ for all α > 0. Then, by Lemma 2, lim n→∞ P(Ω n,c ) = 1. So From lim n→∞ α n χ Ω n,c = 0 for n → ∞ it follows that nχ Ω n,c → 0 for n → ∞, because of Slutsky's theorem (section 6.1, Theorem 8). Since P( √ nδ est n = γ) = 1 with γ = 1 or γ = E Y 1 −ŷ 2 , it suffices in fact to show R α n δ est n χ Ω n,c → 0. By definition of Ω n,c we have that α n χ Ω n,c < 1 for n large enough and therefore So by Proposition 2, Note that the qualification ν 0 of (F α ) α is greater than 1. That is, there exist constants Let ε > 0 be arbitrary. There is an N ∈ N such that ∑ l>N (x, v l ) 2 < ε/2C 2 1 , so for α n /q = α small enough (that is n large enough) Thus lim n→∞ R α n δ est n χ Ω n,c = 0 also. Together with (31) it follows that

Proof (Theorem 4) We split
Because K is injective it holds that KK +ŷ =ŷ, so by Proposition 3 Therefore by definition of Ω n,c we get

Now we treat the second term. Proposition 2 yields
By Proposition 3 we have that for α n < 1 By the definition of Ω n,c , It follows that Putting it all together R α nȲ n − K +ŷ χ Ω n,c ≤ R α nŷ − K +ŷ χ Ω n,c + R α nȲ n − R α nŷ χ Ω n,c .

Numerical demonstration
In this section we give examples for each of the Theorems 4 and 3. We state the idealised infinte dimensional problems. The operators K in the examples will be Fredholm integral operators of the first kind and hence compact. In the numerical simulation we use high dimensional discretisations. The chosen regularisation method is Tikhonov regularisation together with the discrepancy principle. The discretised large systems of linear equations are solved iteratively using the conjugate gradient method. In order to use Tikhonov regularisation we need K . As a discretisation for this we use (K disc ) , that is the conjugate of our discretisation K disc of K. Another option would be to discretise K directly, but then K disc (K ) disc would not be symmetric and positive semidefinite.

Derivation of a binary option
A natural example is given if the data is acquired by a Monte-Carlo simulation, here we consider an example from mathematical finance. The buyer of a binary option receives after T days a payoff Q, if then a certain stock price S T is higher then the strike value K. Otherwise he gets nothing. Thus the value V of the binary option depends on the expected evolution of the stock price. We denote by r the riskfree rate, for which we could have invested the buying price of the option until the expiry rate T . If we already knew today for sure, that the stock price will hit the strike (insider information), we would pay V = e −rT Q for the binary option (e −rT is called discount factor). Otherwise, if we believed that the stock price will hit the strike with probability p, we would pay V = e −rT Qp. In the Black Scholes model one assumes, that the relative change of the stock price in a short time intervall is normally distributed, that is S t+δt − S t ∼ N (µδt, σ 2 δt). Under this assumption one can show that (see [17]) where S 0 is the initial stock price and s ∼ N µ − σ 2 /2, σ 2 /T . Under this assumptions one has V = e −rT QΦ(d), with Ultimatively we are interested in the sensitivity of V with respect to the starting stock price S 0 , that is ∂V (S 0 )/∂ S 0 . We formulate this as the inverse problem of derivation. Set X = Y = L 2 ([0, 1] = and define Note that this is only an adademic example for demonstration. In the given situation one would rather proceed as in [15], [1] or [11].

Image deblurring
When taking a picture, sharp edges of objects are usually blurred. An idealised monocoloured picture may be modelled as a function v ∈ L 2 ([0, 1] 2 ) and the blurring process is written as  = Kx, compare to figure 5.2. Monocoloured pictures are stored on a computer as vectors (a i j ) i j ∈ R k×l , where (a i j ) is the intensity of the (i, j)-th pixel. Hence we replace L 2 [0, 1] 2 with step functions on a homogenous grid with m = kl = 150 2 = 22500 elements. For a step function g we calculate Kg using the Matlab function imgaussfilt with σ = 2. We added uniform noise, e.g. Y i :=ŷ + U i 1, where U i is uniformily distributed on [−1/2, 1/2]. We use n = 20000 samples. Clearly the true solutionx has no smoothness at all, that is we can expect only subpolynomial convergence.

Basics from probability in Hilbert spaces and regularisation theory
Here we collect the required definitions and results from probability theory and classical regularisation theory.

Probability theory
We start with the definitions of a probability space, random variables and independency.
Definition 1 (measurable space,probability space) A set Ω together with a sigma algebra A is called a measurable space (Ω , A ). A measurable space (Ω , A ) together with a probability measure P on A is called a probability space (Ω , A , P). A toplogical space Y naturally carries the Borel sigma algebra B generated by the family of open sets in Y .
Definition 2 (random variable) A measurable function X : Ω → Y from a probability space (Ω , A , P) to a measurable space (Y , G ) is called a random variable. A random variable Y naturally induces a probability measure P Y on (Y , G ) through P Y (G) := P(Y −1 (G)) for G ∈ G (the pushforward). A random variable generates a sub sigma algebra For an integrable random variable we define the expectation EY := Ω Y dP.

Definition 3 (independency)
Let I be some index set. The family of sub sigma algebras (A ) α∈I on the probability space (Ω , A , P) is called independent, if for all J ⊂ I with |J| < ∞ it holds that A family of random variables is called independent, if the family of its generated sigma algebras are independent.
The following theorem assures, that our error model is well defined.
Theorem 6 (i.i.d. sequence of random variables) Let (Y , G ) be a measurable space and µ be a probability measure on (Y , G ). Then there exists a probability space (Ω , A , P) and random variables Y i : Ω → Y such that the family (Y i ) i∈N is independent and P Y i = µ for all i ∈ N.
2. The sequence (X n ) n is said to converge in L p for p ∈ [1, ∞] to X (X n 3. The sequence (X n ) n is said to converge in probability to X (X n X) is measurable). 4. The sequence (X n ) n is said to converge weakly (or in distribution) to X (X n w −→ X), if for all continuous and bounded f : Note that there is no topology inducing almost sure convergence. Almost sure convergence and convergence in L p each imply convergence in probability, which implies weak convergence. The following three statements are well known facts for weak convergence, see for example [19] for the proofs.
Lemma 3 (Portementau's leamma) Suppose we have random variables X n , X, taking values in a separable Hilbert space. Then the following are equivalent: Proof Theorem 13.1 from [19].
Theorem 7 (Continuous mapping theorem) Assume that T : X → Y is a continuous map between separable Hilbert spaces and X n , X : Ω → X are random variables with X n w −→ X. Then it follows that T (X n ) w −→ T (X).
Theorem 8 (Slutsky's theorem) Assume that (X n ) n and (Y n ) n are sequences of random variables on a separable Hilbert space X . If X n w −→ X for some random variable X : Ω → X and Y n P −→ c for some constant c ∈ X , it holds that X n + Y n w −→ X +c. Moreover, for X = R, if P(Y n = 0) = 0, for all n ∈ N and c = 0, then X n /Y n w −→ X/c. Proof Theorem 13.18 from [19].
We conclude with some generalisations of familiar facts for real valued random variables to random variables in a Hilbert space.
Proposition 1 For independent random variables X,Y : Ω → X on a separable Hilbert space with E X 2 , E Y 2 < ∞ it holds that E(X,Y ) = (EX, EY ). Corollary 3 Consider independent random variables X,Y in a separable Hilbert space X with EX = 0 and E X 2 < ∞ and a constant c ∈ X . Then Definition 5 A random variabel X on a Banach space X is called Gaussian, if for all linear and bounded f : X → R it holds that Lemma 4 Let X be a centred Gaussian with values in a Banach space and σ 2 := E X 2 . Denote by Ψ the distribution function of a standard normal variable. Then, for all t > 0 in particular, P( X ≤ σ ) ≥ 0.6827.
Theorem 9 (Central limit theorem) Let X be a random variable with values in a separable Hilbert space X , such that EX = 0 and E X 2 < ∞. Then for i.i.d. copies (X n ) n it holds that where Z is a centred Gaussian with the same covariance structure as X. That is, . Note, that this fully determines the distribution of Z.

Regularisation theory
We use the following standard definitions from deterministic linear regularisation theory.
Definition 6 (Moore-Penrose inverse) For a compact operator K : X → Y between separable Hilbert spaces we define the Moore-Penrose inverse K + : R(K) ⊕ R(K) ⊥ → X as the unique solution of the normal equation K * Kx = K * y in N (K) ⊥ . K + is linear. K + is continuous if and only if the range R(K) := {Kx ∈ Y : x ∈ X } of K is closed.
Definition 10 (singular value decomposition) For a compact operator K : X → Y there exists a monotone sequence K = σ 1 ≥ σ 2 ≥ ... > 0 with either σ i i→∞ → 0 or there exists a N ∈ N with σ i = σ N for all i ≥ N. Moreover there are families of orthonormal vectors (u i ) i∈N and (v i ) i∈N with span(u i : i ∈ N) = R(K), span(v i : i ∈ N) = N (K) ⊥ such that Kv i = σ i v i , K * u i = σ i v i .
In order to guarantee certain rates of convergence, the true solutionx has to fullfill a source condition.
Many popular linear regularisation methods are of the following class.

Landweber iteration, induced by
where the relaxation parameter ω is chosen between 0 and 1 K 2 .
Definition 16 (qualification of a filter) For a filter (F α ) α>0 , the maximal ν 0 > 0, such that for all ν ∈ (0, ν 0 ] there exist a constant C ν > 0 with sup λ ∈(0, is called the qualification of the filter (respectively the qualification of the induced regularisation). For example, ν 0 = 2 for the Tikhonov regularisation and ν 0 = ∞ for the Landweber iteration or the truncated singular value regularisation.
We give some properties of regularisations defined by filters which fullfill Assumption 1 or 2.