1 Introduction

The goal is to solve the ill-posed equation \(K{\hat{x}}={\hat{y}}\), where \({\hat{x}}\in {\mathscr {X}}\) and \({\hat{y}}\in {\mathscr {Y}}\) are elements of infinite dimensional Hilbert spaces and K is either linear and bounded with non-closed range, or more specifically compact. We do not know the right hand side \({\hat{y}}\) exactly, but we are given several measurements \(Y_1,Y_2,\ldots \) of it, which are independent, identically distributed and unbiased (\({\mathbb {E}}Y_i = {\hat{y}}\)) random variables. Thus we assume, that we are able to measure the right hand side multiple times, and a crucial requirement is that the solution does not change at least on small time scales. Let us stress that using multiple measurements to decrease the data error is a standard engineering practice under the name ‘signal averaging’, see, e.g., [27] for an introducing monograph or [20] for a survey article. Examples with low or moderate numbers of measurements (up to a hundred) can be found in [9] or [28] on image averaging or [13] on satellite radar measurements. For the recent first image of a black hole, even up to \(10^9\) samples were averaged, cf. [1].

The given multiple measurements naturally lead to an estimator of \({\hat{y}}\), namely the sample mean

$$\begin{aligned} {\bar{Y}}_n:=\frac{\sum _{i \le n}Y_i}{n}. \end{aligned}$$

But, in general \(K^+{\bar{Y}}_n \not \rightarrow K^+{\hat{y}}\) for \(n\rightarrow \infty \), because the generalised inverse (Definition 2.2 of [12]) of K is not continuous. So the inverse is replaced with a family of continuous approximations \((R_{\alpha })_{\alpha >0}\), called regularisation, e.g. the Tikhonov regularisation \(R_{\alpha }:=\left( K^*K+\alpha Id\right) ^{-1}K^*\), where \(Id:{\mathscr {X}}\rightarrow {\mathscr {X}}\) is the identity. The regularisation parameter \(\alpha \) has to be chosen accordingly to the data \({\bar{Y}}_n\) and the true data error

$$\begin{aligned} \delta _n^{true}:=\Vert {\bar{Y}}_n-{\hat{y}}\Vert , \end{aligned}$$

which is also a random variable. Since \({\hat{y}}\) is unknown, \(\delta _n^{true}\) is also unkown and has to be guessed. Natural guesses are

$$\begin{aligned} \delta _n^{est}:= \frac{1}{\sqrt{n}}\quad \text{ or } \quad \delta _n^{est}:=\frac{\sqrt{\sum _{i\le n}\Vert Y_i -{\bar{Y}}_n\Vert ^2/(n-1)}}{\sqrt{n}}. \end{aligned}$$

One first natural approach is now to use a (deterministic) regularisation method together with \({\bar{Y}}_n\) and \(\delta _n^{est}\). We are in particular interested in the discrepancy principle [30], wich is known to provide optimal convergence rates (for some \({\hat{y}}\)) in the classical deterministic setting. The following main result states, that in a certain sense, the natural approach converges and yields the optimal deterministic rates asymptotically.

Corollary 1

(to Theorems 3 and 4) Assume that \(K:{\mathscr {X}}\rightarrow {\mathscr {Y}}\) is a compact operator with dense range between Hilbert spaces and that \(Y_1,Y_2,\ldots \) are i.i.d. \({\mathscr {Y}}-\)valued random variables which fullfill \({\mathbb {E}}[ Y_1] = {\hat{y}}\in {\mathscr {R}}(K)\) and \(0<{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2<\infty \). Define the Tikhonov regularisation \(R_{\alpha }:=\left( K^*K+\alpha Id\right) ^{-1}K^*\) (or the truncated singular value regularisation, or Landweber iteration). Determine \((\alpha _n)_n\) through the discrepancy principle using \(\delta _n^{est}\) (see Algorithm 1). Then \(R_{\alpha _n}{\bar{Y}}_n\) converges to \(K^+{\hat{y}}\) in probability, that is

$$\begin{aligned} {\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \le \varepsilon \right) \rightarrow 1,\quad n\rightarrow \infty ,\quad \forall \varepsilon >0. \end{aligned}$$

Moreover, if \(K^+{\hat{y}}=\left( K^*K\right) ^{\nu /2}w\) with \(w\in {\mathscr {X}}\) and \(\Vert w\Vert \le \rho \) for \(\rho >0\) and \(0<\nu <\nu _0-1\) (where \(\nu _0\) is the qualification of the chosen method, see Assumptions 1), then for all \(\varepsilon >0\),

$$\begin{aligned} {\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \le \rho ^\frac{1}{\nu +1}\left( \frac{1}{\sqrt{n}}\right) ^{\frac{\nu }{\nu +1}-\varepsilon }\right) \rightarrow 1,\quad n\rightarrow \infty . \end{aligned}$$

Moreover it is shown, that the approach in general does not yield \(L^2\) convergenceFootnote 1 for a naive use of the discrepancy principle, but it does for a priori regularisation. We also discuss quickly, how one has to estimate the error to obtain almost sure convergence.

To solve an inverse problem, as already mentioned, typically some a priori information about the noise is required. This may be, in the classical deterministic case, the knowledge of an upper bound of the noise level, or, in the stochastic case, some knowledge of the error distribution or the restriction to certain classes of distributions, for example to Gaussian distributions. Here we present the first rigorous convergence theory for noisy measurements without any knowledge of the error distribution. The approach can be easily used by everyone, who can measure multiple times.

Stochastic or statistical inverse problems are an active field of research with close ties to high dimensional statistics [16, 17, 31]. In general, there are two approaches to tackle an ill-posed problem with stochastic noise. The Bayesian setting considers the solution of the problem itself as a random quantity, on which one has some a priori knowledge (see [23]). This opposes the frequentist setting, where the inverse problem is assumed to have a deterministic, exact solution [6, 10]. We are working in the frequentist setting, but we stay close to the classic deterministic theory of linear inverse problems [12, 32, 33]. For statistical inverse problems, typical methods to determine the regularisation parameter are cross validation [34], Lepski’s balancing principle [29] or penalised empirical risk minimisation [11]. Modifications of the discrepancy principle were studied recently [7, 8, 25, 26]. In [8], it was first shown how to obtain optimal convergence in \(L^2\) under Gaussian white noise with a modified version of the discrepancy principle.

Another approach is to transfer results from the classical deterministic theory using the Ky-Fan metric, which metrises convergence in probability. In [15, 21] it is shown, how to obtain convergence if one knows the Ky-Fan distance between the measurements and the true data. Aspects of the Bakushinskii veto [3] for stochastic inverse problems are discussed in [4, 5, 35] under assumptions for the noise distribution. In particular, [5] gives an explicit non trivial example for a convergent regularisation, without knowing the exact error level, under Gaussian white noise. We extent this to arbitrary distributions here, if one has multiple measurements.

In the articles mentioned above, the error is usually modelled as a Hilbert space process (such as white noise), thus it is impossible to determine the regularisation parameter directly through the discrepancy principle. This is in contrast to our, more classic error model, where the measurement is an element of the Hilbert space itself. Under the popular assumption that the operator K is Hilbert-Schmidt, one could in principle extend our results to a general Hilbert space process error model (considering the symmetrised equation \(K^*K{\hat{x}}=K^*{\hat{y}}\) instead of \(K{\hat{x}}={\hat{y}}\), as it is done for example in [8]). But we will postpone the discussion of the white noise case to a follow up paper.

To summarise the connection to the Bakushinskii veto let us state the following. The Bakushinskii veto states that the inverse problem can only be solved with a deterministic regularisation, if the noise level of the data is known. In this article we show, that if one has access to multiple i.i.d. measurements of an unkown distribution, one may use as data the average together with the estimated noise level and one obtains the optimal deterministic rate with high probability, as the number of measurements tends to infinity. That is one can estimate the error from the data. Finally, the measurements potentially contain more information, which is not used here. For example one could estimate the whole covariance structure of one measurement and use this to rescale the measurements and the operator, eventually increasing the relative smoothness of the data. Also one could directly regularise the non-averaged measurements.

In the following section we apply our approach to a priori regularisations and in the main part we consider the widely used discrepancy principle, which is known to work optimal in the classic deterministic theory. After that we quickly show how to choose \(\delta _n^{est}\) to obtain almost sure convergence and we compare the methods numerically.

2 A priori regularisation

We use the usual definition that \(R_\alpha :{\mathscr {Y}}\rightarrow {\mathscr {X}}\) is called a linear regularisation, if \(R_\alpha \) is a bounded linear operator for all \(\alpha >0\) and if \(R_{\alpha }y\rightarrow K^+y\) for \(\alpha \rightarrow 0\) for all \(y\in {\mathscr {D}}(K^+)\). A regularisation method is a combination of a regularisation and a parameter choice strategy \(\alpha : {\mathbb {R}}^+ \times {\mathscr {Y}} \rightarrow {\mathbb {R}}^+\), such that \(R_{\alpha (\delta ,y^{\delta })}y^{\delta } \rightarrow K^+y\) for \(\delta \rightarrow 0\), for all \(y \in {\mathscr {D}}(K^+)\) and for all \((y^{\delta })_{\delta > 0}\subset {\mathscr {Y}}\) with \(\Vert y^{\delta } - y \Vert \le \delta \). The method is called a priori, if the parameter choice does not depend on the data, that is if \(\alpha (\delta ,y)=\alpha (\delta )\).

The measurements can be formally modelled as realisations of an independent and identically distributed sequence \(Y_1,Y_2,\ldots : \varOmega \rightarrow {\mathscr {Y}}\) of random variables with values in \({\mathscr {Y}}\), such that \({\mathbb {E}}Y_1 ={\hat{y}}\in {\mathscr {D}}(K^+)\). Moreover, we require that \(0<{\mathbb {E}}\Vert Y_1 \Vert ^2 < \infty \), that is the measurements are (almost surely) in the Hilbert space.

In the following we apply the above approach to a priori parameter choice strategies \(\alpha (y^{\delta },\delta )=\alpha (\delta )\). We restrict to \(\delta _n^{est}=1/\sqrt{n}\) here, that is we do not estimate the variance here (otherwise the parameter choice would depend on the data). Since then \(\delta _n^{est}\) and hence \(\alpha (\delta _n^{est})\) are deterministic, the situation is very easy here and the results are not surprising (see Remark 2).

Theorem 1

(Convergence of a priori regularisation) Assume that \(K:{\mathscr {X}}\rightarrow {\mathscr {Y}}\) is a bounded linear operator with non-closed range between Hilbert spaces and that \(Y_1,Y_2,\ldots \) are i.i.d. \({\mathscr {Y}}-\)valued random variables which fullfill \({\mathbb {E}}[ Y_1] = {\hat{y}}\in {\mathscr {D}}(K^+)\) and \(0<{\mathbb {E}}\Vert Y_1\Vert ^2<\infty \). Take an a priori regularisation scheme, with \(\alpha (\delta ) {\mathop {\longrightarrow }\limits ^{\delta \rightarrow 0}} 0\) and \(\Vert R_{\alpha (\delta )} \Vert \delta {\mathop {\longrightarrow }\limits ^{\delta \rightarrow 0}} 0\). Set \({\bar{Y}}_n:= \sum _{i\le n} Y_i/n\) and \(\delta _n^{est}:=n^{-1/2}\). Then \(\lim _{n\rightarrow \infty }{\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})} {\bar{Y}}_n -K^+{\hat{y}}\Vert ^2 =0\).

Proof

Because of linearity, \({\mathbb {E}}\left[ R_{\alpha } Y_1 \right] = R_{\alpha }{\mathbb {E}}\left[ Y_1\right] = R_{\alpha }{\hat{y}}\) and thus by (3)

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha } {\bar{Y}}_n - R_{\alpha } {\hat{y}}\Vert ^2&= \frac{1}{n^2}{\mathbb {E}}\left\Vert \sum _{i=1}^n R_{\alpha }\left( Y_i-{\hat{y}}\right) \right\Vert ^2 = \frac{{\mathbb {E}}\Vert R_{\alpha }Y_1 - R_{\alpha }{\hat{y}}\Vert ^2}{n}, \end{aligned}$$

since \(R_{\alpha }Y_i \in {\mathscr {R}}(K^*)\) where the latter is separable. Therefore, by the bias-variance-decomposition,

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - K^+{\hat{y}}\Vert ^2&= {\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - R_{\alpha (\delta _n^{est})} {\hat{y}} + R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&= {\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - R_{\alpha (\delta _n^{est})} {\hat{y}}\Vert ^2 + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&= \frac{{\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}Y_1 - R_{\alpha (\delta _n^{est})}{\hat{y}}\Vert ^2}{n} + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&\le \frac{\Vert R_{\alpha (\delta _n^{est})}\Vert ^2}{n} {\mathbb {E}}\Vert Y_1 - {\hat{y}}\Vert ^2 + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&= \Vert R_{\alpha (\delta _n^{est})}\Vert ^2{\delta _n^{est}}^2 {\mathbb {E}}\Vert Y_1 - {\hat{y}}\Vert ^2 + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&\rightarrow 0 \qquad \text{ for }\quad n\rightarrow \infty . \end{aligned}$$

\(\square \)

As in the deterministic case, under additional source conditions we can prove convergence rates. We restrict to regularisations \(R_\alpha :=F_{\alpha }\left( K^*K\right) K^*\) defined via the spectral decomposition (see [12]) with the following assumptions for the generating filter.

Assumption 1

\((F_{\alpha })_{\alpha >0}\) is a regularising filter, i.e. a family of piecewise continuous real valued functions on \([0,\Vert K\Vert ^2]\), continuous from the right, with \(\lim _{\alpha \rightarrow 0}F_{\alpha }(\lambda )=\frac{1}{\lambda }\) for all \(\lambda \in (0,\Vert K \Vert ^2]\) and \(\lambda F_{\alpha }(\lambda )\le C_R\) for all \(\alpha >0\) and all \(\lambda \in \left( 0,\Vert K\Vert ^2\right] \), where \(C_R>0\) is some constant. Moreover, it has qualification \(\nu _0>0\), i.e. \(\nu _0\) is maximal such that for all \(\nu \in [0,\nu _0]\) there exists a constant \(C_{\nu }>0\) with

$$\begin{aligned} \sup _{\lambda \in (0,\Vert K\Vert ^2]}\lambda ^{\nu /2}| 1 -\lambda F_{\alpha }(\lambda )|\le C_{\nu }\alpha ^{\nu /2}. \end{aligned}$$

Finally, there is a constant \(C_F>0\) such that \(|F_{\alpha }(\lambda )|\le C_F/\alpha \) for all \(0<\lambda \le \Vert K\Vert ^2\).

Remark 1

The generating filter of the following regularisation methods fullfill the Assumption 1:

  1. 1.

    Tikhonov regularisation (qualification 2)

  2. 2.

    n-times iterated Tikhonov regularisation (qualification 2n),

  3. 3.

    truncated singular value regularisation (infinite qualification),

  4. 4.

    Landweber iteration (infinite qualification).

Theorem 2

(Rate of convergence of aprioi regularisation) Assume that \(K:{\mathscr {X}}\rightarrow {\mathscr {Y}}\) is a bounded linear operator with non-closed range between Hilbert spaces and that \(Y_1,Y_2,\ldots \) are i.i.d. \({\mathscr {Y}}-\)valued random variables which fullfill \({\mathbb {E}}[ Y_1] = {\hat{y}}\in {\mathscr {D}}(K^+)\) and \(0<{\mathbb {E}}\Vert Y_1\Vert ^2<\infty \). Let \(R_{\alpha }\) be induced by a filter fullfilling Assumption 1. Set \({\bar{Y}}_n:= \sum _{i\le n} Y_i/n\) and \(\delta _n^{est}=n^{-1/2}\). Assume that for \(0<\nu \le \nu _0\) and \(\rho >0\) we have that \(K^+{\hat{y}}=(K^*K)^{\nu /2}w\) for some \(w\in {\mathscr {X}}\) with \(\Vert w \Vert \le \rho \). Then if for constants \(0<c<C\),

$$\begin{aligned} c \left( \frac{\delta _n^{est}}{\rho }\right) ^\frac{2}{\nu +1} \le \alpha (\delta _n^{est}) \le C \left( \frac{\delta _n^{est}}{\rho }\right) ^\frac{2}{\nu +1}, \end{aligned}$$

we have that \(\sqrt{{\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - K^+{\hat{y}} \Vert ^2} \le {C^{\prime }} {\delta _n^{est}}^\frac{\nu }{\nu +1} \rho ^\frac{1}{\nu +1} = {\mathscr {O}}\left( n^{-\frac{\nu }{2(\nu +1)}}\right) \) for some constant \({C^{\prime }}>0\).

Proof

We proceed similiary to the proof of Theorem 1, using additionally Proposition 1 of Sect. 4.

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - K^+{\hat{y}}\Vert ^2&= {\mathbb {E}}\Vert R_{\alpha (\delta _n^{est})}{\bar{Y}}_n - R_{\alpha (\delta _n^{est})} {\hat{y}}\Vert ^2 + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&\le \Vert R_{\alpha (\delta _n^{est})}\Vert ^2{\delta _n^{est}}^2 {\mathbb {E}}\Vert Y_1 - {\hat{y}}\Vert ^2 + \Vert R_{\alpha (\delta _n^{est})}{\hat{y}} - K^+{\hat{y}}\Vert ^2\\&\le C_RC_F{\mathbb {E}}\Vert Y_1 - {\hat{y}}\Vert ^2 \frac{{\delta _n^{est}}^2}{\alpha (\delta _n^{est})} + C_{\nu }^2 \rho ^2 \alpha (\delta _n^{est})^{\nu }\\&\le \frac{C_RC_F{\mathbb {E}}\Vert Y_1 - {\hat{y}}\Vert ^2}{c} {\delta _n^{est}}^\frac{-2}{\nu +1} \rho ^\frac{2}{\nu +1} {\delta _n^{est}}^2 \\&\quad + C_{\nu }^2 C^{\nu } {\delta _n^{est}}^\frac{2\nu }{\nu +1} \rho ^\frac{-2\nu }{\nu +1} \rho ^2\\&\le C^{{\prime 2}} {\delta _n^{est}}^\frac{2\nu }{\nu +1} \rho ^\frac{2}{\nu +1}. \end{aligned}$$

\(\square \)

Remark 2

For separable Hilbert spaces one could alternatively argue as follows: The spaces \({\mathscr {X}}^{\prime }:=L^2(\varOmega ,{\mathscr {X}})=\{X:\varOmega \rightarrow {\mathscr {X}}:{\mathbb {E}}\Vert X\Vert ^2<\infty \}\) and \({\mathscr {Y}}^{\prime }:=L^2(\varOmega ,{\mathscr {Y}})\) are also Hilbert spaces, with scalar products \((X,{\tilde{X}})_{{\mathscr {X}}^{\prime }}:=\sqrt{{\mathbb {E}}(X,{\tilde{X}})_{{\mathscr {X}}}}\) and \((\cdot ,\cdot )_{{\mathscr {Y}}^{\prime }}\) defined similary. Then \(K:{\mathscr {X}}\rightarrow {\mathscr {Y}}\) induces naturally a bounded linear operator \(K^{\prime }:{\mathscr {X}}^{\prime }\rightarrow {\mathscr {Y}}^{\prime },X\mapsto KX\). Clearly we have that \({\hat{y}}\in {\mathscr {Y}}^{\prime }\), and \(({\bar{Y}}_n)_n\) is a sequence in \({\mathscr {Y}}^{\prime }\) which fullfills

$$\begin{aligned} \Vert {\bar{Y}}_n-{\hat{y}}\Vert _{{\mathscr {Y}}^{\prime }}:=\sqrt{({\bar{Y}}_n-{\hat{y}},{\bar{Y}}_n-{\hat{y}})_{{\mathscr {Y}}^{\prime }}} = \sqrt{\frac{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2}{n}}=\sqrt{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2}\delta _n^{est} \end{aligned}$$

and we can use the classic deterministic results for \(K^{\prime }:{\mathscr {X}}^{\prime }\rightarrow {\mathscr {Y}}^{\prime }\) and \({\bar{Y}}_n\) and \(\delta _n^{est}\).

3 The discrepancy principle

In this section we restrict to compact operators with dense range. Note that then \({\mathscr {Y}}=\overline{{\mathscr {R}}(K)}\) will be automatically separable. In practice the above parameter choice strategies are of limited interest, since they require the knowledge of the abstract smoothness parameters \(\nu \) and \(\rho \). The classical discrepancy principle would be to choose \(\alpha _n\) such that

$$\begin{aligned} \Vert (KR_{\alpha _n}-Id){\bar{Y}}_n\Vert \approx \delta _n^{true} = \Vert {\bar{Y}}_n-{\hat{y}}\Vert , \end{aligned}$$
(1)

which is not possible, because of the unknown \(\delta _n^{true}\). So we replace it with our estimator \(\delta _n^{est}\) and implement the discrepancy principle via Algorithm 1 with or without the optional emergency stop.

figure a

Remark 3

To our knowledge, the idea of an emergency stop first appeared in [8]. It provides a deterministic lower bound for the regularisation parameter, which may avoid overfitting. We use an elementary form of an emergency stop here, which does not require the knowledge of the singular value decomposition of K. It would be interesting to see, how more sophisticated versions of the emergency stop worked here, which is not clear to us since in our general setting we cannot rely on the concentration properties of Gaussian noise.

Algorithm 1 will terminate, if we use the emergency stop. Otherwise, we can guarantee that Algorithm 1 terminates, if K has dense image (or equivalently, if \(K^*\) is injective) and if \(\delta _n^{est}>0\). This is because then \(\lim _{\alpha \rightarrow 0} KR_{\alpha }=P_{\overline{{\mathscr {R}}(K)}}=Id\) pointwise, so \(\Vert (KR_{q^k}-Id){\bar{Y}}_n\Vert < \delta _n^{est}\) for k large enough . If we decided to use the sample variance, it may happen that \(\delta _n^{est}=0\). But assuming \({\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2>0\), it follows that \({\mathbb {P}}\left( \delta _n^{est}=0\right) ={\mathbb {P}}\left( Y_1=\cdots =Y_n\right) \rightarrow 0\) for \(n\rightarrow \infty \) (with exponential rate). If the distribution of \(Y_1\) posseses a density (with respect to the Gaussian measure for example), then actually \({\mathbb {P}}(Y_1=\cdots =Y_n)=0\) for all \(n\in {\mathbb {N}}\).

Unlike in the previous section, here the \(L^2\) error will not converge in general, even if \(Y_1\) has a density. The regularisation parameter \(\alpha _n\) is now random, since it depends on the potentially bad random data. With a diminishing probability p we are underestimating the data error significantly, and thus the discrepancy principle gives a too small \(\alpha \) and we still have \(p\Vert R_{\alpha }\Vert \gg 1\) in such a case.

In the following we will need the singular value decomposition of the compact operator K with dense range (see [10]): there exists a monotone sequence \(\Vert K \Vert =\sigma _1\ge \sigma _2 \ge \cdots >0\) with \(\sigma _l{\rightarrow }0\) for \(l\rightarrow \infty \). Moreover there are families of orthonormal vectors \((u_l)_{l\in {\mathbb {N}}}\) and \((v_l)_{l\in {\mathbb {N}}}\) with \(span( u_l:l\in {\mathbb {N}})={\mathscr {Y}}\), \(span(v_l:l\in {\mathbb {N}})= {\mathscr {N}}(K)^\bot \) such that \(Kv_l=\sigma _lv_l\) and \(K^*u_l=\sigma _lv_l\).

3.1 A counter example for convergence

We now show that a naive use of the discrepancy principle, as implemented in Algorithm 1 without emergency stop, may fail to converge in \(L^2\). To simplify calculations we pick Gaussian noise and the truncated singular value regularisation and we set \(\delta _n^{est}=1/\sqrt{n}\). We choose \({\mathscr {X}}:=l^2({\mathbb {N}})\) with the standard basis \(\{u_k:=(0,\ldots ,0,1,0,\ldots )\}\) and consider the diagonal operator

$$\begin{aligned} K:l^2({\mathbb {N}})\rightarrow l^2({\mathbb {N}}),\quad u_l \mapsto \left( \frac{1}{100}\right) ^\frac{l}{2} u_l \end{aligned}$$

with \({\hat{x}}=0={\hat{y}}=K{\hat{x}}\). Hence the \(\sigma _l=(1/100)^\frac{l}{2}\) are the eigenvalues of K and

$$\begin{aligned} R_{\alpha }:l^2({\mathbb {N}})\rightarrow l^2({\mathbb {N}}), \quad y \mapsto \sum _{l:\sigma _l^2\ge \alpha } \sigma _l^{-1}(y,u_l)u_l. \end{aligned}$$

We assume that the noise is distributed along \(y:= \sum _{l\ge 2} 1/\sqrt{l(l-1)} u_l\), so we have that \(\sum _{l> n} (y,u_l)^2=1/n\) and thus \(y\in l^2({\mathbb {N}})\). That is we set \({\bar{Y}}_n:=\sum _{i\le n} Y_i = \sum _{i\le n} Z_iy\), where \(Z_i\) are i.i.d. standard Gaussians. We define \(\varOmega _n:=\{Z_i\ge 1, i=1\ldots n\}\), a (very unlikely) event on which we significantly underestimate the true data error. We get that \({\mathbb {P}}(\varOmega _n):={\mathbb {P}}(Z_1\ge 1)^n\ge 1/10^n\). Moreover, by the definition of the discrepancy principle

$$\begin{aligned} \frac{1}{n} \chi _{\varOmega _n}={\delta _n^{est}}^2 \chi _{\varOmega _n}&\ge \Vert (KR_{\alpha _n}-Id){\bar{Y}}_n\Vert ^2 \chi _{\varOmega _n} = |{\bar{Z}}_n|^2 \Vert (KR_{\alpha _n}-Id)y\Vert ^2 \chi _{\varOmega _n}\\&\ge \Vert (KR_{\alpha _n}-Id)y\Vert ^2\chi _{\varOmega _n}\\&=\sum _{l:\sigma _l^2<\alpha _n} (y,u_l)^2 \chi _{\varOmega _n}= \sum _{l:(1/100)^i<\alpha _n} (y,u_l)^2\chi _{\varOmega _n}\\&= \sum _{l> \frac{\log (\alpha _n)}{\log (1/100)}}(y,u_l)^2 \chi _{\varOmega _n}\ge \frac{\log (1/100)}{\log (\alpha _n)}\chi _{\varOmega _n}\\ \Longrightarrow \alpha _n \chi _{\varOmega _n}&< \frac{1}{100^n}. \end{aligned}$$

It follows that

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha _n} {\bar{Y}}_n - K^+ {\hat{y}}\Vert ^2&= {\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n\Vert ^2 \ge {\mathbb {E}}\Vert R_{\alpha _n} {\bar{Y}}_n \chi _{\varOmega _n}\Vert ^2\\&= {\mathbb {E}}\left[ {\bar{Z}}_n^2\Vert R_{\alpha _n}y \chi _{\varOmega _n}\Vert \right] ^2\ge {\mathbb {E}}\Vert R_{1/100^n} y \chi _{\varOmega _n}\Vert ^2\\&\ge \sum _{l:\sigma _i^{2}\ge 1/100^n} \sigma _l^{-2}(y,u_l)^2 {\mathbb {P}}(\varOmega _n) \ge \frac{1}{10^n} \sum _{l\le n} \sigma _l^{-2}(y,u_l)^2\\&\ge \frac{1}{10^n} 100^n (y,u_n)^2 = \frac{10^n}{n(n-1)}\rightarrow \infty . \end{aligned}$$

That is the probability of the events \(\varOmega _n\) is not small enough to compensate the huge error we have on these events, so in the end \({\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\rightarrow \infty \) for \(n\rightarrow \infty \).

3.2 Convergence in probability of the discrepancy principle

In this section we show, that the discrepancy principle yields convergence in probability, matching asymptotically the optimal deterministic rate. The proofs of the Theorems 3 and 4 and of Corollary 3 are given in the following section.

Theorem 3

(Convergence of the discrepancy principle) Assume that K is a compact operator with dense range between Hilbert spaces \({\mathscr {X}}\) and \({\mathscr {Y}}\) and that \(Y_1,Y_2,\ldots \) are i.i.d. \({\mathscr {Y}}-\)valued random variables with \({\mathbb {E}}Y_1={\hat{y}}\in {\mathscr {R}}(K)\) and \(0<{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2 < \infty \). Let \(R_{\alpha }\) be induced by a filter fullfilling Assumption 1 with \(\nu _0>1\). Applying Algorithm 1 with or without the emergency stop yields a sequence \((\alpha _n)_n\). Then we have that for all \(\varepsilon > 0\)

$$\begin{aligned} {\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n - K^+{\hat{y}}\Vert \le \varepsilon \right) {\mathop {\longrightarrow }\limits ^{n\rightarrow \infty }}1, \end{aligned}$$

i.e. \(R_{\alpha _n}{\bar{Y}}_n {\mathop {\longrightarrow }\limits ^{{\mathbb {P}}}}K^+{\hat{y}}\).

Remark 4

If one tried to argue as in Remark 1 to show \(L^2\) convergence one would have to determine the regularisation parameter not as given by Eq. (1), but such that \({\mathbb {E}}\Vert (KR_{\alpha }-Id){\bar{Y}}_n\Vert ^2 \approx \delta _n^{est}\), which is not practicable since we cannot calculate the expectation on the left hand side.

The popularity of the discrepancy principles is a result of the fact that it guarantees optimal convergence rates under an additional source condition: Assuming that there is a \(0<\nu \le \nu _0-1\) (where \(\nu _0\) is the qualification of the chosen regularisation method) such that \(K^+{\hat{y}}=\left( K^*K\right) ^\frac{\nu }{2}w\) for a \(w\in {\mathscr {X}}\) with \(\Vert w \Vert \le \rho \), then

$$\begin{aligned} \sup _{y^{\delta }:\Vert y^{\delta }-{\hat{y}}\Vert \le \delta }\Vert R_{\alpha (y^{\delta },\delta )}y^{\delta }-K^+{\hat{y}}\Vert \le C \rho ^\frac{1}{\nu +1}\delta ^\frac{\nu }{\nu +1} \end{aligned}$$
(2)

for some constant \(C>0\). The next theorem shows a concentration result for the discrepancy principle as implemented in Algorithm 1, with a bound similiar to (2).

Theorem 4

(Rate of convergence of the discrepancy principle) Assume that K is a compact operator with dense range between Hilbert spaces \({\mathscr {X}}\) and \({\mathscr {Y}}\). Moreover, \(Y_1,Y_2,\ldots \) are i.i.d. \({\mathscr {Y}}-\)valued random variables with \({\mathbb {E}}Y_1={\hat{y}}\in {\mathscr {R}}(K)\) and \(0<{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2 < \infty \). Let \(R_{\alpha }\) be induced by a filter fullfilling Assumption 1 with \(\nu _0>1\). Moreover, assume that there is a \(0<\nu \le \nu _0-1\) and a \(\rho >0\) such that \(K^+{\hat{y}}=(K^*K)^{\nu /2}w\) for some \(w\in {\mathscr {X}}\) with \(\Vert w \Vert \le \rho \). Applying Algorithm 1 with or without the emergency stop yields a sequence \((\alpha _n)_{n\in {\mathbb {N}}}\). Then there is a constant L, such that

$$\begin{aligned} {\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n - K^+{\hat{y}}\Vert \le L \rho ^\frac{1}{\nu +1} \max \left\{ {\delta _n^{est}}^\frac{\nu }{\nu +1},{\delta _n^{true}}^\frac{\nu }{\nu +1}\left( \delta _n^{true}/\delta _n^{est}\right) ^\frac{1}{\nu +1}\right\} \right) {\mathop {\longrightarrow }\limits ^{n\rightarrow \infty }}1. \end{aligned}$$

We deduce a deterministic bound for \(\Vert R_{\alpha _n}\bar{Y_n}-K^+{\hat{y}}\Vert \) (for n large).

Corollary 2

Under the assumptions of Theorem 4, for all \(\varepsilon >0\) it holds that

$$\begin{aligned} {\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n - K^+{\hat{y}}\Vert \le \rho ^\frac{1}{\nu +1} \left( \frac{1}{\sqrt{n}}\right) ^{\frac{\nu }{\nu +1}-\varepsilon }\right) {\mathop {\longrightarrow }\limits ^{n\rightarrow \infty }}1. \end{aligned}$$

Proof (Corollary 2)

By the second assertion in Lemma 1 and Markov’s inequality, for any \(c,\varepsilon >0\),

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {P}}\left( \delta _n^{est},\delta _n^{true}\le cn^{-\frac{1}{2}+\varepsilon }\right) =1. \end{aligned}$$

\(\square \)

The ad hoc emergency stop \(\alpha _n>1/n\), additionally assures, that the \(L^2\) error will not explode (unlike in the counter example of the previous subsection). Under the assumption that \({\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^4<\infty \), one can guarantee, that the \(L^2\) error will converge.

Corollary 3

Under the assumptions of Theorem 3, consider the sequence \(\alpha _n\) determined by Algorithm 1 with emergency stop. Then there is a constant C such that \({\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\le C\) for all \(n\in {\mathbb {N}}\). If additionally \({\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^4<\infty \), then it holds that \({\mathbb {E}}\Vert R_{\alpha _n} {\bar{Y}}_n -K^+{\hat{y}}\Vert ^2 \rightarrow 0\) for \(n\rightarrow \infty \).

3.3 Almost sure convergence

The results so far delievered either convergence in probability or convergence in \(L^2\). We give a short remark how one can obtain almost sure convergence. Roughly speaking, one has to multiply a \(\sqrt{\log \log n}\) term to \(\delta _n^{est}\). This is a simple consequence of the following theorem

Theorem 5

(Law of the iterated logarithm) Assume that \(Y_1,Y_2,\ldots \) is an i.i.d sequence with values in some seperable Hilbert space \({\mathscr {Y}}\). Moreover, assume that \({\mathbb {E}}Y_1 = 0\) and \({\mathbb {E}}\Vert Y_1\Vert ^2<\infty \). Then we have that

$$\begin{aligned} {\mathbb {P}}\left( \limsup _{n\rightarrow \infty } \frac{\Vert \sum _{i\le n} Y_i \Vert }{\sqrt{2 {\mathbb {E}}\Vert Y_1 \Vert ^2n\log \log n}}\le 1\right) = 1. \end{aligned}$$

Proof

This is a simple consequence of Corollary 8.8 in [24]. \(\square \)

So if \({\mathbb {E}}Y_1 = {\hat{y}} \in {\mathscr {Y}}\) we have for \(\delta _n^{true}=\Vert {\bar{Y}}_n-{\hat{y}}\Vert \)

$$\begin{aligned} {\mathbb {P}}\left( \limsup _{n\rightarrow \infty } \frac{\sqrt{n}\delta _n^{true}}{\sqrt{2{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2\log \log n}} \le 1\right) =1, \end{aligned}$$

that is, with probability 1 it holds that \(\delta _n^{true}\le \sqrt{\frac{2{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2\log \log n}{n}}\) for n large enough. Consequently, for some \(\tau >1\) the estimator should be

$$\begin{aligned} \delta _n^{est}:= \tau s_n \sqrt{\frac{2\log \log n}{n}}, \end{aligned}$$

where \( s_n\) is the square root of the sample variance. Since \({\mathbb {P}}(\lim _{n\rightarrow \infty } s_n^2={\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2)=1\) and \(\tau >1\) it holds that \(\sqrt{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert }\le \tau s_n\) for n large enough with probability 1 and thus \(\delta _n^{true}\le \delta _n^{est}\) for n large enough with probability 1. In other words, there is an event \(\varOmega _0 \subset \varOmega \) with \({\mathbb {P}}(\varOmega _0)=1\) such that for any \(\omega \in \varOmega _0\) there is a \(N(\omega )\in {\mathbb {N}}\) with \(\delta _n^{true}(\omega )\le \delta _n^{est}(\omega )\) for all \(n\ge N(\omega )\). So we can use \({\bar{Y}}_n\) and \(\delta _n^{est}\) together with any deterministic regularisation method to get almost sure convergence.

4 Proofs of Theorem 3 and 4

4.1 Proofs without emergency case

We will multiple times use the Pythagorean theorem for independent separable Hil-bert space valued random variables \(Z_i\) with \({\mathbb {E}}\Vert Z_i\Vert ^2<\infty \) and \({\mathbb {E}}Z_i=0\),

$$\begin{aligned} {\mathbb {E}}\left\Vert \sum _{i=1}^n Z_i \right\Vert ^2 = \sum _{i=1}^n \sum _{l,l^{\prime }=1}^\infty {\mathbb {E}}\left[ (Z_i,e_l)(Z_i,e_{l^{\prime }})\right] = \sum _{i=1}^n{\mathbb {E}}\left[ \sum _{j=1}^\infty (Z_i,e_j)^2\right] =\sum _{i=1}^n {\mathbb {E}}\left\Vert Z_i \right\Vert ^2,\nonumber \\ \end{aligned}$$
(3)

where \((e_l)_{l\in {\mathbb {N}}}\) is an orthonormal basis. Based on this, the central ingridient will be the following lemma, which strengthens the pointwise worst case error bound \(\Vert (KR_{\alpha }-Id)({\bar{Y}}_n-{\hat{y}})\Vert \le C_0 \delta _n^{true}\) in some sense.

Lemma 1

For all \(\varepsilon >0\) and (deterministic) sequences \((q_n)_{n\in {\mathbb {N}}}\) with \(q_n>0\) and \(\lim _{n\rightarrow \infty }q_n=0\), it holds that

$$\begin{aligned} {\mathbb {P}}\left( \Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert \ge \varepsilon /\sqrt{n}\right) \rightarrow 0 \end{aligned}$$

and

$$\begin{aligned} {\mathbb {P}}\left( |\sqrt{n}\delta _n^{est}-\gamma |\ge \varepsilon \right) \rightarrow 0 \end{aligned}$$

for \(n\rightarrow \infty \), where \(\gamma =1\) or \(\gamma =\sqrt{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2}\), depending on if we used the sample variance or not.

Proof

By Tschebyscheff’s inequality and (3)

$$\begin{aligned} {\mathbb {P}}\left( \Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert \ge \varepsilon /\sqrt{n}\right)&\le \frac{n}{\varepsilon ^2}{\mathbb {E}}\Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert ^2\\&= \frac{1}{\varepsilon ^2}{\mathbb {E}}\Vert (KR_{q_n}-Id)(Y_1-{\hat{y}})\Vert ^2. \end{aligned}$$

Since K has dense range, \(KR_{q_n}-Id\) converges to 0 pointwise for \(n\rightarrow \infty \) and it follows that \((KR_{q_n}-Id)(Y_1-{\hat{y}})\) also converges pointwise to 0. By inequality (6) of Proposition 1 below, \(\Vert (KR_{q_n}-Id)(Y_1-{\hat{y}}) \Vert ^2 \le C_0 \Vert Y_1-{\hat{y}}\Vert ^2\), so \({\mathbb {E}}\Vert (KR_{q_n}-Id)(Y_1-{\hat{y}})\Vert ^2\rightarrow 0\) for \(n\rightarrow \infty \) by the dominated convergence theorem. The second assertion only needs a proof for \(\gamma =\sqrt{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2}\) and then

$$\begin{aligned} n {\delta _n^{est}}^2&= \frac{1}{n-1} \sum _{i=1}^n\Vert Y_i-{\bar{Y}}_n\Vert ^2 = \frac{n}{n-1} \left( \frac{1}{n}\sum _{i=1}^n\Vert Y_i\Vert ^2- \Vert {\bar{Y}}_n\Vert ^2\right) \\&\rightarrow {\mathbb {E}}\Vert Y_1\Vert ^2-\Vert {\hat{y}}\Vert ^2 ={\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2= \gamma ^2 \end{aligned}$$

almost surely (thus in particular in probability) for \(n\rightarrow \infty \) by the strong law of large numbers (Corollary 7.10 in [24]) and the bias-variance-decomposition. Therefore \(\sqrt{n}\delta _n^{est}\rightarrow \gamma \) in probability for \(n\rightarrow \infty \). \(\square \)

For convergence in probability it does not matter how large the error is on sets with diminishing probability and with Lemma 1 we will show, that the probability of certain ‘good events’ is 1 in the limit of infinitely many measurements.

Define for \(q\in (0,1)\) (as chosen in Algorithm 1)

$$\begin{aligned} \psi _q:~&{\mathbb {R}}^+ \rightarrow \left\{ q^k~:~k\in {\mathbb {N}}_0\right\} \nonumber \\&\alpha \mapsto \max \left\{ q^k: q^k\le \alpha \right\} . \end{aligned}$$
(4)

So \(\min (q\alpha ,1)\le \psi _q(\alpha )\le \alpha \) and by definition, if \(\Vert \left( KR_{\psi _q(\alpha )}-Id\right) {\bar{Y}}_n\Vert <\delta _n^{est}\), it holds that \(\alpha _n\ge \min (q\alpha ,1)\), where \(\alpha _n\) is the output of Algorithm 1.

We will also need some well known properties of regularisations defined by filters which fullfill Assumption 1. These are mostly easy modifications from [12].

Proposition 1

The constants in the following are defined as in Assumption 1. We assume, that K is bounded and linear with non-closed range. Assume that \((R_{\alpha })_{\alpha >0}\) is induced by a regularising filter fullfilling \(|F_{\alpha }(\lambda )|\le C_F/\alpha \) for all \(0<\lambda \le \Vert K\Vert ^2\). Then

$$\begin{aligned} \Vert R_{\alpha }\Vert&\le \sqrt{C_RC_F}/\sqrt{\alpha } \end{aligned}$$
(5)
$$\begin{aligned} \Vert Id-KR_{\alpha }\Vert&\le C_0 \end{aligned}$$
(6)

for all \(\alpha >0\), with \(C_0\ge 1\). If moreover, the filter has qualification \(\nu _0>0\) and there is a \(w \in {\mathscr {X}}\) with \(\Vert w \Vert \le \rho \) such that \(K^+{\hat{y}}=\left( K^*K\right) ^\frac{\nu }{2}w\) for some \(0<\nu \le \nu _0\), then

$$\begin{aligned} \Vert R_{\alpha }{\hat{y}}-K^+{\hat{y}}\Vert&\le C_{\nu } \rho \alpha ^{\nu /2} \end{aligned}$$
(7)
$$\begin{aligned} \Vert R_{\alpha }{\hat{y}}-K^+{\hat{y}}\Vert&\le \Vert KR_{\alpha }{\hat{y}}-KK^+{\hat{y}}\Vert ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1} \end{aligned}$$
(8)

for all \(\alpha >0\). If additionally, \(\nu _0\ge \nu +1>1\), then

$$\begin{aligned} \Vert KR_{\alpha }{\hat{y}}-KK^+{\hat{y}}\Vert \le C_{\nu +1}\rho \alpha ^\frac{\nu +1}{2}. \end{aligned}$$
(9)

Moreover, if K is compact, than for all \(x\in {\mathscr {X}}\) there is a function \(g:{\mathbb {R}}^+\rightarrow {\mathbb {R}}^+\) with \(g(\alpha )\rightarrow \infty \) for \(\alpha \rightarrow 0\), such that

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\Vert (KR_{\psi _q\left( \alpha g(\alpha )\right) }-Id)Kx\Vert /\sqrt{\alpha } = 0, \end{aligned}$$
(10)

where \(\psi _q\) is given in (4).

Proof (Proposition 1)

(5) and (8) are shown in the proofs of Theorem 4.2 and Theorem 4.17 in [12]. (7) and (8) are Theorem 4.3 in [12]. (6) follows directly from Assumption 1.

For (10), let \(x\in {\mathscr {X}}\) be fixed and set

$$\begin{aligned} {\tilde{g}}(\alpha ):=\sup \left\{ t>0~:~\left\Vert \left( KR_{\psi _q\left( \alpha t\right) }-Id\right) Kx\right\Vert /\sqrt{\alpha }\le t^{-1}\right\} . \end{aligned}$$

W.l.o.g. \({\tilde{g}}\) is finite for any \(\alpha >0\). Now we first show that

$$\begin{aligned} \lim _{\alpha \rightarrow 0}\Vert (KR_{\alpha }-Id)Kx\Vert /\sqrt{\alpha }=0. \end{aligned}$$
(11)

We mimic the proof of Theorem 3.1.17 of [31] and set \(\varepsilon >0\). We fix L, such that \(C_1^2 \sum _{l=L+1}^\infty ({\hat{x}},v_l)^2<\varepsilon \). Then

$$\begin{aligned} \Vert (KR_{\alpha }-Id)K{\hat{x}}\Vert ^2/\alpha&= \sum _{l=1}^\infty \left( F_{\alpha }(\sigma _l^2)\sigma _l^2-1\right) ^2 \frac{\sigma _l^2}{\alpha }({\hat{x}},v_l)^2\\&\le \left( \sup _{\lambda>0} \lambda ^\frac{\nu _0}{2}|F_{\alpha }(\lambda )\lambda -1|\right) ^2\Vert {\hat{x}}\Vert ^2\sum _{l=1}^L \frac{\sigma _l^{2(1-\nu _0)}}{\alpha }\\&\quad + \left( \sup _{\lambda >0} \lambda ^\frac{1}{2}|F_{\alpha }(\lambda )\lambda -1|\right) ^2\frac{\sum _{l=L+1}^\infty ({\hat{x}},v_l)^2}{\alpha }\\&\le C_{\nu _0}^2L \sigma _L^{2(1-\nu _0)}\Vert {\hat{x}}\Vert ^2\alpha ^{\nu _0-1}+C_1^2 \sum _{l=L+1}^\infty ({\hat{x}},v_l)^2< 2 \varepsilon \end{aligned}$$

for all \(\alpha <\left( \varepsilon ^{-1} C_{\nu _0}^2L\sigma _L^{2(1-\nu _0)}\Vert {\hat{x}}\Vert ^2\right) ^{-\frac{1}{\nu _0-1}}\), therefore \(\Vert (KR_{\alpha }-Id)Kx\Vert /\sqrt{\alpha }=0\) for \(\alpha \rightarrow 0\). So for any \(t>0\)

$$\begin{aligned} \left\Vert \left( KR_{\psi _q\left( \alpha t\right) }-Id\right) Kx\right\Vert /\sqrt{\alpha }&= \sqrt{\frac{\psi _q\left( \alpha t\right) }{\alpha }}\left\Vert \left( KR_{\psi _q\left( \alpha t\right) }-Id\right) Kx\right\Vert /\sqrt{\psi _q\left( \alpha t\right) }\le \frac{1}{t} \end{aligned}$$

for \(\alpha \) small enough, because of (11) and since \(\psi _q(\alpha t)\le \alpha t\). So \({\tilde{g}}(\alpha )\rightarrow \infty \) for \(\alpha \rightarrow 0\) and by definition of \({\tilde{g}}\) the claim holds for \(g(\alpha ):={\tilde{g}}(\alpha )-1\) (g is well defined for \(\alpha \) small enough). \(\square \)

Proof (Theorem 4)

Set \(q_n:=\psi _q(b_n)\) where \(b_n:=\left( \frac{1}{\rho }\frac{\gamma }{4C_{\nu +1}\sqrt{n}}\right) ^\frac{2}{\nu +1}\) with \(\gamma =1\) or \(\gamma =\sqrt{{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2}\), depending on if we used the sample variance or not, and \(\psi _q\) given in (4). Define

$$\begin{aligned} \varOmega _n:=\varOmega _n(q_n,\gamma ){:=}\left\{ |\sqrt{n}\delta _n^{est}-\gamma |{<} \gamma /2~,~\Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert {<} \gamma /\sqrt{16n}\right\} .\nonumber \\ \end{aligned}$$
(12)

Then by (9) and since \(q_n\le b_n\),

$$\begin{aligned} \Vert (KR_{q_n}-Id){\bar{Y}}_n\Vert \chi _{\varOmega _n}&\le \Vert (KR_{q_n}-Id){\hat{y}}\Vert \chi _{\varOmega _n} + \Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert \chi _{\varOmega _n}\nonumber \\&\le C_{\nu +1}\rho b_n^\frac{\nu +1}{2}\chi _{\varOmega _n} + \frac{\gamma }{4\sqrt{n}}\chi _{\varOmega _n} =\frac{\gamma }{2\sqrt{n}}\chi _{\varOmega _n}<\delta _n^{est}\chi _{\varOmega _n}, \end{aligned}$$
(13)

so \(\alpha _n \chi _{\varOmega _n} \ge q b_n \chi _{\varOmega _n} \ge q \left( \frac{\delta _n^{est}}{6C_{\nu +1}}\right) ^\frac{2}{\nu +1}\chi _{\varOmega _n}\) for n large enough. By (6), (8) and since K has dense image,

$$\begin{aligned} \Vert R_{\alpha _n}{\hat{y}} - K^+{\hat{y}}\Vert&\le \Vert KR_{\alpha _n}{\hat{y}}-KK^+{\hat{y}}\Vert ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1}= \Vert KR_{\alpha _n}{\hat{y}}-{\hat{y}}\Vert ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1}\\&\le \left( \Vert (KR_{\alpha _n}-Id){\bar{Y}}_n\Vert + \Vert (KR_{\alpha _n}-Id)({\hat{y}} - {\bar{Y}}_n)\Vert \right) ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1}\\&\le \left( \delta _n^{est} + \Vert (KR_{\alpha _n}-Id)({\hat{y}} - {\bar{Y}}_n)\Vert \right) ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1}\\&\le \left( \delta _n^{est} + C_0\delta _n^{true}\right) ^\frac{\nu }{\nu +1}C_0^\frac{1}{\nu +1}\rho ^\frac{1}{\nu +1}\le \left( \delta _n^{est} + \delta _n^{true}\right) ^\frac{\nu }{\nu +1}C_0\rho ^\frac{1}{\nu +1} . \end{aligned}$$

Finally,

$$\begin{aligned}&\Vert R_{\alpha _n}{\bar{Y}}_n - K^+{\hat{y}}\Vert \chi _{\varOmega _n}\\&\quad \le \Vert R_{\alpha _n}{\hat{y}} - K^+{\hat{y}}\Vert \chi _{{\tilde{\varOmega }}_n} + \Vert R_{\alpha _n}{\bar{Y}}_n - R_{\alpha _n}{\hat{y}}\Vert \chi _{\varOmega _n}\\&\quad \le \left( \delta _n^{est}+\delta _n^{true} \right) ^{\frac{\nu }{\nu +1}} C_0\rho ^{\frac{1}{\nu +1}}\chi _{\varOmega _n} +\sqrt{C_RC_F} \frac{\delta _n^{true}}{\sqrt{\alpha _n}}\chi _{\varOmega _n}\\&\quad \le \left( 2\max \left( \delta _n^{est},\delta _n^{true}\right) \right) ^{\frac{\nu }{\nu +1}} C_0\rho ^{\frac{1}{\nu +1}} \chi _{\varOmega _n}+\sqrt{C_RC_F} \rho ^\frac{1}{\nu +1}\left( \frac{6C_{\nu +1}}{\delta _n^{est}}\right) ^\frac{1}{\nu +1}\frac{\delta _n^{true}}{\sqrt{q}}\chi _{\varOmega _n}\\&\quad \le L\max \left\{ {\delta _n^{est}}^\frac{\nu }{\nu +1},{\delta _n^{true}}^\frac{\nu }{\nu +1}\left( \frac{\delta _n^{true}}{\delta _n^{est}}\right) ^\frac{1}{\nu +1}\right\} , \end{aligned}$$

with \(L:=2^\frac{\nu }{\nu +1}C_0\rho ^\frac{1}{\nu +1}+\sqrt{C_RC_F/q}\left( 6C_{\nu +1}\right) ^\frac{1}{\nu +1}\) and the proof is finished, because \({\mathbb {P}}\left( \varOmega _n\right) \rightarrow 1\) for \(n\rightarrow \infty \) by Lemma 1. \(\square \)

Proof (Proof of Theorem 3)

W.l.o.g. we may assume that there are arbitrarily large \(l\in {\mathbb {N}}\) with \(({\hat{y}},u_l)\ne 0\), since otherwise we could apply Theorem 4 with any \(\nu >0\). Let \(\varepsilon ^{\prime }>0\). Then there is a \(L\in {\mathbb {N}}\) such that \(({\hat{y}},u_L)\ne 0\) and \(\left( F_{q^k}(\sigma _L^2)\sigma _L^2-1\right) ^2>1/2\) for all \(k\in {\mathbb {N}}_0\) with \(q^k\ge \varepsilon ^{\prime }\) (because the \(F_{q^k}\) are bounded and \(\sigma _l\rightarrow 0\) for \(l\rightarrow \infty \)). Set

$$\begin{aligned} \varOmega _n:=\left\{ | \sqrt{n}\delta _n^{est}-\gamma |< \gamma ~,~({\bar{Y}}_n,u_L)^2\ge ({\hat{y}},u_L)^2/2\right\} . \end{aligned}$$
(14)

Then for \(n\ge 16\gamma ^2/({\hat{y}},u_L)^2\),

$$\begin{aligned} \delta _n^{est}\chi _{\varOmega _n}&\le \frac{2\gamma }{\sqrt{n}}\chi _{\varOmega _n} < \sqrt{\frac{({\hat{y}},u_L)^2}{4}}\chi _{\varOmega _n} \le \sqrt{\left( F_{q^k}(\sigma _L^2)\sigma _L^2-1\right) ^2 ({\bar{Y}}_n,u_L)^2}\chi _{\varOmega _n}\\&\le \sqrt{ \sum _{l=1}^\infty \left( F_{q^k}(\sigma _l^2)\sigma _l^2-1\right) ^2\left( {\bar{Y}}_n,u_l\right) ^2}\chi _{\varOmega _n} = \Vert (KR_{q^k}-Id){\bar{Y}}_n\Vert \chi _{\varOmega _n} \end{aligned}$$

for all \(k\in {\mathbb {N}}_0\) with \(q^k\ge \varepsilon ^{\prime }\). Thus for \(\varOmega _n\) given in (14)

$$\begin{aligned} \lim _{n\rightarrow \infty }{\mathbb {P}}\left( \alpha _n\le \varepsilon ^{\prime }\right) \ge \lim _{n\rightarrow \infty }{\mathbb {P}}\left( \varOmega _n\right) = 1 \end{aligned}$$
(15)

by Lemma 1 and since \(({\bar{Y}}_n,u_L)=\sum _{i=1}^n(Y_i,u_L)/n\rightarrow {\mathbb {E}}(Y_1,u_L)=({\hat{y}},u_L)\ne 0\) almost surely for \(n\rightarrow \infty \). Set \(q_n:=\psi _q\left( b_n\right) \) with \(b_n:=n^{-1}g(n^{-1})\) and g and \(\psi _q\) given in (4) and (10). Define

$$\begin{aligned} \varOmega _n:=\left\{ | \sqrt{n}\delta _n^{est}-\gamma |< \gamma /2~,~\Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert < \gamma /4\sqrt{n}\right\} . \end{aligned}$$
(16)

Then for n large enough (such that \(\Vert (KR_{q_n}-Id){\hat{y}}\Vert \sqrt{n}\le \gamma /4\), see (10) with \(\alpha =n^{-1}\)),

$$\begin{aligned}&\Vert (KR_{q_n}-Id){\bar{Y}}_n\Vert \chi _{\varOmega _n}\nonumber \\&\le \frac{1}{\sqrt{n}}\sqrt{n}\Vert (KR_{q_n}-Id){\hat{y}}\Vert \chi _{\varOmega _n} + \Vert (KR_{q_n}-Id)({\bar{Y}}_n-{\hat{y}})\Vert \chi _{\varOmega _n}\nonumber \\&\le \frac{\gamma }{4\sqrt{n}}\chi _{\varOmega _n}+ \frac{\gamma }{4\sqrt{n}}\chi _{\varOmega _n} \le \frac{\gamma }{2\sqrt{n}}\chi _{\varOmega _n}\le \delta _n^{est}\chi _{\varOmega _n}. \end{aligned}$$
(17)

That is \(\alpha _n \chi _{\varOmega _n}\ge q b_n \chi _{\varOmega _n}\ge q n^{-1}g(n^{-1}) \chi _{\varOmega _n}\) for n large enough. Finally set

$$\begin{aligned} {\tilde{\varOmega }}_n :=\left\{ \delta _n^{true}\le \sqrt{\frac{\sqrt{g(n^{-1})}}{n}}~,~\Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert \le \frac{\varepsilon }{2}\right\} \cap \varOmega _n, \end{aligned}$$

with \(\varOmega _n\) given in (16). So \({\mathbb {P}}\left( {\tilde{\varOmega }}_n\right) \rightarrow 1\) for \(n\rightarrow \infty \), since \({\mathbb {P}}\left( \delta _n^{true}\le \sqrt{\sqrt{g(n^{-1})}/n}\right) \rightarrow 1\), because of \(g(n^{-1})\rightarrow \infty \), \({\mathbb {P}}\left( \varOmega _n\right) \rightarrow 1\) by Lemma 1 and \({\mathbb {P}}\left( \Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert \le \frac{\varepsilon }{2}\right) \rightarrow 1\) by (15) (\(\varepsilon ^{\prime }>0\) is arbitrary). Thus for n large enough (so that \(C_RC_F/q\sqrt{g(n^{-1})} \le \frac{\varepsilon ^2}{4}\))

$$\begin{aligned} \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \chi _{{\tilde{\varOmega }}_n}&\le \Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert \chi _{{\tilde{\varOmega }}_n} + \Vert R_{\alpha _n}({\bar{Y}}_n-{\hat{y}})\Vert \chi _{{\tilde{\varOmega }}_n}\\&\le \frac{\varepsilon }{2}+ \sqrt{\frac{C_RC_F}{\alpha _n}}\delta _n^{true}\chi _{{\tilde{\varOmega }}_n} \le \frac{\varepsilon }{2}+\sqrt{\frac{C_RC_F}{q\sqrt{g(n^{-1})}}} \le \varepsilon , \end{aligned}$$

and \({\mathbb {P}}\left( \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \le \varepsilon \right) \ge {\mathbb {P}}\left( {\tilde{\varOmega }}_n\right) \rightarrow 1\) for \(n\rightarrow \infty \). \(\square \)

4.2 Proofs for the emergency stop case

Again, denote by \(\alpha _n\) the output of Algorithm 1 without the emergency stop. For the emergency stop, we have to consider \(\Vert R_{\max \{\alpha _n,1/n\}}{\bar{Y}}_n-K^+{\hat{y}}\Vert \). It suffices to show that \({\mathbb {P}}\left( \alpha _n\ge 1/n\right) \rightarrow 1\) for \(n\rightarrow \infty \).

First assume that \(K^+{\hat{y}}=(K^*K)^\frac{\nu }{2}w\) for some \(w\in {\mathscr {X}}\) with \(\Vert w \Vert \le \rho \) and \(0<\nu \le \nu _0-1\). With (13) it follows that

$$\begin{aligned} {\mathbb {P}}\left( \alpha _n \ge q\left( \frac{\gamma }{4 \rho C_{\nu +1} \sqrt{n}}\right) ^\frac{2}{\nu +1}\right) \ge {\mathbb {P}}\left( \varOmega _n\right) \rightarrow 1 \end{aligned}$$
(18)

for \(n\rightarrow \infty \), with \(\varOmega _n\) given in (12). Otherwise, if there are no such \(\nu , \rho \) and w, then (17) implies that for all \(\varepsilon >0\)

$$\begin{aligned} {\mathbb {P}}\left( \alpha _n \ge qg(n^{-1})/n\right) \ge {\mathbb {P}}\left( \varOmega _n\right) \rightarrow 1 \end{aligned}$$
(19)

for \(n\rightarrow \infty \), with \(g(n^{-1})\rightarrow \infty \) and \(\varOmega _n\) given in (16). Then (18) and (19) together yield \({\mathbb {P}}\left( \alpha _n\ge 1/n\right) \rightarrow 1\) for \(n\rightarrow \infty \) and therefore the result. \(\square \)

4.3 Proof of Corollary 3

Proof (Corollary 3)

Fix \(\varepsilon >0\). Denote by \(\alpha _n\) the output of the discrepancy principle with emergency stop and set

$$\begin{aligned} \varOmega _n:=\{ \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \le \varepsilon \}. \end{aligned}$$
(20)

It is

$$\begin{aligned} \Vert R_{\alpha }{\hat{y}}-K^+{\hat{y}}\Vert \le \Vert R_{\alpha }K-Id\Vert \Vert {\hat{x}} \Vert \le C \end{aligned}$$
(21)

for all \(\alpha >0\). By the triangle inequality,

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n - K^+{\hat{y}}\Vert ^2&= 2{\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n - R_{\alpha _n}{\hat{y}}\Vert ^2 + 2{\mathbb {E}}\Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert ^2\\&\le 2{\mathbb {E}}\left[ \Vert R_{\alpha _n}\Vert ^2{\delta _n^{true}}^2\right] + 2C^2 \le 2C_RC_F{\mathbb {E}}\left[ {\delta _n^{true}}^2/\alpha _n\right] + 2C^2\\&\le 2nC_RC_F{\mathbb {E}}{\delta _n^{true}}^2 + 2C^2= 2C_RC_F {\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2 + 2C^2\le C^{\prime }, \end{aligned}$$

where \(C^{\prime }\) does not depend on n and where we used \(\alpha _n\le 1\) and (21) in the second step and \(\alpha _n\ge 1/n\) in the fourth. By (20) there holds \(\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \chi _{\varOmega _n} \le \varepsilon \), so

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2&= {\mathbb {E}}\left[ \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\chi _{\varOmega _{n}}\right] + {\mathbb {E}}\left[ \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\chi _{\varOmega _{n}^C}\right] \\&\le \varepsilon ^2 +{\mathbb {E}}\left[ \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\chi _{\varOmega _{n}^C}\right] . \end{aligned}$$

We apply Cauchy-Schwartz to the second term

$$\begin{aligned} {\mathbb {E}}\left[ \Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2\chi _{\varOmega _{n}^C}\right]&\le \sqrt{{\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^4{\mathbb {E}}\chi _{\varOmega _{n}^C}^2}\\&=\sqrt{{\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^4~{\mathbb {P}}\left( \varOmega _{n}^C\right) } \end{aligned}$$

and we claim that there is a constant A with \({\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^4\le A\) for all \(n\in {\mathbb {N}}\).

$$\begin{aligned}&{\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^4\\&\quad \le 4\left( {\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-R_{\alpha _n}{\hat{y}}\Vert ^4+2{\mathbb {E}}\left[ \Vert R_{\alpha _n}{\bar{Y}}_n-R_{\alpha _n}{\hat{y}}\Vert ^2\Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert ^2\right] \right. \\&\qquad \left. +{\mathbb {E}}\Vert R_{\alpha _n}{\hat{y}}-K^+{\hat{y}}\Vert ^4\right) \\&\quad \le 4\left( {\mathbb {E}}\left[ \Vert R_{\alpha _n}\Vert ^4 {\delta _n^{true}}^4\right] + 2C^2{\mathbb {E}}\left[ \Vert R_{\alpha _n}\Vert ^2 {\delta _n^{true}}^2\right] +C^4\right) \\&\quad \le B\left( {\mathbb {E}}\left[ {\delta _n^{true}}^4/\alpha _n^2\right] + {\mathbb {E}}\left[ {\delta _n^{true}}^2/\alpha _n\right] + 1\right) \end{aligned}$$

for some constant B, where we used (21) in the second step. First,

$$\begin{aligned}&{\mathbb {E}}\left[ {\delta _n^{true}}^4/\alpha _n^2\right] \\&\quad \le n^2 {\mathbb {E}}\Vert {\bar{Y}}_n-{\hat{y}}\Vert ^4 = n^2{\mathbb {E}}\left[ \sum _{j,j^{\prime }\ge 1}\left( {\bar{Y}}_n-{\hat{y}},u_j\right) ^2\left( {\bar{Y}}_n-{\hat{y}},u_{j^{\prime }}\right) ^2\right] \\&\quad = \frac{1}{n^2}\left( \sum _{j,j^{\prime }\ge 1} \sum _{i,i^{\prime },l,l^{\prime }=1}^n {\mathbb {E}}\left[ \left( Y_i-{\hat{y}},u_j\right) \left( Y_l-{\hat{y}},u_j\right) \left( Y_{i^{\prime }}-{\hat{y}},u_{j^{\prime }}\right) \left( Y_{l^{\prime }}-{\hat{y}},u_{j^{\prime }}\right) \right] \right) \\&\quad \le \frac{1}{n^2}\sum _{j,j^{\prime }\ge 1} \left( n {\mathbb {E}}\left[ \left( Y_1-{\hat{y}},u_j\right) ^2\left( Y_1-{\hat{y}},u_{j^{\prime }}\right) ^2\right] \right. \\&\qquad + n^2 {\mathbb {E}}\left[ \left( Y_1-{\hat{y}},u_j\right) ^2\right] {\mathbb {E}}\left[ \left( Y_1-{\hat{y}},u_{j^{\prime }}\right) ^2\right] \\&\qquad + \left. 2n^2 \left( {\mathbb {E}}\left[ \left( Y_1-{\hat{y}},u_j\right) \left( Y_1-{\hat{y}},u_{j^{\prime }}\right) \right] \right) ^2\right) \\&\quad \le \frac{n+2n^2}{n^2}{\mathbb {E}}\left[ \sum _{j,j^{\prime }\ge 1} \left( Y_1-{\hat{y}},u_j\right) ^2\left( Y_1-{\hat{y}},u_{j^{\prime }}\right) ^2\right] \\&\qquad +{\mathbb {E}}\left[ \sum _{j\ge 1}\left( Y_1-{\hat{y}},u_j\right) ^2\right] {\mathbb {E}}\left[ \sum _{j^{\prime }\ge 1}\left( Y_1-{\hat{y}},u_{j^{\prime }}\right) ^2\right] \\&\quad \le \frac{n+2n^2}{n^2}{\mathbb {E}}\left[ \left( \sum _{j\ge 1} \left( Y_1-{\hat{y}},u_j\right) ^2\right) ^2\right] +\left( {\mathbb {E}}\left[ \sum _{j\ge 1}\left( Y_1-{\hat{y}},u_j\right) ^2\right] \right) ^2\\&\quad = \frac{n+2n^2}{n^2}{\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^4 + \left( {\mathbb {E}}\left[ \Vert Y_1-{\hat{y}}\Vert ^2\right] \right) ^2 \le B_1 \end{aligned}$$

for some constant \(B_1\), where in the fourth step we used that the \(Y_i\) are i.i.d, that \({\mathbb {E}}\left( Y_1-{\hat{y}},u_j\right) =\left( {\mathbb {E}}[Y_1]-{\hat{y}},u_j\right) =0\) and that \({\mathbb {E}}[XY]={\mathbb {E}}[X]{\mathbb {E}}[Y]\) for independent (and integrable) random variables (so the relevant cases are the ones where either all indices \(i,i^{\prime },l,l^{\prime }\) are equal or exactly pairwise two). Then we used Jensen’s inequality in the fifth step. Moreover, \({\mathbb {E}}\left[ {\delta _n^{true}}^2/\alpha _n\right] \le n {\mathbb {E}}\left[ {\delta _n^{true}}^2\right] = {\mathbb {E}}\Vert Y_1-{\hat{y}}\Vert ^2=B_2\), so the claim holds for \(A=B(B_1+B_2+1)\). By Theorem 3 it holds that \({\mathbb {P}}\left( \varOmega _n\right) \rightarrow 1\) for \(n\rightarrow \infty \), thus \({\mathbb {P}}\left( \varOmega _n^C\right) \le \varepsilon ^4/A\) for n large enough and

$$\begin{aligned} {\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^2&\le \varepsilon ^2 {\mathbb {E}}[\chi _{\varOmega _n}]+\sqrt{{\mathbb {E}}\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert ^4~{\mathbb {P}}\left( \varOmega _{n}^C\right) } \le 2\varepsilon ^2. \end{aligned}$$

\(\square \)

5 Numerical demonstration

We conclude with some numerical results.

5.1 Differentiation of binary option prices

A natural example is given if the data is acquired by a Monte-Carlo simulation, here we consider an example from mathematical finance. The buyer of a binary call option receives after T days a payoff Q, if then a certain stock price \(S_T\) is higher then the strike value K. Otherwise he gets nothing. Thus the value V of the binary option depends on the expected evolution of the stock price. We denote by r the riskfree rate, for which we could have invested the buying price of the option until the expiry rate T. If we already knew today for sure, that the stock price will hit the strike (insider information), we would pay \(V=e^{-rT}Q\) for the binary option (\(e^{-rT}\) is called discount factor). Otherwise, if we believed that the stock price will hit the strike with probability p, we would pay \(V=e^{-rT}Qp\). In the Black Scholes model one assumes, that the relative change of the stock price in a short time intervall is normally distributed, that is

$$\begin{aligned} S_{t+\delta t}-S_t \sim {\mathscr {N}}(\mu \delta t,\sigma ^2 \delta t). \end{aligned}$$

Under this assumption one can show that (see [22])

$$\begin{aligned} S_T = S_0 e^{sT}, \end{aligned}$$

where \(S_0\) is the initial stock price and \(s \sim {\mathscr {N}}\left( \mu -\sigma ^2/2,\sigma ^2/T\right) \). Under this assumptions one has \(V=e^{-rT}Q\varPhi (d)\), with

$$\begin{aligned} \varPhi (x):=\frac{1}{\sqrt{2\pi }}\int _{-\infty }^x e^{-\frac{\xi ^2}{2}}d\xi ,\qquad d=\frac{\log \frac{S_0}{K}+T\left( \mu -\frac{\sigma ^2}{2}\right) }{\sigma \sqrt{T}}. \end{aligned}$$

Ultimatively we are interested in the sensitivity of V with respect to the starting stock price \(S_0\), that is \(\partial V(S_0)/\partial S_0\). We formulate this as the inverse problem of differentiation. Set \({\mathscr {X}}={\mathscr {Y}}=L^2([0,1]=\) and define

$$\begin{aligned} K:&L^2([0,1])\rightarrow L^2([0,1])\\&f \mapsto Af=g: x \mapsto \int _0^xf(y)dy. \end{aligned}$$

Then our true data is \({\hat{y}}=V=e^{-rT}Q\varPhi (d)\). To demonstrate our results we now approximate \(V: S_0\mapsto e^{-rT}Qp(S_0)\) through a Monte-Carlo approach. That is we generate independent gaussian random variables \(Z_1,Z_2,\ldots \) identically distributed to s and set \(Y_i:=e^{-rT}Q \chi _{\{S_0e^{TZ_i}\ge K \}}\). Then we have \({\mathbb {E}}Y_i = e^{-rT}Q{\mathbb {P}}(S_0e^{TZ_i})=e^{-rT}Qp(S_0)=V(S_0)\) and \({\mathbb {E}}\Vert Y_i \Vert ^2\le e^{-rT}Q<\infty \). We replace \(L^2([0,1])\) with piecewise continuous linear splines on a homogeneous grid with \(m=50,000\) elements (we can calculate Kg exactly for such a spline g). We use in total \(n=10,000\) random variables for each simulation. As parameters we chose \(r=0.0001, T=30, K=0.5, Q=1, \mu = 0.01, \sigma =0.1\). It is easy to see that \({\hat{x}} =K^+{\hat{y}}\in {\mathscr {X}}_{\nu }\) for all \(\nu >0\) using the transformation \(z(\xi )=0,5e^{\sqrt{0,3}\xi -0,15}\). Since the qualification of the Tikhonov regularisation is 2, Theorem 4 gives an error bound which is asymptotically proportional to \(\left( 1/\sqrt{n}\right) ^\frac{1}{2}\). In Fig. 1 we plot the \(L^2\) average of 100 simulations of the discrepancy principle together with the (translated) optimal error bound. In this case the emergency stop did not trigger once - this is plausible, since the true solution is very smooth, which yields comparably higher values of the regularisation parameter and also, the error distribution is Gaussian and the problem is only mildly ill-posed.

Let us stress that this is only an academic example to demonstrate the possibility of using our new generic approach in the context of Monte Carlo simulations. Explicit solution formulas for standard binary options are well-known, and for more complex financial derivatives with discontinuous payoff profiles (such as autocallables or Coco-bonds) one would rather resort to stably differentiable Monte Carlo methods [2] or [14] or use specific regularization methods for numerical differentiation [18].

Fig. 1
figure 1

Estimated risk of a binary option

Fig. 2
figure 2

Comparison of Tikhonov regularisation with discrepancy principle (dp, Algorithm 1), discrepancy principle with emergency stop (dp + es, Algorithm 1 (optional)) and a priori choice for ‘heat’. Boxplots of the relative errors \(\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert /\Vert K^+{\hat{y}}\Vert \) for 200 simulations with three different sample sizes

Fig. 3
figure 3

Estimated relative \(L^2\) error for ‘heat’, that is \(\sqrt{\sum _{t=1}^{200} e_i^2/200}\) where \(e_i\) is the relative error \(\Vert R_{\alpha _n}{\bar{Y}}_n-K^+{\hat{y}}\Vert \Vert /\Vert K^+{\hat{y}}\Vert \) of the i-th run

5.2 Inverse heat equation

We consider the toy problem ‘heat’ from [19]. We chose the discretisation level \(m=100\) and set \(\sigma =0.7\). Under this choice, the last seven singular values (calculated with the function ‘csvd’) fall below the machine precision of \(10^{-16}\). The discretised large systems of linear equations are solved iteratively using the conjugate gradient method (‘pcg’ from MATLAB) with a tolerance of \(10^{-8}\). As a regularisation method we chose Tikhonov regularisation and we compared the a priori choice \(\alpha _n=1/\sqrt{n}\), the discrepancy principle (dp) and the discrepancy principle with emergency stop (dp+es), as implemented in Algorithm 1 with \(q=0.7\) and estimated sample variance. The unbiased i.i.d measurements fullfill \(\sqrt{{\mathbb {E}}\Vert Y_i-{\hat{y}}\Vert ^2}\approx 1.16\) and \({\mathbb {E}}\Vert Y_i - {\mathbb {E}}Y_i \Vert ^k=\infty \) for \(k\ge 3\). Concretely, we chose \(Y_i:={\hat{y}}+E_i\) with \(E_i:=U_i*Z_i*v\), where the \(U_i\) are independent and uniformly on \([-1/2,1/2]\) distributed, the \(Z_i\) are independent Pareto distributed (MATLAB function ‘gprnd’ with parameters 1/3, 1/2 and 3/2), and v is a uniform permutation of \(1,1/2^\frac{3}{4},\ldots ,1/m^\frac{3}{4}\). Thus we chose a rather ill-posed problem together with a heavy-tailed error distribution. We considered three different sample sizes \(n=10^3,10^4,10^5\) with 200 simulations for each one. The results are presented as boxplots in Fig. 2. It is visible, that the results are much more concentrated for a priori regularisation and discrepancy prinicple with emergency stop, indicating the \(L^2\) convergence (strictly speaking we do not know if the discrepancy principle with emergency stop converges in \(L^2\), since the additional assumption of Corollary 3 is violated here). Moreover the statistics of the discrepancy principle with and without emergency stop become more similiar with increasing sample size - with the crucial difference, that the outliers as such we denote the red crosses above the blue box, thus the cases where the mehod performed badly) are only present in case of the discrepancy principle without emergency stop, causing non-convergence in \(L^2\), see Fig. 3. Thus here the discrepancy principle with emergency stop is superior to the discrepancy principle without emergency stop, in particular for large sample sizes. Beside that, the error is falling slower in case of the a priori parameter choice. The number of outliers falls with increasing sample size from 37 for \(n=10^3\) to 18 for \(n=10^5\), indicating the (slow) convergence in probability of the discrepancy principle. Note that \(\delta _n^{true}/\delta _n^{est}\approx 1.9\) (in average), if we only consider the runs yielding outliers. This illustrates, that the lack of convergence in \(L^2\) is caused by the occasional underestimation of the data error.