Iteratively regularized Newton-type methods for general data misfit functionals and applications to Poisson data

We study Newton type methods for inverse problems described by nonlinear operator equations $F(u)=g$ in Banach spaces where the Newton equations $F'(u_n;u_{n+1}-u_n) = g-F(u_n)$ are regularized variationally using a general data misfit functional and a convex regularization term. This generalizes the well-known iteratively regularized Gauss-Newton method (IRGNM). We prove convergence and convergence rates as the noise level tends to 0 both for an a priori stopping rule and for a Lepski{\u\i}-type a posteriori stopping rule. Our analysis includes previous order optimal convergence rate results for the IRGNM as special cases. The main focus of this paper is on inverse problems with Poisson data where the natural data misfit functional is given by the Kullback-Leibler divergence. Two examples of such problems are discussed in detail: an inverse obstacle scattering problem with amplitude data of the far-field pattern and a phase retrieval problem. The performence of the proposed method for these problems is illustrated in numerical examples.


Introduction
This study has been motivated by applications in photonic imaging, e.g. positron emission tomography [47], deconvolution problems in astronomy and microscopy [8], phase retrieval problems [29] or semi-blind deconvolution problems, i.e. deconvolution with partially unknown convolution kernel [44]. In these problems, data consist of counts of photons which have interacted with the object of interest. The inverse problem of recovering the information on the object of interest from such photon counts can be formulated as an operator equation if one introduces an operator F : B ⊂ X → Y mapping a mathematical description u ∈ B of the object of interest to the photon density g ∈ Y ⊂ L 1 (M) on the manifold M at which measurements are taken. In this paper we focus on problems where the operator F is nonlinear. For fundamental physical reasons, photon count data are described by a Poisson process with the exact data g † as mean if read-out noise and finite averaging volume of detectors is neglected. Ignoring this a priori information often leads to non-competitive reconstruction methods.
To avoid technicalities in this introduction, let us consider a discrete version where the exact data vector g † belongs to [0, ∞) J , and g † j is the expected number of counts of the jth detector. Then the observed count data are described by a vector g obs ∈ N J 0 of J independent Poisson distributed random variables with mean g † . A continuous version will be discussed in section 6. Since − ln P[g obs |g] = − ln j e −g j g g obs Note that both S and KL are convex in their second arguments.
A standard way to solve perturbed nonlinear operator equations (1) is the Gauß-Newton method. If F denotes the Gateaux derivative of F , it is given by given by u n+1 := argmin u∈B F (u n ) + F (u n ; u − u n ) − g obs 2 . As explained above, for data errors with a non-Gaussian distribution it is in general not appropriate to use a squared norm as data misfit functional. Therefore, we will consider general data misfit functionals S : Y obs × Y → (−∞, ∞] where Y obs is a space of (possibly discrete) observations g obs .
Since inverse problems are typically ill-posed in the sense that F and its derivatives F (u n ; ·) do not have continuous inverses, regularization has to be used. Therefore, we add a proper convex penalty functional R : X → (−∞, ∞], which should be chosen to incorporate a priori knowledge about the unknown solution u † . This leads to the iteratively regularized Newton-type method u n+1 := argmin u∈B S g obs ; F (u n ) + F (u n ; u − u n ) + α n R (u) (4a) which will be analyzed in this paper. The regularization parameters α n are chosen such that α 0 ≤ 1, α n 0, 1 ≤ α n α n+1 ≤ C dec for all n ∈ N (4b) for some constant C dec , typically α n = α 0 C −n dec with C dec = 3/2. If Y = R J , F (u) = (F j (u)) j=1,...,d , and S is given by (2), we obtain the convex minimization problems u n+1 := argmin u∈Bn J j=1 F j (u n ) + F j (u n ; u − u n ) − − g obs j ln(F j (u n ) + F j (u n ; u − u n )) + α n R (u) (5) in each Newton step where B n := {u ∈ B S g obs ; F (u) + F (u n ; u − u n ) < ∞}. In principle, several methods for the solution of (5) are available. In particular we mention inverse scale space methods [13,38] for linear operator equations and total variation penalties R. EM-type methods cannot readily be used for the solution of the convex minimization problems (5) (or subproblems of the inverse scale space method as in [13]) if F (u n ; ·) is not positivity preserving as in our examples. A simple algorithm for the solution of subproblems of the type (5) is discussed in section 7. We consider the design of more efficient algorithms for minimizing the functionals (5) for large scale problems as an important problem for future research.
The most common choice of the data misfit functional is S (ĝ; g) = g −ĝ 2 Y with a Hilbert space norm · Y . This can be motivated by the case of (multi-variate) Gaussian errors. If the penalty term is also given by a Hilbert space norm R (u) = u − u 0 2 X , (4) becomes the iteratively regularized Gauss-Newton method (IRGNM) which is one of the most popular methods for solving nonlinear ill-posed operator equations [2,3,9,32]. If the penalty term u − u 0 2 X is replaced by u − u n 2 X one obtains the Levenberg-Marquardt method, which is well-known in optimization and has first been analyzed as regularization method in [21]. Recently, a generalization of the IRGNM to Banach spaces has been proposed and analyzed by Kaltenbacher & Hofmann [31].
As an alternative to (4) we mention Tikhonov-type or variational regularization methods of the form u α := argmin u∈B S g obs ; F (u) + αR (u) .
Here α > 0 is a regularization parameter. For nonlinear operators this is in general a non-convex optimization problem even if S g obs ; · and R are convex. Hence, (6) may have many local minima and it cannot be guaranteed that the global minimum can be found numerically. Let us summarize some recent convergence results on this method: Bardsley [4] shows stability and convergence for linear operators and S = KL. Benning & Burger [7] prove rates of convergence for linear operators under the special source condition F * ω ∈ ∂R(u † ). Generalizations to nonlinear operators and general variational source conditions were published simultaneously by Bot & Hofmann [12], Flemming [17], and Grasmair [20]. Given some rule to choose the stopping index n * our main results (Theorems 2.3 and 4.2) establish rates of convergence of the method (4), i.e. uniform estimates of the error of the final iterate in terms of some data noise level err for some increasing, continuous function ϕ : [0, ∞) → [0, ∞) satisfying ϕ(0) = 0. For the classical deterministic error model g obs − g ≤ δ and S g obs ; g = g − g obs r with some r ≥ 1 we have err = δ r . In this case we recover most of the known convergence results on the IRGNM for weak source conditions. Our main results imply error estimates for Poisson data provided a concentration inequality holds true. In this case err = 1 √ t where t can be interpreted as an exposure time proportional to the expected total number of photons, and an estimate of the form (7) holds true with the right hand side replaced by an expected error. As opposed to a Hilbert or Banach space setting our data misfit functional S does not necessarily fulfill a triangle inequality. Therefore, it is necessary to use more general formulations of the noise level and the tangential cone condition, which controls the degree of nonlinearity of the operator F . Both coincide with the usual assumptions if S is given by a norm. Our analysis uses variational methods rather than methods based on spectral theory, which have recently been studied in the context of inverse problems by a number of authors (see, e.g., [14,25,31,41,43]). The plan of this paper is as follows: In the following section we formulate our first main convergence theorem (Theorem 2.3) and discuss its assumptions. The proof will be given in section 3. In the following section 4 we discuss the case of additive variational inequalities and state a convergence rates result for a Lepskiȋ-type stopping rule (Theorem 4.2). In section 5 we compare our result to previous results on the iteratively regularized Gauss-Newton method. Section 6 is devoted to the special case of Poisson data, which has been our main motivation. We conclude our paper with numerical results for an inverse obstacle scattering problem and a phase retrieval problem in optics in section 7.

Assumptions and convergence theorem with a priori stopping rule
Throughout the paper we assume the following mapping and differentiability properties of the forward operator F : Assumption 1 (Assumptions on F and R): Let X and Y be Banach spaces and let B ⊂ X a convex subset.
Assume that the forward operator F : B → Y and the penalty functional R : X → (−∞, ∞] have the following properties: 3. R is proper and convex.
At interior points u ∈ B the second assumption amounts to Gateaux differentiability of F .
To motivate our assumptions on the data misfit functional, let us consider the case that g obs = F (u † ) + ξ, and ξ is Gaussian white noise on the Hilbert space Y, i.e. ξ, g ∼ N (0, g 2 ) and E ξ, g ξ,g = g,g for all g,g ∈ Y. If Y = R J , then the negative log-likelihood functional is given by S g obs ; g = g − g obs 2 2 . However, in an infinite dimensional Hilbert space Y we have g obs Y = ∞ almost surely, and S g obs ; · ≡ ∞ is obviously not a useful data misfit term. Therefore, one formally subtracts g obs 2 Y (which is independent of g) to obtain S g obs ; g := g 2 Y − 2 g obs , g Y . For exact data g † we can of course use the data misfit functional T g † ; g = g − g † 2 Y . As opposed to S, the functional T is nonnegative and does indeed describe the size of the error in the data space Y. It will play an important role in our analysis. It may seem cumbersome to work with two different types data misfit functionals S and T , and a straightforward idea to fix the free additive constant in S is to introducẽ S g obs ; g := S g obs ; g −s withs := inf g S g obs ; g . Then we obtain indeed that S g † ; g = T g † ; g . However, the expected error E S g obs ; g − s − T g † ; g 2 is not minimized for s =s, but for s = ES g obs ; g − T g † ; g = − g † 2 . Note that s depends on the unknown g † , but this does not matter since the value of s does not affect the numerical algorithms. For this choice of s the error has the convenient representation S g obs ; g + g † 2 − T g † ; g = −2 ξ, g Y . Bounds on sup g∈Ỹ | ξ, g Y | with high probabilities for certain subsetsỸ ⊂ Y (concentration inequalities) have been studied intensively in probability theory (see e.g. [35]). Such results can be used in case of Gaussian errors to show that the following deterministic error assumption holds true with high probability and uniform bounds on err(g) for g ∈Ỹ. Assumption 2 (data errors, properties of S and T ): Let u † ∈ B ⊂ X be the exact solution and denote by g † := F u † ∈ Y the exact data. Let Y obs be a set containing all possible observations and g obs ∈ Y obs the observed data. Assume that: 1. The fidelity term T : F (B) × Y → [0, ∞] with respect to exact data fulfills T g † ; g † = 0.
2. T and the fidelity term S : Y obs × Y → (−∞, ∞] with respect to noisy data are connected as follows: There exists a constant C err ≥ 1 and functionals err : Y → [0, ∞] and s : for all g ∈ Y.
2. For randomly perturbed data a general recipe for the choice of S, T and s is to define S as the log-likelihood functional, s(g † ) := E g † S g obs ; g † and T g † ; g := E g † S g obs ; g − s(g † ). Then we always have T g † ; g † = 0, but part 2. of Assumption 2 has to be verified case by case.
3. Poisson data. For discrete Poisson data we have already seen in the introduction that the general recipe of the previous point yields S given by (2), T = KL and It is easy to see that KL g † ; g ≥ 0 for all g † and g. Then (8) holds true with C err = 1 and Obviously, it will be necessary to show that err (g) is finite and even small in some sense for all g for which the inequalities (8) are applied (see section 6).
To simplify our notation we will assume in the following analysis that s ≡ 0 or equivalently replace S g obs ; g by S g obs ; g − s(g † ). As already mentioned in the motivation of Assumption 2, it is not relevant that s(g † ) is unknown since the value of this additive constant does not influence the iterates u n in (4a). Typically S and T will be convex in their second arguments, but we do not need this property in our analysis. However, without convexity it is not clear if the numerical solution of (4a) is easier than the numerical solution of (6).
Assumption 3 (Existence): For any n ∈ N the problem (4a) has a solution.
Remark 2.2. By standard arguments the following properties are sufficient to ensure existence of a solution to (4a) for convex S g obs ; · (see [17,25,40]): There are possibly weaker topologies τ X , τ Y on X , Y respectively such that 1. B is sequentially closed w.r.t. τ X , 2. F (u; ·) is sequentially continuous w.r.t. τ X and τ Y for all u ∈ B, 3. the penalty functional R : X → (−∞, ∞] is sequentially lower semi-continuous with respect to τ X , 4. the sets M R (c) := u ∈ X R (u) ≤ c are sequentially pre-compact with respect to τ X for all c ∈ R and 5. for each g obs the data misfit term S g obs ; · : Y → (−∞, ∞] is sequentially lower semi-continuous w.r.t. τ Y . Note that for our analysis we do not require that the solution to (4a) is unique or depends continuously on the data g obs even though these properties are desirable for other reasons.
Obviously, uniqueness is given if S is convex and R is strictly convex, and there are reasonable assumptions on S which guarantee continuous dependence, cf. [40].
All known convergence rate results for nonlinear ill-posed problems under weak source conditions assume some condition restricting the degree of nonlinearity of the operator F . Here we use a generalization of the tangential cone condition which was introduced in [22] and is frequently used for the analysis of regularization methods for nonlinear inverse problems. It must be said, however, that for many problems it is very difficult to show that this condition is satisfied (or not satisfied). Since S does not necessarily fulfill a triangle inequality we have to use a generalized formulation of the tangential cone condition, which follows from the standard formulation if S is given by the power of a norm (cf. Lemma 5.2). Assumption 4 (Generalized tangential cone condition): (A) There exist constants η (later assumed to be sufficiently small) and C tc ≥ 1 such that for all g obs ∈ Y obs (B) There exist constants η (later assumed to be sufficiently small) and C tc ≥ 1 such that This condition ensures that the nonlinearity of F fits together with the data misfit functionals S or T . Obviously, it is fulfilled with η = 0 and C tc = 1 if F is linear.
It is well-known that for ill-posed problems rates of convergence can only be obtained under an additional "smoothness condition" on the solution (see [16,Prop. 3.11]). In a Hilbert space setting such conditions are usually formulated as source conditions in the form ϕ is continuous and monotonically increasing with ϕ(0) = 0. Such general source conditions were systematically studied in [24,37]. The most common choices of ϕ are discussed in section 5.
To formulate similar source conditions in Banach spaces, we first have to introduce Bregman distances, which will also be used to measure the error of our approximate solutions (see [14]): Let u * ∈ ∂R u † be a subgradient (e.g.
Then the Bregman distance of R between u and u † is given by If X is a Hilbert space and R(u) for all u ∈ X (see e.g. [10]). In those cases, convergence rates w.r.t. the Bregman distance also imply rates w.r.t. the Banach space norm. Now we can formulate the following variational formulation of the source condition (10), which is a slight variation of the one proposed in [31]: Assumption 5A (Multiplicative variational source condition): There exists u * ∈ ∂R u † ⊂ X , β ≥ 0 and a concave index function ϕ : Moreover, we assume that As noted in [31] using Jensen's inequality, a Hilbert space source condition (10) for which ϕ 2 −1 is convex implies the variational inequality The tangential cone condition now shows that an inequality of type (12) is valid and hence, in a Hilbert space setup Assumption 5 is weaker than (10) at least for linear operators. As opposed to [31] we have omitted absolute values on the left hand side of (12) since they are not needed in the proofs, and this form may allow for better index functions ϕ if u † is on the boundary of B.
In many recent publications [12,17,26,43] variational source conditions in additive rather than multiplicative form have been used. Such conditions will be discussed in section 4.
Since we use a source condition with a general index function ϕ, we need to restrict the nonlinearity of F with the help of a tangential cone condition. Nevertheless, we want to mention that for ϕ (t) = t 1/2 in (12) our convergence analysis also works under a generalized Lipschitz assumption, but this lies beyond the aims of this paper. The cases ϕ (t) = t ν with ν > 1 2 where similar results are expected are not covered by Assumption 5, since for the motivation in the Hilbert space setup we needed to assume that ϕ 2 −1 is convex, which is not the case for ν > 1 2 . In our convergence analysis we will use the following two functions, which are both index functions as well as their inverses: We are now in a position to formulate our convergence result with a priori stopping rule: Theorem 2.3. Let Assumption 1, 2, 3, 4A or 4B and 5A hold true, and suppose that η, D u * R u 0 , u † and T g † ; F (u 0 ) are sufficiently small. Then the iterates u n defined by (4) with exact data g obs = g † fulfill as n → ∞. For noisy data define in case of Assumption 4A or under Assumption 4B, and choose the stopping index n * by with a sufficiently large parameter τ ≥ 1. Then (16) holds for n ≤ n * and the following convergence rates are valid:

Proof of Theorem 2.3
We will split the proof into to two main parts. For brevity we will denote Let us now start with the following in the case of 4B and in the case of 4A for all n ∈ N.
Proof. Due to (12) we have From the minimality condition (4a) with u = u † we obtain and putting (23) and (24) together we find that • In the case of 4B we use (8), which yields By (9b) with v = u n+1 , u = u n we obtain (22a).
• In the case of 4A we are able to apply (9a) with v = u † , u = u n and (9a) with v = u n+1 and u = u n to (25) to conclude .
Before we deduce the convergence rates from the recursive error estimates (22) respectively, we note some inequalities for the index functions defined in (15) and their inverses: for all t ≥ 0 and C > 0 if defined, where each inequality follows from two applications of the monotonicity assumption (13) (see [31,Remark 2]).
2. Since ϕ is concave, we have for all t sufficiently small and λ ≥ 1 (28) 3. (28) implies the following inequality for all t sufficiently small and λ ≥ 1: The following induction proof follows along the lines of a similar argument in the proof of [31, Theorem 1]: Let the assumptions of Theorem 2.3 hold. Then an estimate of the kind (22a) implies for all n ≤ n * in case of noisy data and for all n ∈ N in case of exact data where (due to η sufficiently small) Since (22b) is of the same form as (22a) (only the constants differ), (30) and (31) are (with slightly changed constants) also valid under (22b).
Proof. For n = 0 (30) and (31) are guaranteed by the assumption that d 0 and s 0 are small enough. For the induction step we observe that (22a) together with (18) and the induction hypothesis for n ≤ n * − 1 implies where C η,τ = ηC 2 (C err + 1/C err ) + 1/τ . Now we distinguish between two cases: In that case we find (28) and (29) implies The assertions now follow by 2C η,τ C dec ≤ C 1 and 2C tc C err C η,τ C 3 dec ≤ C 2 which is ensured by the definition of C 2 . Case 2: α n βd n+1 ϕ s n+1 In that case we find .
If d n+1 = 0, then this implies s n+1 = 0 and hence the assertion is trivial. By multiplying with √ s n+1 and dividing by d 2 n+1 we have Considering only the first term on the left hand side of (32) this is and by considering only the second term on the left hand side of (32) Hence, which by (29) and and hence by (33) where we used (26), C 2 ≥ 4β 2 due to C dec C tc C err ≥ 1 and Therefore, we have proven that (30) and (31) hold for all n ≤ n * (or in case of exact data for all n ∈ N).

A Lepskiȋ-type stopping rule and additive source conditions
In this section we will present a convergence rates result under the following variational source condition in additive form: Assumption 5B: There exists u * ∈ ∂R(u † ) ⊂ X , parameters β 1 ∈ [0, 1/2), β 2 > 0 (later assumed to be sufficiently small), and a strictly concave, differentiable index function ϕ satisfying ϕ (t) ∞ as t 0 such that A special case of condition (35), motivated by the benchmark condition u * = F u † * ω was first introduced in [25] to prove convergence rates of Tikhonov-type regularization in Banach spaces (see also [43]). Flemming [17] uses them to prove convergence rates for nonlinear Tikhonov regularization (6) with general S and R. Bot & Hofmann [12] prove convergence rates for general ϕ and introduce the use of Young's inequality which we will apply in the following. Finally, Hofmann & Yamamoto [26] prove equivalence in the Hilbert space case for ϕ (t) = √ t in (10) (35) in a way that one obtains order optimal rates (see [18]). Nevertheless, this can be seen much easier for multiplicative variational source conditions (see (14)). The additive structure of the variational inequality will facilitate our proof and the result will give us the possibility to apply a Lepskiȋ-type stopping rule. We remark that for s = 0 in Assumption 2 it is not clear how to formulate an implementable discrepancy principle.
Given ϕ in (35), we construct the following further index functions as in [12], which will be used in our convergence theorem: The definition (36c) ensures that √ Λ is concave, which by (4b) implies for all q ≥ 1 and n ∈ N. Since for linear problems Ψ (α n ) /α n is a bound on the approximation error (see [12]) and since for Tikhonov regularization the approximation error decays at most of the order O(α n ), we expect that t → Ψ (t)/t is "asymptotically concave" in the sense that lim t 0 Λ(t)t/Ψ (t) = 1, so we don't loose anything by replacing Ψ(t)/t by Λ(t). Indeed, it is easy to see that this is the case for logarithmic and Hölder type source conditions with ν ≤ 1, and in the latter case t → Ψ (t)/t itself is concave everywhere.
and C NL := max {2C err , C err + 1/C err }. Moreover, if η and β 2 are sufficiently small, the estimate holds true with Proof. Similar to the proof of Lemma 3.1 the assumptions imply the iterative estimate for all n ∈ N in case of of 4B and [23,Thm. 156]) with the index function ψ defined in (36a) applied to the secondlast term yields This shows that for all n ∈ N both in case 4A and in case 4B. Together with 1/(1 − β 1 ) ≤ 2 and for all n ≥ 0 which is by definition (38).
The definition of γ nl implies C 2 decγ (1 + γ nl ) ≤ γ nl and hence the assertion is shown.
Lemma 4.1 allows us to apply the Lepskiȋ balancing principle as developed in [5,6,36,37] as a posteriori stopping rule. Since the balancing principle requires a metric on X we assume that (11) holds true. As already mentioned, this is for example the case if X is a q-convex Banach space and R(u) = u q . Together with (11) and taking the q-th root it follows from Lemma 4.1 that Whereas Φ app and Φ nl are typically unknown, it is important to note that the error component Φ noi is known if an error bound err is available. Therefore, the following Lepskiȋ balancing principle can be implemented: Moreover, it is important to note that Φ noi is increasing and Φ app is decreasing. Therefore, the general theory developed in the references above can be applied, and we obtain the following convergence result: Theorem 4.2 (Convergence rates under Assumption 5B). Let the assumptions of Lemma 4.1 hold true and assume that D u * R u 0 , u † and S g † ; F (u 0 ) are sufficiently small.

exact data:
Then the iterates (u n ) defined by (4) with exact data g obs = g † fulfill 2. a priori stopping rule: For noisy data and the stopping rule n * := min n ∈ N Ψ (α n ) ≤ err with Ψ defined in (36b) we obtain the convergence rate 3. Lepskiȋ-type stopping rule: Assume that (11) holds true. Then the Lepskiȋ balancing principle (41b) with c = C 1 q bd 4 (1 + γ nl ) leads to the convergence rate Proof. By (38) and (39) we find d 2 n ≤ (1 + γ nl ) (Φ app (n) + Φ noi (n)) which implies part 1 and Using the definition of n * and (37) we have Using the definition of n * again we obtain α n * ≤ Ψ −1 (err). Putting these estimates together yields (43). To prove part 3 assume that err is sufficiently small in the following. We use again d 2 n ≤ (1 + γ nl ) (Φ app (n) + Φ noi (n)), which yields by (11) the estimate This is the case if N max is sufficiently large which holds true for sufficiently small err as assumed. Thus by (37) If we can show that n * ∈ {1, ..., N max } we obtain the assertion as in part 2. Since by definition α n * −1 > Ψ −1 (err), we have and hence n * ≤ N max if err is sufficiently small.

Relation to previous results
The most commonly used source conditions are Hölder-type and logarithmic source conditions, which correspond to respectively. For a number of inverse problems such source conditions have been shown to be equivalent to natural smoothness assumptions on the solution in terms of Sobolev space regularity (see [16,28]). We have restricted the range of Hölder indices to ν ∈ (0, 1/2] since for ν > 1/2 the monotonicity assumption (13) is violated. By computing the second derivative, one can easily see that the functionsφ p are concave on the interval [0, exp(−p − 1)], and condition (13) is trivial. If necessary, the functionsφ p can be extended to concave functions on [0, ∞) by suitable affine linear function on (exp(−p − 1), ∞). We note the explicit form of the abstract error estimates (19) for these classes of source conditions as a corollary:  (12) is of the form (44a) and n * := min n ∈ N α n ≤ τ err 1 1+2ν n with τ ≥ 1 sufficiently large, then
In the case of logarithmic source conditions we have Θ (t) = t ·φ 2p (t) . The function Θ −1 does not have an algebraic representation, but its asymptotic behavior at 0 can be computed: Note that the proposed stopping rulen * , which can be implemented without knowledge of the smoothness index p, deviates from the stopping rule n * := min n ∈ N α nφ2p (α n ) ≤ τ err n proposed in Theorem 2.3. Asymptotically we have n * >n * , and hence (16) holds for n =n * . Therefore, we still get the optimal rates since Recall from section 2 that we can choose err ≡ δ r if g obs − g † Y ≤ δ and S (g 2 ; g 1 ) = g 1 − g 2 r Y , T = S with r ∈ [1, ∞). In particular, if X and Y are Hilbert spaces, r = 2 and R = u − u 0 2 for some u 0 ∈ X , then (45a) and (46a) translate into the rates respectively, for δ → 0 (see, e.g., [32]), which are known to be optimal for linear inverse problems. It remains to discuss the relation of Assumption 4 to the standard tangential cone condition: withη ≥ 0 sufficiently small, then Assumptions 4A and 4B are satisfied.

Convergence analysis for Poisson data
In this section we discuss the application of our results to inverse problems with Poisson data. We first describe a natural continuous setting involving Poisson processes (see e.g. [1]). The relation to the finite dimensional setting discussed in the introduction is described at the end of this section. Recall that a Poisson process with intensity g ∈ L 1 (M) on some submanifold M ⊂ R d can be described as a random finite set of points {x 1 , . . . , x N } ⊂ M written as random measure G = N n=1 δ xn such that the following conditions are satisfied: Actually, the first condition can be replaced by the weaker assumption that EG(M ) = Poisson process G with intensity g and a measurable function ψ : M → R the following equalities hold true whenever the integrals on the right hand sides exist (see [33]): We also introduce an exposure time t > 0. Our convergence results will describe reconstruction errors in the limit t → ∞. Assume the dataG t are drawn from a Poisson process with intensity tg † and define G t := 1 tG t . The negative log-likelihood functional is given by We set ln 0 := −∞, so S (G t ; g) = ∞ if g(x n ) = 0 for some n = 1, . . . , N . Using (48) we obtain the following formulas for the mean and variance of S (G t ; g) if the integrals on the right hand side exist: The term s( and g † ≥ 0 as assumed below (see e.g. [46,Lemma 2.2]). Abbreviating the set {x ∈ M : g † (x) > 0} by {g † > 0} we set It can be shown that the integral is well-defined, possibly taking the value +∞, i.e. the negative part of −g † ln(g † /g) is integrable if g, g † ∈ L 1 (M) and g, g † ≥ 0 (see e.g. [46, Lemma 2.2]). We find that Assumption 2 holds true with C err = 1 and This motivates the following assumption: Assumption P: With the notation of Assumption 1 assume that 2. For a subsetỸ ⊂ Y specified later there exist constants ρ 0 , t 0 > 0 and a strictly monotonically decreasing function ζ : (ρ 0 , ∞) → [0, 1] fulfilling lim ρ→∞ ζ(ρ) = 0 such that for all ρ > ρ 0 and all t > t 0 .
It remains to discuss the concentration inequality (53). A general result of this type, which can be seen as an analog to Talagrand's inequalities for empirical processes, has been shown by Reynaud-Bouret [42,Corollary 2]. She proved that for a Poisson process G with intensity g ∈ L 1 (M) and a countable family of functions {f n } n∈N with values in [−b, b] the random variable Z := sup n∈N f n (dG − g dx ) satisfy the concentration inequality for all ρ, > 0 with v 0 := sup n∈N f 2 n g dx and κ( ) = 5/4 + 32/ . We can apply this result with G = tG t and g = tg † ifỸ is separable and ln(g) ∞ ≤ b for all g ∈Ỹ. Under additional regularity assumptions (e.g. M Lipschitz domain and sup{ ln(g) H s : g ∈Ỹ} < ∞ with s > dim(M)/2) it can be shown that E(Z) ≤ C/ √ t (see [48, sec. 4.1]). This yields a concentration inequality of the form (53) with ζ(ρ) := exp(−cρ) for some c > 0. An essential restriction of Reynaud-Bouret's concentration inequality in our context is the assumption ln(g) ∞ ≤ b for all g ∈Ỹ. This does not allow for zeros of F (u) even on sets of measure 0 if F (u) is continuous, which is a very restrictive assumption. Therefore, we introduce the following shifted version of the Kullback-Leibler divergence (3) involving an offset parameter σ ≥ 0 and a side-constraint g ≥ − σ 2 : Note that (51) and (55) coincide for σ = 0. Correspondingly, we choose as data misfit functional in (4a). Setting s(g † ) := M [g † − (g † + σ) ln(g † + σ)] dx, Assumption 2 is satisfied with Remark 6.1 (Assumptions 5A and 5B (source conditions)). Using the inequality (see [11, Lemma 2.2 (a)]), Assumption 5A/B with T (g 1 ; g 2 ) = g 1 − g 2 2 L 2 imply Assumption 5A/B with T (g 1 ; g 2 ) = KL (g 1 ; g 2 ) if F (B) is bounded in L ∞ (M). However, Assumptions 5A/B with T (g 1 ; g 2 ) = KL (g 1 ; g 2 ) may be fulfilled with a better index function ϕ if F (u † ) is close to 0 in parts of the domain.
Before we state our convergence result, we introduce the smallest concave function larger than the rate function in Theorem 4.2: From the case of Hölder-type source conditions we expect thatφ will typically coincide with Λ • Ψ −1 at least in a neighborhood of 0 (see e.g. [26,Prop. 4.3]).
Corollary 6.2. Let the Assumptions 1, 3 and 5B hold true. Moreover, assume that one of the following conditions is satisfied: • Assumptions 4A and P hold true with S and T given by (49) and (51) andỸ = F (B).
• Assumptions 4B and P hold true with T and S given by (55) and (56) and Suppose that β 2 is sufficiently small, B is bounded and R is chosen such that (11) holds true, and Lepskiȋ's balancing principle (41) is applied with c = C 1 q bd 4 (1 + γ nl ) and with a sufficiently large parameter τ (a lower will be given in the proof ). Then we obtain the following convergence rate in expectation: Proof. In the case of Assumption 4A and σ = 0, we find that Assumption 2 holds true with err defined by (52). Assumption P implies that the terms err n defined by (17a) in Theorem 2.3 satisfy for all ρ > ρ 0 and t > t 0 with τ := 1 + 2ηC tc + C tc due to C err = 1. To show the analogous estimate in the case of Assumption 4B, recall that Assumption 2 holds true with err defined by (57). From the variational characterization of u n+1 it follows that Moreover, from Assumption 4B we conclude that This yields the inequality (60) with τ := 2 also for err n defined by (17b) using Assumption P. By virtue of (60) the sets E ρ := sup n∈N 0 err n ≤ τ ρ Recall that ζ is monotonically decreasing and define ρ (t) := ζ −1 1/ √ t where we assume t to be sufficiently large. We have . (63) Now we can apply Theorem 4.2 to obtain the error bound with some constant C 1 > 0 for all sufficiently large t. In the last inequality we have used the concavity ofφ. Plugging this into (63) yields Sinceφ is concave, there exists C 2 > 0 such that s ≤ C 2φ (s) for all sufficiently small s > 0. Moreover, 1 √ t in the second term is bounded by 1 ρ 0 , and thus we obtain the assertion (59).
If ζ (ρ) = exp (−cρ) for some c > 0 as discussed above, then our convergence rates result (59) means that we have to pay a logarithmic factor for adaptation to unknown smoothness by the Lepskiȋ principle. It is known (see [45]) that in some cases such a logarithmic factor is inevitable. The most important issue is the verification of Assumption P. In case of Assumption 4A this follows from the results discussed above only under the restrictive assumption that F (u) is uniformly bounded away from 0 for all u ∈ B. On the other hand for the case of Assumption 4B we find that Assumption P is satisfied under the mild condition sup u,v∈B Binning. Let us discuss the relation between the discrete data model discussed in the introduction and the continuous model above. Consider a decomposition of the measurement manifold M into J measurable disjoint subdomains (bins) of positive measure The discrete data model above can be treated in the framework of our analysis by choosing S g obs ; g := S J 1 t g obs ; S J g , s(g † ) := S J S J g † ; S J g † , and T := KL ∞ . Then Assumption 2 holds true with if S J g ≥ 0, {j : (S J g) j = 0, (Sg † ) j + g obs j > 0} = ∅ and err(g) := ∞ else. To achieve convergence, the binning has to be refined as t → ∞. The binning should be chosen such that the second term on the right hand side of (64) (the discretization error) is dominated by the first term (the stochastic error) such that the reconstruction error is determined by the number of observed photons rather than discretization effects.

applications and computed examples
Solution of the convex subproblems. We first describe a simple strategy to minimize the convex functional (4a) with S as defined in (56) in each Newton step. For the moment we neglect the side condition g ≥ −σ/2 in (56). For simplicity we further assume that R is quadratic, e.g. R(u) = u − u 0 2 . We approximate S g obs ; g + h by the second order Taylor expansion S (2) [g obs ; g](h) := S g obs ; g + M 1 − g obs + σ g + σ h + 1 2 and define an inner iteration h n,l := argmin h S (2) g obs ; F (u n ) + F [u n ](u n,l − u n ); (h) + α n R(u n,l + h) for l = 0, 1, . . . with u n,0 := u n and u n,l+1 := u n,l + s n,l h n,l . Here the step-length parameter s n,l is chosen as the largest s ∈ [0, 1] for which sF [u n ] ≥ −ησ − F (u n ) with a tuning parameter η ∈ [0, 1) (typically η = 0.9). This choice of s n,l ensures that (65) is a reasonable approximation to (4a), and η = 1/2 ensures that u n,l+1 satisfies the side condition in (56). It follows from the first order optimality conditions, which are necessary and sufficient due to strict convexity here, that u n,l = u n,l+1 is the exact solution u n+1 of (4a) if h n,l = 0. Therefore, we stop the inner iteration if h n,l / h n,0 is sufficiently small. We also stop the inner iteration if s n,l is 0 or too small. Simplifying and omitting terms independent of h we can write (65) as a least squares problem with g n,l := F (u n ) + F [u n ](u n,l − u n ). (66) is solved by the CG method applied to the normal equation.
In the examples below we observed fast convergence of the inner iteration (65). In the phase retrieval problem we had problems with the convergence of the CG iteration when α n becomes too small. If the offset parameter σ becomes too small or if σ = 0 convergence deteriorates in general. This is not surprising since the iteration (65) cannot be expected to converge to the exact solution u n+1 of (4a) if the side condition F (u n ) + F (u n ; u n+1 − u n ) ≥ −σ/2 is active at u n+1 . The design of efficient algorithms for this case will be addressed in future research.
An inverse obstacle scattering problem without phase information. The scattering of polarized, transverse magnetic (TM) time harmonic electromagnetic waves by a perfect cylindrical conductor with smooth cross section D ⊂ R 2 is described by the Here D is compact, R 2 \ D is connected, n is the outer normal vector on ∂D, and u i = exp(ikx · d) is a plane incident wave with direction d ∈ {x ∈ R 2 : |x| = 1}. This is a classical obstacle scattering problems, and we refer to the monograph [15] for further details and references. The Sommerfeld radiation condition (67c) implies the asymptotic behavior as |x| → ∞, and u ∞ is called the far field pattern or scattering amplitude of u s . We consider the inverse problem to recover the shape of the obstacle D from photon counts of the scattered electromagnetic field far away from the obstacle. Since the photon density is proportional to the squared absolute value of the electric field, we have no immediate access to the phase of the electromagnetic field. Since at large distances the photon density is approximately proportional to |u ∞ | 2 , our inverse problem is described by the operator equation A similar problem is studied with different methods and noise models by Ivanyshyn & Kress [30]. Recall that |u ∞ | is invariant under translations of ∂D. Therefore, it is only possible to recover the shape, but not the location of D. For plottings we always shift the center of gravity of ∂D to the origin. We assume that D is star-shaped and represent ∂D by a periodic function q such that ∂D = {q(t)(cos t, sin t) : t ∈ [0, 2π]}. For details on the implementation of F , its derivative and adjoint we refer to [27] where the mapping q → u ∞ is considered as forward operator. Even in this situation where the phase of u ∞ is given in addition to its modulus, it has been shown in [27] that for Sobolev-type smoothness assumptions at most logarithmic rates of convergence can be expected. As a test example we choose the obstacle shown in Figure 1 described by q † (t) = 1 2 √ 3 cos 2 t + 1 with two incident waves from "South West" and from "East" with wave number k = 10 as shown in Figure 1. We used J = 200 equidistant bins. The initial guess for the Newton iteration is the unit circle described by q 0 ≡ 1, and we choose the Sobolev norm R (q) = q − q 0 2 H s with s = 1.6 as penalty functional. The regularization parameters are chosen as α n = 0.5 · (2/3) n . Moreover, we choose an initial offset parameter σ = 0.002, which is reduced by 4 5 in each iteration step. The inner iteration (65) is stopped when h n,l / h n,0 ≤ 0.1, which was usually the case after about 3 iterations (or about 5 iterations for h n,l / h n,0 ≤ 0.01). For comparison we take the usual IRGNM, i.e. (4) with S (ĝ; g) = g −ĝ 2 L 2 and R as   Table 1.
above as well as a weighted IRGNM where S is chosen to be Pearson's φ 2 -distance: Since in all our examples we have many zero counts, we actually used S g obs ; g = φ 2 g obs ; max{g, c}  69)) for different values of the expected total number of counts t with 100 experiments for each set of parameters. The error of the initial guess is q 0 −q † L 2 = 0.288. All parameters as in Figure 1.
with a cutoff-parameter c > 0. Error statistics of shape reconstructions from 100 experiments are shown in Table 1. The stopping index N is chosen a priori such that (the empirical version of) the expectation E q n −q † 2 L 2 is minimal for n = N , i.e. we compare both methods with an oracle stopping rule. Note that the mean square error is significantly smaller for the Kullback-Leibler divergence than for the L 2 -distance and also clearly smaller than for Pearson's distance. Moreover the distribution of the error is more concentrated for the Kullback-Leibler divergence. For Pearson's φ 2 distance it must be said that the results depend strongly on the cutoff parameter for the data. In our experiments c = 0.2 seemed to be a good choice in general.
A phase retrieval problem. A well-known class of inverse problems with numerous applications in optics consists in reconstructing a function f : R d → C from the modulus of its Fourier transform |Ff | and additional a priori information, or equivalently to reconstruct the phase Ff /|Ff | of Ff (see Hurt [29]). In the following we assume more specifically that f : R 2 → C is of the form f (x) = exp(iϕ(x)) with an unknown real-valued function ϕ with known compact support supp(ϕ). For a uniqueness result we refer to Klibanov [34], although not all assumptions of this theorem are satisfied in the example below. It turns out to be particularly helpful if ϕ has a jump of known magnitude at the boundary of its support. We will assume that supp ϕ = B ρ = {x ∈ R 2 : |x| ≤ ρ} and that ϕ ≈ χ Bρ close to the boundary ∂B ρ (here χ Bρ denotes the characteristic function of B ρ ). This leads to an inverse problem where Here H s (B ρ ) denotes a Sobolev space with index s ≥ 0 and M ⊂ R 2 is typically of the form M = [−κ, κ] 2 . The a priori information on ϕ can be incorporated in the form of an initial guess ϕ 0 ≡ 1. Note that the range of F consists of analytic functions. The problem above occurs in optical imaging: If f (x ) = exp(iϕ(x )) = u(x , 0) (x = (x 1 , x 2 )) denotes the values of a cartesian component u of an electric field in the plane {x ∈ R 3 : x 3 = 0} and u solves the Helmholtz equation ∆u + k 2 u = 0 and a radiation condition in the half-space {x ∈ R 3 : x 3 > 0}, then the intensity g(x ) = |u(x , ∆)| 2 of the electric field at a measurement plane {x ∈ R 3 : x 3 = ∆} in the limit ∆ → ∞ in the (f) log 10 of median data reconstruction tF (ϕN ) for our method (66) Figure 3: Median reconstructions for the phase retrieval problem with t = 10 6 expected counts.

1.5]
). If f is generated by a plane incident wave in x 3 direction passing through a non-absorbing, weakly scattering object of interest in the half-space {x 3 < 0} close to the plane {x 3 = 0} and if the wave length is small compared to the length scale of the object, then the projection approximation ϕ(x ) ≈ k 2 0 −∞ (n 2 (x , x 3 ) − 1) dx 3 is valid where n describes the refractive index of the object of interest (see e.g. [39, Sec. 2.1]). A priori information on ϕ concerning a jump at the boundary of its support can be obtained by placing a known transparent object before or behind the object or interest. The simulated test object in Figure 3 which represents two cells is taken from Giewekemeyer et al. [19]. We choose the initial guess ϕ 0 ≡ 1, the Sobolev index s = 1 2 , and the regularization parameters α n = 5 10 6 · (2/3) n . The photon density is approximated by J = 256 2 bins. The offset parameter σ is initially set to 2 · 10 −6 and reduced by a factor 4 5 in each iteration step. As for the scattering problem, we use an oracle stopping rule N := argmin n E ϕ n − ϕ † 2 L 2 . As already mentioned, we had difficulties to solve the quadratic minimization problems (66) by the CG method for small α n and had to stop the iterations before residuals were sufficiently small to guarantee a reliable solution. Nevertheless, comparing subplots (c) and (e) in Figure 3, the median KL-reconstruction (e) seems preferable (although more noisy) since the contours are sharper and details in the interior of the cells are more clearly separated.