Efﬁcient, Quantitative Numerical Methods for Statistical Image Deconvolution and Denoising

We review the development of efﬁcient numerical methods for statistical multi-resolutionestimationofopticalimagingexperiments.Inprinciple,thisinvolvesconstrainedlineardeconvolutionanddenoising,andsothesetypesofproblemscanbeformulatedasconvexconstrained,orevenunconstrained,optimization.Weaddresstwomainchallenges:ﬁrstoftheseistoquantifyconvergenceofiterativealgorithms;thesecondchallengeistodevelopefﬁcientmethodsfortheselarge-scaleproblemswithoutsacriﬁcingthequantiﬁcationofconvergence.Wereviewthestateoftheartforthesechallenges.


Introduction
In this chapter we review progress towards addressing two main challenges in scientific image processing. The first of these is to quantify convergence of iterative algorithms for image processing to solutions (as opposed to optimal values) to the underlying variational problem. The second challenge is to develop efficient methods for these large-scale problems without sacrificing the quantification of convergence.
The techniques surveyed here were studied in [1][2][3]. We present only the main results from these studies, in the context that hindsight provides.
Scientific images are often processed with software that accomplishes a number of tasks like registration, denoising and deblurring. Implicit in the processing is that some systematic error is being corrected to bring the image closer to the truth. This presumption is more complicated for denoising and deblurring. These are often accomplished by filtering or by solving some variational problem such as minimizing the variance of an image. For applications requiring speedy processing, such as audio and video communication, this is sufficient. But the recent development of nanoscale photonic imaging modalities such as STED and RESOLFT featured in Chaps. 1, 7 and 9 has shifted the focus of image denoising and deconvolution from qualitative to quantitative models.
Quantitative approaches to image processing are the subject of Chap. 11 where statistical multiscale estimation is discussed (see Sect. 11.3). Here, the recovered image comes with statistical statements about how far the processed image is, in some statistical sense, from the truth. The estimators are almost exclusively variational, that is, they can be characterized as the solution to an optimization problem. It is important to emphasize that the value of the optimization problem is meaningless. This stands in stark contrast to many conventional applications in economics and operations research, where the value of the optimal solution is related to profit or cost, and so is of principal interest.
The focus on optimal solutions rather than optimal values places heavy demands on the structure of model formulations and the algorithms for solving them. Unless the numerical method allows one to state how far a computed iterate is to the solution of the underlying variational problem, then the scientific significance of the iterate is lost.
The leading computational approaches for solving imaging problems with multiresolution statistical estimation criterion are based on iterated proximal operators. Most of the analysis for first-order iterative proximal methods is limited to statements about rates of convergence of function values, if rates are discussed at all (see for instance [4][5][6][7]). First-order methods have slow convergence in the worst case scenario. A common assumption to guarantee linear convergence of the iterates is strong convexity, but this is far more than is necessary, and in particular it is not satisfied for the Huber function (12.35). It was shown in [8] that metric subregularity is necessary for local linear convergence. Aspelmeier, Charitha and Luke [1] showed that the popular alternating directions method of multipliers algorithm (ADMM) applied to optimization problems with piecewise linear-quadratic objective functions (e.g. the Huber function), together with linear inequality constraints generically satisfies metric subregularity at isolated critical points; hence linear convergence of the iterates for this algorithm can be expected without further ado. More recently, in [3] it was shown that the primal iterates of a modification of the PAPC algorithm (Algorithm 2) converge R-linearly for any quadratically supportable objective function (for instance, the Huber function). Conventional results without metric subregularity obtain a convergence rate of O(1/k) with respect to the function values. In settings like qualitative image processing or machine learning, such results are acceptable, but in the setting of statistical image processing these statements do not contain any scientific content. We present in this chapter efficient iterative first-order methods that offer some hope of quantitative guarantees about the distance of the iterates to optimal solutions.

Problem Formulation
We limit our scope to the real vector space R n with the norm generated from the inner product. The closed unit ball centered on the point y ∈ R n is denoted by B(y). The positive orthant (resp. negative orthant) in R n is denoted by R n + (resp. R n − ). The domain of an extended real-valued function ϕ : The set of symmetric n × n positive (semi)-definite matrices is denoted by S n ++ (S n + ). The notation A 0 (A 0) denotes a positive (semi)definite matrix A. For any z ∈ R n and any A ∈ S n + , we denote the semi-norm z 2 A := z, Az . The operator norm is defined by A = max u∈R n { Au : u = 1} and coincides with the spectral radius of A whenever A is symmetric. If A = 0, σ min (A) denotes its smallest nonzero singular value. For a sequence {z k } k∈N converging to z * , we say the convergence is We limit our discussion to proper (nowhere equal to −∞ and finite at some point), lower semi-continuous (lsc), extended-valued (can take the value +∞) functions. We will, in fact, limit our discussion to convex functions, but convexity is not the central property governing quantitative convergence estimates. By the subdifferential of a function ϕ, denoted ∂ϕ, we mean the collection of all subgradients that can be written as limits of sequences of Fréchet subgradients at nearby points; a vector v is a (Fréchet) subgradient of ϕ at y, The functions of interest for us are subdifferentially regular on their domains, that is, the epigraphs of the functions are Clarke regular at points where they are finite [10,Definition 7.25]. For our purposes it suffices to note that, for a function ϕ that is subdifferentially regular at a point y, the subdifferential is nonempty and all subgradients are Fréchet subgradients, that is, ∂ϕ(y) = ∂ϕ(y) = ∅. Convex functions, in particular, are subdifferentially regular on their domains and the subdifferential has the particularly simple representation as the set of all vectors v where A mapping Φ : R n ⇒ R n is said to be β-inverse strongly monotone [10,Corollary 12.55 The mapping Φ is said to be polyhedral (or piecewise polyhedral [10]) if its graph is the union of finitely many sets that are polyhedral convex in R n × R n [11]. Polyhedral mappings are generated by the subdifferential of piecewise linear-quadratic functions (see Proposition 12.9).
is called piecewise linear-quadratic if dom f can be represented as the union of finitely many polyhedral sets, relative to each of which f (x) is given by an expression of the form 1 2 x, Ax + a, x + α for some scalar α ∈ R vector a ∈ R n , and symmetric matrix A ∈ R n×n .
Closely related to plq functions is quadratically supportable functions. Definition 12.2 (pointwise quadratically supportable (pqs)) A proper, extendedvalued function ϕ : R n → R ∪ {+∞} is said to be pointwise quadratically supportable at y if it is subdifferentially regular there and there exists a neighborhood V of y and a constant μ > 0 such that If for each bounded neighborhood V of y there exists a constant μ > 0 such that (12.4) holds, then the function ϕ is said to be pointwise quadratically supportable at y on bounded sets. If (12.4) holds with one and the same constant μ > 0 on all neighborhoods V , then ϕ is said to be uniformly pointwise quadratically supportable at y.
For more on the relationship between pointwise quadratic supportability, coercivity, strong monotonicity and strong convexity see [3].
We denote the resolvent of Φ by J Φ ≡ (Id + Φ) −1 where Id denotes the identity mapping and the inverse is defined as The corresponding reflector is defined by R ηΦ ≡ 2J ηΦ − Id. One of the more prevalent examples of resolvents is the proximal map. For ϕ : R n → (−∞, ∞] a proper, lsc and convex function and for any u ∈ R n and Q ∈ S n ++ , the proximal map associated with ϕ with respect to the weighted Euclidean norm is uniquely defined by: When Q = c −1 Id, c > 0, we simply use the notation prox c,ϕ (u). We also recall the fundamental Moreau proximal identity [12], that is, for any z ∈ R n z = prox Q,ϕ (z) + Qprox Q −1 ,ϕ * (Q −1 (z)), (12.6) where Q −1 is the inverse of Q ∈ S n ++ . Notions of continuity of set-valued mappings have been thoroughly developed over the last 40 years. Readers are referred to the monographs [10,11,13] for basic results. A key property of set-valued mappings that we will rely on is metric subregularity, which can be understood as the property corresponding to a Lipschitzlike continuity of the inverse mapping relative to a specific point. It is a weaker property than metric regularity which, in the case of an n × m matrix for instance, is equivalent to surjectivity. Our definition follows the characterization of this property given in [11,Exercise 3H.4].

Definition 12.3 (metric subregularity)
The mapping Φ : R n ⇒ R m is called metrically subregular at x for y relative to W ⊂ R n if (x, y) ∈ gphΦ and there is a constant c > 0 and neighborhoods O of x such that The constant c measures the stability under perturbations of inclusion y ∈ Φ(x). An important instance where metric subregularity comes for free is for polyhedral mappings. A notion related to metric regularity is that of weak-sharp solutions. This will be used in the development of error bounds (Theorem 12.6). [14]) The solution set argmin { f (x) | x ∈ Ω } for a nonempty closed convex set Ω, is weakly sharp if, for p = inf Ω f , there exists a positive number α (sharpness constant) such that

Definition 12.5 (weak sharp minimum
Similarly, the solution set S f is weakly sharp of order ν > 0 if there exists a positive number α (sharpness constant) such that, for each x ∈ Ω,

Abstract Problem
The generic problem in which we are interested is The following blanket assumptions on the problem's data hold throughout: Assumption (i) implies that the optimal value of (P 0 ) is finite. Assumption (ii) implies that the constraint structure is convex. Assumption (iii) implies that the mapping A : R n → R m is linear and full rank, where The challenge of statistical multi-resolution estimation lies in the feature that the dimension of the constraint structure, m, is much greater than the dimension of the unknowns, n, and grows superlinearly with respect to the number of unknowns.
The above constrained optimization problem is often formulated as an unconstrained-looking problem via the introduction of a (nonsmooth) penalty term enforcing the constraints: min The requirements on the function θ align this penalty term with exact penalization [15], that is, a relaxation of the constraints where, for all parameters ρ large enough, the constraints are exactly satisfied.
The following assumptions are used to guarantee the exact correspondence between solutions to (P 0 ) and (P). In (P 0 ) and (P) the function f is often smooth, but not prox friendly. In applications it is most often a smooth regularization or a fidelity term. For the ADMM/DR method reviewed in Sect. 12.3 smoothness is not required.

Assumption 2
It is assumed that the functions g i (i = 1, 2, . . . , M) are prox friendly and that they enjoy some structure that makes g also prox friendly. For instance, if the constraints are separable, then the function is also prox-friendly as is the function The functions g i • A i can be regularizing functions (like total variation) or hard inequality constraints. For example, hard inequality constraints are modeled by the use of indicator functions for g i in (P 0 ):

Saddle Point and Dual Formulations
The saddle point formulation is derived by viewing the function g in (P) as the image of a function g * under Fenchel conjugation, that is, g(x) = (g * ) * . Writing this explicitly into (P) yields The bifunction in the saddle point formulation is Contrast this with the Lagrangian for the extended problem The Lagrangian is 14) and the augmented Lagrangian L for (P L ) is given by where z ∈ R m , and η > 0 is a fixed penalty parameter. Assumption 1(i) guarantees that the mapping L(·, ·) has a saddle point, that is, The existence of a saddle point corresponds to zero duality gap for the induced optimization problems By weak duality, we have inf x∈R n p(x) ≥ sup y∈R m q(y). This can be viewed as a partial dual to problem (P). The full dual problem involves the Fenchel conjugate of the entire objective function. For (P) the dual problem is Instead of working with this dual, it is more convenient to work with the following equivalent formulation via the change of variable y → −y: inf Under standard constraint qualifications (e.g., [16,Theorem 2.3.4]), (x,ŷ) is a saddle point of L if and only ifx is an optimal solution of the primal problem (P 0 ), andŷ is an optimal solution of the dual problem (D). The following two inclusions characterize the solutions of the problems (P 0 ) and (D) respectively: In both cases, one has to solve an inclusion of the form for general set-valued mappings B and D.

Statistical Multi-resolution Estimation
Statistical multi-resolution estimation (SMRE) discussed in Sect. 11.2.7 of Chap. 11 is specialized here for the case of imaging systems with Gaussian noise.
The variational model for statistical multi-resolution estimation with Gaussian noise takes the form Here f : R n → R is a regularization functional, which incorporates a priori knowledge about the unknown signal x such as smoothness, w i is a weighting function for the grid points in the subset V i , and A : R n → R n is the linear imaging operator that models the experiment. The constant γ i has an interpretation in terms of the quantile of the estimator.
In the context of the general model (P 0 ), Here the affine mapping is an averaging operator that accounts for sampling at different resolutions of the image. Note that the observation b need not be in the range of the imaging operator A -all that is assumed is that this mapping is injective, not surjective. This means that, in applications, practitioners need to be careful not to make the constraint γ i too small, otherwise the optimization problem might be infeasible. If the algorithms presented below appear to be diverging for a particular instance of (P SM R E ), it is because the problem is infeasible; increasing the constants γ i should solve the problem.

Alternating Directions Method of Multipliers and Douglas Rachford
In this section we survey the main results (without proofs) from [1]. For proofs of the statements, readers are referred to that article. A starting point for most of the main approaches to solving (P 0 ) is the alternating directions method of multipliers (ADMM) (primary sources include [17][18][19][20][21]). This method is one of many splitting methods which are the principal approach to handling the computational burden of large-scale, separable problems [22]. ADMM belongs to a class of augmented Lagrangian methods whose original motivation was to regularize Lagrangian formulations of constrained optimization problems. The ADMM algorithm for solving (P L ) follows. The penalty parameter η need not be a constant, and indeed evidence indicates that the choice of η can greatly impact the complexity of the algorithm. For simplicity we keep this parameter fixed.
We do not specify how the argmin in Algorithm 1 should be calculated, and indeed, the analysis that follows assumes that these can be computed exactly. 1 One problem that should be immediately apparent is that this algorithm operates on a space of dimension n + 2m. Since one of the two challenges we address is high dimension, this expansion in the dimension of the problem formulation should be troubling. Nevertheless, we show with this algorithm how the first challenge, namely quantification of convergence is achieved.
The connection between the ADMM algorithm and the Douglas-Rachford algorithm introduced in Chap. 6, (6.30) was first discovered by Gabay [19] (see also the thesis of Eckstein [17]). For any η > 0, the Douglas-Rachford algorithm [23,24] for solving (12.16) is given by (12.19) where J η D ≡ (Id + η D) −1 and J η B (Id + η B) −1 are the resolvents of η D and η B respectively. Given z 0 and y 0 ∈ Dz 0 , following [25], define the new variable ξ 0 ≡ z 0 + η y 0 so that z 0 = J η D ξ 0 . We thus arrive at an alternative formulation of the Douglas-Rachford algorithm (12.18): for (12.21) where R η D and R η B are the reflectors of the respective resolvents. This is the form of Douglas-Rachford considered in [26]. Specializing this to our application yields and so the resolvent mappings are the proximal mappings of the convex functions f * • (−A T ) and g * respectively, and hence the resolvent mappings and corresponding fixed point operator T are single-valued [12].  (12.22). For fixed η > 0, given any initial points ξ 0 and y 0 , z 0 ∈ gphD such that ξ 0 = y 0 + ηz 0 , the sequences z k k∈N , ξ k k∈N and y k k∈N defined respectively by (12.18), (12.20) and y k ≡ 1 η ξ k − z k converge to points z ∈ Fix T , ξ ∈ Fix T and y ∈ D Fix T . The point z = J η D ξ is a solution to (D), and y = 1 η ξ − z ∈ Dz. If, in addition, A has full column rank, then the sequence y k , z k k∈N corresponds exactly to the sequence of points generated in steps 2-3 of Algorithm 1 and the sequence ξ k+1 k∈N converges to ξ, a solution to (P 0 ).
The correspondence between Douglas-Rachford and ADMM in the proposition above means that if quantitative convergence can be established for one of the algorithms, it is automatically established for the other. Linear convergence of Douglas-Rachford under the assumption of strong convexity and Lipschitz continuity of f was already established by Lions and Mercier [26]. Recent published work in this direction includes [27,28]. Local linear convergence of the iterates to a solution was established in [29] for linear and quadratic programs using spectral analysis. In Proposition 12.8, two conditions are given that guarantee linear convergence of the ADMM iterates to a solution. The first condition is classical and follows Lions and Mercier [26]. The second condition, based on [30], is much more prevalent in applications and generalizes the results of [29].  (12.22) where A : R n → R m is an injective linear mapping. Let ξ ∈ Fix T for T defined by (12.21). For fixed η > 0 and any given triplet of points ξ 0 , y 0 , z 0 satisfying ξ 0 ≡ z 0 + η y 0 , with y 0 ∈ Dz 0 , generate the sequence (y k , z k ) k∈N by Steps 2-3 of Algorithm 1 and the sequence (ξ k ) k∈N by (12.20).
(i) Let O ⊂ R n be a neighborhood of ξ on which g is strongly convex with constant μ and ∂g is β-inverse strongly monotone for some β > 0. Then, for Then the sequences (ξ k ) k∈N and (y k , z k ) k∈N converge linearly to the respective points ξ ∈ Fix T ∩ W and (y, z) with rate bounded above by In either case, the limit point z = J η D ξ is a solution to (D), y ∈ Dz and the sequence x k k∈N of Step 1 of Algorithm 1 converges to x, a solution of (P 0 ). The strong convexity assumption (i) of Theorem 12.8 fails in many applications of interest, and in particular for feasibility problems (minimizing the sum of indicator functions). By [31, Theorem 2.2], case (ii) of Theorem 12.8, in contrast, holds in general for mappings T for which T − Id is metrically subregular and the fixed point sets are isolated points with respect to an affine subspace to which the iterates are confined. The restriction to the affine subspace W is a natural generalization for the Douglas-Rachford algorithm, where the iterates are known to stay confined to affine subspaces orthogonal to the fixed point set [32,33]. We show that metric subregularity with respect to this affine subspace holds in many applications.

)(x), that T : W → W for W some affine subspace of R m and that Fix T ∩ W is an isolated point {ξ}. Then there is a neighborhood
O of ξ such that, for all starting points (ξ 0 , y 0 , z 0 ) with ξ 0 ≡ z 0 + η y 0 ∈ O ∩ W for y 0 ∈ D(z 0 ) so that J η D ξ 0 = z 0 , the sequence (ξ k ) k∈N generated by (12.20

ADMM for Statisitcal Multi-resolution Estimation of STED Images
The theoretical results above are demonstrated with an image b ∈ R n (Fig. 12.1) generated from a Stimulated Emission Depletion (STED) microscopy experiment [34,35] conducted at the Laser-Laboratorium Göttingen examining tubulin, represented as the "object" x ∈ R n . The imaging model is simple linear convolution. The measurement b, shown in Fig. 12.1, is noisy or otherwise inexact, and thus an exact solution is not desirable. Although the noise in such images is usually modeled by  and exact penalty g(Ax) given by (12.12) with g i given by (12.17).
For an image size of n = 64 × 64 with three resolution levels the resulting number of constraints is m = 12161 (that is, 64 2 constraints at the finest resolution, 4 * 32 2 constraints at the next resolution and 9 * 21 2 constraints at the lowest resolution). The constant α = 0.01 in (12.24) is used to balance the contributions of the individual terms to make the most of limited numerical accuracy (double precision). The constant γ i is chosen so that the model solution would be no more than 3 standard deviations from the noisy data on each interval of each scale.
Since this is experimental data, there is no "truth" for comparison -the constraint, together with the error bounds on the numerical solution to the model solution provide statistical guarantees on the numerical reconstruction [36]. In Fig. 12.2b the iteration is shown with the value of ρ = 4096 for which the constraints are exactly satisfied (to within machine precision), indicating the correspondence of the computed solution of problem (P) to a solution to the exact model problem (P 0 ).
The only assumption from Proposition 12.10 that cannot be verified for this implementation is the assumption that the algorithm fixed point is a singleton; all other assumptions are satisfied automatically by the problem structure. We observe, however, starting from around iteration 1500 in Fig. 12.2b, behavior that is consistent with (i.e. does not contradict) linear convergence. From this, the observed convergence rate is c = 0.9997, which yields an a posteriori upper estimate of the pixel-wise error of about 8.9062e −4 , or 3 digits of accuracy at each pixel. 2

Primal-Dual Methods
The ADMM method presented above suffers from the extreme computational cost of computing the prox-operator in step 1. The results of the previous section required several days of cpu time on a 2016-era laptop. In this section we present a method studied in [3] that can achieve results in about 30 s on the same computer architecture. In this section we survey the main results (without proofs) from [1]. There is one subtle difference in the present survey over [3] that has major implications for the application and implementation of the main Algorithm 2.
In this section we consider exclusively functions g in problem (P) of the form (12.11). The algorithm we revisit is the proximal alternating predictor-corrector (PAPC) algorithm proposed in [37] for solving (S). It consists of a predictor-corrector gradient step for handling the smooth part of L in (12.13) and a proximal step for handling the nonsmooth part.

Algorithm 2: Extended Proximal Alternating Predictor-Corrector (EPAPC) for (S).
Parameters : Set η > 0 and choose the parameters τ and σ to satisfy At each iteration the algorithm computes one gradient and a prox-mapping corresponding to the nonsmooth function, both of which are assumed to efficiently implementable. We suppose these can be evaluated exactly, though this does not take into account finite precision arithmetic. The dimension of the iterates of the EPAPC algorithm is on the same order of magnitude as with the ADMM/Douglas-Rachford method, but the individual steps can be run in parallel and, with the exception of the projection in Step 6, are much less computationally intensive to execute.
For quantitative convergence guarantees of primal-dual methods, additional assumptions are required.

Assumption 3
(i) The function f : R n → R is convex and continuously differentiable with Lip- ii) The function f : R n → R is pointwise quadratically supportable (Definition 12.2) at eachx in the solution set S * . (iii) There exists a σ > 0 such that The assumption of Lipschitz continuous gradients Assumption 3(i), is standard, but stronger than one might desire in general. The assumption is included mainly to guarantee boundedness of the iterates. Lipschitz continuity of the gradients is enough for our purposes, however. By the standing Assumption 1 the mapping A is injective and when m ≤ n, then A has full row rank, and AA T is invertible. When m > n, A is still injective but AA T has a nontrivial kernel and care must be taken that the conjugate function g * does not decrease too fast in the direction of the kernel of A T . This is assured by Assumption 3(iii). This assumption comes into play in Lemma 12.1.
Step (3) of Algorithm 2 can be written more compactly when g(w) := g(w 1 , . . . , In this case, the convex conjugate of a separable sum of functions is the sum of the individual conjugates: g * (w) Defining the matrix S = σ −1 I m we immediately get that for any point ζ i ∈ R m i , i = 1, . . . , M,

Thus
Step (3) of Algorithm 2 can be written in vector notation by w k = prox S,g (y k−1 + σA p k ). It is possible to use different proximal step constants σ i , i = 1 . . . , M, see details in [37]. The choice σ i = σ for i = 1, . . . , M is purely for simplicity of exposition. The projection onto (ker A T ) ⊥ in (6) is carried out by applying the pseudo inverse: When m ≤ n and A i is full rank for all i = 1, 2, . . . , M, then ker A T = {0} and the above operation is not needed. But an unavoidable feature of multi-resolution analysis, our motivating application, is that m > n, so some thought must be given to efficient computation of A T A −1 .
The next technical result, which is new, establishes a crucial upper bound on the growth of the Lagrangian with respect to the primal variables. Assumption 3 hold and let ( p k , y k , x k ) k∈N be the sequence generated by the EPAPC algorithm. Then for every k ∈ N and every (x, y) ∈ R n × R m ,

Lemma 12.1 Let
and Note that for the choice of τ given in the parameter initialization of Algorithm 2, G 0.
The constant μ in Proposition 12.11 depends on the choice of (x 0 , y 0 ) and so depends implicitly on the distance of the initial guess to the point in the set of saddle point solutions.
Convergence of the primal-dual sequence is with respect to a weighted norm on the primal-dual product space built on G in (12.28).
where by the assumptions on the choice of τ given in Algorithm 2, G 0. We can then define an associated norm using the positive definite matrix H , u 2 H := 1 τ x 2 + y 2 G . We are now ready to state the main result and corollaries, whose proofs can be found in [3]. ( p k , x k , y k ) k∈N be the sequence generated by the EPAPC algorithm. Let λ min+ (AA T ) denote the smallest nonzero eigenvalue of AA T . If  Assumptions 1 and 3 are satisfied, then there exists a saddle point solution for L(·, ·), the pairû = (x,ŷ), withŷ ∈ (ker A T ) ⊥ , and for any α > 1 and for all k ≥ 1, the sequence u k = (x k , y k ) k∈N satisfies

Theorem 12.4.1 Let
is positive and μ > 0 is the constant of pointwise quadratic supportability of f atx depending on the distance of the initial guess to the point (x,ŷ) in the solution set S * . In particular, (x k , y k ) k∈N is Q-linearly convergent with respect to the H -norm to a saddle-point solution.
An unexpected corollary of the result above is that the set of saddle points is a singleton. The above theorem yields the following estimate on the number of iterations required to achieve a specified distance to a saddle point.

Corollary 12.13
Under Assumptions 1 and 3, letū = (x,ȳ) be the limit point of the sequence generated by the EPAPC algorithm. In order obtain (12.33) it suffices to compute k iterations, with , and δ is given in (12.32).

EPAPC for Statisitcal Multi-resolution Estimation of STED Images
An efficient computational strategy for evaluating or at least approximating the projection P (ker A T ) ⊥ in Step 6 of Algorithm 2 has not yet been established. We report here preliminary computational results of Algorithm 2 without computing Step 6.
Our results show that the method is promising, though error bounds to the solution to (S) are not justified without computation of P (ker A T ) ⊥ . In our numerical experiments, the constraint penalty in (S) takes the form g(y) = This is an exact penalty function, and so solutions to (S) correspond to solutions to (P 0 ). Using Moreau's identity (12.6), the prox-mapping is evaluated explicitly in (6) for each constraint by The proximal parameter is a function of τ and given by σ = 1/(τ AA T 2 ). More details in [37,Sect. 4.1].
Here, we also consider the smooth approximation of the L 1 -norm as the qualitative objective. The L 1 -norm is non-smooth at the origin, thus in order to make the derivative-based methods possible we consider a smoothed approximation of this, known as the Huber approximation.
The Huber loss function is defined as follows: if |t| ≤ α |t| − α 2 if |t| > α, (12.35) where α > 0 is a small parameter defining the trade-off between quadratic regularization (for small values) and L 1 regularization (for larger values). The function φ is smooth with 1 α -Lipschitz continuous derivative and its derivative is given by (12.36) Pointwise quadratically supportability of this function at solutions is not unreasonable but still must be assumed. We demonstrate our reconstruction of the image inset shown in Fig. 12.1 of size n = 64 2 with the same SMRE model as the demonstration in Sect. 12.3.1. The confidence level γ i was set to 0.25 * i at each resolution level (i = 1, 2, 3).   Figure 12.3(bottom) shows the step size of the primal-dual pair for each of these regularized problems as a function of iteration. The model with quadratic regularization achieves a better average rate of convergence, but for both objective functions the algorithm appears to exhibit R-linear convergence (not Q-linear). What is not evident from these experiments is the computational effort required per iteration. Without computation of the pseudo-inverse in step 6, the EPAPC algorithm computes these results in about 30 s on a 2018-era laptop, compared to several days for the results shown in Fig. 12.2.

Randomized Block-Coordinate Primal-Dual Methods
The previous sections reviewed numerical strategies and structures that yield quantitative estimates of the distance of an iterate to the solution of the underlying variational problem. In this section we examine implementation strategies for dealing with high dimensional problems. These are implementation strategies because they do not involve changing the optimization model. Instead, we select at random a smaller subset of variables or constraints in the computation of an update in the fulldimensional iterative procedure. This is the principal strategy for handling problems that, due to their size, must be distributed across many processing and storage units (see for instance [26,[38][39][40] and references therein). We survey here a randomized primal-dual technique proposed and analyzed in [2]. The main theoretical question to resolve with such approaches is whether, and in what sense, iterates converge to a solution to the original problem. We can determine whether the iterates converge, but obtaining an estimate of the distance to the solution remains an open problem.
The algorithm below is a primal-dual method like the algorithms reviewed above, with the exception that it solves an extension of the dual problem (D): The main prox operation is computed on the dual objective in (D), that is f * (x) + g * (y) with respect to the variables (x, y) ∈ R n × R m . The dimension of the basic operations is unchanged from the previous approaches, but the structure of the sum of functions allows for efficient evaluation of the prox mapping. Implicit in this is that the function f is prox friendly. In the algorithm description below it is convenient to use the convention f ≡ g 0 , A 0 ≡ Id. The algorithm is based in part on [39].
Notice that each iteration of Algorithm 3 requires only two small matrix-vector multiplications: A i (·) and A T i (·). The methods of the previous sections, in contrast, worked with full matrix A = [A T 1 , . . . , A T m ] T . This means that all iterations involve full vector operations. For some applications this might be not feasible, at least on standard desktop computers due to the size of problems. Algorithm 3 uses only blocks A i of A, therefore each iteration requires fewer floating point operations, at the cost of having less information available for choosing the next step. This reduction in the effectiveness of the step is compensated for through larger blockwise steps. Computation of the step size is particularly simple. This follows the same hybrid step-length approach developed in [41] for the nonlinear problem of blind ptychography. In particular, we use step sizes adjusted to each block A i : Proposition 12.14 (Theorem 1 of [2]) Suppose Assumption 1 holds and let τ σ i A i 2 < 1. Then (x k , y k ) generated by Algorithm 3 converges to a solution to ( D). In particular, the sequence (x k ) converges almost surely to a solution to (P).
The statement above concerns part (i) of Theorem 1 of [2]. No smoothness is required of the qualitative regularization f . Instead, it is assumed that this function is proxfriendly. This opens up the possibility of using the 1-norm as a regularize, promoting, in some sense, sparsity in the image. No claim is made on the rate of convergence, though the numerical experiments below indicate that, for regular enough functions f , convergence might be locally linear. This remains to be proved.

RBPD for Statisitcal Multi-resolution Estimation of STED Images
Despite many open questions regarding convergence, random methods offer a way to handle extremely large problems. To make a comparison with the deterministic approaches above, cycles of the RBPD Algorithm 3 are counted in terms of epochs. An epoch contains the number of passes through steps 1-6 of Algorithm 3 required before each block is chosen at least once. After k epochs, therefore, the i-th coordinate of x will be updated, on average, the same number of times for the randomized algorithm as for the deterministic methods. In other words, an epoch for a randomized block-wise method is comparable to an iteration of a deterministic method. As the RBPD updates only one block per iteration, each iteration is less computationally intensive than the the deterministic counterparts. However, in our case this efficient iteration still requires one to evaluate two (possibly) expensive convolution products (embedded in A i x and A T i y). Thus, if these operations are relatively expensive, the efficiency gain will be marginal. Nevertheless, because of the ability to operate on smaller blocks, the randomized method requires, per epoch, approximately half the time required per iteration of the deterministic methods. Although the quantitative convergence analysis remains open, our numerical experiments indicate that the method achieves a comparable step-residual to the EPAPC Algorithm 2 after the same number of epochs/iterations.
As with the experiments in the previous Sections, we use three resolutions, which results in one block at the highest resolution, four blocks at the next resolution (four possible shifts of 2 × 2 pixels), and nine blocks at the third resolution (nine different shifts of 3 × 3 pixels). We applied Algorithm 3 with different regularization f in (P 0 ): the 1-norm f (x) = x 1 , Huber loss function f (x) = x 1,α given by (12.35) (α = 0.25) and the squared Euclidean norm. As with the EPAPC experiments, the function g is given by (12.11) with g i given by (12.17) for the parameter γ i = 0.25 * i for i = 1, 2, 3. All of these functions are prox-friendly and have closed-form Fenchel conjugates. The gain in efficiency over the deterministic EPAPC method proposed above (without computation of the pseudo-inverse) is a factor of 2. Figure 12.4a-c shows the reconstructions on the same 64 × 64 image data used in the previous sections. The numerical performance of the algorithm is shown in Fig. 12.4(d). What the more efficient randomization strategy enables is for the full 976 × 976 pixel image to be processed. The result for regularization with the 1-norm is shown in Fig. 12.5. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.