Nanoscale Photonic Imaging pp 313-338 | Cite as

# Efficient, Quantitative Numerical Methods for Statistical Image Deconvolution and Denoising

- 2.9k Downloads

## Abstract

We review the development of efficient numerical methods for statistical multi-resolution estimation of optical imaging experiments. In principle, this involves constrained linear deconvolution and denoising, and so these types of problems can be formulated as convex constrained, or even unconstrained, optimization. We address two main challenges: first of these is to quantify convergence of iterative algorithms; the second challenge is to develop efficient methods for these large-scale problems without sacrificing the quantification of convergence. We review the state of the art for these challenges.

## Keywords

Augmented Lagrangian Primal-dual Saddle point Pointwise quadratic supportable Statistical multiscale analysis Linear convergence Randomized algorithm## 2010 Mathematics Subject Classification:

Primary 49J52 49M20 90C26 Secondary 15A29 47H09 65K05 65K10 94A08.## 12.1 Introduction

In this chapter we review progress towards addressing two main challenges in scientific image processing. The first of these is to quantify convergence of iterative algorithms for image processing to solutions (as opposed to optimal values) to the underlying variational problem. The second challenge is to develop efficient methods for these large-scale problems without sacrificing the quantification of convergence. The techniques surveyed here were studied in [1, 2, 3]. We present only the main results from these studies, in the context that hindsight provides.

Scientific images are often processed with software that accomplishes a number of tasks like registration, denoising and deblurring. Implicit in the processing is that some systematic error is being corrected to bring the image closer to the truth. This presumption is more complicated for denoising and deblurring. These are often accomplished by filtering or by solving some variational problem such as minimizing the variance of an image. For applications requiring speedy processing, such as audio and video communication, this is sufficient. But the recent development of nanoscale photonic imaging modalities such as STED and RESOLFT featured in Chaps. 1, 7 and 9 has shifted the focus of image denoising and deconvolution from qualitative to quantitative models.

Quantitative approaches to image processing are the subject of Chap. 11 where statistical multiscale estimation is discussed (see Sect. 11.3). Here, the recovered image comes with statistical statements about how far the processed image is, in some statistical sense, from the truth. The estimators are almost exclusively variational, that is, they can be characterized as the solution to an optimization problem. It is important to emphasize that the value of the optimization problem is meaningless. This stands in stark contrast to many conventional applications in economics and operations research, where the value of the optimal solution is related to profit or cost, and so is of principal interest.

The focus on optimal solutions rather than optimal values places heavy demands on the structure of model formulations and the algorithms for solving them. Unless the numerical method allows one to state how far a computed iterate is to the solution of the underlying variational problem, then the scientific significance of the iterate is lost.

The leading computational approaches for solving imaging problems with multi-resolution statistical estimation criterion are based on iterated proximal operators. Most of the analysis for first-order iterative proximal methods is limited to statements about rates of convergence of function values, if rates are discussed at all (see for instance [4, 5, 6, 7]). First-order methods have slow convergence in the worst case scenario. A common assumption to guarantee linear convergence of the iterates is strong convexity, but this is far more than is necessary, and in particular it is not satisfied for the Huber function (12.35). It was shown in [8] that metric subregularity is necessary for local linear convergence. Aspelmeier, Charitha and Luke [1] showed that the popular alternating directions method of multipliers algorithm (ADMM) applied to optimization problems with piecewise linear-quadratic objective functions (e.g. the Huber function), together with linear inequality constraints generically satisfies metric subregularity at isolated critical points; hence linear convergence of the iterates for this algorithm can be expected without further ado. More recently, in [3] it was shown that the primal iterates of a modification of the PAPC algorithm (Algorithm 2) converge R-linearly for any *quadratically supportable* objective function (for instance, the Huber function). Conventional results without metric subregularity obtain a convergence rate of *O*(1/*k*) with respect to the function values. In settings like qualitative image processing or machine learning, such results are acceptable, but in the setting of statistical image processing these statements do not contain any scientific content. We present in this chapter efficient iterative first-order methods that offer some hope of quantitative guarantees about the distance of the iterates to optimal solutions.

## 12.2 Problem Formulation

We limit our scope to the real vector space \({\mathbb {R}^n}\) with the norm generated from the inner product. The closed unit ball centered on the point \(y\in {\mathbb {R}^n}\) is denoted by \(\mathbb {B}(y)\). The positive orthant (resp. negative orthant) in \({\mathbb {R}^n}\) is denoted by \({\mathbb {R}^n}_{\!+}\) (resp. \({\mathbb {R}^n}_{\!-}\)). The domain of an extended real-valued function \(\varphi :{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\equiv (-\infty , +\infty ]\) is \({\mathrm{dom}\,}(\varphi )\equiv \{z \in {\mathbb {R}^n}:\varphi (z)<+\infty \}\). The *Fenchel conjugate* of \(\varphi \) is denoted by \(\varphi ^*\) and is defined by \(\varphi ^*(u)\equiv \sup _{z\in {\mathbb {R}^n}}\{\langle {z,u}\rangle -\varphi (z)\}\). The set of symmetric \(n\times n\) positive (semi)-definite matrices is denoted by \(\mathbb {S}^n_{++}\) (\(\mathbb {S}^n_+\)). The notation \(A \succ 0\) \((A \succeq 0)\) denotes a positive (semi)definite matrix *A*. For any \(z \in {\mathbb {R}^n}\) and any \(A \in \mathbb {S}^n_+\), we denote the semi-norm \(\Vert z\Vert ^2_A :=\langle {z,Az}\rangle \). The operator norm is defined by \(\Vert A\Vert = \max _{u\in {\mathbb {R}^n}} \{ \Vert Au\Vert : \Vert u\Vert =1\}\) and coincides with the spectral radius of *A* whenever *A* is symmetric. If \(A \ne 0\), \(\sigma _{min} (A)\) denotes its smallest nonzero singular value. For a sequence \(\{z^{k}\}_{k\in \mathrm {\mathbb {N}}}\) converging to \(z^*\), we say the convergence is *Q*-linear if there exists \(c \in (0, 1)\) such that \(\frac{\Vert z^{k+1}-z^*\Vert }{\Vert z^{k}-z^*\Vert } \le c\;\) for all *k*; convergence is *R*-linear if there exists a sequence \(\eta _{k}\) such that \(\Vert z^{k}-z^*\Vert \le \eta _{k}\) and \(\eta _{k}\rightarrow 0\) *Q*-linearly [9, Chap. 9].

*convex*functions, but convexity is not the central property governing quantitative convergence estimates. By the

*subdifferential*of a function \(\varphi \), denoted \(\partial \varphi \), we mean the collection of all

*subgradients*that can be written as limits of sequences of

*Fréchet subgradients*at nearby points; a vector

*v*is a

*(Fréchet) subgradient*of \(\varphi \) at

*y*, written \(v\in \widehat{\partial } \varphi (y)\), if

*subdifferentially regular*on their domains, that is, the epigraphs of the functions are

*Clarke regular*at points where they are finite [10, Definition 7.25]. For our purposes it suffices to note that, for a function \(\varphi \) that is subdifferentially regular at a point

*y*, the subdifferential is nonempty and all subgradients are Fréchet subgradients, that is, \(\partial \varphi (y)=\widehat{\partial } \varphi (y)\ne \emptyset \). Convex functions, in particular, are subdifferentially regular on their domains and the subdifferential has the particularly simple representation as the set of all vectors

*v*where

*-inverse strongly monotone*[10, Corollary 12.55] if for all \(x, x' \in {\mathbb {R}^n}\)

*polyhedral*(or piecewise polyhedral [10]) if its graph is the union of finitely many sets that are polyhedral convex in \({\mathbb {R}^n}\times {\mathbb {R}^n}\) [11]. Polyhedral mappings are generated by the subdifferential of piecewise linear- quadratic functions (see Proposition 12.9).

### Definition 12.1

(*piecewise linear-quadratic (plq) functions*) A function \(f:{\mathbb {R}^n}\rightarrow [-\infty , +\infty ]\) is called *piecewise linear-quadratic* if *domf* can be represented as the union of finitely many polyhedral sets, relative to each of which *f*(*x*) is given by an expression of the form \(\frac{1}{2}\langle x, Ax\rangle +\langle a, x \rangle +\alpha \) for some scalar \(\alpha \in \mathbb {R}\) vector \(a\in {\mathbb {R}^n}\), and symmetric matrix \(A\in \mathbb {R}^{n\times n} \).

Closely related to plq functions is quadratically supportable functions.

### Definition 12.2

*pointwise quadratically supportable (pqs)*) A proper, extended-valued function \(\varphi : {\mathbb {R}^n}\rightarrow \mathbb R\cup \{+\infty \}\) is said to be

*pointwise quadratically supportable at*

*y*if it is subdifferentially regular there and there exists a neighborhood

*V*of

*y*and a constant \(\mu >0\) such that

*V*of

*y*there exists a constant \(\mu >0\) such that (12.4) holds, then the function \(\varphi \) is said to be

*pointwise quadratically supportable at y on bounded sets*. If (12.4) holds with one and the same constant \(\mu >0\) on all neighborhoods

*V*, then \(\varphi \) is said to be

*uniformly*pointwise quadratically supportable at

*y*.

For more on the relationship between pointwise quadratic supportability, coercivity, strong monotonicity and strong convexity see [3].

*resolvent*of \(\varPhi \) by \({\mathcal J}_{\varPhi }\equiv \left( {\mathrm{Id}}+{\varPhi }\right) ^{-1}\) where \(\mathrm{Id}\) denotes the identity mapping and the inverse is defined as

*reflector*is defined by \(R_{\eta \varPhi } \equiv 2{\mathcal J}_{\eta \varPhi }-\mathrm{Id}\). One of the more prevalent examples of resolvents is the

*proximal map*. For \(\varphi :{\mathbb {R}^n}\rightarrow (-\infty ,\infty ]\) a proper, lsc and convex function and for any \(u \in {\mathbb {R}^n}\) and \(Q \in \mathbb {S}^n_{++}\), the proximal map associated with \(\varphi \) with respect to the weighted Euclidean norm is uniquely defined by:

Notions of *continuity* of set-valued mappings have been thoroughly developed over the last 40 years. Readers are referred to the monographs [10, 11, 13] for basic results. A key property of set-valued mappings that we will rely on is *metric subregularity*, which can be understood as the property corresponding to a Lipschitz-like continuity of the inverse mapping relative to a specific point. It is a weaker property than *metric regularity* which, in the case of an \(n\times m\) matrix for instance, is equivalent to surjectivity. Our definition follows the characterization of this property given in [11, Exercise 3H.4].

### Definition 12.3

*metric subregularity*) The mapping \(\varPhi :{\mathbb {R}^n}\rightrightarrows {\mathbb {R}^m}\) is called

*metrically subregular at*\(\overline{x}\)

*for*\(\overline{y}\)

*relative to*\(W\subset {\mathbb {R}^n}\) if \((\overline{x},\overline{y})\in {\mathrm {gph}}\varPhi \) and there is a constant \(c>0\) and neighborhoods \({\mathcal O}\) of \(\overline{x}\) such that

The constant *c* measures the stability under perturbations of inclusion \({\overline{y}}\in \varPhi ({\overline{x}})\). An important instance where metric subregularity comes for free is for polyhedral mappings.

### Proposition 12.4

(polyhedrality implies metric subregularity) Let \(W\subset V\) be an affine subspace and \(T:\,W\rightrightarrows W\,\). If *T* is polyhedral and \(\mathsf {Fix}\,T\cap W\) is an isolated point, \(\{{\overline{x}}\}\), then \(T-\mathrm{Id}:\,W\rightrightarrows (W-{\overline{x}})\,\) is metrically subregular at \({\overline{x}}\) for 0 relative to *W*.

### Proof

Polyhedrality and isolated fixed points in fact imply *strong* metric subregularity. See [11, Propositions 3I.1 and 3I.2].

A notion related to metric regularity is that of *weak-sharp solutions*. This will be used in the development of error bounds (Theorem 12.6).

### Definition 12.5

*weak sharp minimum*[14]) The solution set \(\mathrm {argmin}\,\left\{ f(x)\,\left| \,x\in \varOmega \right. \right\} \) for a nonempty closed convex set \(\varOmega \), is weakly sharp if, for \({\overline{p}}=\inf _\varOmega f\), there exists a positive number \(\alpha \) (sharpness constant) such that

### 12.2.1 Abstract Problem

*m*, is much greater than the dimension of the unknowns,

*n*, and grows superlinearly with respect to the number of unknowns.

*exact*penalization [15], that is, a relaxation of the constraints where, for all parameters \(\rho \) large enough, the constraints are exactly satisfied.

### Proposition 12.6

Suppose Assumption 2 holds. Then the set of solutions to (\({{\mathcal P}}\)) with *g* defined by (12.8) is bounded and for all \(\rho \) large enough, the solutions to (\({{\mathcal P}_0}\)) and (\({{\mathcal P}}\)) coincide.

### Proof

This is a distillation of [1, Theorem 3.4].

In (\({{\mathcal P}_0}\)) and (\({{\mathcal P}}\)) the function *f* is often smooth, but not prox friendly. In applications it is most often a smooth regularization or a fidelity term. For the ADMM/DR method reviewed in Sect. 12.3 smoothness is not required.

*g*also prox friendly. For instance, if the constraints are

*separable*, then the function

### 12.2.2 Saddle Point and Dual Formulations

*g*in (\({{\mathcal P}}\)) as the

*image*of a function \(g^*\) under Fenchel conjugation, that is, \(g(x) = (g^*)^*\). Writing this explicitly into (\({{\mathcal P}}\)) yields

*partial dual*to problem (\({{\mathcal P}}\)). The full dual problem involves the Fenchel conjugate of the entire objective function. For (\({{\mathcal P}}\)) the dual problem is

*L*if and only if \({\hat{x}}\) is an optimal solution of the primal problem (\({{\mathcal P}_0}\)), and \({\hat{y}}\) is an optimal solution of the dual problem (\({{\mathcal D}}\)). The following two inclusions characterize the solutions of the problems (\({{\mathcal P}_0}\)) and (\({{\mathcal D}}\)) respectively:

*B*and

*D*.

### 12.2.3 Statistical Multi-resolution Estimation

Statistical multi-resolution estimation (SMRE) discussed in Sect. 11.2.7 of Chap. 11 is specialized here for the case of imaging systems with Gaussian noise. Let \({\mathcal G}=\{1,\dots , N\}\times \{1,\dots , N\}\) denote a grid of \(N^2=n\) points. Denote by \({\mathcal V}\) a collection of subsets of \({\mathcal G}\). Suppose there are *M* such subsets, that is \(M=|{\mathcal V}|\), each of these subsets \(V_i\) consisting of \(m_i\le n\) grid points with \(\sum _{i=1}^{M}m_i=m\ge n\).

*quantile*of the estimator.

*b*need not be in the range of the imaging operator

*A*- all that is assumed is that this mapping is injective, not surjective. This means that, in applications, practitioners need to be careful not to make the constraint \(\gamma _{i}\) too small, otherwise the optimization problem might be

*infeasible*. If the algorithms presented below appear to be diverging for a particular instance of (\({{\mathcal P}_{SMRE}}\)), it is because the problem is infeasible; increasing the constants \(\gamma _{i}\) should solve the problem.

## 12.3 Alternating Directions Method of Multipliers and Douglas Rachford

*splitting methods*which are the principal approach to handling the computational burden of large-scale, separable problems [22]. ADMM belongs to a class of

*augmented Lagrangian methods*whose original motivation was to regularize Lagrangian formulations of constrained optimization problems. The ADMM algorithm for solving (\({{\mathcal P}_{\mathcal L}}\)) follows. The penalty parameter \(\eta \) need not be a constant, and indeed evidence indicates that the choice of \(\eta \) can greatly impact the complexity of the algorithm. For simplicity we keep this parameter fixed.

We do not specify how the argmin in Algorithm 1 should be calculated, and indeed, the analysis that follows assumes that these can be computed exactly.^{1} One problem that should be immediately apparent is that this algorithm operates on a space of dimension \(n+2m\). Since one of the two challenges we address is high dimension, this expansion in the dimension of the problem formulation should be troubling. Nevertheless, we show with this algorithm how the first challenge, namely *quantification* of convergence is achieved.

*resolvents*of \(\eta D\) and \(\eta B\) respectively.

*T*are single-valued [12].

### Proposition 12.7

(Proposition 2.2 [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Let \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) be linear and suppose there exists a solution to \(0\in (B+D)(x)\) for *B* and *D* defined by (12.22). For fixed \(\eta >0\), given any initial points \(\xi ^0\) and \(\left( y^0, z^0\right) \in {\mathrm {gph}}D\) such that \(\xi ^0=y^0+\eta z^0\), the sequences \(\left( z^k\right) _{k\in \mathrm {\mathbb {N}}}\), \(\left( \xi ^k\right) _{k\in \mathrm {\mathbb {N}}}\) and \(\left( y^k\right) _{k\in \mathrm {\mathbb {N}}}\) defined respectively by (12.18), (12.20) and \(y^k\equiv \tfrac{1}{\eta }\left( \xi ^k-z^k\right) \) converge to points \({\overline{z}}\in \mathsf {Fix}\,T'\), \({\overline{\xi }}\in \mathsf {Fix}\,T\) and \({\overline{y}}\in D\left( \mathsf {Fix}\,T'\right) \). The point \({\overline{z}}= {\mathcal J}_{\eta D}{\overline{\xi }}\) is a solution to (\({{\mathcal D}}\)), and \({\overline{y}}=\frac{1}{\eta }\left( {\overline{\xi }}-{\overline{z}}\right) \in D{\overline{z}}\). If, in addition, \({\mathcal A}\) has full column rank, then the sequence \(\left( y^k, z^k\right) _{k\in \mathrm {\mathbb {N}}}\) corresponds exactly to the sequence of points generated in steps 2–3 of Algorithm 1 and the sequence \(\left( \xi ^{k+1}\right) _{k\in \mathrm {\mathbb {N}}}\) converges to \({\overline{\xi }}\), a solution to (\({{\mathcal P}_0}\)).

The correspondence between Douglas-Rachford and ADMM in the proposition above means that if quantitative convergence can be established for one of the algorithms, it is automatically established for the other. Linear convergence of Douglas-Rachford under the assumption of strong convexity and Lipschitz continuity of *f* was already established by Lions and Mercier [26]. Recent published work in this direction includes [27, 28]. Local linear convergence of the iterates to a *solution* was established in [29] for linear and quadratic programs using spectral analysis. In Proposition 12.8, two conditions are given that guarantee linear convergence of the ADMM iterates to a solution. The first condition is classical and follows Lions and Mercier [26]. The second condition, based on [30], is much more prevalent in applications and generalizes the results of [29].

### Proposition 12.8

(Theorem 2.3 of [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Suppose there exists a solution to \(0\in (B+D)(x)\) for *B* and *D* defined by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is an injective linear mapping. Let \(\widehat{\xi }\in \mathsf {Fix}\,T\) for *T* defined by (12.21). For fixed \(\eta >0\) and any given triplet of points \(\left( \xi ^0,y^0, z^0\right) \) satisfying \(\xi ^0\equiv z^0+\eta y^0\), with \(y^0\in Dz^0\), generate the sequence \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) by Steps 2–3 of Algorithm 1 and the sequence \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) by (12.20).

- (i)
Let \({\mathcal O}\subset {\mathbb {R}^n}\) be a neighborhood of \(\widehat{\xi }\) on which

*g*is strongly convex with constant \(\mu \) and \(\partial g\) is \(\beta \)-inverse strongly monotone for some \(\beta >0\). Then, for any \(\left( \xi ^0, y^0, z^0\right) \in {\mathcal O}\) satisfying \(\xi ^0\equiv z^0+\eta y^0\in {\mathcal O}\), the sequences \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) and \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) converge linearly to the respective points \({\overline{\xi }}\in \mathsf {Fix}\,T\) and \(\left( {\overline{y}},{\overline{z}}\right) \) with rate at least \(K=\left( 1-\frac{2\eta \beta \mu ^2}{(\mu +\eta )^2}\right) ^{\tfrac{1}{2}}<1\). - (ii)Suppose that \(T:\,W\rightarrow W\,\) for some affine subspace \(W\subset {\mathbb {R}^n}\) with \({\widehat{\xi }}\in W\). On the neighborhood \({\mathcal O}\) of \({\widehat{\xi }}\) relative to
*W*, that is \({\mathcal O}\cap W\), suppose there is a constant \(\kappa >0\) such thatThen the sequences \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) and \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) converge linearly to the respective points \({\overline{\xi }}\in \mathsf {Fix}\,T\cap W\) and \(\left( {\overline{y}},{\overline{z}}\right) \) with rate bounded above by \(\sqrt{1-\kappa }\).$$\begin{aligned} \Vert \xi -\xi ^+\Vert \ge \sqrt{\kappa } \mathrm {dist}(\xi ,\mathsf {Fix}\,T)\quad \forall \xi \in {\mathcal O}\cap W,~\forall \xi ^+\in T\xi . \end{aligned}$$(12.23)

In either case, the limit point \({\overline{z}}={\mathcal J}_{\eta D}{\overline{\xi }}\) is a solution to (\({{\mathcal D}}\)), \({\overline{y}}\in D{\overline{z}}\) and the sequence \(\left( x^k\right) _{k\in \mathrm {\mathbb {N}}}\) of Step 1 of Algorithm 1 converges to \({\overline{x}}\), a solution of (\({{\mathcal P}_0}\)).

The strong convexity assumption (i) of Theorem 12.8 fails in many applications of interest, and in particular for feasibility problems (minimizing the sum of indicator functions). By [31, Theorem 2.2], case (ii) of Theorem 12.8, in contrast, holds in general for mappings *T* for which \(T-\mathrm{Id}\) is *metrically subregular* and the fixed point sets are *isolated points* with respect to an affine subspace to which the iterates are confined. The restriction to the affine subspace *W* is a natural generalization for the Douglas–Rachford algorithm, where the iterates are known to stay confined to affine subspaces orthogonal to the fixed point set [32, 33]. We show that metric subregularity with respect to this affine subspace holds in many applications.

### Proposition 12.9

(polyhedrality of the Douglas–Rachford operator) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Suppose, in addition, that *f* and *g* are piecewise linear-quadratic. The operator \(T:\,{\mathbb {R}^m}\rightarrow {\mathbb {R}^m}\,\) defined by (12.21) with \(\eta >0\) fixed, is polyhedral for *B* and *D* given by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is a linear mapping.

### Proof

This is Proposition 2.6 of [1].

### Proposition 12.10

(local linear convergence, Theorem 2.7 of [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc, convex, piecewise linear-quadratic functions (see Definition 12.1). Define the operator \(T:\,{\mathbb {R}^m}\rightarrow {\mathbb {R}^m}\,\) by (12.21) with \(\eta >0\) fixed and *B* and *D* given by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is a linear mapping. Suppose that there exists a solution to \(0\in (B+D)(x)\), that \(T:\,W\rightarrow W\,\) for *W* some affine subspace of \({\mathbb {R}^m}\) and that \(\mathsf {Fix}\,T\cap W\) is an isolated point \(\{{\overline{\xi }}\}\). Then there is a neighborhood \({\mathcal O}\) of \({\overline{\xi }}\) such that, for all starting points \((\xi ^0,y^0, z^0)\) with \(\xi ^0\equiv z^0+\eta y^0\in {\mathcal O}\cap W\) for \(y^0\in D(z^0)\) so that \({\mathcal J}_{\eta D}\xi ^0=z^0\), the sequence \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) generated by (12.20) converges Q-linearly to \({\overline{\xi }}\) where \({\overline{z}}\equiv {\mathcal J}_{\eta D}\overline{\xi }\) is a solution to (\({{\mathcal D}}\)). The rate of linear convergence is bounded above by \(\sqrt{1-\kappa }\), where \(\kappa =c^{-2}>0\), for *c* a constant of metric subregularity of \(T-\mathrm{Id}\) at \({\overline{\xi }}\) for the neighborhood \({\mathcal O}\). Moreover, the sequence \(\left( y^k, z^k\right) _{k\in \mathrm {\mathbb {N}}}\) generated by steps 2–3 of Algorithm 1 converges linearly to \(\left( {\overline{y}},{\overline{z}}\right) \) with \({\overline{y}}=\tfrac{1}{\eta }\left( {\overline{x}}-{\overline{z}}\right) \), and the sequence \(\left( x^k\right) _{k\in \mathrm {\mathbb {N}}}\) defined by Step 1 of Algorithm 1 converges to a solution to (\({{\mathcal P}_0}\)).

### 12.3.1 ADMM for Statisitcal Multi-resolution Estimation of STED Images

*b*, shown in Fig. 12.1, is

*noisy*or otherwise inexact, and thus an exact solution is not desirable. Although the noise in such images is usually modeled by Poisson noise, a Gaussian noise model with constant variance suffices as the photon counts are of the order of 100 per pixel and do not vary significantly across the image. We calculate the numerically reconstructed tubulin density \({\overline{x}}\) shown in Fig. 12.2a via Algorithm 1 for the problem (\({{\mathcal P}}\)) with the qualitative regularization

For an image size of \(n=64\times 64\) with three resolution levels the resulting number of constraints is \(m=12161\) (that is, \(64^2\) constraints at the finest resolution, \(4*32^2\) constraints at the next resolution and \(9*21^2\) constraints at the lowest resolution). The constant \(\alpha =0.01\) in (12.24) is used to balance the contributions of the individual terms to make the most of limited numerical accuracy (double precision). The constant \(\gamma _i\) is chosen so that the model solution would be no more than 3 standard deviations from the noisy data on each interval of each scale.

Since this is experimental data, there is no “truth” for comparison - the constraint, together with the error bounds on the numerical solution to the model solution provide statistical guarantees on the numerical reconstruction [36]. In Fig. 12.2b the iteration is shown with the value of \(\rho =4096\) for which the constraints are exactly satisfied (to within machine precision), indicating the correspondence of the computed solution of problem (\({{\mathcal P}}\)) to a solution to the exact model problem (\({{\mathcal P}_0}\)).

The only assumption from Proposition 12.10 that cannot be verified for this implementation is the assumption that the algorithm fixed point is a singleton; all other assumptions are satisfied *automatically* by the problem structure. We *observe*, however, starting from around iteration 1500 in Fig. 12.2b, behavior that is *consistent* with (i.e. does not contradict) linear convergence. From this, the observed convergence rate is \(c=0.9997\), which yields an a posteriori upper estimate of the pixel-wise error of about \(8. 9062e^{-4}\), or 3 digits of accuracy at each pixel.^{2}

## 12.4 Primal-Dual Methods

The ADMM method presented above suffers from the extreme computational cost of computing the prox-operator in step 1. The results of the previous section required several days of cpu time on a 2016-era laptop. In this section we present a method studied in [3] that can achieve results in about 30 s on the same computer architecture. In this section we survey the main results (without proofs) from [1]. There is one subtle difference in the present survey over [3] that has major implications for the application and implementation of the main Algorithm 2.

*g*in problem (\({{\mathcal P}}\)) of the form (12.11). The algorithm we revisit is the proximal alternating predictor-corrector (PAPC) algorithm proposed in [37] for solving (\({{\mathcal S}}\)). It consists of a predictor-corrector gradient step for handling the smooth part of

*L*in (12.13) and a proximal step for handling the nonsmooth part.

At each iteration the algorithm computes one gradient and a prox-mapping corresponding to the nonsmooth function, both of which are assumed to efficiently implementable. We suppose these can be evaluated exactly, though this does not take into account finite precision arithmetic. The dimension of the iterates of the EPAPC algorithm is on the same order of magnitude as with the ADMM/Douglas-Rachford method, but the individual steps can be run in parallel and, with the exception of the projection in Step 6, are much less computationally intensive to execute.

The next technical result, which is new, establishes a crucial upper bound on the growth of the Lagrangian with respect to the primal variables.

### Lemma 12.1

Note that for the choice of \(\tau \) given in the parameter initialization of Algorithm 2, \(G\succ 0\).

### Proof

The proof of (12.26) follows exactly as in the proof of [37, Lemma 3.1(i)]. The second inequality (12.27) follows exactly as the proof of [37, Lemma 3.1(ii)] in the case that \(\ker {{\mathcal A}^T}=\{0\}\). The case where \(\ker {{\mathcal A}^T}\) is nontrivial requires more care. The proof of this is suppressed in the interest of brevity.

The next intermediate result establishes pointwise quadratic supportability (Definition 12.2) on bounded sets at all saddle points under Assumptions 1 and 3.

### Proposition 12.11

The constant \(\mu \) in Proposition 12.11 depends on the choice of \((x^0, y^0)\) and so depends implicitly on the distance of the initial guess to the point in the set of saddle point solutions.

*G*in (12.28).

*H*, \(\left\| u\right\| ^2_H := \frac{1}{\tau }\left\| x\right\| ^2 + \left\| y\right\| ^2_G\).

We are now ready to state the main result and corollaries, whose proofs can be found in [3].

### Theorem 12.4.1

*f*at \(\hat{x}\) depending on the distance of the initial guess to the point \(({\hat{x}},{\hat{y}})\) in the solution set \({\mathcal S}^*\). In particular, \(\left( (x^{k}, y^{k})\right) _{k\in \mathrm{N}}\) is

*Q*-linearly convergent with respect to the

*H*-norm to a saddle-point solution.

An unexpected corollary of the result above is that the set of saddle points is a singleton.

### Corollary 12.12

(Unique saddle point) Under Assumptions 1 and 3 the solution set \({\mathcal S}^*\) is a singleton.

The above theorem yields the following estimate on the number of iterations required to achieve a specified distance to a saddle point.

### Corollary 12.13

*k*iterations, with

### 12.4.1 EPAPC for Statisitcal Multi-resolution Estimation of STED Images

An efficient computational strategy for evaluating or at least approximating the projection \(P_{(\ker {{\mathcal A}^T})^\perp }\) in Step 6 of Algorithm 2 has not yet been established. We report here preliminary computational results of Algorithm 2 *without* computing Step 6. Our results show that the method is promising, though error bounds to the solution to (\({{\mathcal S}}\)) are not justified without computation of \(P_{(\ker {{\mathcal A}^T})^\perp }\).

Here, we also consider the smooth approximation of the \(L^1\)-norm as the qualitative objective. The \(L^1\)-norm is non-smooth at the origin, thus in order to make the derivative-based methods possible we consider a smoothed approximation of this, known as the Huber approximation.

## 12.5 Randomized Block-Coordinate Primal-Dual Methods

The previous sections reviewed numerical strategies and structures that yield quantitative estimates of the distance of an iterate to the solution of the underlying variational problem. In this section we examine *implementation* strategies for dealing with high dimensional problems. These are implementation strategies because they do not involve changing the optimization model. Instead, we select at random a smaller subset of variables or constraints in the computation of an update in the full-dimensional iterative procedure. This is the principal strategy for handling problems that, due to their size, must be distributed across many processing and storage units (see for instance [26, 38, 39, 40] and references therein). We survey here a randomized primal-dual technique proposed and analyzed in [2]. The main theoretical question to resolve with such approaches is whether, and in what sense, iterates converge to a solution to the original problem. We can determine whether the iterates converge, but obtaining an estimate of the distance to the solution remains an open problem.

*f*is prox friendly. In the algorithm description below it is convenient to use the convention \(f\equiv g_{0}\), \(A_{0}\equiv \mathrm{Id}\). The algorithm is based in part on [39].

### Proposition 12.14

(Theorem 1 of [2]) Suppose Assumption 1 holds and let \(\tau \sigma _i \Vert A_i \Vert ^2 <1\). Then \((x^k,y^k)\) generated by Algorithm 3 converges to a solution to \((\widetilde{{\mathcal D}})\). In particular, the sequence \((x^k)\) converges almost surely to a solution to (\({{\mathcal P}}\)).

The statement above concerns part (i) of Theorem 1 of [2]. No smoothness is required of the qualitative regularization *f*. Instead, it is assumed that this function is prox-friendly. This opens up the possibility of using the 1-norm as a regularize, promoting, in some sense, sparsity in the image. No claim is made on the *rate* of convergence, though the numerical experiments below indicate that, for regular enough functions *f*, convergence might be locally linear. This remains to be proved.

### 12.5.1 RBPD for Statisitcal Multi-resolution Estimation of STED Images

Despite many open questions regarding convergence, random methods offer a way to handle extremely large problems. To make a comparison with the deterministic approaches above, cycles of the RBPD Algorithm 3 are counted in terms of *epochs*. An epoch contains the number of passes through steps 1–6 of Algorithm 3 required before each block is chosen at least once. After *k* epochs, therefore, the *i*-th coordinate of *x* will be updated, on average, the same number of times for the randomized algorithm as for the deterministic methods. In other words, an epoch for a randomized block-wise method is comparable to an iteration of a deterministic method. As the RBPD updates only one block per *iteration*, each iteration is less computationally intensive than the the deterministic counterparts. However, in our case this efficient iteration still requires one to evaluate two (possibly) expensive convolution products (embedded in \(A_i x\) and \(A_i^T y\)). Thus, if these operations are relatively expensive, the efficiency gain will be marginal. Nevertheless, because of the ability to operate on smaller blocks, the randomized method requires, per epoch, approximately half the time required per iteration of the deterministic methods. Although the quantitative convergence analysis remains open, our numerical experiments indicate that the method achieves a comparable step-residual to the EPAPC Algorithm 2 after the same number of epochs/iterations.

As with the experiments in the previous Sections, we use three resolutions, which results in one block at the highest resolution, four blocks at the next resolution (four possible shifts of \(2\times 2\) pixels), and nine blocks at the third resolution (nine different shifts of \(3\times 3\) pixels). We applied Algorithm 3 with different regularization *f* in (\({{\mathcal P}_0}\)): the 1-norm \(f(x) = \Vert x \Vert _{1}\), Huber loss function \(f(x) = \Vert x\Vert _{1, \alpha }\) given by (12.35) (\(\alpha =0.25\)) and the squared Euclidean norm. As with the EPAPC experiments, the function *g* is given by (12.11) with \(g_i\) given by (12.17) for the parameter \(\gamma _i=0.25*i\) for \(i=1,2,3\). All of these functions are prox-friendly and have closed-form Fenchel conjugates. The gain in efficiency over the deterministic EPAPC method proposed above (without computation of the pseudo-inverse) is a factor of 2.

## Footnotes

## References

- 1.Aspelmeier, T., Charitha, C., Luke, D.R.: Local linear convergence of the ADMM/Douglas-Rachford algorithms without strong convexity and application to statistical imaging. SIAM J. Imaging Sci.
**9**(2), 842–868 (2016)MathSciNetCrossRefGoogle Scholar - 2.Luke, D.R., Malitsky, Y.: Block-coordinate primal-dual method for the nonsmooth minimization over linear constraints. In: Giselsson, P., Rantzer, A. (eds.) Distributed and Large-Scale Optimization. Springer (2018)Google Scholar
- 3.Luke, D.R., Shefi, R.: A globally linearly convergent method for pointwise quadratically supportable convex-concave saddle point problems. J. Math. Anal. Appl.
**457**(2), 1568–1590 (2018). https://doi.org/10.1016/j.jmaa.2017.02.068 - 4.Burger, M., Sawatzky, A., Steidl, G.: First order algorithms in variational image processing. In: Splitting Methods in Communication, Imaging, Science, and Engineering, pp. 345–407. Springer (2016)Google Scholar
- 5.Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numerica
**25**, 161–319 (2016)MathSciNetCrossRefGoogle Scholar - 6.Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Fixed Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212 (2011)Google Scholar
- 7.Komodakis, N., Pesquet, J.C.: Playing with duality: an overview of recent primal? Dual approaches for solving large-scale optimization problems. IEEE Signal Proc. Mag.
**32**(6), 31–54 (2015)ADSCrossRefGoogle Scholar - 8.Luke, D.R., Teboulle, M., Thao, N.H.: Necessary conditions for linear convergence of iterated expansive, set-valued mappings. Math. Program. A
**180**, 1–31 (2020). https://doi.org/10.1007/s10107-018-1343-8 - 9.Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)zbMATHGoogle Scholar
- 10.Rockafellar, R.T., Wets, R.J.: Variational Analysis. Grundlehren Math. Wiss. Springer, Berlin (1998)Google Scholar
- 11.Dontchev, A.L., Rockafellar, R.T.: Implicit Functions and Solution Mapppings, 2nd edn. Springer, New York (2014)Google Scholar
- 12.Moreau, J.J.: Proximité et dualité dans un espace Hilbertian. Bull. de la Soc. Math. de France
**93**(3), 273–299 (1965)Google Scholar - 13.Aubin, J.P., Frankowska, H.: Set-valued Analysis. Birkhäuser, Boston (1990)zbMATHGoogle Scholar
- 14.Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control. Optim.
**31**, 1340–1359 (1993)MathSciNetCrossRefGoogle Scholar - 15.Friedlander, M.P., Tseng, P.: Exact regularization of convex programs. SIAM J. Optim.
**18**(4), 1326–1350 (2007). https://doi.org/10.1137/060675320MathSciNetCrossRefzbMATHGoogle Scholar - 16.Borwein, J., Vanderwerff, J.: Convex Functions: Constructions, Characterizations and Counterexamples, Encyclopedias in Mathematics, vol. 109. Cambridge University Press, New York (2010)CrossRefGoogle Scholar
- 17.Eckstein, J.: Splitting methods for monotone operators with applications to parallel optimization. Ph.D. thesis, MIT, Cambridge, MA (1989)Google Scholar
- 18.Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program.
**55**, 293–318 (1992)MathSciNetCrossRefGoogle Scholar - 19.Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Augmented Lagrangian Methods: Applications to the Solution of Boundary-Value Problems, pp. 299–331. North-Holland, Amsterdam (1983)Google Scholar
- 20.Glowinski, R., Marroco, A.: Sur l’approximation, par elements finis d’ordre un, et las resolution, par penalisation-dualitè, d’une classe de problemes de dirichlet non lineares. Revue Francais d’Automatique, Informatique et Recherche Opérationelle
**9**(R-2), 41–76 (1975)Google Scholar - 21.Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control. Optim.
**14**, 877–898 (1976)MathSciNetCrossRefGoogle Scholar - 22.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.
**3**(1), 1–122 (2011)CrossRefGoogle Scholar - 23.Douglas Jr., J., Rachford Jr., H.H.: On the numerical solution of heat conduction problems in two or three space variables. Trans. Am. Math. Soc.
**82**, 421–439 (1956)MathSciNetCrossRefGoogle Scholar - 24.Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal.
**16**, 964–979 (1979)ADSMathSciNetCrossRefGoogle Scholar - 25.Svaiter, B.F.: On weak convergence of the Douglas-Rachford method. SIAM J. Control. Optim.
**49**(1), 280–287 (2011)MathSciNetCrossRefGoogle Scholar - 26.Strohmer, T., Vershynin, R.: A randomized kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl.
**15**(2), 262–278 (2009)MathSciNetCrossRefGoogle Scholar - 27.He, B., Yuan, X.: On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers. Numerische Mathematik
**130**(3), 567–577 (2015)Google Scholar - 28.He, B., Yuan, X.: On the o(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal.
**50**(2), 700–709 (2012)MathSciNetCrossRefGoogle Scholar - 29.Boley, D.: Local linear convergence of the alternating direction method of multipliers on quadratic or linear programs. SIAM J. Optim.
**23**(4), 2183–2207 (2013). https://doi.org/10.1137/120878951MathSciNetCrossRefzbMATHGoogle Scholar - 30.Hesse, R., Luke, D.R.: Nonconvex notions of regularity and convergence of fundamental algorithms for feasibility problems. SIAM J. Optim.
**23**(4), 2397–2419 (2013)MathSciNetCrossRefGoogle Scholar - 31.Luke, D.R., Thao, N.H., Tam, M.K.: Quantitative convergence analysis of iterated expansive, set-valued mappings. Math. Oper. Res.
**43**(4), 1143–1176 (2018). https://doi.org/10.1287/moor.2017.0898 - 32.Hesse, R., Luke, D.R., Neumann, P.: Alternating projections and Douglas-Rachford for sparse affine feasibility. IEEE Trans. Signal. Process.
**62**(18), 4868–4881 (2014). https://doi.org/10.1109/TSP.2014.2339801ADSMathSciNetCrossRefzbMATHGoogle Scholar - 33.Phan, H.: Linear convergence of the Douglas-Rachford method for two closed sets. Optimization
**65**, 369–385 (2016)MathSciNetCrossRefGoogle Scholar - 34.Hell, S.W., Wichmann, J.: Breaking the diffraction resolution limit by stimulated emission: stimulated-emission-depletion fluorescence microscopy. Opt. Lett.
**19**(11), 780–782 (1994). https://doi.org/10.1364/OL.19.000780ADSCrossRefGoogle Scholar - 35.Klar, T.A., Jakobs, S., Dyba, M., Egner, A., Hell, S.W.: Fluorescence microscopy with diffraction resolution barrier broken by stimulated emission. Proc. Natl. Acad. Sci.
**97**(15), 8206–8210 (2000). https://doi.org/10.1073/pnas.97.15.8206ADSCrossRefGoogle Scholar - 36.Frick, K., Marnitz, P., Munk, A.: Statistical multiresolution dantzig estimation in imaging: fundamental concepts and algorithmic framework. Electron. J. Stat.
**6**, 231–268 (2012)MathSciNetCrossRefGoogle Scholar - 37.Drori, Y., Sabach, S., Teboulle, M.: A simple algorithm for a class of nonsmooth convex-concave saddle-point problems. Oper. Res. Lett.
**43**(2), 209–214 (2015)MathSciNetCrossRefGoogle Scholar - 38.Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)Google Scholar
- 39.Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim.
**25**(4), 1997–2023 (2015)MathSciNetCrossRefGoogle Scholar - 40.Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)Google Scholar
- 41.Hesse, R., Luke, D.R., Sabach, S., Tam, M.: The proximal heterogeneous block implicit-explicit method and application to blind ptychographic imaging. SIAM J. Imaging Sci.
**8**(1), 426–457 (2015)MathSciNetCrossRefGoogle Scholar

## Copyright information

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.