Keywords

2010 Mathematics Subject Classification:

1 Introduction

In this chapter we review progress towards addressing two main challenges in scientific image processing. The first of these is to quantify convergence of iterative algorithms for image processing to solutions (as opposed to optimal values) to the underlying variational problem. The second challenge is to develop efficient methods for these large-scale problems without sacrificing the quantification of convergence. The techniques surveyed here were studied in [1,2,3]. We present only the main results from these studies, in the context that hindsight provides.

Scientific images are often processed with software that accomplishes a number of tasks like registration, denoising and deblurring. Implicit in the processing is that some systematic error is being corrected to bring the image closer to the truth. This presumption is more complicated for denoising and deblurring. These are often accomplished by filtering or by solving some variational problem such as minimizing the variance of an image. For applications requiring speedy processing, such as audio and video communication, this is sufficient. But the recent development of nanoscale photonic imaging modalities such as STED and RESOLFT featured in Chaps. 1, 7 and 9 has shifted the focus of image denoising and deconvolution from qualitative to quantitative models.

Quantitative approaches to image processing are the subject of Chap. 11 where statistical multiscale estimation is discussed (see Sect. 11.3). Here, the recovered image comes with statistical statements about how far the processed image is, in some statistical sense, from the truth. The estimators are almost exclusively variational, that is, they can be characterized as the solution to an optimization problem. It is important to emphasize that the value of the optimization problem is meaningless. This stands in stark contrast to many conventional applications in economics and operations research, where the value of the optimal solution is related to profit or cost, and so is of principal interest.

The focus on optimal solutions rather than optimal values places heavy demands on the structure of model formulations and the algorithms for solving them. Unless the numerical method allows one to state how far a computed iterate is to the solution of the underlying variational problem, then the scientific significance of the iterate is lost.

The leading computational approaches for solving imaging problems with multi-resolution statistical estimation criterion are based on iterated proximal operators. Most of the analysis for first-order iterative proximal methods is limited to statements about rates of convergence of function values, if rates are discussed at all (see for instance [4,5,6,7]). First-order methods have slow convergence in the worst case scenario. A common assumption to guarantee linear convergence of the iterates is strong convexity, but this is far more than is necessary, and in particular it is not satisfied for the Huber function (12.35). It was shown in [8] that metric subregularity is necessary for local linear convergence. Aspelmeier, Charitha and Luke [1] showed that the popular alternating directions method of multipliers algorithm (ADMM) applied to optimization problems with piecewise linear-quadratic objective functions (e.g. the Huber function), together with linear inequality constraints generically satisfies metric subregularity at isolated critical points; hence linear convergence of the iterates for this algorithm can be expected without further ado. More recently, in [3] it was shown that the primal iterates of a modification of the PAPC algorithm (Algorithm 2) converge R-linearly for any quadratically supportable objective function (for instance, the Huber function). Conventional results without metric subregularity obtain a convergence rate of O(1/k) with respect to the function values. In settings like qualitative image processing or machine learning, such results are acceptable, but in the setting of statistical image processing these statements do not contain any scientific content. We present in this chapter efficient iterative first-order methods that offer some hope of quantitative guarantees about the distance of the iterates to optimal solutions.

2 Problem Formulation

We limit our scope to the real vector space \({\mathbb {R}^n}\) with the norm generated from the inner product. The closed unit ball centered on the point \(y\in {\mathbb {R}^n}\) is denoted by \(\mathbb {B}(y)\). The positive orthant (resp. negative orthant) in \({\mathbb {R}^n}\) is denoted by \({\mathbb {R}^n}_{\!+}\) (resp. \({\mathbb {R}^n}_{\!-}\)). The domain of an extended real-valued function \(\varphi :{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\equiv (-\infty , +\infty ]\) is \({\mathrm{dom}\,}(\varphi )\equiv \{z \in {\mathbb {R}^n}:\varphi (z)<+\infty \}\). The Fenchel conjugate of \(\varphi \) is denoted by \(\varphi ^*\) and is defined by \(\varphi ^*(u)\equiv \sup _{z\in {\mathbb {R}^n}}\{\langle {z,u}\rangle -\varphi (z)\}\). The set of symmetric \(n\times n\) positive (semi)-definite matrices is denoted by \(\mathbb {S}^n_{++}\) (\(\mathbb {S}^n_+\)). The notation \(A \succ 0\) \((A \succeq 0)\) denotes a positive (semi)definite matrix A. For any \(z \in {\mathbb {R}^n}\) and any \(A \in \mathbb {S}^n_+\), we denote the semi-norm \(\Vert z\Vert ^2_A :=\langle {z,Az}\rangle \). The operator norm is defined by \(\Vert A\Vert = \max _{u\in {\mathbb {R}^n}} \{ \Vert Au\Vert : \Vert u\Vert =1\}\) and coincides with the spectral radius of A whenever A is symmetric. If \(A \ne 0\), \(\sigma _{min} (A)\) denotes its smallest nonzero singular value. For a sequence \(\{z^{k}\}_{k\in \mathrm {\mathbb {N}}}\) converging to \(z^*\), we say the convergence is Q-linear if there exists \(c \in (0, 1)\) such that \(\frac{\Vert z^{k+1}-z^*\Vert }{\Vert z^{k}-z^*\Vert } \le c\;\) for all k; convergence is R-linear if there exists a sequence \(\eta _{k}\) such that \(\Vert z^{k}-z^*\Vert \le \eta _{k}\) and \(\eta _{k}\rightarrow 0\)  Q-linearly [9, Chap. 9].

We limit our discussion to proper (nowhere equal to \(-\infty \) and finite at some point), lower semi-continuous (lsc), extended-valued (can take the value \(+\infty \)) functions. We will, in fact, limit our discussion to convex functions, but convexity is not the central property governing quantitative convergence estimates. By the subdifferential of a function \(\varphi \), denoted \(\partial \varphi \), we mean the collection of all subgradients that can be written as limits of sequences of Fréchet subgradients at nearby points; a vector v is a (Fréchet) subgradient of \(\varphi \) at y, written \(v\in \widehat{\partial } \varphi (y)\), if

$$\begin{aligned} \liminf _{x\rightarrow y,~x\ne y}\frac{\varphi (x)- \varphi (y) - \langle {v}, {x-y} \rangle }{\Vert x-y\Vert }\ge 0. \end{aligned}$$
(12.1)

The functions of interest for us are subdifferentially regular on their domains, that is, the epigraphs of the functions are Clarke regular at points where they are finite [10, Definition 7.25]. For our purposes it suffices to note that, for a function \(\varphi \) that is subdifferentially regular at a point y, the subdifferential is nonempty and all subgradients are Fréchet subgradients, that is, \(\partial \varphi (y)=\widehat{\partial } \varphi (y)\ne \emptyset \). Convex functions, in particular, are subdifferentially regular on their domains and the subdifferential has the particularly simple representation as the set of all vectors v where

$$\begin{aligned} \varphi (x)- \varphi (y) - \langle {v}, {x-y}\rangle \ge 0\quad \forall x. \end{aligned}$$
(12.2)

A mapping \(\varPhi :\,{\mathbb {R}^n}\rightrightarrows {\mathbb {R}^n}\,\) is said to be \(\beta \)-inverse strongly monotone [10, Corollary 12.55] if for all \(x, x' \in {\mathbb {R}^n}\)

$$\begin{aligned} \langle v-v', x-x'\rangle \ge \beta \Vert v-v'\Vert ^2, \quad \text{ whenever }\quad v\in \varPhi (x), v'\in \varPhi (x'). \end{aligned}$$
(12.3)

The mapping \(\varPhi \) is said to be polyhedral (or piecewise polyhedral [10]) if its graph is the union of finitely many sets that are polyhedral convex in \({\mathbb {R}^n}\times {\mathbb {R}^n}\) [11]. Polyhedral mappings are generated by the subdifferential of piecewise linear- quadratic functions (see Proposition 12.9).

Definition 12.1

(piecewise linear-quadratic (plq) functions)  A function \(f:{\mathbb {R}^n}\rightarrow [-\infty , +\infty ]\) is called piecewise linear-quadratic if domf can be represented as the union of finitely many polyhedral sets, relative to each of which f(x) is given by an expression of the form \(\frac{1}{2}\langle x, Ax\rangle +\langle a, x \rangle +\alpha \) for some scalar \(\alpha \in \mathbb {R}\) vector \(a\in {\mathbb {R}^n}\), and symmetric matrix \(A\in \mathbb {R}^{n\times n} \).

Closely related to plq functions is quadratically supportable functions.

Definition 12.2

(pointwise quadratically supportable (pqs))   A proper, extended-valued function \(\varphi : {\mathbb {R}^n}\rightarrow \mathbb R\cup \{+\infty \}\) is said to be pointwise quadratically supportable at y if it is subdifferentially regular there and there exists a neighborhood V of y and a constant \(\mu >0\) such that

$$\begin{aligned} (\forall v\in \partial \varphi (y))\quad \varphi (x)\ge \varphi (y) + \langle {v, x-y}\rangle + \frac{\mu }{2} \left\| x-y\right\| ^2, \quad \forall x\in V. \end{aligned}$$
(12.4)

If for each bounded neighborhood V of y there exists a constant \(\mu >0\) such that (12.4) holds, then the function \(\varphi \) is said to be pointwise quadratically supportable at y on bounded sets. If (12.4) holds with one and the same constant \(\mu >0\) on all neighborhoods V, then \(\varphi \) is said to be uniformly pointwise quadratically supportable at y.

For more on the relationship between pointwise quadratic supportability, coercivity, strong monotonicity and strong convexity see [3].

We denote the resolvent of \(\varPhi \) by \({\mathcal J}_{\varPhi }\equiv \left( {\mathrm{Id}}+{\varPhi }\right) ^{-1}\) where \(\mathrm{Id}\) denotes the identity mapping and the inverse is defined as

$$\begin{aligned} \varPhi ^{-1}(y)\equiv \left\{ x\in {\mathbb {R}^n}\,\left| \,y\in \varPhi (x)\right. \right\} . \end{aligned}$$
(12.5)

The corresponding reflector is defined by \(R_{\eta \varPhi } \equiv 2{\mathcal J}_{\eta \varPhi }-\mathrm{Id}\). One of the more prevalent examples of resolvents is the proximal map. For \(\varphi :{\mathbb {R}^n}\rightarrow (-\infty ,\infty ]\) a proper, lsc and convex function and for any \(u \in {\mathbb {R}^n}\) and \(Q \in \mathbb {S}^n_{++}\), the proximal map associated with \(\varphi \) with respect to the weighted Euclidean norm is uniquely defined by:

$$ {\mathrm{prox}}_{Q,\varphi }(u) = \mathrm {argmin}\,_z \{ \varphi (z) + \frac{1}{2}\left\| z-u\right\| ^2_Q : z\in {\mathbb {R}^n}\}. $$

When \(Q = {c}^{-1}\mathrm{Id}, c > 0\), we simply use the notation \({\mathrm{prox}}_{c,\varphi }(u)\). We also recall the fundamental Moreau proximal identity [12], that is, for any \(z \in {\mathbb {R}^n}\)

$$\begin{aligned} z = {\mathrm{prox}}_{{Q},{\varphi }} (z) + Q {\mathrm{prox}}_{{{Q}^{-1}},{\varphi ^*}}({Q}^{-1}(z)), \end{aligned}$$
(12.6)

where \({Q}^{-1}\) is the inverse of \(Q\in \mathbb {S}^n_{++}\).

Notions of continuity of set-valued mappings have been thoroughly developed over the last 40 years. Readers are referred to the monographs [10, 11, 13] for basic results. A key property of set-valued mappings that we will rely on is metric subregularity, which can be understood as the property corresponding to a Lipschitz-like continuity of the inverse mapping relative to a specific point. It is a weaker property than metric regularity which, in the case of an \(n\times m\) matrix for instance, is equivalent to surjectivity. Our definition follows the characterization of this property given in [11, Exercise 3H.4].

Definition 12.3

(metric subregularity)   The mapping \(\varPhi :{\mathbb {R}^n}\rightrightarrows {\mathbb {R}^m}\) is called metrically subregular at \(\overline{x}\) for \(\overline{y}\) relative to \(W\subset {\mathbb {R}^n}\) if \((\overline{x},\overline{y})\in {\mathrm {gph}}\varPhi \) and there is a constant \(c>0\) and neighborhoods \({\mathcal O}\) of \(\overline{x}\) such that

$$\begin{aligned} \mathrm {dist}(x,\varPhi ^{-1}(\overline{y})\cap W)\le c\mathrm {dist}(\overline{y}, \varPhi (x))~ \forall ~ x\in {\mathcal O}\cap W. \end{aligned}$$
(12.7)

The constant c measures the stability under perturbations of inclusion \({\overline{y}}\in \varPhi ({\overline{x}})\). An important instance where metric subregularity comes for free is for polyhedral mappings.

Proposition 12.4

(polyhedrality implies metric subregularity) Let \(W\subset V\) be an affine subspace and \(T:\,W\rightrightarrows W\,\). If T is polyhedral and \(\mathsf {Fix}\,T\cap W\) is an isolated point, \(\{{\overline{x}}\}\), then \(T-\mathrm{Id}:\,W\rightrightarrows (W-{\overline{x}})\,\) is metrically subregular at \({\overline{x}}\) for 0 relative to W.

Proof

Polyhedrality and isolated fixed points in fact imply strong metric subregularity. See [11, Propositions 3I.1 and 3I.2].

A notion related to metric regularity is that of weak-sharp solutions. This will be used in the development of error bounds (Theorem 12.6).

Definition 12.5

(weak sharp minimum [14]) The solution set \(\mathrm {argmin}\,\left\{ f(x)\,\left| \,x\in \varOmega \right. \right\} \) for a nonempty closed convex set \(\varOmega \), is weakly sharp if, for \({\overline{p}}=\inf _\varOmega f\), there exists a positive number \(\alpha \) (sharpness constant) such that

$$\begin{aligned} f(x)\ge {\overline{p}}+\alpha ~\mathrm {dist}(x, S_f)~~~\forall x\in \varOmega . \end{aligned}$$

Similarly, the solution set \(S_f\) is weakly sharp of order \(\nu >0\) if there exists a positive number \(\alpha \) (sharpness constant) such that, for each \(x\in \varOmega \),

$$\begin{aligned} f(x)\ge {\overline{p}}+\alpha ~\mathrm {dist}(x, S_f)^\nu ~~~\forall x\in \varOmega . \end{aligned}$$

2.1 Abstract Problem

The generic problem in which we are interested is

$$\begin{aligned} \begin{array}{ll}\underset{x\in {\mathbb {R}^n}}{\text{ minimize } }\qquad f(x)&{}\\ \text{ subject } \text{ to } \qquad g_i(A_ix)\le 0\quad (i=1,\dots M)&{}\end{array} {({\mathcal P}_0)} \end{aligned}$$

The following blanket assumptions on the problem’s data hold throughout:

figure a

Assumption (i) implies that the optimal value of (\({{\mathcal P}_0}\)) is finite. Assumption (ii) implies that the constraint structure is convex. Assumption (iii) implies that the mapping \({\mathcal A}:{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\) is linear and full rank, where

$$ {\mathcal A}\equiv [ A_0^T,A_1^T, \dots , A_M^T]^T\in {\mathbb {R}^m}\times {\mathbb {R}^n}$$

so that \({\mathcal A}x = y\) where \(y = (y_1, \ldots , y_M) \in {\mathbb {R}^m}\) for \(m =\sum _{i=1}^{M}m_i\). The challenge of statistical multi-resolution estimation lies in the feature that the dimension of the constraint structure, m, is much greater than the dimension of the unknowns, n, and grows superlinearly with respect to the number of unknowns.

The above constrained optimization problem is often formulated as an unconstrained-looking problem via the introduction of a (nonsmooth) penalty term enforcing the constraints:

$$\begin{aligned} \min _{x\in {\mathbb {R}^n}}\quad f(x) + g({\mathcal A}x) {({\mathcal P})} \end{aligned}$$

where

$$\begin{aligned} g:\,{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\,\equiv \rho \theta \end{aligned}$$
(12.8)

for \(\rho \) a positive scalar and

$$\begin{aligned} \theta :\,{\mathbb {R}^m}\rightarrow [0,+\infty ]\, \text{ proper, } \text{ lsc } \text{ and } \text{ convex } \text{ with } \theta (y)=0 ~ \text{ if } \text{ and } \text{ only } \text{ if } y\in {\mathcal C}\end{aligned}$$
(12.9)

where

$$\begin{aligned} {\mathcal C}\equiv \left\{ y=\left( y_1,y_2, \dots ,y_M\right) \,\left| \,y\in \mathrm{range}\,{\mathcal A} \text{ and } g_i(y_i)\le 0, ~i=1,2,\dots ,M\right. \right\} . \end{aligned}$$
(12.10)

The requirements on the function \(\theta \) align this penalty term with exact penalization [15], that is, a relaxation of the constraints where, for all parameters \(\rho \) large enough, the constraints are exactly satisfied.

The following assumptions are used to guarantee the exact correspondence between solutions to (\({{\mathcal P}_0}\)) and (\({{\mathcal P}}\)).

figure b

Proposition 12.6

Suppose Assumption 2 holds. Then the set of solutions to (\({{\mathcal P}}\)) with g defined by (12.8) is bounded and for all \(\rho \) large enough, the solutions to (\({{\mathcal P}_0}\)) and (\({{\mathcal P}}\)) coincide.

Proof

This is a distillation of [1, Theorem 3.4].

In (\({{\mathcal P}_0}\)) and (\({{\mathcal P}}\)) the function f is often smooth, but not prox friendly. In applications it is most often a smooth regularization or a fidelity term. For the ADMM/DR method reviewed in Sect. 12.3 smoothness is not required.

It is assumed that the functions \(g_i\) (\(i=1,2,\dots ,M\)) are prox friendly and that they enjoy some structure that makes g also prox friendly. For instance, if the constraints are separable, then the function

$$\begin{aligned} g({\mathcal A}x) \equiv \rho \sum _{i=1}^M g_i(A_ix) \end{aligned}$$
(12.11)

is also prox-friendly as is the function

$$\begin{aligned} g({\mathcal A}x)\equiv \rho \max \{g_1(A_1x), g_2(A_2x), \dots , g_M(A_Mx), 0\}. \end{aligned}$$
(12.12)

The functions \(g_i\circ A_i\) can be regularizing functions (like total variation) or hard inequality constraints. For example, hard inequality constraints are modeled by the use of indicator functions for \(g_i\) in (\({{\mathcal P}_0}\)): \(g_i(A_ix) = \iota _{\gamma _i{\mathbb {B}}}\left( A_ix - b_i\right) \) where, for a subset \(\varOmega \subset {\mathbb {R}^n}\),

$$ \iota _\varOmega (x)\equiv {\left\{ \begin{array}{ll}0&{}x\in \varOmega \\ +\infty &{} \text{ else }. \end{array}\right. } $$

2.2 Saddle Point and Dual Formulations

The saddle  point formulation is derived by viewing the function g in (\({{\mathcal P}}\)) as the image of a function \(g^*\) under Fenchel conjugation, that is, \(g(x) = (g^*)^*\). Writing this explicitly into (\({{\mathcal P}}\)) yields

$$\begin{aligned} \min _{x\in \mathbb R^n}\left\{ \max _{y\in {\mathbb {R}^m}}\quad f(x) + \langle {{\mathcal A}x,y}\rangle - g^*(y)\right\} . {({\mathcal S})} \end{aligned}$$

The bifunction in the saddle point formulation is

$$\begin{aligned} L(x,y) := f(x) + \langle {{\mathcal A}x,y}\rangle - g^*(y). \end{aligned}$$
(12.13)

Contrast this with the Lagrangian for the extended problem

$$\begin{aligned} \underset{(x,y)\in {\mathbb {R}^n}\times {\mathbb {R}^m}}{\text{ minimize }}~ f(x)+g(y),~~~ \text{ subject } \text{ to } {\mathcal A}x=y. {({\mathcal P}_{\mathcal L})} \end{aligned}$$

The Lagrangian is

$$\begin{aligned} {\mathcal L}(x,y,z)=f(x)+g(y)+\langle z, {\mathcal A}x-y\rangle , \end{aligned}$$
(12.14)

and the augmented Lagrangian \( \widetilde{{\mathcal L}}\) for (\({{\mathcal P}_{\mathcal L}}\)) is given by

$$\begin{aligned} \widetilde{{\mathcal L}}(x,y,z)=f(x)+g(y)+\langle z, {\mathcal A}x-y\rangle +\tfrac{\eta }{2}\Vert {\mathcal A}x-y\Vert ^2, \end{aligned}$$
(12.15)

where \(z\in {\mathbb {R}^m}\), and \(\eta >0\) is a fixed penalty parameter.

Assumption 1(i) guarantees that the mapping \(L(\cdot ,\cdot )\) has a saddle point, that is, there exists \(({\hat{x}},{\hat{y}}) \in {\mathbb {R}^n}\times {\mathbb {R}^m}\) such that

$$L({\hat{x}}, y) \le L({\hat{x}}, {\hat{y}}) \le L(x, {\hat{y}}) \qquad \forall x\in {\mathbb {R}^n}, y \in {\mathbb {R}^m}. $$

The existence of a saddle point corresponds to zero duality gap for the induced optimization problems

$$ p(x) = \sup _{y} \{L (x, y) : y\in {\mathbb {R}^m}\} \qquad q (y) = \inf _{x} \{ L (x, y) : x\in {\mathbb {R}^n}\}. $$

By weak duality, we have \(\inf _{x\in {\mathbb {R}^n}} p (x) \ge \sup _{y\in {\mathbb {R}^m}} q (y)\).

This can be viewed as a partial dual to problem (\({{\mathcal P}}\)). The full dual problem involves the Fenchel conjugate of the entire objective function. For (\({{\mathcal P}}\)) the dual problem is

$$\begin{aligned} \sup _{y\in {\mathbb {R}^m}}-f^*({\mathcal A}^{T}y)-g^*(-y). {({\mathcal D}')} \end{aligned}$$

Instead of working with this dual, it is more convenient to work with the following equivalent formulation via the change of variable \(y\rightarrow -y\):

$$\begin{aligned} \inf _{y\in {\mathbb {R}^m}}f^*(-{\mathcal A}^{T}y)+g^*(y). {({\mathcal D})} \end{aligned}$$

Under standard constraint qualifications (e.g., [16, Theorem 2.3.4]), \(({\hat{x}},{\hat{y}})\) is a saddle point of L if and only if \({\hat{x}}\) is an optimal solution of the primal problem (\({{\mathcal P}_0}\)), and \({\hat{y}}\) is an optimal solution of the dual problem (\({{\mathcal D}}\)). The following two inclusions characterize the solutions of the problems (\({{\mathcal P}_0}\)) and (\({{\mathcal D}}\)) respectively:

$$ 0 \in \partial f({\overline{x}})+\partial (g\circ {\mathcal A})({\overline{x}}); $$
$$ 0 \in \partial \left( f^*\circ (-{\mathcal A}^{T})\right) ({\overline{y}})+\partial g^* ({\overline{y}}). $$

In both cases, one has to solve an inclusion of the form

$$\begin{aligned} 0 \in (B+D)(x), \end{aligned}$$
(12.16)

for general set-valued mappings B and D.

2.3 Statistical Multi-resolution Estimation

Statistical multi-resolution estimation (SMRE) discussed in Sect. 11.2.7 of Chap. 11 is specialized here for the case of imaging systems with Gaussian noise. Let \({\mathcal G}=\{1,\dots , N\}\times \{1,\dots , N\}\) denote a grid of \(N^2=n\) points. Denote by \({\mathcal V}\) a collection of subsets of \({\mathcal G}\). Suppose there are M such subsets, that is \(M=|{\mathcal V}|\), each of these subsets \(V_i\) consisting of \(m_i\le n\) grid points with \(\sum _{i=1}^{M}m_i=m\ge n\).

The variational model for statistical multi-resolution estimation with Gaussian noise takes the form

$$\begin{aligned} \begin{array}{ll}\underset{x\in {\mathbb {R}^n}}{\text{ minimize } }\qquad f(x)&{}\\ \text{ subject } \text{ to } \qquad \left| \sum _{j\in V_i} w_{i}(j)\left( (Ax)_{j} - b_{j}\right) \right| \le \gamma _{i} \qquad \forall i=1,2,\dots ,M.&{}\end{array} {({\mathcal P}_{SMRE})} \end{aligned}$$

Here \(f: {\mathbb {R}^n}\rightarrow \mathbb {R}\) is a regularization functional, which incorporates a priori knowledge about the unknown signal \({\widehat{x}}\) such as smoothness, \(w_{i}\) is a weighting function for the grid points in the subset \(V_i\), and \(A : {\mathbb {R}^n}\rightarrow {\mathbb {R}^n}\) is the linear imaging operator that models the experiment. The constant \(\gamma _{i}\) has an interpretation in terms of the quantile of the estimator.

In the context of the general model (\({{\mathcal P}_0}\)),

$$\begin{aligned} g_{i}\left( A_{i}x\right) = \left| \sum _{j\in V_i} w_{i}(j)\left( (Ax)_{j} - b_{j}\right) \right| -\gamma _i, ~i=1,2,\dots ,M. \end{aligned}$$
(12.17)

Here the affine mapping \(A_{i}x\equiv \sum _{j\in V_i} w_{i}(j)\left( (Ax)_{j} - b_{j}\right) \) is an averaging operator that accounts for sampling at different resolutions of the image. Note that the observation b need not be in the range of the imaging operator A - all that is assumed is that this mapping is injective, not surjective. This means that, in applications, practitioners need to be careful not to make the constraint \(\gamma _{i}\) too small, otherwise the optimization problem might be infeasible. If the algorithms presented below appear to be diverging for a particular instance of (\({{\mathcal P}_{SMRE}}\)), it is because the problem is infeasible; increasing the constants \(\gamma _{i}\) should solve the problem.

3 Alternating Directions Method of Multipliers and Douglas Rachford

In this section we survey the main results (without proofs) from [1]. For proofs of the statements, readers are referred to that article. A starting point for most of the main approaches to solving (\({{\mathcal P}_0}\)) is the alternating directions method of multipliers (ADMM) (primary sources include [17,18,19,20,21]). This method is one of many splitting methods which are the principal approach to handling the computational burden of large-scale, separable problems [22]. ADMM belongs to a class of augmented Lagrangian methods whose original motivation was to regularize Lagrangian formulations of constrained optimization problems. The ADMM algorithm for solving (\({{\mathcal P}_{\mathcal L}}\)) follows. The penalty parameter \(\eta \) need not be a constant, and indeed evidence indicates that the choice of \(\eta \) can greatly impact the complexity of the algorithm. For simplicity we keep this parameter fixed.

figure c

We do not specify how the argmin in Algorithm 1 should be calculated, and indeed, the analysis that follows assumes that these can be computed exactly.Footnote 1 One problem that should be immediately apparent is that this algorithm operates on a space of dimension \(n+2m\). Since one of the two challenges we address is high dimension, this expansion in the dimension of the problem formulation should be troubling. Nevertheless, we show with this algorithm how the first challenge, namely quantification of convergence is achieved.

The connection between the ADMM algorithm and the Douglas–Rachford algorithm introduced in Chap. 6, (6.30) was first discovered by Gabay [19] (see also the thesis of Eckstein [17]). For any \(\eta >0\), the Douglas–Rachford algorithm [23, 24] for solving (12.16) is given by

$$\begin{aligned}&z^{k+1}&\in T'z^k\quad (k\in \mathrm {\mathbb {N}}),\end{aligned}$$
(12.18)
$$\begin{aligned} \text{ for }&T'&\equiv {\mathcal J}_{\eta D}\left( {\mathcal J}_{\eta B}(\mathrm{Id}-\eta D)+\eta D\right) , \end{aligned}$$
(12.19)

where \({\mathcal J}_{\eta D}\equiv \left( \mathrm{Id}+ \eta D\right) ^{-1}\) and \({\mathcal J}_{\eta B}\left( \mathrm{Id}+ \eta B\right) ^{-1}\) are the resolvents of \(\eta D\) and \(\eta B\) respectively.

Given \(z^0\) and \(y^0\in Dz^0\), following [25], define the new variable \(\xi ^0\equiv z^0+\eta y^0\) so that \(z^0= {\mathcal J}_{\eta D}\xi ^0\). We thus arrive at an alternative formulation of the Douglas–Rachford algorithm (12.18):

$$\begin{aligned}&\xi ^{k+1}&\in T\xi ^k \quad (k\in \mathrm {\mathbb {N}}),\end{aligned}$$
(12.20)
$$\begin{aligned} \text{ for }&T&\equiv \tfrac{1}{2}(R_{\eta B}R_{\eta D}+\mathrm{Id})= {\mathcal J}_{\eta B}(2{\mathcal J}_{\eta D}-\mathrm{Id})+(\mathrm{Id}-{\mathcal J}_{\eta D}), \end{aligned}$$
(12.21)

where \(R_{\eta D}\) and \(R_{\eta B}\) are the reflectors of the respective resolvents. This is the form of Douglas–Rachford considered in [26].

Specializing this to our application yields

$$\begin{aligned} B\equiv \partial \left( f^*\circ (-{\mathcal A}^{T})\right) \quad \text{ and } \quad D\equiv \partial g^*, \end{aligned}$$
(12.22)

and so the resolvent mappings are the proximal mappings of the convex functions \(\left( f^*\circ (-{\mathcal A}^{T})\right) \) and \(g^*\) respectively, and hence the resolvent mappings and corresponding fixed point operator T are single-valued [12].

Proposition 12.7

(Proposition 2.2 [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Let \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) be linear and suppose there exists a solution to \(0\in (B+D)(x)\) for B and D defined by (12.22). For fixed \(\eta >0\), given any initial points \(\xi ^0\) and \(\left( y^0, z^0\right) \in {\mathrm {gph}}D\) such that \(\xi ^0=y^0+\eta z^0\), the sequences \(\left( z^k\right) _{k\in \mathrm {\mathbb {N}}}\), \(\left( \xi ^k\right) _{k\in \mathrm {\mathbb {N}}}\) and \(\left( y^k\right) _{k\in \mathrm {\mathbb {N}}}\) defined respectively by (12.18), (12.20) and \(y^k\equiv \tfrac{1}{\eta }\left( \xi ^k-z^k\right) \) converge to points \({\overline{z}}\in \mathsf {Fix}\,T'\), \({\overline{\xi }}\in \mathsf {Fix}\,T\) and \({\overline{y}}\in D\left( \mathsf {Fix}\,T'\right) \). The point \({\overline{z}}= {\mathcal J}_{\eta D}{\overline{\xi }}\) is a solution to (\({{\mathcal D}}\)), and \({\overline{y}}=\frac{1}{\eta }\left( {\overline{\xi }}-{\overline{z}}\right) \in D{\overline{z}}\). If, in addition, \({\mathcal A}\) has full column rank, then the sequence \(\left( y^k, z^k\right) _{k\in \mathrm {\mathbb {N}}}\) corresponds exactly to the sequence of points generated in steps 2–3 of Algorithm 1 and the sequence \(\left( \xi ^{k+1}\right) _{k\in \mathrm {\mathbb {N}}}\) converges to \({\overline{\xi }}\), a solution to (\({{\mathcal P}_0}\)).

The correspondence between Douglas-Rachford and ADMM in the proposition above means that if quantitative convergence can be established for one of the algorithms, it is automatically established for the other. Linear convergence of Douglas-Rachford under the assumption of strong convexity and Lipschitz continuity of f was already established by Lions and Mercier [26]. Recent published work in this direction includes [27, 28]. Local linear convergence of the iterates to a solution was established in [29] for linear and quadratic programs using spectral analysis. In Proposition 12.8, two conditions are given that guarantee linear convergence of the ADMM iterates to a solution. The first condition is classical and follows Lions and Mercier [26]. The second condition, based on [30], is much more prevalent in applications and generalizes the results of [29].

Proposition 12.8

(Theorem 2.3 of [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Suppose there exists a solution to \(0\in (B+D)(x)\) for B and D defined by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is an injective linear mapping. Let \(\widehat{\xi }\in \mathsf {Fix}\,T\) for T defined by (12.21). For fixed \(\eta >0\) and any given triplet of points \(\left( \xi ^0,y^0, z^0\right) \) satisfying \(\xi ^0\equiv z^0+\eta y^0\), with \(y^0\in Dz^0\), generate the sequence \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) by Steps 2–3 of Algorithm 1 and the sequence \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) by (12.20).

  1. (i)

    Let \({\mathcal O}\subset {\mathbb {R}^n}\) be a neighborhood of \(\widehat{\xi }\) on which g is strongly convex with constant \(\mu \) and \(\partial g\) is \(\beta \)-inverse strongly monotone for some \(\beta >0\). Then, for any \(\left( \xi ^0, y^0, z^0\right) \in {\mathcal O}\) satisfying \(\xi ^0\equiv z^0+\eta y^0\in {\mathcal O}\), the sequences \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) and \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) converge linearly to the respective points \({\overline{\xi }}\in \mathsf {Fix}\,T\) and \(\left( {\overline{y}},{\overline{z}}\right) \) with rate at least \(K=\left( 1-\frac{2\eta \beta \mu ^2}{(\mu +\eta )^2}\right) ^{\tfrac{1}{2}}<1\).

  2. (ii)

    Suppose that \(T:\,W\rightarrow W\,\) for some affine subspace \(W\subset {\mathbb {R}^n}\) with \({\widehat{\xi }}\in W\). On the neighborhood \({\mathcal O}\) of \({\widehat{\xi }}\) relative to W, that is \({\mathcal O}\cap W\), suppose there is a constant \(\kappa >0\) such that

    $$\begin{aligned} \Vert \xi -\xi ^+\Vert \ge \sqrt{\kappa } \mathrm {dist}(\xi ,\mathsf {Fix}\,T)\quad \forall \xi \in {\mathcal O}\cap W,~\forall \xi ^+\in T\xi . \end{aligned}$$
    (12.23)

    Then the sequences \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) and \((y^k, z^k)_{k\in \mathrm {\mathbb {N}}}\) converge linearly to the respective points \({\overline{\xi }}\in \mathsf {Fix}\,T\cap W\) and \(\left( {\overline{y}},{\overline{z}}\right) \) with rate bounded above by \(\sqrt{1-\kappa }\).

In either case, the limit point \({\overline{z}}={\mathcal J}_{\eta D}{\overline{\xi }}\) is a solution to (\({{\mathcal D}}\)), \({\overline{y}}\in D{\overline{z}}\) and the sequence \(\left( x^k\right) _{k\in \mathrm {\mathbb {N}}}\) of Step 1 of Algorithm 1 converges to \({\overline{x}}\), a solution of (\({{\mathcal P}_0}\)).

The strong convexity assumption (i) of Theorem 12.8 fails in many applications of interest, and in particular for feasibility problems (minimizing the sum of indicator functions). By [31, Theorem 2.2], case (ii) of Theorem 12.8, in contrast, holds in general for mappings T for which \(T-\mathrm{Id}\) is metrically subregular and the fixed point sets are isolated points with respect to an affine subspace to which the iterates are confined. The restriction to the affine subspace W is a natural generalization for the Douglas–Rachford algorithm, where the iterates are known to stay confined to affine subspaces orthogonal to the fixed point set [32, 33]. We show that metric subregularity with respect to this affine subspace holds in many applications.

Proposition 12.9

(polyhedrality of the Douglas–Rachford operator) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc and convex. Suppose, in addition, that f and g are piecewise linear-quadratic. The operator \(T:\,{\mathbb {R}^m}\rightarrow {\mathbb {R}^m}\,\) defined by (12.21) with \(\eta >0\) fixed, is polyhedral for B and D given by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is a linear mapping.

Proof

This is Proposition 2.6 of [1].

Proposition 12.10

(local linear convergence, Theorem 2.7 of [1]) Let \(f:{\mathbb {R}^n}\rightarrow \overline{\mathbb {R}}\) and \(g:{\mathbb {R}^m}\rightarrow \overline{\mathbb {R}}\) be proper, lsc, convex, piecewise linear-quadratic functions (see Definition 12.1). Define the operator \(T:\,{\mathbb {R}^m}\rightarrow {\mathbb {R}^m}\,\) by (12.21) with \(\eta >0\) fixed and B and D given by (12.22) where \({\mathcal A}:\,{\mathbb {R}^n}\rightarrow {\mathbb {R}^m}\,\) is a linear mapping. Suppose that there exists a solution to \(0\in (B+D)(x)\), that \(T:\,W\rightarrow W\,\) for W some affine subspace of \({\mathbb {R}^m}\) and that \(\mathsf {Fix}\,T\cap W\) is an isolated point \(\{{\overline{\xi }}\}\). Then there is a neighborhood \({\mathcal O}\) of \({\overline{\xi }}\) such that, for all starting points \((\xi ^0,y^0, z^0)\) with \(\xi ^0\equiv z^0+\eta y^0\in {\mathcal O}\cap W\) for \(y^0\in D(z^0)\) so that \({\mathcal J}_{\eta D}\xi ^0=z^0\), the sequence \((\xi ^k)_{k\in \mathrm {\mathbb {N}}}\) generated by (12.20) converges Q-linearly to \({\overline{\xi }}\) where \({\overline{z}}\equiv {\mathcal J}_{\eta D}\overline{\xi }\) is a solution to (\({{\mathcal D}}\)). The rate of linear convergence is bounded above by \(\sqrt{1-\kappa }\), where \(\kappa =c^{-2}>0\), for c a constant of metric subregularity of \(T-\mathrm{Id}\) at \({\overline{\xi }}\) for the neighborhood \({\mathcal O}\). Moreover, the sequence \(\left( y^k, z^k\right) _{k\in \mathrm {\mathbb {N}}}\) generated by steps 2–3 of Algorithm 1 converges linearly to \(\left( {\overline{y}},{\overline{z}}\right) \) with \({\overline{y}}=\tfrac{1}{\eta }\left( {\overline{x}}-{\overline{z}}\right) \), and the sequence \(\left( x^k\right) _{k\in \mathrm {\mathbb {N}}}\) defined by Step 1 of Algorithm 1 converges to a solution to (\({{\mathcal P}_0}\)).

Fig. 12.1
figure 1

Recorded STED image used in [1]. Inset: area to be processed (\(640\times 640\,\mathrm{nm}^2)\). The scale bar (white) is \(1\,\upmu \)m

3.1 ADMM for Statisitcal Multi-resolution Estimation of STED Images

The theoretical results above are demonstrated with an image \(b\in {\mathbb {R}^n}\) (Fig. 12.1) generated from a Stimulated Emission Depletion (STED) microscopy experiment [34, 35] conducted at the Laser-Laboratorium Göttingen examining tubulin, represented as the “object” \(x\in {\mathbb {R}^n}\). The imaging model is simple linear convolution. The measurement b, shown in Fig. 12.1, is noisy or otherwise inexact, and thus an exact solution is not desirable. Although the noise in such images is usually modeled by Poisson noise, a Gaussian noise model with constant variance suffices as the photon counts are of the order of 100 per pixel and do not vary significantly across the image. We calculate the numerically reconstructed tubulin density \({\overline{x}}\) shown in Fig. 12.2a via Algorithm 1 for the problem (\({{\mathcal P}}\)) with the qualitative regularization

$$\begin{aligned} f(x) \equiv \alpha || x||^2 \end{aligned}$$
(12.24)

and exact penalty \(g({\mathcal A}x)\) given by (12.12) with \(g_i\) given by (12.17).

Fig. 12.2
figure 2

Reproduced from [1]

a Numerical reconstruction via Algorithm 1 from the imaging data shown in Fig. 12.1 for \(\rho =4096\). b Iterates of Algorithm 1: primal and dual step sizes, as well as constraint violation \(g({\mathcal A}x)\) given by (12.12) with \(\rho =4096\).

For an image size of \(n=64\times 64\) with three resolution levels the resulting number of constraints is \(m=12161\) (that is, \(64^2\) constraints at the finest resolution, \(4*32^2\) constraints at the next resolution and \(9*21^2\) constraints at the lowest resolution). The constant \(\alpha =0.01\) in (12.24) is used to balance the contributions of the individual terms to make the most of limited numerical accuracy (double precision). The constant \(\gamma _i\) is chosen so that the model solution would be no more than 3 standard deviations from the noisy data on each interval of each scale.

Since this is experimental data, there is no “truth” for comparison - the constraint, together with the error bounds on the numerical solution to the model solution provide statistical guarantees on the numerical reconstruction [36]. In Fig. 12.2b the iteration is shown with the value of \(\rho =4096\) for which the constraints are exactly satisfied (to within machine precision), indicating the correspondence of the computed solution of problem (\({{\mathcal P}}\)) to a solution to the exact model problem (\({{\mathcal P}_0}\)).

The only assumption from Proposition 12.10 that cannot be verified for this implementation is the assumption that the algorithm fixed point is a singleton; all other assumptions are satisfied automatically by the problem structure. We observe, however, starting from around iteration 1500 in Fig. 12.2b, behavior that is consistent with (i.e. does not contradict) linear convergence. From this, the observed convergence rate is \(c=0.9997\), which yields an a posteriori upper estimate of the pixel-wise error of about \(8. 9062e^{-4}\), or 3 digits of accuracy at each pixel.Footnote 2

4 Primal-Dual Methods

The ADMM method presented above suffers from the extreme computational cost of computing the prox-operator in step 1. The results of the previous section required several days of cpu time on a 2016-era laptop. In this section we present a method studied in [3] that can achieve results in about 30 s on the same computer architecture. In this section we survey the main results (without proofs) from [1]. There is one subtle difference in the present survey over [3] that has major implications for the application and implementation of the main Algorithm 2.

In this section we consider exclusively functions g in problem (\({{\mathcal P}}\)) of the form (12.11). The algorithm we revisit is the proximal alternating predictor-corrector (PAPC) algorithm proposed in [37] for solving (\({{\mathcal S}}\)). It consists of a predictor-corrector gradient step for handling the smooth part of L in (12.13) and a proximal step for handling the nonsmooth part.

figure d

At each iteration the algorithm computes one gradient and a prox-mapping corresponding to the nonsmooth function, both of which are assumed to efficiently implementable. We suppose these can be evaluated exactly, though this does not take into account finite precision arithmetic. The dimension of the iterates of the EPAPC algorithm is on the same order of magnitude as with the ADMM/Douglas-Rachford method, but the individual steps can be run in parallel and, with the exception of the projection in Step 6, are much less computationally intensive to execute.

For quantitative convergence guarantees of primal-dual methods, additional assumptions are required.

figure e

The assumption of Lipschitz continuous gradients Assumption 3(i), is standard, but stronger than one might desire in general. The assumption is included mainly to guarantee boundedness of the iterates. Lipschitz continuity of the gradients is enough for our purposes, however. By the standing Assumption 1 the mapping \({\mathcal A}\) is injective and when \(m\le n\), then \({\mathcal A}\) has full row rank, and \({\mathcal A}{{\mathcal A}^T}\) is invertible. When \(m>n\), \({\mathcal A}\) is still injective but \({\mathcal A}{{\mathcal A}^T}\) has a nontrivial kernel and care must be taken that the conjugate function \(g^*\) does not decrease too fast in the direction of the kernel of \({\mathcal A}^T\). This is assured by Assumption 3(iii). This assumption comes into play in Lemma 12.1.

Step (3) of Algorithm 2 can be written more compactly when \(g(w) := g(w_1, \ldots , w_M) = \sum _{i=1}^{M}g_i(w_i)\). In this case, the convex conjugate of a separable sum of functions is the sum of the individual conjugates: \(g^*(w) := \sum _{i=1}^{M}g^*_i(w_i)\). Defining the matrix \(S ={\sigma }^{-1}I_m\) we immediately get that for any point \(\zeta _i\in \mathbb {R}^{m_i}, \; i=1, \ldots , M\),

$$ \mathrm{prox}_{S,{g^*}} (\zeta ) = (\mathrm{prox}_{{\sigma },{g^*_1}}(\zeta _1), \mathrm{prox}_{{\sigma } ,{g^*_2}}(\zeta _2), \ldots , \mathrm{prox}_{{\sigma },{g^*_M}}(\zeta _M)). $$

Thus Step (3) of Algorithm 2 can be written in vector notation by \(w^{k}= \mathrm{prox}_{S,g} (y^{k-1}+ \sigma {\mathcal A}p^{k})\). It is possible to use different proximal step constants \(\sigma _i\), \(i = 1\ldots ,M\), see details in [37]. The choice \(\sigma _i=\sigma \) for \(i=1, \ldots , M\) is purely for simplicity of exposition. The projection onto \((\ker {{\mathcal A}^T})^\perp \) in (6) is carried out by applying the pseudo inverse:

$$ P_{\left( \ker {{\mathcal A}^T}\right) ^\perp } = {\mathcal A}\left( {{\mathcal A}^T}{\mathcal A}\right) ^{-1}{{\mathcal A}^T}. $$

When \(m\le n\) and \(A_i\) is full rank for all \(i=1,2,\dots ,M\), then \(\ker {{\mathcal A}^T}=\{0\}\) and the above operation is not needed. But an unavoidable feature of multi-resolution analysis, our motivating application, is that \(m>n\), so some thought must be given to efficient computation of \(\left( {{\mathcal A}^T}{\mathcal A}\right) ^{-1}\).

The next technical result, which is new, establishes a crucial upper bound on the growth of the Lagrangian with respect to the primal variables.

Lemma 12.1

Let Assumption 3 hold and let \(\left( (p^{k}, y^{k}, x^{k})\right) _{k\in \mathrm{N}}\) be the sequence generated by the EPAPC algorithm. Then for every \(k\in \mathrm{N}\) and every \((x,y)\in {\mathbb {R}^n}\times {\mathbb {R}^m}\),

$$\begin{aligned} \begin{aligned} L(x^{k},y^{k}) - L(x,y^{k})&\le \frac{1}{2\tau }\left( \left\| x-x^{k-1}\right\| ^2 - \left\| x-x^{k}\right\| ^2\right) \\&- \frac{1}{2}\left( \frac{1}{\tau } - K_f\right) \left\| y^{k}-y^{k-1}\right\| ^2. \end{aligned} \end{aligned}$$
(12.26)

and

$$\begin{aligned} L(x^{k},y) - L(x^{k},y^{k}) \le \frac{1}{2}\left( \left\| y-y^{k-1}\right\| ^2_G - \left\| y-y^{k}\right\| ^2_G - \left\| y^{k}-y^{k-1}\right\| ^2_G\right) \end{aligned}$$
(12.27)

where

$$\begin{aligned} G:= {\sigma }^{-1}I_m - \tau {\mathcal A}{{\mathcal A}^T}. \end{aligned}$$
(12.28)

Note that for the choice of \(\tau \) given in the parameter initialization of Algorithm 2, \(G\succ 0\).

Proof

The proof of (12.26) follows exactly as in the proof of [37, Lemma 3.1(i)]. The second inequality (12.27) follows exactly as the proof of [37, Lemma 3.1(ii)] in the case that \(\ker {{\mathcal A}^T}=\{0\}\). The case where \(\ker {{\mathcal A}^T}\) is nontrivial requires more care. The proof of this is suppressed in the interest of brevity.

The next intermediate result establishes pointwise quadratic supportability (Definition 12.2) on bounded sets at all saddle points under Assumptions 1 and 3.

Proposition 12.11

([3]) Let \(\left( (p^{k}, y^{k}, x^{k})\right) _{k\in \mathrm{N}}\) be the sequence generated by the EPAPC algorithm. If Assumptions 1 and 3 are satisfied, then for any primal solution \({\hat{x}}\) to the saddle point problem (\({{\mathcal S}}\)), there exists a \(\mu >0\) such that

$$\begin{aligned} f(x^{k})\ge f({\hat{x}})+\left\langle \nabla f({\hat{x}}), x^{k}-{\hat{x}}\right\rangle + \tfrac{1}{2}\mu \Vert x^{k}-{\hat{x}}\Vert ^2 \quad \forall k. \end{aligned}$$
(12.29)

The constant \(\mu \) in Proposition 12.11 depends on the choice of \((x^0, y^0)\) and so depends implicitly on the distance of the initial guess to the point in the set of saddle point solutions.

Convergence of the primal-dual sequence is with respect to a weighted norm on the primal-dual product space built on G in (12.28).

$$\begin{aligned} u = \begin{pmatrix}x\\ y\end{pmatrix}, \qquad H = \begin{pmatrix} \tau ^{-1}I_n &{} 0 \\ 0 &{} G \end{pmatrix} \end{aligned}$$
(12.30)

where by the assumptions on the choice of \(\tau \) given in Algorithm 2, \(G\succ 0\). We can then define an associated norm using the positive definite matrix H, \(\left\| u\right\| ^2_H := \frac{1}{\tau }\left\| x\right\| ^2 + \left\| y\right\| ^2_G\).

We are now ready to state the main result and corollaries, whose proofs can be found in [3].

Theorem 12.4.1

Let \(\left( (p^{k}, x^{k}, y^{k})\right) _{k\in \mathrm{N}}\) be the sequence generated by the EPAPC algorithm. Let \(\lambda _{min+}({\mathcal A}{{\mathcal A}^T})\) denote the smallest nonzero eigenvalue of \({\mathcal A}{{\mathcal A}^T}\). If Assumptions 1 and 3 are satisfied, then there exists a saddle point solution for \(L(\cdot ,\cdot )\), the pair \({\hat{u}}=({\hat{x}},{\hat{y}})\), with \({\hat{y}}\in (\ker {{\mathcal A}^T})^\perp \), and for any \(\alpha > 1\) and for all \(k\ge 1\), the sequence \(\left( u^k = (x^{k},y^{k})\right) _{k\in \mathrm{N}}\) satisfies

$$\begin{aligned} \left\| u^k- {\hat{u}}\right\| ^2_H \le \frac{1}{1 + \delta }\left\| u^{k-1}- {\hat{u}}\right\| ^2_H, \end{aligned}$$
(12.31)

where

$$\begin{aligned} \delta = \min \left\{ \frac{(\alpha -1)\tau \sigma (1 -\tau L_f)\lambda _{min+}({\mathcal A}{{\mathcal A}^T})}{\alpha }, \frac{\mu \tau \sigma \lambda _{min+}({\mathcal A}{{\mathcal A}^T})}{\alpha \tau L_f^2 + \sigma \lambda _{min+}({\mathcal A}{{\mathcal A}^T})}\right\} \end{aligned}$$
(12.32)

is positive and \(\mu >0\) is the constant of pointwise quadratic supportability of f at \(\hat{x}\) depending on the distance of the initial guess to the point \(({\hat{x}},{\hat{y}})\) in the solution set \({\mathcal S}^*\). In particular, \(\left( (x^{k}, y^{k})\right) _{k\in \mathrm{N}}\) is Q-linearly convergent with respect to the H-norm to a saddle-point solution.

An unexpected corollary of the result above is that the set of saddle points is a singleton.

Corollary 12.12

(Unique saddle point) Under Assumptions 1 and 3 the solution set \({\mathcal S}^*\) is a singleton.

The above theorem yields the following estimate on the number of iterations required to achieve a specified distance to a saddle point.

Corollary 12.13

Under Assumptions 1 and 3, let \(\bar{u}=(\bar{x}, \bar{y})\) be the limit point of the sequence generated by the EPAPC algorithm. In order obtain

$$\begin{aligned} \Vert x^{k}-\bar{x}\Vert \le \epsilon \qquad (\text {resp.} \quad \Vert y^{k}-\bar{y}\Vert _G \le \epsilon ), \end{aligned}$$
(12.33)

it suffices to compute k iterations, with

$$\begin{aligned} k \ge \frac{2\log \left( \frac{C}{L_f\epsilon }\right) }{\delta } \qquad \left( \text {resp.} \quad k \ge \frac{2\log \left( \frac{C}{\epsilon }\right) }{\delta }\right) , \end{aligned}$$
(12.34)

where \( C = \Vert u^0 - \bar{u}\Vert _H = \left( \frac{1}{\tau }\left\| x^0-\bar{x}\right\| ^2 + \left\| y^0-\bar{y}\right\| ^2_G\right) ^{1/2}\), and \(\delta \) is given in (12.32).

4.1 EPAPC for Statisitcal Multi-resolution Estimation of STED Images

An efficient computational strategy for evaluating or at least approximating the projection \(P_{(\ker {{\mathcal A}^T})^\perp }\) in Step 6 of Algorithm 2 has not yet been established. We report here preliminary computational results of Algorithm 2 without computing Step 6. Our results show that the method is promising, though error bounds to the solution to (\({{\mathcal S}}\)) are not justified without computation of \(P_{(\ker {{\mathcal A}^T})^\perp }\).

In our numerical experiments, the constraint penalty in (\({{\mathcal S}}\)) takes the form \( g(y) = \sum _i \iota _{{\mathcal C}_i}(y) \) where each \({\mathcal C}_i = \{ y: | \sum _{j\in V_i} \omega _i (y_j - b_j) | \le \gamma _i\}\). This is an exact penalty function, and so solutions to (\({{\mathcal S}}\)) correspond to solutions to (\({{\mathcal P}_0}\)). Using Moreau’s identity (12.6), the prox-mapping is evaluated explicitly in (6) for each constraint by

$$\begin{aligned} y^{k}_i= & {} \mathrm{prox}_{\sigma ,\iota ^*_{{\mathcal C}_i}}(y^{k-1}_i + \sigma A_ip^{k})\\= & {} y^{k-1}_i + \sigma A_ip^{k}- \sigma P_{{\mathcal C}_i}\left( \frac{y^{k-1}_i + \sigma A_ip^{k}}{\sigma }\right) , \qquad i=1, \ldots , M. \end{aligned}$$

The proximal parameter is a function of \(\tau \) and given by \(\sigma = 1/ ( \tau \Vert {\mathcal A}{{\mathcal A}^T}\Vert _2)\). More details in [37, Sect. 4.1].

Here, we also consider the smooth approximation of the \(L^1\)-norm as the qualitative objective. The \(L^1\)-norm is non-smooth at the origin, thus in order to make the derivative-based methods possible we consider a smoothed approximation of this, known as the Huber approximation.

The Huber loss function is defined as follows:

$$\begin{aligned} \Vert x\Vert _{1, \alpha } = \sum _{i,j} \phi _{\alpha }(x_{i,j}), \qquad \phi _{\alpha }(t) = {\left\{ \begin{array}{ll} \frac{t^2}{2\alpha } &{} \text {if}\, |t| \le \alpha \\ |t| - \frac{\alpha }{2} &{} \text {if}\,|t| > \alpha , \end{array}\right. } \end{aligned}$$
(12.35)

where \(\alpha > 0\) is a small parameter defining the trade-off between quadratic regularization (for small values) and \(L^1\) regularization (for larger values). The function \(\phi \) is smooth with \(\frac{1}{\alpha }\)-Lipschitz continuous derivative and its derivative is given by

$$\begin{aligned} \phi '_{\alpha }(t) = {\left\{ \begin{array}{ll} \frac{t}{\alpha } &{} \text {if}\, |t| \le \alpha \\ \text {sgn}(t) &{} \text {if}\,|t| > \alpha . \end{array}\right. } \end{aligned}$$
(12.36)

Pointwise quadratically supportability of this function at solutions is not unreasonable but still must be assumed.

We demonstrate our reconstruction of the image inset shown in Fig. 12.1 of size \(n=64^2\) with the same SMRE model as the demonstration in Sect. 12.3.1. The confidence level \(\gamma _i\) was set to \(0.25*i\) at each resolution level (\(i=1,2,3\)). Figure 12.3(top left) shows the reconstruction with the \(L^2\) function \(f(x) = 0.01 \Vert x\Vert ^2\) (compare to Fig. 12.2). Figure 12.3(top right) shows the reconstruction with the Huber function \(\Vert x\Vert _{1,\alpha }\), where \(\alpha = 0.25\). Figure 12.3(bottom) shows the step size of the primal-dual pair for each of these regularized problems as a function of iteration. The model with quadratic regularization achieves a better average rate of convergence, but for both objective functions the algorithm appears to exhibit R-linear convergence (not Q-linear). What is not evident from these experiments is the computational effort required per iteration. Without computation of the pseudo-inverse in step 6, the EPAPC algorithm computes these results in about 30 s on a 2018-era laptop, compared to several days for the results shown in Fig. 12.2.

Fig. 12.3
figure 3

Results of image denoising and deconvolution via Algorithm 2 for statistical multi-resolution estimation with three levels at iteration 4000 with a \(f(x)=0.01*\Vert x^2\Vert ^2\) and b \(f(x) = \Vert x\Vert _{1,0.25}\), the Huber loss function. Frame c shows the step length \(\Vert u^{k+1} - u^k\Vert _H\) where \(u^k=(x^{k},y^{k})\) of the associated sequence as function of the number of iterations

5 Randomized Block-Coordinate Primal-Dual Methods

The previous sections reviewed numerical strategies and structures that yield quantitative estimates of the distance of an iterate to the solution of the underlying variational problem. In this section we examine implementation strategies for dealing with high dimensional problems. These are implementation strategies because they do not involve changing the optimization model. Instead, we select at random a smaller subset of variables or constraints in the computation of an update in the full-dimensional iterative procedure. This is the principal strategy for handling problems that, due to their size, must be distributed across many processing and storage units (see for instance [26, 38,39,40] and references therein). We survey here a randomized primal-dual technique proposed and analyzed in [2]. The main theoretical question to resolve with such approaches is whether, and in what sense, iterates converge to a solution to the original problem. We can determine whether the iterates converge, but obtaining an estimate of the distance to the solution remains an open problem.

The algorithm below is a primal-dual method like the algorithms reviewed above, with the exception that it solves an extension of the dual problem (\({{\mathcal D}}\)):

$$\begin{aligned} \begin{array}{ll}\underset{(x,y)\in {\mathbb {R}^n}\times {\mathbb {R}^m}}{\text{ minimize } }\qquad \,\,\,\,f^*(x)+g^*(y)&{}\\ \text{ subject } \text{ to } \qquad x=-{{\mathcal A}^T}y&{}\end{array}. \qquad {(\widetilde{{\mathcal D}})} \end{aligned}$$

The main \(\mathrm{prox}\) operation is computed on the dual objective in (\({{\mathcal D}}\)), that is \(f^*(x)+g^*(y)\) with respect to the variables \((x, y)\in {\mathbb {R}^n}\times {\mathbb {R}^m}\). The dimension of the basic operations is unchanged from the previous approaches, but the structure of the sum of functions allows for efficient evaluation of the \(\mathrm{prox}\) mapping. Implicit in this is that the function f is prox friendly. In the algorithm description below it is convenient to use the convention \(f\equiv g_{0}\), \(A_{0}\equiv \mathrm{Id}\). The algorithm is based in part on [39].

figure f

Notice that each iteration of Algorithm 3 requires only two small matrix-vector multiplications: \(A_i(\cdot )\) and \(A_i^T(\cdot )\). The methods of the previous sections, in contrast, worked with full matrix \({\mathcal A}= [A_1^T,\dots , A_m^T]^T\). This means that all iterations involve full vector operations. For some applications this might be not feasible, at least on standard desktop computers due to the size of problems. Algorithm 3 uses only blocks \(A_i\) of \({\mathcal A}\), therefore each iteration requires fewer floating point operations, at the cost of having less information available for choosing the next step. This reduction in the effectiveness of the step is compensated for through larger block-wise steps. Computation of the step size is particularly simple. This follows the same hybrid step-length approach developed in [41] for the nonlinear problem of blind ptychography. In particular, we use step sizes adjusted to each block \(A_i\):

$$ \tau \sigma _i \lambda _{\max } (A_i^TA_i) <1 \quad \forall i=0,1,\dots , m. $$

Proposition 12.14

(Theorem 1 of [2]) Suppose Assumption 1 holds and let \(\tau \sigma _i \Vert A_i \Vert ^2 <1\). Then \((x^k,y^k)\) generated by Algorithm 3 converges to a solution to \((\widetilde{{\mathcal D}})\). In particular, the sequence \((x^k)\) converges almost surely to a solution to (\({{\mathcal P}}\)).

The statement above concerns part (i) of Theorem 1 of [2]. No smoothness is required of the qualitative regularization f. Instead, it is assumed that this function is prox-friendly. This opens up the possibility of using the 1-norm as a regularize, promoting, in some sense, sparsity in the image. No claim is made on the rate of convergence, though the numerical experiments below indicate that, for regular enough functions f, convergence might be locally linear. This remains to be proved.

5.1 RBPD for Statisitcal Multi-resolution Estimation of STED Images

Despite many open questions regarding convergence, random methods offer a way to handle extremely large problems. To make a comparison with the deterministic approaches above, cycles of the RBPD Algorithm 3 are counted in terms of epochs. An epoch contains the number of passes through steps 1–6 of Algorithm 3 required before each block is chosen at least once. After k epochs, therefore, the i-th coordinate of x will be updated, on average, the same number of times for the randomized algorithm as for the deterministic methods. In other words, an epoch for a randomized block-wise method is comparable to an iteration of a deterministic method. As the RBPD updates only one block per iteration, each iteration is less computationally intensive than the the deterministic counterparts. However, in our case this efficient iteration still requires one to evaluate two (possibly) expensive convolution products (embedded in \(A_i x\) and \(A_i^T y\)). Thus, if these operations are relatively expensive, the efficiency gain will be marginal. Nevertheless, because of the ability to operate on smaller blocks, the randomized method requires, per epoch, approximately half the time required per iteration of the deterministic methods. Although the quantitative convergence analysis remains open, our numerical experiments indicate that the method achieves a comparable step-residual to the EPAPC Algorithm 2 after the same number of epochs/iterations.

As with the experiments in the previous Sections, we use three resolutions, which results in one block at the highest resolution, four blocks at the next resolution (four possible shifts of \(2\times 2\) pixels), and nine blocks at the third resolution (nine different shifts of \(3\times 3\) pixels). We applied Algorithm 3 with different regularization f in (\({{\mathcal P}_0}\)): the 1-norm \(f(x) = \Vert x \Vert _{1}\), Huber loss function \(f(x) = \Vert x\Vert _{1, \alpha }\) given by (12.35) (\(\alpha =0.25\)) and the squared Euclidean norm. As with the EPAPC experiments, the function g is given by (12.11) with \(g_i\) given by (12.17) for the parameter \(\gamma _i=0.25*i\) for \(i=1,2,3\). All of these functions are prox-friendly and have closed-form Fenchel conjugates. The gain in efficiency over the deterministic EPAPC method proposed above (without computation of the pseudo-inverse) is a factor of 2.

Figure 12.4a–c shows the reconstructions on the same \(64\times 64\) image data used in the previous sections. The numerical performance of the algorithm is shown in Fig. 12.4(d). What the more efficient randomization strategy enables is for the full \(976\times 976\) pixel image to be processed. The result for regularization with the 1-norm is shown in Fig. 12.5.

Fig. 12.4
figure 4

Results of image denoising and deconvolution using RBPD (Algorithm 3) for statistical multi-resolution estimation of a \(64\times 64\) pixel image with three levels at epoch 4000 with a \(f(x)=0.01*\Vert x\Vert ^2\), b \(f(x) = \Vert x\Vert _{1,0.25}\), the Huber loss function and c \(f(x)=\Vert x\Vert _1\). d The step length \(\Vert u^{k+1} - u^k\Vert _H\) where \(u^k=(x^{k},y^{k})\) of the associated sequence as function of the number of iterations

Fig. 12.5
figure 5

Left: original image. Right: image denoising and deconvolution using RBPD (Algorithm 3) for statistical multi-resolution estimation with three levels at epoch 4000 with \(f(x)=\Vert x\Vert _1\)