1 Introduction

In many experimental settings the informationFootnote 1 \(z\in\mathbb{R}^{n}\) to be processed and analyzed computationally is obtained through measuring some real world data \(x\in\mathbb{R}^{m}\). The action of performing such measurement oftentimes introduces distortions or errors in the real data which, given that the distortion \(A:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}\) is known, may be inverted to recover the original data. A particularly common case (e.g. in image processing, dose computationFootnote 2 or convolution and deconvolution processes in general [1, 2]) occurs when this relation A between measurements and data is in fact linear or easily linearizable, i.e. if \(A\in\mathbb{R}^{m\times n}\).

It is thus natural to consider the following optimization problem

$$ \min_{x\in\mathbb{R}^{m}}f(Ax), $$
(1.1)

where \(f:\mathbb{R}^{n}\rightarrow\mathbb{R}\) is a continuously differentiable function and A is a real \(m\times n\) matrix. Typical (first order) approaches for solving (1.1) involve estimates of the gradient, see for example the classical works of Levitin and Polyak [3], Goldstein and Tretyakov [4] and more recent and related results [5, 6]. Hence there is the need to evaluate the term

$$ \nabla_{x}f(Ax)=A^{T}\cdot \nabla_{z}f(z), $$
(1.2)

where \(z=Ax\). In the case of ill-conditioned A, (1.2) gives only little information and hence long run-times ensue, see also [7, 8].

The purpose of this paper is introduce a new preconditioning process through altering the singular value spectrum of A and then transforming (1.1) into a more benign problem. Our proposed algorithmic scheme can be used as a preconditioning process in many optimization procedures; but due to their simplicity and nice geometrical interpretation we focus here on Projection Methods. For related work using preconditioning in optimization with applications see [9, 10] and the many references therein.

The paper is organized as follows. In Section 2 we present some preliminaries and definitions that will be needed in the sequel. Later, in Section 3 the new Singular Value Homogenization (SVH) transformation is presented and analyzed. In Section 4 we present numerical experiments to linear least squares and dose deposition computation in IMRT; these results are conducted and compared with LAPACK solvers and projection methods. Finally we summarize our findings and put them into larger context in Section 5.

2 Preliminaries

In our terminology we shall always adhere to the subsequent definitions. We denote by \(\mathcal{C}^{1}(\mathbb{R}^{m}) \) the set of all continuously differentiable functions \(f:\mathbb{R}^{m}\rightarrow\mathbb{R}\).

Definition 2.1

Let \(a\in \mathbb{R} ^{n}\), \(a\neq0\) and \(\beta\in \mathbb{R} \), then \(H_{-}(\alpha,\beta)\) is called a half-space, and it is defined as

$$ H_{-}(\alpha,\beta):=\bigl\{ z\in \mathbb{R} ^{n}\mid \langle a,z \rangle\leq\beta\bigr\} . $$
(2.1)

When there is equality in (2.1) then it is called a hyper-plane and it is denoted by \(H(\alpha,\beta)\).

Definition 2.2

Let C be non-empty, closed and convex subset of \(\mathbb{R} ^{n}\). For any point \(x\in \mathbb{R} ^{n}\), there exists a point \(P_{C}(x)\) in C that is the unique point in C closest to x, in the sense of the Euclidean norm; that is,

$$ \bigl\Vert x-P_{C} ( x ) \bigr\Vert \leq \Vert x-y\Vert \quad \text{for all }y\in C. $$
(2.2)

The mapping \(P_{C}:\mathbb{R} ^{n}\rightarrow C\) is called the orthogonal or metric projection of \(\mathbb{R} ^{n}\) onto C. The metric projection \(P_{C}\) is characterized [11], Section 3, by the following two properties:

$$ P_{C}(x)\in C $$
(2.3)

and

$$ \bigl\langle x-P_{C} ( x ) ,P_{C} ( x ) -y \bigr\rangle \geq0\quad\text{for all }x\in \mathbb{R} ^{n}, y\in C, $$
(2.4)

where equality in (2.4) is reached, if C is a hyper-plane.

A simple example when the projection has a close formula is the following.

Example 2.3

The orthogonal projection of a point \(x\in \mathbb{R} ^{n}\) onto \(H_{-}(\alpha,\beta) \) is defined as

$$ P_{H_{-}(\alpha,\beta)}(x):=\textstyle\begin{cases} x-\frac{ \langle a,z \rangle-\beta}{ \Vert a\Vert ^{2}}a & \text{if } \langle a,x \rangle>\beta,\\ x & \text{if } \langle a,x \rangle\leq\beta. \end{cases} $$
(2.5)

2.1 Projection methods

Projection methods (see, e.g., [1214]) were first used to solve systems of linear equations in Euclidean spaces in the 1930s and were subsequently extended to systems of linear inequalities. The basic step in these early algorithms consists of a projection onto a hyper-plane or a half-space. Modern projection methods are more sophisticated and they can solve the general Convex Feasibility Problem (CFP) in a Hilbert space, see, e.g., [15].

In general, projection methods are iterative algorithms that use projections onto sets while relying on the general principle that when a family of (usually closed and convex) sets is present, then projections onto the given individual sets are easier to perform than projections onto other sets (intersections, image sets under some transformation, etc.) that are derived from the given individual sets. These methods have a nice geometrical interpretation, moreover their main advantage is low computational effort and stability. This is the major reason they are so successful in real-world applications, see [16, 17].

As two prominent classical examples of projection methods, we avail the Kaczmarz [18] and Cimmino [19] algorithms for solving linear systems of the form \(Ax=b\) as above. Denote by \(a^{i}\) the ith row of A. In our presentation of these algorithms here, they are restricted to exact projection onto the corresponding hyper-plane while in general relaxation is also permitted.

Algorithm 2.4

(Kaczmarz method)

Step 0::

Let \(x^{0}\) be arbitrary initial point in \(\mathbb{R} ^{n}\), and set \(k=0\).

Step 1::

Given the current iterate \(x^{k}\), compute the next iterate by

$$ x^{k+1}=P_{H(a^{i},b_{i})}\bigl(x^{k}\bigr):=x^{k}+ \frac{b_{i}- \langle a^{i},x^{k} \rangle}{ \Vert a^{i}\Vert ^{2}}a^{i}, $$
(2.6)

where \(i=k\) mod \(m+1\).

Step 2::

Set \(k\leftarrow(k+1)\) and return to Step 1.

Algorithm 2.5

(Cimmino method)

Step 0::

Let \(x^{0}\) be arbitrary initial point in \(\mathbb{R} ^{n}\), and set \(k=0\).

Step 1::

Given the current iterate \(x^{k}\), compute the next iterate by

$$ x^{k+1}:=\frac{1}{n}\sum_{i=1}^{n} \biggl( x^{k}+2\frac{b_{i}- \langle a^{i},x^{k} \rangle}{ \Vert a^{i}\Vert ^{2}}a^{i} \biggr) . $$
(2.7)
Step 2::

Set \(k\leftarrow(k+1)\) and return to Step 1.

Moreover, in order to develop the process by which we improve a matrix’s condition, understanding of the following concepts is essential.

Definition 2.6

Let A be an \(m\times n\) real (complex) matrix of rank r. The singular value decomposition of A is a factorization of the form \(A=U\Sigma V^{\ast}\) where U is an \(m\times m\) real or complex unitary matrix, Σ is an \(m\times n\) rectangular diagonal matrix with non-negative real numbers on the diagonal, and \(V^{\ast}\) is an \(n\times n\) real or complex unitary matrix. The diagonal entries \(\sigma_{i}\) of Σ, for which holds \(\sigma_{1}\geq\sigma_{2}\geq\cdots\geq\sigma_{r}>0=\sigma_{r+1}=\cdots =\sigma_{n}\), are known as the singular values of A. The m columns \(\langle u_{1},\dots,u_{m}\rangle\) of U and the n columns \(\langle v_{1},\dots,v_{n}\rangle\) of V are called the left-singular vectors and right-singular vectors of A, respectively.

Definition 2.7

The condition number \(\kappa(A)\) of an \(m\times n\) matrix A is given by

$$ \kappa(A)=\frac{\sigma_{1}}{\sigma_{r}} $$
(2.8)

and is a measure of its degeneracy. We speak of A being well-conditioned if \(\kappa(A)\approx1\) and the more ill-conditioned the farther away \(\kappa(A)\) is from unity.

3 Singular Value Homogenization

The ill-conditioning of a linear inverse problem \(Ax=z\) is directly seen in the singular value decomposition (SVD) \(A=U\Sigma V^{T}\) of its associated matrix, namely as the ratio of \(\sigma_{\mathrm{max}}/\sigma_{\mathrm{min}}\). Changes in the data along the associated first and last right singular vectors (or more generally along any two right singular vectors whose ratio of corresponding singular values is large) are only reflected in measurement changes along the major left singular vector - which poses challenges in achieving sufficient accuracy with respect to the minor singular vectors.Footnote 3

A new geometrical interpretation of the above can be described in the language of projection methods. This conflicting behavior along singular vectors corresponds to projections onto hyper-planes whose normal vectors are to a high degree identically aligned, i.e. for any two such normal vectors \(n_{1},n_{2}\in\mathbb{R}^{n}\) their dot product is close to unity. A toy example for \(A\in\mathbb{R}^{3\times2}\) that will be used for visualization is provided on the left in Figure 1.

Figure 1
figure 1

Illustration of ill-conditioning (left) as a challenge to linear comparing to the preconditioning step (SVH) described herein (right).

Such high degree of alignment poses challenges to classical projection methods since the progress made in each iteration is clearly humble. A much more favorable situation applies when the normal vectors’ directions are spread close to evenly over the unit circle so as to lower the conditioning of the problem. The system depicted on the right in Figure 1 is obtained from the previous ill-conditioned one through the easily invertible Singular Value Homogenization (SVH) transformation (described below) and visibly features such better condition. Also plotted is the progress made by the classical Kaczmarz projection method which confirms the improved run-time (left: first 50 iterations without convergence, right: convergence after seven steps).

3.1 The transformation

To achieve better condition number \(\kappa(A)\) of A we directly manipulate its SVD through introducing the SVH matrix \(\Gamma =\operatorname{diag}(\gamma_{1},\gamma_{2},\dots,\gamma_{n})\in\mathbb{R}^{n\times n}\) (\(\gamma_{i}\neq0\)) to multiply the singular values \((\sigma_{1},\sigma _{2},\dots,\sigma_{r})\), where \(r\leq\min\{n,m\}\) is the rank of A:

$$ \tilde{A}=U\Sigma\Gamma V^{T}. $$
(3.1)

By proper choice of \(\gamma_{1},\dots,\gamma_{r}\), the singular values \(\tilde{\sigma_{1}},\dots,\tilde{\sigma_{r}}\) of à can be set to any arbitrary values. In particular, they may be chosen such \(\kappa (\tilde{A})=1\). Consequently, solving the transformed problem

$$ \tilde{A}x=z $$
(3.2)

iteratively does not pose difficulties to most (projection) solvers. Assume (3.2) admits a solution \(\tilde{x_{0}}\), the question then is whether we can recover (easily) a solution \(x_{0}\) satisfying

$$ Ax_{0}=z $$
(3.3)

that is, the original linear subproblem.

Since Γ leaves the range of A, \(\operatorname{ran}A\subset \mathbb{R}^{n}\), invariant, solutions to (3.2) exist if and only if (3.3) admits such. Moreover, setting

$$ x_{0}=V\Gamma V^{T}\tilde{x}_{0} $$
(3.4)

a solution to (3.3) is obtained:

$$ Ax_{0}=\bigl(U\Sigma V^{T}\bigr) \bigl(V\Gamma V^{T}\bigr)\tilde{x}_{0}=\bigl(U\Sigma\Gamma V^{T}\bigr)\tilde{x}_{0}=\tilde{A} \tilde{x}_{0}=z. $$
(3.5)

Thus by a scaling of the components of \(\tilde{x}_{0}\) in the coordinate system of A’s right singular vectors, which computationally does not pose any difficulties, we can solve the original problem (3.3) by working out the solution to the simpler formulation (3.2).

Example 3.1

For geometric intuition, the assignment in (3.4) can be rewritten as

$$\begin{aligned} x_{0} =& \bigl(V \bigl[(\Gamma- I) + I \bigr]V^{T} \bigr)\tilde{x}_{0} \\ =& \tilde{x}_{0} + \sum_{i=1}^{n} \tilde{\alpha}_{i} (\gamma_{i} - 1)\cdot v_{i} , \end{aligned}$$
(3.6)

where \(\tilde{\alpha}_{i}\) are the V-coordinates of \(\tilde{x}_{0}\) and \(v_{i}\) the right singular vectors.

Equation (3.6) illustrates that the Γ-transformation is in fact a translation of the solution set along these right singular vectors, proportional to the choice of \(\gamma_{i}\). The toy example from Figure 1 is used to demonstrate this effect in Figure 2.

Figure 2
figure 2

Reconstruction of \(\pmb{x_{0}}\) ( \(\pmb{=x_{\mathrm{opt}}}\) ).

Here

$$ A= \begin{pmatrix} 1 & 0.8 \\ 1 & 1 \\ 1 & 1.2\end{pmatrix} ,\quad\quad z=A\cdot \begin{pmatrix}100\\100 \end{pmatrix},\quad\quad x_{0}= \begin{pmatrix}100\\100 \end{pmatrix} $$
(3.7)

with

$$ \sigma_{1}=2.46,\quad\quad \sigma_{2}=0.20,\quad\quad \kappa(A)=12.33 $$
(3.8)

and right-singular vectors visualized in red.

Applying the Γ-transformation with \(\gamma_{1}=\sigma_{1}\) and \(\gamma_{2}=\sigma_{1}/\sigma_{2}\) the transformed inverse problem \(\tilde{A}x=z\) is optimally conditioned with \(\kappa(\tilde{A})=1\) and hence easily solvable with solution

$$ \tilde{x}_{0}= ( 100.6,99.4 ) ^{T} $$
(3.9)

which is exactly the translation expected from (3.6).

3.2 Main result

The application of this preconditioning process to optimization problems with linear subproblems as in (1.1) is the natural next step.

Theorem 3.2

Given a convex function \(f\in\mathcal{C}^{1}(\mathbb{R}^{n})\), then the minimization problem

$$ \min_{x\in\mathbb{R}^{m}}f(Ax) $$
(3.10)

has solution

$$ x_{0}=V\Gamma V^{T}\tilde{x}_{0}, $$
(3.11)

where Γ is a diagonal matrix with non-zero diagonal elements and \(\tilde{x}_{0}\) solves

$$ \min_{\tilde{x}\in\mathbb{R}^{m}}f(\tilde{A}\tilde{x}) $$
(3.12)

with à as in (3.1).

Proof

For the minimizer \(x_{0}\) we have

$$ \nabla_{x}f(Ax_{0})=A^{T}\cdot \nabla_{z}f(z) |_{Ax_{0}}=0 $$
(3.13)

and hence

$$ \nabla_{z}f(z) |_{Ax_{0}}\in\ker\bigl(A^{T}\bigr). $$
(3.14)

The statement of the theorem is thus equivalent to showing

$$ \nabla_{z}f(z) |_{\tilde{A}\tilde{x}_{0}}\in\ker\bigl(\tilde{A}^{T} \bigr)\quad\implies\quad\nabla_{z}f(z) |_{Ax_{0}}\in\ker \bigl(A^{T}\bigr) $$
(3.15)

which follows from our previous observation that Γ-transformations leave kernel and range of A invariant together with (3.5). □

3.3 The algorithmic scheme

The results of the previous two sections are straightforward to encode into a program usable for actual computation. What follows is a pseudo-code of the general scheme.

Algorithm 3.3

(Singular Value Homogenization)

Step 0::

Let f and A be given as in (1.1).

Step 1::

Compute the SVD of \(A=U\Sigma V^{T}\) and choose \(\Gamma= \operatorname{diag} (\gamma_{1},\dots,\gamma_{m})\) such that

$$ \kappa \bigl(\tilde{A}=U\Sigma\Gamma V^{T} \bigr)\approx1. $$
(3.16)
Step 2::

Apply any optimization procedure to solve (3.12) and obtain a solution \(\tilde{x}_{0}\).

Step 3::

Reconstruct the original solution \(x_{0}\) of (3.10) via

$$ x_{0}=V\Gamma V^{T}\tilde{x}_{0}. $$
(3.17)

The optimal choices of Γ in Step 1 and the concrete solver to find \(\tilde{x}_{0}\) in Step 2 are likely problem specific and are as of now left as user parameters. A parameter exploration to find all-purpose configurations is included in the next section.

Furthermore, due to the near-optimal conditioning in Step 1 the time complexity of Algorithm 3.3 is \(\mathcal{O}(\min\{ mn^{2},m^{2}n\}) \) since it is dominated by the SVD of A.

This does not necessarily prohibit from solving large linear systems as in many cases (e.g. in IMRT [20]) either the spectral gap of A is big or large and small singular values cluster together - which allows for reliable k-SVD schemes that can be computed in \(\mathcal{O}(mn\log k)\) time.

4 Numerical experiments

All testing was done in both Matlab and Mathematica with negligible performance differences between the two (as both implement the same set of standard minimization algorithms).

4.1 Linear feasibility and linear least squares

The first series of experiments concerns the simplest and most often encountered formulation of (1.1) with

$$ f(Ax)=\Vert Ax-b\Vert_{2}^{2} $$
(4.1)

which corresponds to solving a linear system of equations exactly if a solution exist or in the least squares sense if it has empty intersection (here \(b\in\mathbb{R}^{n}\) is fixed).

As projection methods in general, and the Kaczmarz and Cimmino algorithms in particular, are known to perform well in such settings, we chose to compare execution of Algorithm 3.3 to these two for benchmarking. Moreover, to isolate the effects of the Γ-transformation most visibly, these two algorithms are used as subroutines in Step 2 as well.

Performance was measured on a set \(\{A_{i}\in\mathbb{R}^{100\times3}\} \) of 3,000 randomly generated matrices with \(\kappa(A_{i})\in [1,10^{5}]\) and data \(x_{0}=(1,1,1)^{T}\) that the respective algorithms were run on. The convergence threshold in all cases was set to 10−3 and Γ chosenFootnote 4 such that \(\Sigma\Gamma=\sigma_{2}\cdot I\). The results are depicted in Figure 3 and Figure 4 (the presence of two graphs for Algorithm 3.3 indicate whether Kaczmarz (green) or Cimmino (red) was used as subroutine).

Figure 3
figure 3

Comparison with respect to number of iterations as a function of condition number between Kaczmarz, Cimmino and SVH with Kaczmarz and SVH with Cimmino. Stopping criteria is \(\Vert x_{\mathrm{opt}}-x\Vert\leq10^{-3}\), where \(x_{\mathrm{opt}}=(1,1,1)^{T}\)

Figure 4
figure 4

Time needed: Algorithm 3.3 (red, green) and Kaczmarz (blue) and Cimmino (black) methods.

As expected from the results obtained in the preceding sections, both projection solvers scale poorly (indeed exponentially) with the condition number of A while Algorithm 3.3 retains constant time (≈ 0.02 s andFootnote 5 ≈ 0.06 s respectively) and numbers of iteration (≈ 10) necessary.

In addition, reducing the accuracy threshold (<10−4) or constructing matrices of extreme condition (\(\kappa(A) \geq10^{6}\)) that result in failure to converge of Cimmino, Kaczmarz and LAPACK solvers native to Matlab and Mathematica does not impair the performance of Algorithm 3.3. That is, through appropriate Γ-transformation we were able to solve very ill-conditioned linear problems for the first time to 10−5 accuracy within seconds.

4.2 \(L^{p}\) penalties and one-sided \(L^{p}\) penalties

In the biomedical field of cancer treatment planning problems of the kind (1.1) occur often in calculating the optimal dose deposition in patient tissue. A typical formulation involves the linearized convolution A of radiation x into dose d and a reference dose \(r\in\mathbb{R}^{n}\) which is to be achieved under \(L^{p}\) penalties \(\Vert Ax-r\Vert_{p}\) or their one-sided variations \(\Vert\max\{0,Ax-r\}\Vert_{p}\) and \(\Vert \min \{0,Ax-r\}\Vert_{p}\).

We examined fiveFootnote 6 cases \(\{A_{i},r_{i}\}\) that were collected from patient data under the penalties

$$\begin{aligned}& f_{1}(d) =\Vert d-r\Vert_{2}, \end{aligned}$$
(4.2)
$$\begin{aligned}& f_{2}(d) =\Vert d-r\Vert_{8}, \end{aligned}$$
(4.3)
$$\begin{aligned}& f_{3}(d) =\bigl\Vert \max(0,d-r)\bigr\Vert _{2}, \end{aligned}$$
(4.4)
$$\begin{aligned}& f_{4}(d) =\bigl\Vert \min(0,d-r)\bigr\Vert _{2}, \end{aligned}$$
(4.5)
$$\begin{aligned}& f_{5}(d) =f_{1}(d\cdot s_{1})+f_{2}(d \cdot s_{2})+f_{3}(d\cdot s_{3})+f_{4}(d \cdot s_{4}), \end{aligned}$$
(4.6)

where \(s_{i}\in\{0,1\}^{n}\) with \(\sum s_{i}=\boldsymbol{1}_{\mathbb{R} ^{n}} \) is a partition of the unit vector accounting for varied sensitivity of distinct body tissue to radiation.

The performance of Algorithm 3.3 in comparison to native Matlab and Mathematica methods is given in Table 1. \(t_{i}^{\bullet }\) is the time in seconds until convergence that Mathematica’s NMinimize and Matlab’s fminunc routines require on average whereas \(t_{i}^{\circ}\) seconds are needed for Algorithm 3.3 to converge. In the case of neither Mathematica nor Matlab finding a solution (nC for not converging), the accuracy of Algorithm 3.3’s output \(x_{0}\) is tested through the parameter μ. This is done by randomly sampling a neighborhood of \(x_{0}\) and counting instances that improve the objective. These hits are then sampled similarly until no further such points can detected. μ is the total number of neighborhoods so checked. In all cases, the improvement in f remained below 10−4.

Table 1 Comparison for nonlinear objective function

The results are parallel to what could be seen in the linear feasibility formulation and encourage further exploration.

5 Conclusion

We were able to reduce the time needed to solve a general convex optimization problem with linear subproblem for modestly sized matrices. The performance of the proposed algorithm was compared to classical LAPACK and projection methods which showed an improvement in run-times by a factor of up to 1,190. Additionally, in many cases where LAPACK and projection solvers failed to converge, the singular value homogenization found 10−4 accurate solutions. These results are promising and encourage further exploration of SVH. Especially its application to structured large matrices and constrained optimization as well as in-depth parameter explorations may well turn out to be worthwhile.