In this section, we consider the particular network architecture given by unrolled proximal gradient schemes, as shown in Sect. 3.2. We aim at embedding this approach into the classical regularization theory for inverse problems. For a strict mathematical analysis, we will introduce the notion of an analytic deep prior network, which then allows interpreting the training of the deep prior network as an optimization of a Tikhonov functional. The main result of this section is Theorem 4.2, which states that analytic deep priors in combination with a suitable stopping rule are indeed order optimal regularization schemes. Numerical experiments in Sect. 4.2 demonstrate that such deep prior approaches lead to smaller reconstruction errors when compared with standard Tikhonov reconstructions. The superiority of this approach can be proved, however, only for the rather unrealistic case, that the solution coincides with a singular function of A.
Unrolled Proximal Gradient Networks as Deep Priors for Inverse Problems
In this section, we consider linear operators A and aim at rephrasing DIP, i.e., the minimization of (1.1) with respect to \(\varTheta \), as a constrained optimization problem. This change of view, i.e., regarding deep inverse priors as an optimization of a simple but constrained functional, rather than networks, opens the way for analytic investigations. We will use an unrolled proximal gradient architecture for the network \(\varphi _\varTheta (z)\) in (1.1). The starting point for our investigation is the common observation, as shown in [11, 16] or “Appendix 1”, that an unrolled proximal gradient scheme as defined in Sect. 3.2 approximates a minimizer x(B) of (3.3). Assuming that a unique minimizer x(B) exists as well as neglecting the difference between x(B) and the approximation \(\varphi _\varTheta (z)\) achieved by the unrolled proximal gradient motivates the following definition of analytic deep priors.
Definition 4.1
Let us assume that measured data \(y^\delta \in Y\), a fixed \(\alpha >0\), a convex penalty functional \(R:X\rightarrow \mathbb {R}\) and a measurement operator \(A \in {\mathscr {L}}(X,Y)\) are given. We consider the minimization problem
$$\begin{aligned} \min _B F(B)= \min _B\frac{1}{2} \Vert A x(B) - y^\delta \Vert ^2, \end{aligned}$$
(4.1)
subject to the constraint
$$\begin{aligned}&x(B) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x J_B(x)\nonumber \\&\quad = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x\frac{1}{2} \Vert B x - y^\delta \Vert ^2 + \alpha R(x). \end{aligned}$$
(4.2)
We assume that for every \(B \in {\mathscr {L}}(X,Y)\), there is a unique minimizer x(B). We call this constrained minimization problem an analytic deep prior and denote by x(B) the resulting solution to the inverse problems posed by A and \(y^\delta \).
We can also use this technical definition as the starting point of our consideration and retrieve the neural network architecture by considering the following approach for solving the minimization problem stated in the above definition. Assuming that R has a proximal operator, we can compute x(B), given B, via proximal gradient method. That is, via the (for a suitable choice of \(\lambda >0\) and an arbitrary \(x^0=z\in X\)) converging iteration
$$\begin{aligned} x^{k+1} = {{\,\mathrm{{\text {Prox}}}\,}}_{\lambda \alpha R} \left( x^k - \lambda B^*(Bx^k-y^\delta )\right) . \end{aligned}$$
(4.3)
Following this iteration for L steps can be seen as the forward pass of a particular architecture of a fully connected feedforward network with L layers of identical size as described in (3.1) and (3.2). The affine linear map given by \(\varTheta =(W,b)\) is the same for all layers. Moreover, the activation function of the network is given by the proximal mapping of \(\lambda \alpha R\), the matrix W is given via \(I-W = \lambda B^* B\) (I denotes the identity operator), and the bias is determined by \(b = \lambda B^* y^\delta \).
From now on we will assume that the difference between \(x^L\) and x(B) is negligible, i.e.,
$$\begin{aligned} x^L = x(B). \end{aligned}$$
(4.4)
Remark 4.1
The task in the DIP approach is to find \(\varTheta \) (network parameters). Analogously, in the analytic deep prior, we try to find the operator B.
We now examine the analytic deep image prior utilizing the proximal gradient descent approach to compute x(B). Therefore, we will focus on the minimization of (4.1) with respect to B for given data \(y^\delta \) by means of gradient descent.
The stationary points are characterized by \(\partial F(B)=0\), and gradient descent iterations with stepsize \(\eta \) are given by
$$\begin{aligned} B^{\ell +1} = B^\ell - \eta \partial F (B^\ell ). \end{aligned}$$
(4.5)
Hence, we need to compute the derivative of F with respect to B.
Lemma 4.1
Consider an analytic deep prior with the proximal gradient descent approach as described above. We define
$$\begin{aligned} \psi (x,B) = {{\,\mathrm{{\text {Prox}}}\,}}_{\lambda \alpha R} \left( x - \lambda B^*(Bx-y^\delta )\right) - x. \end{aligned}$$
(4.6)
Then,
$$\begin{aligned} \partial F (B) = \partial x(B)^*A^*(Ax(B) - y^\delta ) \end{aligned}$$
(4.7)
with
$$\begin{aligned} \partial x(B) = - \psi _x(x(B), B)^{-1} \psi _B(x(B),\, B), \end{aligned}$$
(4.8)
which leads to the gradient descent
$$\begin{aligned} B^{\ell +1}= B^\ell - \eta \partial F (B^\ell ). \end{aligned}$$
(4.9)
This lemma allows to obtain an explicit description of the gradient descent for B, which in turn leads to an iteration of functionals \(J_B\) and minimizers x(B). We will now exemplify this derivation for a rather academic example, which however highlights in particular the differences between a classical Tikhonov minimizer, i.e.,
$$\begin{aligned} x(A) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x \frac{1}{2} \Vert A x - y^\delta \Vert ^2 + \frac{\alpha }{2} \Vert x \Vert ^2, \end{aligned}$$
and the solution of the DIP approach.
Example
In this example, we examine analytic deep priors for linear inverse problems \(A:X \rightarrow Y\), i.e., \(A, B \in {\mathscr {L}}(X,Y)\), and
$$\begin{aligned} R(x)=\frac{1}{2}\Vert x\Vert ^2. \end{aligned}$$
(4.10)
The rather abstract characterization of the previous section can be made explicit for this setting. Since \(J_B(x)\) is the classical Tikhonov regularization, which can be solved by
$$\begin{aligned} x(B) = (B^*B+\alpha I)^{-1}B^*y^\delta , \end{aligned}$$
(4.11)
we can rewrite the analytic deep prior reconstruction as x(B), where B is minimizing
$$\begin{aligned} F(B) = \frac{1}{2} \Vert A (B^*B+\alpha I)^{-1}B^*y^\delta - y^\delta \Vert ^2. \end{aligned}$$
(4.12)
Lemma 4.2
Following Lemma 4.1, assuming \(B^0=A\) and computing one step of gradient descent to minimize the functional with respect to B, yields
$$\begin{aligned} B_1 = A - \eta \partial F(A) \end{aligned}$$
(4.13)
with
$$\begin{aligned} \partial F (A)&=\partial x (A)^*A^*(Ax(A) - y^\delta ) \nonumber \\&= \alpha AA^*y^\delta ({y^\delta })^* A {\left( A^*A + \alpha I \right) ^{-3} } \end{aligned}$$
(4.14)
$$\begin{aligned}&\quad +\alpha A { \left( A^*A + \alpha I \right) ^{-3} } A^* y^\delta ({y^\delta })^* A \nonumber \\&\quad -\alpha {y^\delta } ({y^\delta })^* A{ \left( A^*A + \alpha I \right) ^{-2} }. \end{aligned}$$
(4.15)
This expression nicely collapses if \({y^\delta } ({y^\delta })^* \) commutes with \(AA^*\). For illustration, we assume the rather unrealistic case that \(x^+=u\), where u is a singular function for A with singular value \(\sigma \). The dual singular function is denoted by v, i.e., \(Au=\sigma v\) and \(A^* v= \sigma u\) and we further assume that the measurement noise in \(y^\delta \) is in the direction of this singular function, i.e., \(y^\delta = (\sigma + \delta ) v\), as shown in Fig. 3. In this case, the problem is indeed one-dimensional and we obtain an iteration restricted to the span of u, resp. the span of v.
Lemma 4.3
The setting described above yields the following gradient step for the functional in (4.12):
$$\begin{aligned} B^{\ell +1} = B^\ell - c_\ell v u^* \end{aligned}$$
(4.16)
with
$$\begin{aligned} c_\ell =c(\alpha , \delta , \sigma , \eta )=\eta \sigma (\sigma + \delta )^2 (\alpha + \beta _\ell ^2 -\sigma \beta _\ell ) \frac{\beta _\ell ^2 - \alpha }{(\beta _\ell ^2 + \alpha )^3}, \end{aligned}$$
and the iteration (4.16) in-turn results in the sequence \(x(B^{\ell })\) with the unique attractive stationary point
$$\begin{aligned} x = {\left\{ \begin{array}{ll} \frac{1}{2\sqrt{\alpha }}(\sigma + \delta ) u, &{} \sigma < 2 \sqrt{\alpha }\\ \frac{1}{\sigma }(\sigma + \delta ) u, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(4.17)
For comparison, the classical Tikhonov regularization would yield \(\frac{\sigma }{\sigma ^2 + \alpha }(\sigma + \delta ) u\). This is depicted in Fig. 4.
Constrained System of Singular Functions
In the previous example, we showed that if we do gradient descent starting from \(B_0=A\) and assume the rather simple case \(y^\delta = (\sigma + \delta ) v\), we obtain the iteration \(B^{\ell +1} = B^\ell - c_\ell v u^*\), i.e., \(B^{\ell +1}\) has the same singular functions as A and only one of the singular values is different.
We now analyze the optimization from a different perspective. Namely, we focus on finding directly a minimizer of (4.1) for a general \(y^\delta \in Y\); however, we restrict B to be an operator such that \(B^*B\) commutes with \(A^*A\), i.e., A and B share a common system of singular functions. Hence, B has the following representation.
$$\begin{aligned} B=\sum _i \beta _i v_i u_i^*, \quad \ \beta _i \in \mathbb {R}_+\cup \{0\}, \end{aligned}$$
(4.18)
where \(\{u_i, \sigma _i, v_i\}\) is the singular value decomposition of A. That means, we restrict the problem to finding optimal singular values \(\beta _i\) for B. In this case, we show that a global minimizer exists and that it has interesting properties.
Theorem 4.1
For any \(y^\delta \in Y\), there exist a global minimizer (in the constrained singular functions setting) of (4.1) given by \(B_\alpha =\sum \beta _i^\alpha v_i u_i^*\) with
$$\begin{aligned} \beta _i^\alpha (\sigma ) = {\left\{ \begin{array}{ll} \frac{\sigma _i}{2} + \sqrt{\frac{\sigma _i^2}{4} - \alpha } &{} \quad \sigma \ge 2\sqrt{\alpha }\\ \sqrt{\alpha } &{} \quad \sigma < 2\sqrt{\alpha } \\ \end{array}\right. }. \end{aligned}$$
(4.19)
Remark 4.2
The singular values obtained in Theorem 4.1 match the ones obtained in the previous section for general B but simple \(y^\delta = (\sigma + \delta ) v\).
Remark 4.3
The minimizer from Theorem 4.1 does not depend on \(y^\delta \), i.e., \(\forall : y^\delta \in Y\) it holds that \(B_\alpha \) is a minimizer of (4.1). The solution to the inverse problem does still depend on \(y^\delta \) since
$$\begin{aligned} x(B_\alpha ) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x\frac{1}{2} \Vert B_\alpha x - y^\delta \Vert ^2 + \alpha R(x). \end{aligned}$$
(4.20)
Remark 4.4
In the original DIP approach, some of the parameters of the network may be similar for different \(y^\delta \), for example, the parameters of the first layers of the encoder part of the UNet. Other parameters may strongly depend on \(y^\delta \). In this particular case of the analytic deep prior (constrained system of singular functions), we have a explicit separation of which parameters (\(b = \lambda B^* y^\delta \)) depend on \(y^\delta \) and which do not (\(W =I - \lambda B^* B\)).
From now on we consider the notation \(x(B,\,y^\delta )\) to incorporate the dependency of x(B) on \(y^\delta \). Following the classical filter theory for order optimal regularization schemes, [13, 21, 28], we obtain the following theorem.
Theorem 4.2
The pseudoinverse \(K_\alpha : Y \rightarrow X\) defined as
$$\begin{aligned} K_\alpha (y^\delta ) := x(B_\alpha ,\, y^\delta ) \end{aligned}$$
(4.21)
is an order optimal regularization method given by the filter functions
$$\begin{aligned} F_\alpha (\sigma ) = {\left\{ \begin{array}{ll} 1 &{} \quad \sigma \ge 2\sqrt{\alpha }\\ \frac{\sigma }{2\sqrt{\alpha }} &{} \quad \sigma < 2\sqrt{\alpha } \\ \end{array}\right. }. \end{aligned}$$
(4.22)
The regularized pseudoinverse \(K_\alpha \) is quite similar to the truncated singular value decomposition (TSVD) but is a softer version because it does not have a jump (see Fig. 5). We call this method Soft TSVD.
The disadvantage of Tikhonov, in this case, is that it damps all singular values, and the disadvantage of TSVD is that it throws away all the information related to small singular values. On the other hand, the Soft TSVD does not damp the higher singular values (similar to TSVD) and does not throw away the information related to smaller singular values but does damp it (similar to Tikhonov). For a comparison of the filter functions, see Table 1. Moreover, what is interesting is how this method comes out from Definition 4.1, which is stated in terms of the Tikhonov pseudoinverse, and that the optimal singular values do not depend on \(y^\delta \).
Table 1 Values of \(\nu \) for which TSVD, Tikhonov and the Soft TSVD are order optimal At this point, the relation to the original DIP approach becomes more abstract. We considered a simplified network architecture where all layers share the same weights that come from an iterative algorithm for solving inverse problems. That means, we let the solution to the original inverse problem be the solution of another problem with different operator B. The DIP approach in this case is transformed to finding an optimal B and allows us to do the analysis in the functional analysis setting. What we learn from the previous results is that we can establish interesting connections between the DIP approach and the classical inverse problems theory. This is important because it shows that deep inverse priors can be used to solve really ill-posed inverse problems.
Remark 4.5
In the original DIP, the input z to the network is chosen arbitrarily and is of minor importance. However, once the weights have been trained for a given \(y^\delta \), z cannot be changed because it would affect the output of the network, i.e., it would change the obtained reconstruction. In the analytic deep prior, the input to the unrolled proximal gradient method is completely irrelevant (assuming an infinite number of layers). After finding the “weights” B, a different input will still produce the same solution \(\hat{x} = x(B) = \varphi _\varTheta (z)\).
Remark 4.5 tells us that there is still a gap between the original DIP and the analytic one. This was expected because of the obvious trivialization of the network architecture but serves as motivation for further research.
Numerical Experiments
We now use the analytic deep inverse prior approach for solving an inverse problem with the following integration operator \(A:~L^2\left( \left[ 0,1\right] \right) ~\rightarrow ~L^2\left( \left[ 0,1\right] \right) \)
$$\begin{aligned} \left( Ax\right) (t) = \int _0^{t}x(s)\, \text {d}s. \end{aligned}$$
(4.23)
A is a linear and compact operator, hence the inverse problem is ill-posed. Let \(A_n\in \mathbb {R}^{n \times n}\) be a discretization of A and \(x^\dagger \in \mathbb {R}^n\) to be one of its discretized singular vectors u. We set the noisy data \({y^\delta = A_n x^\dagger + \delta \tau }\) with \({\tau \sim \text{ Normal }(0,\mathbb {1}_n)}\), as shown in Fig. 6. A more general example, i.e., where \(x^\dagger \) is not restricted to be a singular function, is also included (Fig. 7).
We aim at recovering \(x^\dagger \) from \(y^\delta \) considering the setting established in Definition 4.1 for \({R(\cdot )=\frac{1}{2}\Vert \cdot \Vert ^2}\). That means that the solution x is parametrized by the operator B. Solving the inverse problem is now equivalent to finding optimal B that minimizes the loss function (1.1) for the single data point \((z, y^\delta )\).
To find such a B, we go back to the DIP and the neural network approach. We write x(B) as the output of the network \(\varphi _\varTheta \) defined in (3.1) with some randomly initialized input z. We optimize with respect to B, which is a matrix in the discretized setting, and obtain a minimizer \(B_{\text {opt}}\) of (1.1). For more details, please refer to “Appendix 3.”
In Fig. 8, we show some reconstruction results. The first plot of each row contains the true solution \(x^\dagger ,\) the standard Tikhonov solution x(A) and the reconstruction obtained with the analytic deep inverse approach \(x(B_{\text {opt}})\) after B converged. For each case, we provide additional plots depicting:
The true error of the network’s output x(B) after each update of B in a logarithmic scale.
The squared Frobenius norm of \(B_k-B_{k+1}\) after each update of B.
The matrix \(B_{\text {opt}}\).
For all choices of \(\alpha \), the training of B converges to a matrix \(B_{\text {opt}}\), such that \(x(B_{\text {opt}})\) has a smaller true error than x(A). In the third plot of each row, one can check that B indeed converges to some matrix \(B_{\text {opt}}\), which is shown in the last plot. The networks were trained using gradient descent with 0.05 as learning rate.
The theoretical findings of the previous subsections allow us to compute, either the exact update (4.16) for B in the rather unrealistic case that \(y^\delta = (\sigma + \delta ) v\) , or the exact solution \(x(B_\alpha , y^\delta )\) if we restrict B to have the same system of singular functions as A (Theorem 4.1). In the numerical experiments, we do not consider any of these restrictions, and therefore, we cannot directly apply our theoretical results. Instead, we implement the network approach (see “Appendix 3”) to be able to find \(B_{\text {opt}}\) in a more general scenario. Nevertheless, as it can be observed in the last plot of each row in Fig. 8, \(B_{\text {opt}}\) contains some patterns that reflect, to some extent, that B keeps the same singular system but with different singular values. Namely, B is updated in a similar way as in (4.16). With the current implementation, we could also use more complex regularization functionals R, in order to reduce the gap between our analytic approach and the original DIP. This is also a motivation for further research.