Introduction

Deep image priors (DIP) were recently introduced in deep learning for some tasks in image processing [19]. Usually, deep learning approaches to inverse problems proceed in two steps. In a first step (training), the parameters \(\varTheta \) of the deep neural network \(\varphi _\varTheta \) are optimized by minimizing a suitable loss function using large sets of training data. In a second step (application), new data are fed into the network for solving the desired task.

DIP approaches are radically different; they are based on unsupervised training using only a single data point \(y^\delta \). More precisely, in the context of inverse problems, where we aim at solving ill-posed operator equations \(Ax \sim y^\delta \), the task of DIP is to train a network \(\varphi _\varTheta (z)\) with parameters \(\varTheta \) by minimizing the simple loss function

$$\begin{aligned} \min _\varTheta \Vert A \varphi _\varTheta (z) - y^\delta \Vert ^2. \end{aligned}$$
(1.1)

The minimization is with respect to \(\varTheta ,\) the random input z is kept fixed. After training, the solution to the inverse problem is approximated directly by \({\hat{x}} = \varphi _\varTheta (z).\)

In image processing, common choices for A are the identity operator (denoising) or a projection operator to a subset of the image domain (inpainting). For these applications, it has been observed that minimizing the functional iteratively by gradient descent methods in combination with a suitable stopping criterion leads to amazing results [19].

Training with a single data point is the most striking property, which separates DIP from other neural network concepts. One might argue that the astonishing results [10, 19, 23, 32] are only possible if the network architecture is fine-tuned to the specific task. This is true for obtaining optimal performance; nevertheless, the presented numerical results perform well even with somewhat generic network architectures such as autoencoders.

We are interested in analyzing DIP approaches for solving ill-posed inverse problems. As a side remark, we note that the applications (denoising, inpainting) mentioned above are modeled by either identity or projection operators, which are not ill-posed in the functional analytical setting [13, 21, 28]. Typical examples of ill-posed inverse problems correspond to compact linear operators such as a large variety of tomographic measurement operators or parameter-to-state mappings for partial differential equations.

We aim at analyzing a specific network architecture \(\varphi _\varTheta \) and at interpreting the resulting DIP approach as a regularization technique in the functional analytical setting, and also at proving convergence properties for the minimizers of (1.1). In particular, we are interested in network architectures, which themselves can be interpreted as a minimization algorithm that solves a regularized inverse problem of the form

$$\begin{aligned} x(B) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x\frac{1}{2} \Vert B x - y^\delta \Vert ^2 + \alpha R(x), \end{aligned}$$
(1.2)

where R is a given convex function and B a learned operator.

In general, deep learning approaches for inverse problems have their own characteristics, and naive applications of neural networks can fail for even the most simple inverse problems, as shown in [22]. However, there is a growing number of compelling numerical experiments using suitable network designs for some of the toughest inverse problems such as photo-acoustic tomography [17] or X-ray tomography with very few measurements [2, 18]. Concerning networks based on deep prior approaches for inverse problems, first experimental investigations have been reported, as shown in [19, 23, 32]. Similar as for the above-mentioned tasks in image processing, DIPs for inverse problems rely on two ingredients:

  1. 1.

    A suitable network design, which leads to our phrase “regularization by architecture.”

  2. 2.

    Training algorithms for iteratively minimizing (1.1) with respect to \(\varTheta \) in combination with a suitable stopping criterion.

In this paper, we present different mathematical interpretations of DIP approaches, and we analyze two network designs in the context of inverse problems in more detail. It is organized as follows: In Sect. 2, we discuss some relations to existing results and make a short survey of the related literature. In Sect. 3, we then state different interpretations of DIP approaches and the network architectures that we use, as a basis for the subsequent analysis. We start with a first mathematical result for a trivial network design, which yields a connection to Landweber iterations. We then consider a fully connected feedforward network with L identical layers, which generates a proximal gradient descent for a modified Tikhonov functional. In Sect. 4, we use this last connection to define the notion of analytic deep prior networks, for which one can strictly analyze its regularization and convergence properties. The key to the theoretical findings is a change of view, which allows for the interpretation of DIP approaches as optimizing families of Tikhonov functionals. Finally, we exemplify our theoretical findings with numerical examples for the standard linear integration operator.

Deep Prior and Related Research

We start with a description of general deep prior concepts. Afterward, we address similarities and differences to other approaches, such as LISTA [16], in more detail.

The Deep Prior Approach

Present results on deep prior networks utilize feedforward architectures. In general, a feedforward neural network is an algorithm that starts with input \(x^0=z\), computes iteratively

$$\begin{aligned} x^{k+1} = \phi \left( W_k x^k + b_k \right) \end{aligned}$$

\(\text{ for }\ \ k=0,\ldots ,L-1\) and outputs

$$\begin{aligned} \varphi _\varTheta (z)= x^L \ . \end{aligned}$$

The parameters of this system are denoted by

$$\begin{aligned} \varTheta = \left\{ W_0,\ldots ,W_{L-1}, b_0,\ldots ,b_{L-1} \right\} \end{aligned}$$

and \(\phi \) denotes a nonlinear activation function.

In order to highlight one of the unique features of deep image priors, let us first refer to classical generative networks that require training on large data sets.

In this classical setting, we are given an operator \(A: X \rightarrow Y\) between Hilbert spaces XY, as well as a set of training data \((x_i,y_i^\delta )\), where \(y_i^\delta \) is a noisy version of \(Ax_i\) satisfying \(\Vert y_i^\delta - Ax_i\Vert \le \delta \). Here, the usual deep learning approach is to use a network for direct inversion and the parameters \(\varTheta \) of the network are obtained by minimizing the loss function

$$\begin{aligned} \min _\varTheta \sum _{i=1}^N\ \Vert \varphi _\varTheta (y_i^\delta ) - x_i \Vert ^2\ . \end{aligned}$$
(2.1)

After training, \(\varTheta \) is fixed and the network is used to approximate the solution of the inverse problem with new data \(y^\delta \) by computing \(x=\varphi _\varTheta (y^\delta )\). For a recent survey on this approach and more general deep learning concepts for inverse problems see [5].

In general, this approach relies on the underlying assumption that complex distributions of suitable solutions x, e.g., the distribution of natural images, can be approximated by neural networks [6, 8, 33]. The parameters \(\varTheta \) are trained for the specific distribution of training data and are fixed after training. One then expects that choosing a new data set as input, i.e., \(z=y^\delta \) will generate a suitable solution to \(Ax \sim y^\delta \) [7]. Hence, after training, the distribution of solutions is parametrized by the inputs z.

In contrast, DIP is an unsupervised approach using only a single data point for training. That means, for given data \(y^\delta \) and fixed z, the parameters \(\varTheta \) of the network \(\varphi _\varTheta \) are obtained by minimizing the loss function (1.1). The solution to the inverse problem is then denoted by \({\hat{x}} = \varphi _\varTheta (z)\). Hence, deep image priors keep z fixed and aim at parameterizing the solution with \(\varTheta \). It has been observed in several works [10, 19, 23, 32] that this approach indeed leads to remarkable results for problems such as inpainting or denoising.

To some extent, the success of deep image priors is rooted in the careful design of network architectures. For example, [19] uses a U-Net-like “hourglass” architecture with skip connections, and the amazing results show that such an architecture implicitly captures some statistics of natural images. However, in general, the DIP learning process may converge toward noisy images or undesirable reconstructions. The whole success relies on a combination of the architecture with a suitable optimization method and stopping criterion. Nevertheless, the authors claim the architecture has a positive impact on the exploration of the solution space during the iterative optimization of \(\varTheta \). They show that the training process descends quickly to “natural-looking” images but requires much more steps to produce noisy images. This is also supported by the theoretical results of [29] and the observations of [35], which shows that deep networks can fit noise very well but need more training time to do so. Another paper that hints in this direction is [4], which analyzes whether neural networks could have a bias toward approximating low frequencies.

There are already quite a few works that deal with deep prior approaches. Following, we mention the most relevant ones to our work. The original deep image prior article [19] introduces the DIP concept and presents experimental evidence that today’s network architectures are in and of themselves conducive to image reconstruction. Another work [32] explores the applicability of DIP to problems in compressed sensing. Also, [23] discusses how to combine DIP with the regularization by denoising approach and [10] explores DIP in the context of stationary Gaussian processes. All of these introduce and discuss variants of DIP concepts; however, neither of them addresses the intrinsic regularizing properties of the network concerning ill-posed inverse problems.

Deep Prior and Unrolled Proximal Gradient Architectures

A major part of this paper is devoted to analyzing the DIP approach in combination with an unrolled proximal gradient network \(\varphi _\varTheta \). Hence, there is a natural connection to the well-established analysis of LISTA schemes. Before we sketch the state of research in this field, we highlight the two major differences (loss function, training data) to the present approach. LISTA is based on a supervised training using multiple data points \((x_i, y_i^\delta )\), \(i=1,\ldots N\) where \(y_i^\delta \) is a noisy representation of \(Ax_i\). The loss function is (2.1). DIP, however, is based on unsupervised learning using the loss function (1.1) and a single data point \(y^\delta \). Hence, DIP with the unrolled proximal gradient network shares the architecture with LISTA, but its concept, as well as its analytic properties, is different. Nevertheless, the analysis we will present in Sect. 4 will exhibit structures similar to the ones appearing in the LISTA-related literature. Hence, we shortly review the major contributions in this field.

Similarities are most visible when considering algorithms and convergence analysis for sparse coding applications [20, 25, 30, 31, 34]. The field of sparse coding makes heavy use of proximal splitting algorithms and, since the advent of LISTA, of trained architectures inspired by truncated versions of these algorithms. In the broadest sense, all of these methods are expressions of “Learning to learn by gradient descent” [3]. Once more, we would like to emphasize that these results utilize multiple data points while DIP does not require any training data but only one measurement. Another key difference is that we approach the topic from an ill-posed inverse problem perspective, which (a) grounds our approach in the functional analytic realm and (b) considers ill-posed (not only ill-conditioned) problems in the Nashed sense, i.e., allows the treatment of unstable inverses [13]. These two points fundamentally differentiate the present approach from traditional compressed sensing considerations which usually deal with (a) finite dimensional formulations and (b) forward operators given by well-conditioned, carefully hand-crafted settings or dictionaries, which are optimized using large sets of training data [30].

Coming back to LISTA for sparse coding applications, there are many excellent papers [15, 24, 25] which are devoted to a strict mathematical analysis of different aspects of LISTA-like approaches. In [25], the authors show under which conditions sparse coding can benefit from LISTA-like trained structures and ask how good trained sparsity estimators can be given a computational budget. The article [15] deals with a similar trade-off proposing the quite exciting, “inexact proximal gradient descent.” The paper [9] proposes, based on theoretically founded considerations, a sibling architecture to LISTA. Moreover, [27] argues that deep learning architectures, in general, can be interpreted as multistage proximal splitting algorithms.

Finally, we want to point at publications, which address deep learning with only a few data points for training, see, e.g., [14] and the references therein. However, they do not address the architectures relevant for our publication, and they do not refer to the specific complications of inverse problems.

Deep Prior Architectures and Interpretations

In this section, we discuss different perspectives on deep prior networks, which open the path to provable mathematical results. The first two subsections are devoted to special network architectures, and the last two subsections deal with more general points of view.

Fig. 1
figure 1

A simple network with scalar input, a single layer and no activation function. For any arbitrary input z one obtains \(\varphi _\varTheta (z) = \varTheta \) (Color figure online)

A Trivial Architecture

We aim at solving ill-posed inverse problems. For a given operator A,  the general task in inverse problems is to recover an approximation for \(x^\dagger \) from measured noisy data

$$\begin{aligned} y^\delta = A x^\dagger + \tau , \end{aligned}$$

where \(\tau \), with \(\Vert \tau \Vert \le \delta ,\) describes the noise in the measurement.

The deep image prior approach to inverse problems asks to train a network \(\varphi _\varTheta (z)\) with parameters \(\varTheta \) and fixed input z by minimizing \(\Vert A \varphi _\varTheta (z) - y^\delta \Vert ^2\) with an optimization method such as gradient descent with early stopping. After training, a final run of the network computes \({\hat{x}} = \varphi _\varTheta (z)\) as an approximation to \(x^\dagger \).

We consider a trivial single-layer network without activation function, as shown in Fig. 1. This network simply outputs \(\varTheta ,\) i.e., \(\varphi _\varTheta (z)=\varTheta \). In this case, the network parameter \(\varTheta \) is a vector, which is chosen to have the same dimension as x. That means, that training the network by gradient descent of \(\Vert A \varphi _\varTheta (z) - y^\delta \Vert ^2 = \Vert A \varTheta - y^\delta \Vert ^2\) with respect to \(\varTheta \) is equivalent to the classical Landweber iteration, which is a gradient descent method for \(\Vert A x - y^\delta \Vert ^2\) with respect to x.

Landweber iterations are slowly converging. However, in combination with a suitable stopping rule, they are optimal regularization schemes for diminishing noise level \(\delta \rightarrow 0\), [13, 21, 28]. Despite the apparent trivialization of the neural network approach, this shows that there is potential in training such networks with a single data point for solving ill-posed inverse problems.

Fig. 2
figure 2

Unrolled proximal gradient network with \(L=2\) (Color figure online)

Unrolled Proximal Gradient Architecture

In this section, we aim at rephrasing DIP, i.e., the minimization of (1.1) with respect to \(\varTheta \), as an approach for learning optimized Tikhonov functionals for inverse problems. This change of view, i.e., regarding deep inverse priors as optimization of functionals rather than networks, opens the way for analytic investigations in Sect. 4.

We use the particular architecture, which was introduced in [16], i.e., a fully connected feedforward network with L layers of identical size,

$$\begin{aligned} \varphi _\varTheta (z)= x^L, \end{aligned}$$
(3.1)

where

$$\begin{aligned} x^{k+1} = \phi \left( W x^k + b \right) \end{aligned}$$
(3.2)

The affine linear map \(\varTheta = (W,b)\) is the same for all layers. The matrix W is restricted to obey \(I-W = \lambda B^* B\) (I denotes the identity operator) for some B and the bias is determined via \(b = \lambda B^* y^\delta \), as shown in Fig. 2. If the activation function of the network is chosen as the proximal mapping of a regularizing functional \(\lambda \alpha R\), then \(\varphi _\varTheta (z)\) is identical to the Lth iterate of a proximal gradient descent method for minimizing

$$\begin{aligned} J_B(x)= \frac{1}{2} \Vert B x - y^\delta \Vert ^2 + \alpha R(x), \end{aligned}$$
(3.3)

see [12] or “Appendix 1”.

Remark 3.1

Restricting activation functions to be proximal mappings is not as severe as it might look at first glance. For example, ReLU is the proximal mapping for the indicator function of positive real numbers, and soft shrinkage is the proximal mapping for the modulus function.

This allows the interpretation that every weight update, i.e., every gradient step for minimizing (1.1) with respect to \(\varTheta \) or B, changes the functional \(J_B\). Hence, DIP can be regarded as optimizing a functional, which in-turn is minimized by the network. This view is the starting point for investigating convergence properties in Sect. 4.

Two Perspectives Based on Regression

The following subsections address more general concepts, which open the way to further analytic investigations, which, however, are not considered further in this paper. The reader interested in the regularization properties for DIP approaches for inverse problems only may jump directly to Sect. 4.

In this subsection, we present two different perspectives on solving inverse problems with the DIP via the minimization of a functional as discussed in the subsection above. The first perspective is based on a reinterpretation of the minimization of the functional (1.1) in the finite, real setting, i.e., \(A\in \mathbb {R}^{m\times n}\). This setting allows us to write

$$\begin{aligned} \min _\varTheta \Vert A\varphi _\varTheta (z)-y^\delta \Vert ^2&= \min _{x\in {\mathscr {R}}(\varphi _\cdot (z))} \Vert Ax-y^\delta \Vert ^2 \end{aligned}$$
(3.4)
$$\begin{aligned}&= \min _{x\in {\mathscr {R}}(\varphi _\cdot (z))} \sum _{i=1}^m (x^*a_i-y^\delta _i)^2, \end{aligned}$$
(3.5)

where \({\mathscr {R}}(\varphi _\cdot (z))\) denotes the range of the network with regard to \(\varTheta \) for a fixed z and \(a_i\) the rows of the matrix A as well as \(y^\delta _i\) the entries of the vector \(y^\delta \). This setting allows for the interpretation that we are solving a linear regression, parameterized by x, which is constrained by a deep learning hypothesis space and given by data pairs of the form \((a_i, y^\delta _i)\).

The second perspective is based on the rewriting of the optimization problem via the method of Lagrange multipliers. We start by considering the constrained optimization problem

$$\begin{aligned} \min _{x\in X, \varTheta }\Vert Ax-y^\delta \Vert ^2 \text { s.t. } \Vert x-\varphi _\varTheta (z)\Vert ^2=0. \end{aligned}$$
(3.6)

If we now assume that \(\varphi \) has continuous first partial derivatives with regard to \(\varTheta \), the Lagrange functional

$$\begin{aligned} {\mathscr {L}}(\varTheta ,x,\lambda ) = \Vert Ax-y^\delta \Vert ^2 + \lambda \Vert x-\varphi _\varTheta (z)\Vert ^2, \end{aligned}$$
(3.7)

with the correct Lagrange multiplier \(\lambda =\lambda _0\), has a stationary point at each minimum of the original constraint optimization problem. This gives us a direct connection to unconstrained variational approaches like Tikhonov functionals.

The Bayesian Point of View

The Bayesian approach to inverse problems focuses on computing MAP (maximum a posteriori probability) estimators, i.e., one aims for

$$\begin{aligned} {\hat{x}} = {{\,\mathrm{\mathrm{arg\,max}}\,}}_{x\in X}p(x|y^\delta ), \end{aligned}$$
(3.8)

where \(p:X\times Y\rightarrow \mathbb {R}_+\cup \{0\}\) is a conditional PDF. From standard Bayesian theory, we obtain

$$\begin{aligned} {\hat{x}} = {{\,\mathrm{\mathrm{arg\,min}}\,}}_{x\in X} \left\{ -\log [p(y^\delta |x)]-\log [p(x)] \right\} \ . \end{aligned}$$
(3.9)

The setting for inverse problems, i.e., \(Ax+\tau = y^\delta \) with \(\tau \sim \text{ Normal }(0,\sigma ^2\mathbb {1}_Y)\), yields (\(\lambda =2\sigma ^2\))

$$\begin{aligned} {\hat{x}}&=: {{\,\mathrm{\mathrm{arg\,min}}\,}}_{x\in X} \Vert Ax-y^\delta \Vert ^2-\lambda \log [p(x)] \ . \end{aligned}$$

We now decompose x into \(x_\perp := P_{{\mathscr {N}}(A)^\perp }(x)\)\( \hbox {and} \)  \(x_{\mathscr {N}} := P_{{\mathscr {N}}(A)}(x)\), where \({\mathscr {N}}(A)\) denotes the nullspace of A and where \(P_{{\mathscr {N}}(A)}(x)\), resp. \(P_{{\mathscr {N}}(A)^\perp }(x)\), denotes the orthogonal projection onto \({\mathscr {N}}(A)\), resp. \({\mathscr {N}}(A)^\perp \). Setting \({\hat{x}} = (x_{\mathscr {N}}, x_\perp )\) yields

The data \(y^\delta \) only contain information about \(x_\perp \), which in classical regularization is exploited by restricting any reconstruction to \({\mathscr {N}}(A)^\perp \).

However, if available, \(p(x_{\mathscr {N}}|x_\perp )\) is a measure on how to extend \(x_\perp \) with an \(x_\perp \in {\mathscr {N}}(A)^\perp \) to a suitable \(x = (x_{\mathscr {N}}, x_\perp )\). The classical regularization of inverse problems uses the trivial extension by zero, i.e., \(x = (0, x_\perp )\), which is not necessarily optimal. If we accept the interpretation that a network can be a meaningful parametrization of the set of suitable solutions x, then \(p(x) \equiv 0\) for all x not in the range of the network and optimizing the network will indeed yield a non-trivial completion \(x = (x_{\mathscr {N}}, x_\perp )\). More precisely (I) can be interpreted to be a deep prior on the measurement and (II) to be a deep prior on the nullspace part of the problem.

Deep Priors and Tikhonov Functionals

In this section, we consider the particular network architecture given by unrolled proximal gradient schemes, as shown in Sect. 3.2. We aim at embedding this approach into the classical regularization theory for inverse problems. For a strict mathematical analysis, we will introduce the notion of an analytic deep prior network, which then allows interpreting the training of the deep prior network as an optimization of a Tikhonov functional. The main result of this section is Theorem 4.2, which states that analytic deep priors in combination with a suitable stopping rule are indeed order optimal regularization schemes. Numerical experiments in Sect. 4.2 demonstrate that such deep prior approaches lead to smaller reconstruction errors when compared with standard Tikhonov reconstructions. The superiority of this approach can be proved, however, only for the rather unrealistic case, that the solution coincides with a singular function of A.

Unrolled Proximal Gradient Networks as Deep Priors for Inverse Problems

In this section, we consider linear operators A and aim at rephrasing DIP, i.e., the minimization of (1.1) with respect to \(\varTheta \), as a constrained optimization problem. This change of view, i.e., regarding deep inverse priors as an optimization of a simple but constrained functional, rather than networks, opens the way for analytic investigations. We will use an unrolled proximal gradient architecture for the network \(\varphi _\varTheta (z)\) in (1.1). The starting point for our investigation is the common observation, as shown in [11, 16] or “Appendix 1”, that an unrolled proximal gradient scheme as defined in Sect. 3.2 approximates a minimizer x(B) of (3.3). Assuming that a unique minimizer x(B) exists as well as neglecting the difference between x(B) and the approximation \(\varphi _\varTheta (z)\) achieved by the unrolled proximal gradient motivates the following definition of analytic deep priors.

Definition 4.1

Let us assume that measured data \(y^\delta \in Y\), a fixed \(\alpha >0\), a convex penalty functional \(R:X\rightarrow \mathbb {R}\) and a measurement operator \(A \in {\mathscr {L}}(X,Y)\) are given. We consider the minimization problem

$$\begin{aligned} \min _B F(B)= \min _B\frac{1}{2} \Vert A x(B) - y^\delta \Vert ^2, \end{aligned}$$
(4.1)

subject to the constraint

$$\begin{aligned}&x(B) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x J_B(x)\nonumber \\&\quad = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x\frac{1}{2} \Vert B x - y^\delta \Vert ^2 + \alpha R(x). \end{aligned}$$
(4.2)

We assume that for every \(B \in {\mathscr {L}}(X,Y)\), there is a unique minimizer x(B). We call this constrained minimization problem an analytic deep prior and denote by x(B) the resulting solution to the inverse problems posed by A and \(y^\delta \).

We can also use this technical definition as the starting point of our consideration and retrieve the neural network architecture by considering the following approach for solving the minimization problem stated in the above definition. Assuming that R has a proximal operator, we can compute x(B), given B, via proximal gradient method. That is, via the (for a suitable choice of \(\lambda >0\) and an arbitrary \(x^0=z\in X\)) converging iteration

$$\begin{aligned} x^{k+1} = {{\,\mathrm{{\text {Prox}}}\,}}_{\lambda \alpha R} \left( x^k - \lambda B^*(Bx^k-y^\delta )\right) . \end{aligned}$$
(4.3)

Following this iteration for L steps can be seen as the forward pass of a particular architecture of a fully connected feedforward network with L layers of identical size as described in (3.1) and (3.2). The affine linear map given by \(\varTheta =(W,b)\) is the same for all layers. Moreover, the activation function of the network is given by the proximal mapping of \(\lambda \alpha R\), the matrix W is given via \(I-W = \lambda B^* B\) (I denotes the identity operator), and the bias is determined by \(b = \lambda B^* y^\delta \).

From now on we will assume that the difference between \(x^L\) and x(B) is negligible, i.e.,

$$\begin{aligned} x^L = x(B). \end{aligned}$$
(4.4)

Remark 4.1

The task in the DIP approach is to find \(\varTheta \) (network parameters). Analogously, in the analytic deep prior, we try to find the operator B.

We now examine the analytic deep image prior utilizing the proximal gradient descent approach to compute x(B). Therefore, we will focus on the minimization of (4.1) with respect to B for given data \(y^\delta \) by means of gradient descent.

The stationary points are characterized by \(\partial F(B)=0\), and gradient descent iterations with stepsize \(\eta \) are given by

$$\begin{aligned} B^{\ell +1} = B^\ell - \eta \partial F (B^\ell ). \end{aligned}$$
(4.5)

Hence, we need to compute the derivative of F with respect to B.

Lemma 4.1

Consider an analytic deep prior with the proximal gradient descent approach as described above. We define

$$\begin{aligned} \psi (x,B) = {{\,\mathrm{{\text {Prox}}}\,}}_{\lambda \alpha R} \left( x - \lambda B^*(Bx-y^\delta )\right) - x. \end{aligned}$$
(4.6)

Then,

$$\begin{aligned} \partial F (B) = \partial x(B)^*A^*(Ax(B) - y^\delta ) \end{aligned}$$
(4.7)

with

$$\begin{aligned} \partial x(B) = - \psi _x(x(B), B)^{-1} \psi _B(x(B),\, B), \end{aligned}$$
(4.8)

which leads to the gradient descent

$$\begin{aligned} B^{\ell +1}= B^\ell - \eta \partial F (B^\ell ). \end{aligned}$$
(4.9)

This lemma allows to obtain an explicit description of the gradient descent for B, which in turn leads to an iteration of functionals \(J_B\) and minimizers x(B). We will now exemplify this derivation for a rather academic example, which however highlights in particular the differences between a classical Tikhonov minimizer, i.e.,

$$\begin{aligned} x(A) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x \frac{1}{2} \Vert A x - y^\delta \Vert ^2 + \frac{\alpha }{2} \Vert x \Vert ^2, \end{aligned}$$

and the solution of the DIP approach.

Example

In this example, we examine analytic deep priors for linear inverse problems \(A:X \rightarrow Y\), i.e., \(A, B \in {\mathscr {L}}(X,Y)\), and

$$\begin{aligned} R(x)=\frac{1}{2}\Vert x\Vert ^2. \end{aligned}$$
(4.10)

The rather abstract characterization of the previous section can be made explicit for this setting. Since \(J_B(x)\) is the classical Tikhonov regularization, which can be solved by

$$\begin{aligned} x(B) = (B^*B+\alpha I)^{-1}B^*y^\delta , \end{aligned}$$
(4.11)

we can rewrite the analytic deep prior reconstruction as x(B), where B is minimizing

$$\begin{aligned} F(B) = \frac{1}{2} \Vert A (B^*B+\alpha I)^{-1}B^*y^\delta - y^\delta \Vert ^2. \end{aligned}$$
(4.12)

Lemma 4.2

Following Lemma 4.1, assuming \(B^0=A\) and computing one step of gradient descent to minimize the functional with respect to B, yields

$$\begin{aligned} B_1 = A - \eta \partial F(A) \end{aligned}$$
(4.13)

with

$$\begin{aligned} \partial F (A)&=\partial x (A)^*A^*(Ax(A) - y^\delta ) \nonumber \\&= \alpha AA^*y^\delta ({y^\delta })^* A {\left( A^*A + \alpha I \right) ^{-3} } \end{aligned}$$
(4.14)
$$\begin{aligned}&\quad +\alpha A { \left( A^*A + \alpha I \right) ^{-3} } A^* y^\delta ({y^\delta })^* A \nonumber \\&\quad -\alpha {y^\delta } ({y^\delta })^* A{ \left( A^*A + \alpha I \right) ^{-2} }. \end{aligned}$$
(4.15)

This expression nicely collapses if \({y^\delta } ({y^\delta })^* \) commutes with \(AA^*\). For illustration, we assume the rather unrealistic case that \(x^+=u\), where u is a singular function for A with singular value \(\sigma \). The dual singular function is denoted by v, i.e., \(Au=\sigma v\) and \(A^* v= \sigma u\) and we further assume that the measurement noise in \(y^\delta \) is in the direction of this singular function, i.e., \(y^\delta = (\sigma + \delta ) v\), as shown in Fig. 3. In this case, the problem is indeed one-dimensional and we obtain an iteration restricted to the span of u, resp. the span of v.

Lemma 4.3

The setting described above yields the following gradient step for the functional in (4.12):

$$\begin{aligned} B^{\ell +1} = B^\ell - c_\ell v u^* \end{aligned}$$
(4.16)

with

$$\begin{aligned} c_\ell =c(\alpha , \delta , \sigma , \eta )=\eta \sigma (\sigma + \delta )^2 (\alpha + \beta _\ell ^2 -\sigma \beta _\ell ) \frac{\beta _\ell ^2 - \alpha }{(\beta _\ell ^2 + \alpha )^3}, \end{aligned}$$

and the iteration (4.16) in-turn results in the sequence \(x(B^{\ell })\) with the unique attractive stationary point

$$\begin{aligned} x = {\left\{ \begin{array}{ll} \frac{1}{2\sqrt{\alpha }}(\sigma + \delta ) u, &{} \sigma < 2 \sqrt{\alpha }\\ \frac{1}{\sigma }(\sigma + \delta ) u, &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(4.17)

For comparison, the classical Tikhonov regularization would yield \(\frac{\sigma }{\sigma ^2 + \alpha }(\sigma + \delta ) u\). This is depicted in Fig. 4.

Fig. 3
figure 3

Example of \(y^\delta = (\sigma + \delta ) v\) where v is a singular function of A (integral operator) (Color figure online)

Fig. 4
figure 4

Comparison of the Tikhonov reconstruction (orange broken line), the result obtained in (4.17) (blue continuous line) and the direct inverse. In this example, we considered \(\alpha =10^{-3}\) (Color figure online)

Constrained System of Singular Functions

In the previous example, we showed that if we do gradient descent starting from \(B_0=A\) and assume the rather simple case \(y^\delta = (\sigma + \delta ) v\), we obtain the iteration \(B^{\ell +1} = B^\ell - c_\ell v u^*\), i.e., \(B^{\ell +1}\) has the same singular functions as A and only one of the singular values is different.

We now analyze the optimization from a different perspective. Namely, we focus on finding directly a minimizer of (4.1) for a general \(y^\delta \in Y\); however, we restrict B to be an operator such that \(B^*B\) commutes with \(A^*A\), i.e., A and B share a common system of singular functions. Hence, B has the following representation.

$$\begin{aligned} B=\sum _i \beta _i v_i u_i^*, \quad \ \beta _i \in \mathbb {R}_+\cup \{0\}, \end{aligned}$$
(4.18)

where \(\{u_i, \sigma _i, v_i\}\) is the singular value decomposition of A. That means, we restrict the problem to finding optimal singular values \(\beta _i\) for B. In this case, we show that a global minimizer exists and that it has interesting properties.

Theorem 4.1

For any \(y^\delta \in Y\), there exist a global minimizer (in the constrained singular functions setting) of (4.1) given by \(B_\alpha =\sum \beta _i^\alpha v_i u_i^*\) with

$$\begin{aligned} \beta _i^\alpha (\sigma ) = {\left\{ \begin{array}{ll} \frac{\sigma _i}{2} + \sqrt{\frac{\sigma _i^2}{4} - \alpha } &{} \quad \sigma \ge 2\sqrt{\alpha }\\ \sqrt{\alpha } &{} \quad \sigma < 2\sqrt{\alpha } \\ \end{array}\right. }. \end{aligned}$$
(4.19)

Remark 4.2

The singular values obtained in Theorem 4.1 match the ones obtained in the previous section for general B but simple \(y^\delta = (\sigma + \delta ) v\).

Remark 4.3

The minimizer from Theorem 4.1 does not depend on \(y^\delta \), i.e., \(\forall : y^\delta \in Y\) it holds that \(B_\alpha \) is a minimizer of (4.1). The solution to the inverse problem does still depend on \(y^\delta \) since

$$\begin{aligned} x(B_\alpha ) = {{\,\mathrm{\mathrm{arg\,min}}\,}}_x\frac{1}{2} \Vert B_\alpha x - y^\delta \Vert ^2 + \alpha R(x). \end{aligned}$$
(4.20)

Remark 4.4

In the original DIP approach, some of the parameters of the network may be similar for different \(y^\delta \), for example, the parameters of the first layers of the encoder part of the UNet. Other parameters may strongly depend on \(y^\delta \). In this particular case of the analytic deep prior (constrained system of singular functions), we have a explicit separation of which parameters (\(b = \lambda B^* y^\delta \)) depend on \(y^\delta \) and which do not (\(W =I - \lambda B^* B\)).

From now on we consider the notation \(x(B,\,y^\delta )\) to incorporate the dependency of x(B) on \(y^\delta \). Following the classical filter theory for order optimal regularization schemes, [13, 21, 28], we obtain the following theorem.

Theorem 4.2

The pseudoinverse \(K_\alpha : Y \rightarrow X\) defined as

$$\begin{aligned} K_\alpha (y^\delta ) := x(B_\alpha ,\, y^\delta ) \end{aligned}$$
(4.21)

is an order optimal regularization method given by the filter functions

$$\begin{aligned} F_\alpha (\sigma ) = {\left\{ \begin{array}{ll} 1 &{} \quad \sigma \ge 2\sqrt{\alpha }\\ \frac{\sigma }{2\sqrt{\alpha }} &{} \quad \sigma < 2\sqrt{\alpha } \\ \end{array}\right. }. \end{aligned}$$
(4.22)

The regularized pseudoinverse \(K_\alpha \) is quite similar to the truncated singular value decomposition (TSVD) but is a softer version because it does not have a jump (see Fig. 5). We call this method Soft TSVD.

The disadvantage of Tikhonov, in this case, is that it damps all singular values, and the disadvantage of TSVD is that it throws away all the information related to small singular values. On the other hand, the Soft TSVD does not damp the higher singular values (similar to TSVD) and does not throw away the information related to smaller singular values but does damp it (similar to Tikhonov). For a comparison of the filter functions, see Table 1. Moreover, what is interesting is how this method comes out from Definition 4.1, which is stated in terms of the Tikhonov pseudoinverse, and that the optimal singular values do not depend on \(y^\delta \).

Fig. 5
figure 5

Filter response of TSVD, Tikhonov and the Soft TSVD (Color figure online)

Table 1 Values of \(\nu \) for which TSVD, Tikhonov and the Soft TSVD are order optimal

At this point, the relation to the original DIP approach becomes more abstract. We considered a simplified network architecture where all layers share the same weights that come from an iterative algorithm for solving inverse problems. That means, we let the solution to the original inverse problem be the solution of another problem with different operator B. The DIP approach in this case is transformed to finding an optimal B and allows us to do the analysis in the functional analysis setting. What we learn from the previous results is that we can establish interesting connections between the DIP approach and the classical inverse problems theory. This is important because it shows that deep inverse priors can be used to solve really ill-posed inverse problems.

Remark 4.5

In the original DIP, the input z to the network is chosen arbitrarily and is of minor importance. However, once the weights have been trained for a given \(y^\delta \), z cannot be changed because it would affect the output of the network, i.e., it would change the obtained reconstruction. In the analytic deep prior, the input to the unrolled proximal gradient method is completely irrelevant (assuming an infinite number of layers). After finding the “weights” B, a different input will still produce the same solution \(\hat{x} = x(B) = \varphi _\varTheta (z)\).

Remark 4.5 tells us that there is still a gap between the original DIP and the analytic one. This was expected because of the obvious trivialization of the network architecture but serves as motivation for further research.

Numerical Experiments

We now use the analytic deep inverse prior approach for solving an inverse problem with the following integration operator \(A:~L^2\left( \left[ 0,1\right] \right) ~\rightarrow ~L^2\left( \left[ 0,1\right] \right) \)

$$\begin{aligned} \left( Ax\right) (t) = \int _0^{t}x(s)\, \text {d}s. \end{aligned}$$
(4.23)

A is a linear and compact operator, hence the inverse problem is ill-posed. Let \(A_n\in \mathbb {R}^{n \times n}\) be a discretization of A and \(x^\dagger \in \mathbb {R}^n\) to be one of its discretized singular vectors u. We set the noisy data \({y^\delta = A_n x^\dagger + \delta \tau }\) with \({\tau \sim \text{ Normal }(0,\mathbb {1}_n)}\), as shown in Fig. 6. A more general example, i.e., where \(x^\dagger \) is not restricted to be a singular function, is also included (Fig. 7).

We aim at recovering \(x^\dagger \) from \(y^\delta \) considering the setting established in Definition 4.1 for \({R(\cdot )=\frac{1}{2}\Vert \cdot \Vert ^2}\). That means that the solution x is parametrized by the operator B. Solving the inverse problem is now equivalent to finding optimal B that minimizes the loss function (1.1) for the single data point \((z, y^\delta )\).

To find such a B, we go back to the DIP and the neural network approach. We write x(B) as the output of the network \(\varphi _\varTheta \) defined in (3.1) with some randomly initialized input z. We optimize with respect to B, which is a matrix in the discretized setting, and obtain a minimizer \(B_{\text {opt}}\) of (1.1). For more details, please refer to “Appendix 3.”

Fig. 6
figure 6

Example of \(y^\delta \) for \(x^\dagger = u\) (singular function) with a SNR of \(17.06\,\text {db}\) (Color figure online)

Fig. 7
figure 7

Example of a more general \(y^\delta \) with a SNR of \(18.97\,\text {db}\) (Color figure online)

In Fig. 8, we show some reconstruction results. The first plot of each row contains the true solution \(x^\dagger ,\) the standard Tikhonov solution x(A) and the reconstruction obtained with the analytic deep inverse approach \(x(B_{\text {opt}})\) after B converged. For each case, we provide additional plots depicting:

  • The true error of the network’s output x(B) after each update of B in a logarithmic scale.

  • The squared Frobenius norm of \(B_k-B_{k+1}\) after each update of B.

  • The matrix \(B_{\text {opt}}\).

For all choices of \(\alpha \), the training of B converges to a matrix \(B_{\text {opt}}\), such that \(x(B_{\text {opt}})\) has a smaller true error than x(A). In the third plot of each row, one can check that B indeed converges to some matrix \(B_{\text {opt}}\), which is shown in the last plot. The networks were trained using gradient descent with 0.05 as learning rate.

The theoretical findings of the previous subsections allow us to compute, either the exact update (4.16) for B in the rather unrealistic case that \(y^\delta = (\sigma + \delta ) v\) , or the exact solution \(x(B_\alpha , y^\delta )\) if we restrict B to have the same system of singular functions as A (Theorem 4.1). In the numerical experiments, we do not consider any of these restrictions, and therefore, we cannot directly apply our theoretical results. Instead, we implement the network approach (see “Appendix 3”) to be able to find \(B_{\text {opt}}\) in a more general scenario. Nevertheless, as it can be observed in the last plot of each row in Fig. 8, \(B_{\text {opt}}\) contains some patterns that reflect, to some extent, that B keeps the same singular system but with different singular values. Namely, B is updated in a similar way as in (4.16). With the current implementation, we could also use more complex regularization functionals R, in order to reduce the gap between our analytic approach and the original DIP. This is also a motivation for further research.

Fig. 8
figure 8

Reconstructions corresponding to \(y^\delta \) as in Fig. 6 (first and second row) and Fig. 7 (third and fourth row) for different values of \(\alpha \). The broken line in the second plot of each row indicates the true error of the standard Tikhonov solution x(A). The horizontal axis in the second and third plots indicates the number of weights updates (Color figure online)

Summary and Conclusion

In this paper, we investigated the concept of deep inverse priors/regularization by architecture. This approach neither requires massive amounts of ground truth/surrogate data, nor pretrained models/transfer learning. The method is based on a single measurement. We started by giving different qualitative interpretations of what regularization is and specifically how regularization by architecture fits into this context.

We followed up with the introduction of the analytic deep prior by explicitly showing how unrolled proximal gradient architectures, allow for a somewhat transparent regularization by architecture. Specifically, we showed that their results can be interpreted as solutions of optimized Tikhonov functionals and proved precise equivalences to regularization techniques. We further investigated this point of view with an academic example, where we implemented the analytic deep inverse prior and tested its numerical applicability. The results confirmed our theoretical findings and showed promising results.

There is obviously, like in deep learning in general, much work to be done in order to have a good understanding of deep inverse priors, but we see much potential in the idea of using deep architectures to regularize inverse problems; especially since an enormous part of the deep learning community is already concerned with the understanding of deep architectures.