Regularization by architecture: A deep prior approach for inverse problems

The present paper studies so-called deep image prior (DIP) techniques in the context of ill-posed inverse problems. DIP networks have been recently introduced for applications in image processing; also first experimental results for applying DIP to inverse problems have been reported. This paper aims at discussing different interpretations of DIP and to obtain analytic results for specific network designs and linear operators. The main contribution is to introduce the idea of viewing these approaches as the optimization of Tikhonov functionals rather than optimizing networks. Besides theoretical results, we present numerical verifications.


Introduction
Deep image priors (DIP) were recently introduced in deep learning for some tasks in image processing [19].Usually, deep learning approaches to inverse problems proceed in two steps.In a first step (training) the parameters Θ of the deep neural network ϕ Θ are optimized by minimizing a suitable loss function using large sets of training data.In a second step (application), new data is fed into the network for solving the desired task.
DIP approaches are radically different; they are based on unsupervised training using only a single data point y δ .More precisely, in the context of inverse problems, where we aim at solving ill-posed operator equations Ax ∼ y δ , the task of DIP is to train a network ϕ Θ (z) with parameters Θ by minimizing the simple loss function min Θ Aϕ Θ (z) − y δ 2 . (1.1) The minimization is with respect to Θ , the random input z is kept fixed.After training the solution to the inverse problem is approximated directly by x = ϕ Θ (z).
In image processing, common choices for A are the identity operator (denoising) or a projection operator to a subset of the image domain (inpainting).For these applications, it has been observed, that minimizing the functional iteratively by gradient descent methods in combination with a suitable stopping criterion leads to amazing results [19].
Training with a single data point is the most striking property, which separates DIP from other neural network concepts.One might argue that the astonishing results [10,19,23,32] are only possible if the network architecture is fine-tuned to the specific task.This is true for obtaining optimal performance; nevertheless, the presented numerical results perform well even with somewhat generic network architectures such as autoencoders.
We are interested in analyzing DIP approaches for solving ill-posed inverse problems.As a side remark, we note that the applications (denoising, inpainting) mentioned above are modeled by either identity or projection operators, which are not ill-posed in the functional analytical setting [13,21,28].Typical examples of ill-posed inverse problems correspond to compact linear operators such as a large variety of tomographic measurement operators or parameter-to-state mappings for partial differential equations.
We aim at analyzing a specific network architecture ϕ Θ and at interpreting the resulting DIP approach as a regularization technique in the functional analytical setting, and also at proving convergence properties for the minimizers of (1.1).In particular, we are interested in network architectures, which themselves can be interpreted as a minimization algorithm that solves a regularized inverse problem of the form Bx − y δ 2 + αR(x), where R is a given convex function and B a learned operator.
In general, deep learning approaches for inverse problems have their own characteristics, and naive applications of neural networks can fail for even the most simple inverse problems, see [22].However, there is a growing number of compelling numerical experiments using suitable network designs for some of the toughest inverse problems such as photo-acoustic tomography [17] or X-ray tomography with very few measurements [2,18].Concerning networks based on deep prior approaches for inverse problems, first experimental investigations have been reported, see [19,23,32].Similar as for the above-mentioned tasks in image processing, DIPs for inverse problems rely on two ingredients: 1.A suitable network design, which leads to our phrase "regularization by architecture".2. Training algorithms for iteratively minimizing (1.1) with respect to Θ in combination with a suitable stopping criterion.
In this paper, we present different mathematical interpretations of DIP approaches, and we analyze two network designs in the context of inverse problems in more detail.It is organized as follows: In Section 2, we discuss some relations to existing results and make a short survey of the related literature.In Section 3, we then state different interpretations of DIP approaches and the network architectures that we use, as a basis for the subsequent analysis.We start with a first mathematical result for a trivial network design, which yields a connection to Landweber iterations.We then consider a fully connected feedforward network with L identical layers, which generates a proximal gradient descent for a modified Tikhonov functional.In Section 4, we use this last connection to define the notion of analytic deep prior networks, for which one can strictly analyze its regularization and convergence properties.The key to the theoretical findings is a change of view, which allows for the interpretation of DIP approaches as optimizing families of Tikhonov functionals.Finally, we exemplify our theoretical findings with numerical examples for the standard linear integration operator.

Deep prior and related research
We start with a description of general deep prior concepts.Afterwards, we address similarities and differences to other approaches, such as LISTA [16], in more detail.

The deep prior approach
Present results on deep prior networks utilize feedforward architectures.In general, a feedforward neural network is an algorithm that starts with input x 0 = z, computes iteratively The parameters of this system are denoted by and φ denotes a non-linear activation function.
In order to highlight one of the unique features of deep image priors, let us first refer to classical generative networks that require training on large data sets.
In this classical setting we are given an operator A : X → Y between Hilbert spaces X,Y , as well as a set of training data (x i , y δ i ), where y δ i is a noisy version of Ax i satisfying y δ i − Ax i ≤ δ .Here the usual deep learning approach is to use a network for direct inversion and the parameters Θ of the network are obtained by minimizing the loss function min After training Θ is fixed and the network is used to approximate the solution of the inverse problem with new data y δ by computing x = ϕ Θ (y δ ).For a recent survey on this approach and more general deep learning concepts for inverse problems see [5].
In general, this approach relies on the underlying assumption, that complex distributions of suitable solutions x, e.g., the distribution of natural images, can be approximated by neural networks [6,8,33].The parameters Θ are trained for the specific distribution of training data and are fixed after training.One then expects, that choosing a new data set as input, i.e., z = y δ will generate a suitable solution to Ax ∼ y δ [7].Hence, after training the distribution of solutions is parametrized by the inputs z.
In contrast, DIP is an unsupervised approach using only a single data point for training.That means, for given data y δ and fixed z, the parameters Θ of the network ϕ Θ are obtained by minimizing the loss function (1.1).The solution to the inverse problem is then denoted by x = ϕ Θ (z).Hence, deep image priors keep z fixed and aim at parameterizing the solution with Θ .It has been observed in several works [10,19,23,32] that this approach indeed leads to remarkable results for problems such as inpainting or denoising.
To some extent, the success of deep image priors is rooted in the careful design of network architectures.For example, [19] uses a U-Net-like hourglass architecture with skip connections, and the amazing results show that such an architecture implicitly captures some statistics of natural images.However, in general, the DIP learning process may converge towards noisy images or undesirable reconstructions.The whole success relies on a combination of the architecture with a suitable optimization method and stopping criterion.Nevertheless, the authors claim the architecture has a positive impact on the exploration of the solution space during the iterative optimization of Θ .They show that the training process descends quickly to "natural-looking" images but requires much more steps to produce noisy images.This is also supported by the theoretical results of [29] and the observations of [35], which shows that deep networks can fit noise very well but need more training time to do so.Another paper that hints in this direction is [4], which analyzes whether neural networks could have a bias towards approximating low frequencies.
There are already quite a few works that deal with deep prior approaches.Following, we mention the most relevant ones to our work.The original deep image prior article [19] introduces the DIP concept and presents experimental evidence that today's network architectures are in and of themselves conducive to image reconstruction.Another work [32] explores the applicability of DIP to problems in compressed sensing.Also, [23] discusses how to combine DIP with the regularization by denoising approach and [10] explores DIP in the context of stationary Gaussian processes.All of these introduce and discuss variants of DIP concepts; however, neither of them addresses the intrinsic regularizing properties of the network concerning ill-posed inverse problems.

Deep prior and unrolled proximal gradient architectures
A major part of this paper is devoted to analyzing the DIP approach in combination with an unrolled proximal gradient network ϕ Θ .Hence, there is a natural connection to the wellestablished analysis of LISTA schemes.Before we sketch the state of research in this field, we highlight the two major differences (loss function, training data) to the present approach.LISTA is based on a supervised training using multiple data points (x i , y δ i ), i = 1, ..N where y δ i is a noisy representation of Ax i .The loss function is (2.1).DIP, however, is based on unsupervised learning using the loss function (1.1) and a single data point y δ .Hence, DIP with the unrolled proximal gradient network shares the architecture with LISTA, but its concept, as well as its analytic properties, are different.Nevertheless, the analysis we will present in Section 4 will exhibit structures similar to the ones appearing in the LISTA-related literature.Hence we shortly review the major contributions in this field.
Similarities are most visible when considering algorithms and convergence analysis for sparse coding applications [20,25,30,31,34].The field of sparse coding makes heavy use of proximal splitting algorithms and, since the advent of LISTA, of trained architectures inspired by truncated versions of these algorithms.In the broadest sense, all of these methods are expressions of "Learning to learn by gradient descent" [3].Once more, we would like to emphasize that these results utilize multiple data points while DIP does not require any training data but only one measurement.Another key difference is that we approach the topic from an ill-posed inverse problem perspective, which (a) grounds our approach in the functional analytic realm and (b) considers ill-posed (not only ill-conditioned) problems in the Nashed sense, i.e., allows the treatment of unstable inverses [13].These two points fundamentally differentiate the present approach from traditional compressed sensing considerations which usually deal with (a) finite dimensional formulations and (b) forward operators given by well-conditioned, carefully hand-crafted settings or dictionaries, which are optimized using large sets of training data [30].
Coming back to LISTA for sparse coding applications, there are many excellent papers [15,24,25] which are devoted to a strict mathematical analysis of different aspects of LISTA-like approaches.In [25], the authors show under which conditions sparse coding can benefit from LISTAlike trained structures and asks how good trained sparsity estimators can be, given a computational budget.The article [15] deals with a similar trade-off proposing the quite exciting, "inexact proximal gradient descent".The paper [9] proposes, based on theoretically founded considerations, a sibling architecture to LISTA.Moreover, [27] argues that deep learning architectures, in general, can be interpreted as multi-stage proximal splitting algorithms.
Finally, we want to point at publications, which address deep learning with only a few data points for training, see, e.g., [14] and the references therein.However, they do not address the architectures relevant for our publication, and they do not refer to the specific complications of inverse problems.In this section, we discuss different perspectives on deep prior networks, which open the path to provable mathematical results.The first two subsections are devoted to special network architectures, and the last two subsections deal with more general points of view.

A trivial architecture
We aim at solving ill-posed inverse problems.For a given operator A, the general task in inverse problems is to recover an approximation for x † from measured noisy data where τ, with τ ≤ δ , describes the noise in the measurement.
The deep image prior approach to inverse problems asks to train a network ϕ Θ (z) with parameters Θ and fixed input z by minimizing Aϕ Θ (z) − y δ 2 with an optimization method such as gradient descent with early stopping.After training, a final run of the network computes x = ϕ Θ (z) as an approximation to x † .
We consider a trivial single-layer network without activation function, see Figure 3.1.This network simply outputs Θ , i.e., ϕ Θ (z) = Θ .In this case, the network parameter Θ is a vector, which is chosen to have the same dimension as x.That means, that training the network by gradient descent of Aϕ Θ (z) − y δ 2 = AΘ − y δ 2 with respect to Θ is equivalent to the classical Landweber iteration, which is a gradient descent method for Ax − y δ 2 with respect to x.
Landweber iterations are slowly converging.However, in combination with a suitable stopping rule, they are optimal regularization schemes for diminishing noise level δ → 0, [13,21,28].Despite the apparent trivialization of the neural network approach, this shows that there is potential in training such networks with a single data point for solving ill-posed inverse problems.

Unrolled proximal gradient architecture
In this section, we aim at rephrasing DIP, i.e., the minimization of (1.1) with respect to Θ , as an approach for learning optimized Tikhonov functionals for inverse problems.This change of view, i.e., regarding deep inverse priors as optimization of functionals rather than networks, opens the way for analytic investigations in Section 4.
We use the particular architecture, which was introduced in [16], i.e. a fully connected feedforward network with L layers of identical size, where The affine linear map Θ = (W, b) is the same for all layers.
The matrix W is restricted to obey I −W = λ B * B (I denotes the identity operator) for some B and the bias is determined via b = λ B * y δ , see Figure 3.2.If the activation function of the network is chosen as the proximal mapping of a regularizing functional λ αR, then ϕ Θ (z) is identical to the L-th iterate of a proximal gradient descent method for minimizing see [12] or Appendix I.
Remark 3.1 Restricting activation functions to be proximal mappings is not as severe as it might look at first glance.E.g., ReLU is the proximal mapping for the indicator function of positive real numbers, and soft shrinkage is the proximal mapping for the modulus function.
This allows the interpretation that every weight update, i.e., every gradient step for minimizing (1.1) with respect to Θ or B, changes the functional J B .Hence, DIP can be regarded as optimizing a functional, which in-turn is minimized by the network.This view is the starting point for investigating convergence properties in Section 4.

Two perspectives based on regression
The following subsections address more general concepts, which open the way to further analytic investigations, which, however, are not considered further in this paper.The reader interested in the regularization properties for DIP approaches for inverse problems only may jump directly to Section 4.
In this subsection we present two different perspectives on solving inverse problems with the DIP via the minimization of a functional as discussed in the subsection above.The first perspective is based on a reinterpretation of the minimization of the functional (1.1) in the finite, real setting, i.e.A ∈ R m×n .This setting allows us to write min where R(ϕ • (z)) denotes the range of the network with regard to Θ for a fixed z and a i the rows of the matrix A as well as y δ i the entries of the vector y δ .This setting allows for the interpretation that we are solving a linear regression, parameterized by x, which is constrained by a deep learning hypothesis space and given by data pairs of the form (a i , y δ i ).The second perspective is based on the rewriting of the optimization problem via the method of Lagrange multipliers.We start by considering the constrained optimization problem min If we now assume that ϕ has continuous first partial derivatives with regard to Θ , the Lagrange functional with the correct Lagrange multiplier λ = λ 0 , has a stationary point at each minimum of the original constraint optimization problem.This gives us a direct connection to unconstrained variational approaches like Tikhonov functionals.

The Bayesian point of view
The Bayesian approach to inverse problems focuses on computing MAP (maximum a posteriori probability) estimators, i.e. one aims for where p : From standard Bayesian theory we obtain .
The data y δ only contains information about x ⊥ , which in classical regularization is exploited by restricting any reconstruction to N (A) ⊥ .However, if available, p(x N |x ⊥ ) is a measure on how to extend x ⊥ with an x ⊥ ∈ N (A) ⊥ to a suitable x = (x N , x ⊥ ).The classical regularization of inverse problems uses the trivial extension by zero, i.e., x = (0, x ⊥ ), which is not necessarily optimal.If we accept the interpretation that a network can be a meaningful parametrization of the set of suitable solutions x, then p(x) ≡ 0 for all x not in the range of the network and optimizing the network will indeed yield a nontrivial completion x = (x N , x ⊥ ).More precisely (I) can be interpreted to be a deep prior on the measurement and (II) to be a deep prior on the nullspace part of the problem.

Deep priors and Tikhonov functionals
In this section, we consider the particular network architecture given by unrolled proximal gradient schemes, see Section 3.2.We aim at embedding this approach into the classical regularization theory for inverse problems.For a strict mathematical analysis, we will introduce the notion of an analytic deep prior network, which then allows interpreting the training of the deep prior network as an optimization of a Tikhonov functional.The main result of this section is Theorem 4.2, which states that analytic deep priors in combination with a suitable stopping rule are indeed order optimal regularization schemes.Numerical experiments in Section 4.2 demonstrate that such deep prior approaches lead to smaller reconstruction errors when compared with standard Tikhonov reconstructions.The superiority of this approach can be proved, however, only for the rather unrealistic case, that the solution coincides with a singular function of A.

Unrolled proximal gradient networks as deep priors for inverse problems
In this section, we consider linear operators A and aim at rephrasing DIP, i.e., the minimization of (1.1) with respect to Θ , as a constrained optimization problem.This change of view, i.e., regarding deep inverse priors as an optimization of a simple but constrained functional, rather than networks, opens the way for analytic investigations.We will use an unrolled proximal gradient architecture for the network ϕ Θ (z) in (1.1).The starting point for our investigation is the common observation, see [11,16] or Appendix I, that an unrolled proximal gradient scheme as defined in Section 3.2 approximates a minimizer x(B) of (3.3).Assuming that a unique minimizer x(B) exists as well as neglecting the difference between x(B) and the approximation ϕ Θ (z) achieved by the unrolled proximal gradient motivates the following definition of analytic deep priors.We assume that for every B ∈ L (X,Y ) there is a unique minimizer x(B).We call this constrained minimization problem an analytic deep prior and denote by x(B) the resulting solution to the inverse problems posed by A and y δ .
We can also use this technical definition as the starting point of our consideration and retrieve the neural network architecture by considering the following approach for solving the minimization problem stated in the above definition.Assuming that R has a proximal operator, we can compute x(B), given B, via proximal gradient method.I.e., via the (for a suitable choice of λ > 0 and an arbitrary x 0 = z ∈ X) converging iteration Following this iteration for L steps can be seen as the forward pass of a particular architecture of a fully connected feed-forward network with L layers of identical size as described in (3.1) and (3.2).The affine linear map given by Θ = (W, b) is the same for all layers.Moreover, the activation function of the network is given by the proximal mapping of λ αR, the matrix W is given via I − W = λ B * B (I denotes the identity operator), and the bias is determined by b = λ B * y δ .From now on we will assume that the difference between x L and x(B) is negligible, i.e., x L = x(B). (4.4) Remark 4.1 The task in the DIP approach is to find Θ (network parameters).Analogously, in the analytic deep prior, we try to find the operator B.
We now examine the analytic deep image prior utilizing the proximal gradient descent approach to compute x(B).Therefore we will focus on the minimization of (4.1) with respect to B for given data y δ by means of gradient descent.
The stationary points are characterized by ∂ F(B) = 0 and gradient descent iterations with stepsize η are given by Hence we need to compute the derivative of F with respect to B.
Lemma 4.1 Consider an analytic deep prior with the proximal gradient descent approach as described above.We define which leads to the gradient descent This lemma allows to obtain an explicit description of the gradient descent for B, which in turn leads to an iteration of functionals J B and minimizers x(B).We will now exemplify this derivation for a rather academic example, which however highlights in particular the differences between a classical Tikhonov minimizer, i.e.
and the solution of the DIP approach.

Example
In this example we examine analytic deep priors for linear inverse problems A : X → Y , i.e.A, B ∈ L (X,Y ), and The rather abstract characterization of the previous section can be made explicit for this setting.Since J B (x) is the classical Tikhonov regularization, which can be solved by we can rewrite the analytic deep prior reconstruction as x(B), where B is minimizing Lemma 4.2 Following Lemma 4.1, assuming B 0 = A and computing one step of gradient descent to minimize the functional with respect to B, yields This expression nicely collapses if y δ (y δ ) * commutes with AA * .For illustration we assume the rather unrealistic case that x + = u, where u is a singular function for A with singular value σ .The dual singular function is denoted by v, i.e.Au = σ v and A * v = σ u and we further assume, that the measurement noise in y δ is in the direction of this singular function, i.e., y δ = (σ + δ )v, see Figure 4.1.In this case, the problem is indeed one-dimensional and we obtain an iteration restricted to the span of u, resp.the span of v. Lemma 4.3 The setting described above yields the following gradient step for the functional in (4.12): and the iteration (4.16) in-turn results in the sequence x(B ) with the unique attractive stationary point For comparison, the classical Tikhonov regularization would yield σ σ 2 +α (σ + δ )u.This is depicted in Figure 4.2.

A 1 y x(A) x(B)
Fig. 4.2 Comparisson of the Tikhonov reconstruction (orange broken line), the result obtained in (4.17) (blue continuous line) and the direct inverse.In this example we considered α = 10 −3 .

Constrained system of singular functions
In the previous example we showed that if we do gradient descent starting from B 0 = A and assume the rather simple case y δ = (σ + δ )v, we obtain the iteration B +1 = B − c vu * , i.e.B +1 has the same singular functions as A and only one of the singular values is different.
We now analyze the optimization from a different perspective.Namely, we focus on finding directly a minimizer of (4.1) for a general y δ ∈ Y , however, we restrict B to be an operator such that B * B commutes with A * A, i.e.A and B share a common system of singular functions.Hence, B has the following representation.
where {u i , σ i , v i } is the singular value decomposition of A.
That means we restrict the problem to finding optimal singular values β i for B. In this case we show that a global minimizer exists and that it has interesting properties.
Theorem 4.1 For any y δ ∈ Y there exist a global minimizer (in the constrained singular functions setting) of (4.1) given by B α = β α i v i u * i with Remark 4.3 The minimizer from Theorem 4.1 does not depend on y δ , i.e. ∀ : y δ ∈ Y it holds that B α is a minimizer of (4.1).The solution to the inverse problem does still depend on y δ since Remark 4.4 In the original DIP approach, some of the parameters of the network may be similar for different y δ , for example, the parameters of the first layers of the encoder part of the UNet.Other parameters may strongly depend on y δ .In this particular case of the analytic deep prior (constrained system of singular functions) we have a explicit separation of which parameters (b = λ B * y δ ) depend on y δ and which do not (W = I − λ B * B).
From now on we consider the notation x(B, y δ ) to incorporate the dependency of x(B) on y δ .Following the classical filter theory for order optimal regularization schemes, [13,21,28], we obtain the following theorem.
is an order optimal regularization method given by the filter functions The regularized pseudoinverse K α is quite similar to the Truncated Singular Value Decomposition (TSVD) but is a softer version because it does not have a jump (see Fig. 4.3).We call this method Soft TSVD.
The disadvantage of Tikhonov, in this case, is that it damps all singular values, and the disadvantage of TSVD is that it throws away all the information related to small singular values.On the other hand, the Soft TSVD does not damp the higher singular values (similar to TSVD) and does not throw away the information related to smaller singular values but does damp it (similar to Tikhonov).For a comparison of the filter functions, see Table 4.1.Moreover, what is interesting is how this method comes out from Def. 4.1, which is stated in terms of the Tikhonov pseudoinverse, and that the optimal singular values do not depend on y δ .F ( ) = 0.5 At this point the relation to the original DIP approach becomes more abstract.We considered a simplified network architecture where all layers share the same weights that comes from an iterative algorithm for solving inverse problems.That means, we let the solution to the original inverse problem be the solution of another problem with different operator B. The DIP approach in this case is transformed to finding an optimal B and allows us to do the analysis in the functional analysis setting.What we learn from the previous results is that we can establish interesting connections between the DIP approach and the classical Inverse Problems theory.This is important because it shows that deep inverse priors can be used to solve really ill-posed inverse problems.

6RIW769' 769' 7LNKRQRY
Remark 4.5 In the original DIP the input z to the network is chosen arbitrarily and is of minor importance.However, once the weights have been trained for a given y δ , z cannot be changed because it would affect the output of the network, i.e. it would change the obtained reconstruction.In the analytic deep prior the input to the unrolled proximal gradient method is completely irrelevant (assuming an infinite number of layers).After finding the "weights" B a different input will still produce the same solution x = x(B) = ϕ Θ (z).
Remark 4.5 tells us that there is still a gap between the original DIP and the analytic one.This was expected because of the obvious trivialization of the network architecture but serves as motivation for further research.

Numerical experiments
We now use the analytic deep inverse prior approach for solving an inverse problem with the following integration operator A : A is a linear and compact operator, hence the inverse problem is ill-posed.Let A n ∈ R n×n be a discretization of A and x † ∈ R n to be one of its discretized singular vectors u.We set the noisy data y δ = A n x † + δ τ with τ ∼ Normal(0, 1 n ), see Figure 4.4.A more general example, i.e where x † is not restricted to be a singular function, is also included (Fig. 4.5).We aim at recovering x † from y δ considering the setting established in Def. the solution x is parametrized by the operator B. Solving the inverse problem is now equivalent to finding optimal B that minimizes the loss function (1.1) for the single data point (z, y δ ).
To find such a B, we go back to the DIP and the neural network approach.We write x(B) as the output of the network ϕ Θ defined in (3.1) with some randomly initialized input z.We optimize with respect to B, which is a matrix in the discretized setting, and obtain a minimizer B opt of (1.1).For more details, please refer to Appendix III.For all choices of α the training of B converges to a matrix B opt , such that x(B opt ) has a smaller true error than x(A).In the third plot of each row, one can check that B indeed converges to some matrix B opt , which is shown in the last plot.The networks were trained using gradient descent with 0.05 as learning rate.

y y
The theoretical findings of the previous subsections allow us to compute, either the exact update (4.16) for B in the rather unrealistic case that y δ = (σ + δ )v , or the exact solution x(B α , y δ ) if we restrict B to have the same system of singular functions as A (Theorem 4.1).In the numerical experiments we do not consider any of these restrictions and therefore we cannot directly apply our theoretical results.Instead we implement the network approach (see Appendix III) to be able to find B opt in a more general scenario.Nevertheless, as it can be observed in the last plot of each row in Figure 4.6, B opt contains some patterns that reflect, to some extent, that B keeps the same singular system but with different singular values.Namely, B is updated in a similar way as in (4.16).With the current implementation we could also use more complex regularization functionals R, in order to reduce the gap between our analytic approach and the original DIP.This is also a motivation for further research.

Summary and conclusion
In this paper, we investigated the concept of deep inverse priors / regularization by architecture.This approach neither requires massive amounts of ground truth / surrogate data, nor pretrained models / transfer learning.The method is based on a single measurement.We started by giving different qualitative interpretations of what regularization is and specifically how regularization by architecture fits into this context.
We followed up with the introduction of the analytic deep prior by explicitly showing how unrolled proximal gradient architectures, allow for a somewhat transparent regularization by architecture.Specifically, we showed that their results can be interpreted as solutions of optimized Tikhonov functionals and proved precise equivalences to regularization techniques.We further investigated this point of view with an academic example, where we implemented the analytic deep inverse prior and tested its numerical applicability.The results confirmed our theoretical findings and showed promising results.
There is obviously, like in deep learning in general, much work to be done in order to have a good understanding of deep inverse priors, but we see much potential in the idea of using deep architectures to regularize inverse problems; especially since an enormous part of the deep learning community is already concerned with the understanding of deep architectures.5HFRQVWUXFWLRQV = 0.0015 Appendix I: A reminder on minimization of Tikhonov functionals and the LISTA approach In this section we consider only linear operators A and we review the well known theory for the Iterative Soft Shrinkage Algorithm (ISTA) as well as the slightly more general Proximal Gradient (PG) [11,26] method for minimizing Tikhonov functionals of the type Ax − y δ 2 + αR(x). (5.1) We recapitulate the main steps in deriving ISTA and PG, as far as we need it for our motivation.The necessary firstorder condition for a minimizer is given by 0 ∈ A * (Ax − y δ ) + α∂ R(x). (5.2) Multiplying with an arbitrary real positive number λ and adding x plus rearranging yields (5.3) For convex R, the term of the right hand side is inverted by the (single valued) proximal mapping of λ αR, which yields Prox Hence this is a fixed point condition, which is a necessary condition for all minimizers of J. Turning the fixed point condition into an iteration scheme yields the PG method This structure is also the motivation for LISTA [16] approaches where fully connected networks with L internal layers of identical size are used.Moreover, in some versions of LISTA, the affine maps between the layers are assumed to be identical.The values at the k-th layer are denoted by x k , hence, This derivation can be rephrased as follows.
Lemma 5.1 Let ϕ Θ , Θ = (W, b), denote a fully connected network with input x 0 and L-internal layers.Further assume, that the activation function is identical to a proximal mapping for a convex functional λ αR : X → IR.Assume W is restricted, such that I − W is positive definite, i.e., there exists a matrix B such that (5.9) Furthermore, we assume that the bias term is fixed as b = λ B * y δ .Then ϕ Θ (z) is the L-th iterate of an ISTA scheme with starting value x 0 = z for minimizing Bx − y δ 2 + αR(x). (5.10) Proof Follows directly from equation (5.5).

Appendix II: Proofs
Proof of Lemma 4.1 F is a functional which maps operators B to real numbers, hence, its derivative is given by which follows from classical variational calculus, see, e.g.[13].The derivative of x(B) with respect to B can be computed using the fix point condition for a minimizer of J B , namely Prox (5.14) The adjoint operator is a mapping from X to L (X,Y ), which can be derived from the defining relation > 0, σ < 2 √ α ≤ 0, otherwise.
This leads to the single attractive fixed point β (1) for σ < 2 √ α and the two attractive fixed points β (3) and β (4) otherwise.Since, x(β (3) ) = x(β (4) ), (5.26) we therefore have a unique reconstruction, namely x = Let B = β i v i u * i .We want to find {β i } to minimize F(B) = Ax(B, y δ ) − y δ 2 . (5.28) The Tikhonov solution is given by the result of applying the operator A to x(B) is and y δ = y δ , v i v i .
(5.31) Inserting (5.30) and (5.31) in (5.28) yields (5.32) In order to minimize F(B), we should set σ β i i +α < 1 and the optimal choice is to find its maximum value which is attained at β i = √ α.Therefore we set and we minimize every term in the sum (5.32), which means we have found singular values {β i } that minimize F(B).

Fig. 3 . 1 A
Fig. 3.1 A simple network with scalar input, a single layer and no activation function.For any arbitrary input z one obtains ϕ Θ (z) = Θ

Definition 4 . 1 2 1 2
Let us assume that measured data y δ ∈ Y , a fixed α > 0, a convex penalty functional R : X → R, and a measurement operator A ∈ L (X,Y ) are given.We consider the minimization problem min B F(B) = min B 1 Ax(B) − y δ 2 , (4.1) subject to the constraint x(B) = arg min x J B (x) = arg min x Bx − y δ 2 + αR(x).(4.2)

Fig. 4 . 4
Fig. 4.4 Example of y δ for x † = u (singular function) with a SNR of 17.06 db.

Fig. 4 . 5
Fig. 4.5 Example of a more general y δ with a SNR of 18.97 db.

Fig. 4 . 6
Fig. 4.6 Reconstructions corresponding to y δ as in Figure 4.4 (first and second row) and Figure 4.5 (third and fourth row) for different values of α.The broken line in the second plot of each row indicates the true error of the standard Tikhonov solution x(A).The horizontal axis in the second and third plots indicates the number of weights updates.

Table 4 . 1
Values of ν for which TSVD, Tikhonov and the Soft TSVD are order optimal.For more details see the Proof of Theorem 4.2 in Appendix II.