1 Introduction

The task of image restoration aims at recovering a clean and sharp unknown image \(\varvec{u}\in {\mathbb {R}}^{n}\) given a blurry and/or noisy measurement \(\varvec{g}\in {\mathbb {R}}^{m}\).

Mathematically, the restoration process can be modelled as a linear inverse problem:

$$\begin{aligned} \text {find} \quad \varvec{u}\in {\mathbb {R}}^{n} \quad s.t. \quad \varvec{H}\varvec{u}+ \varvec{\eta } = \varvec{g}, \end{aligned}$$
(1)

where \(\varvec{H}\in {\mathbb {R}}^{m \times n}\) is a known forward operator and \(\varvec{\eta } \in {\mathbb {R}}^{m}\) is the noise corrupting the data. In this work, we consider a zero-mean Additive White Gaussian Noise (AWGN) component with standard deviation \(\sigma _{\varvec{\eta }}\).

Linear inverse problems are well-known to be ill-posed [3], therefore finding \(\varvec{u}\) from (1) by simply inverting \(\varvec{H}\) is useless due to the lack of stability and/or uniqueness properties. The task is usually reformulated as the problem of finding an estimate \(\varvec{u}^*\) of the desired \(\varvec{u}\) as accurate as possible via a well-posed problem. In the last decades, several approaches have been proposed, ranging from classical variational regularization methods to deep learning based approaches [20, 23, 30, 37].

Variational regularization methods compute \(\varvec{u}^*\) as the solution of the following regularized optimization problem:

$$\begin{aligned} \varvec{u}^{*} \in \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{u}\in {\mathbb {R}}^{n}}\ \dfrac{1}{2}\Vert \varvec{H}\varvec{u}- \varvec{g}\Vert _{2}^{2} + \lambda R(\varvec{u}), \end{aligned}$$
(2)

where the first and the second terms are referred to as data fidelity and regularization, respectively. The hyperparameter \(\lambda \) is a positive scalar typically called regularization parameter. More generally, the data fidelity term measures how a given image adheres to the model (1). Its definition usually depends on the type of noise affecting the acquired \(\varvec{g}\) and, upon AWGN assumptions, it is frequently defined as an \(\ell _{2}\)-norm functional. The regularization term \(R : {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) reflects prior information on the desired solution, such as its regularity and/or sparsity [21], whereas the hyparameter \(\lambda \) weights the strength of the regularization.

Very recently, supervised deep learning based methods have shown state-of-the-art performances in the field of imaging inverse problems [32] due to their capability to learn the correlation between degraded images and their cleaned counterparts by exploiting high representative models like Deep Neural Network architectures and an outer training set of degraded-cleaned example pairs. However, in general, these supervised approaches have several issues, including the lack of generalization when not trained with enough data. Moreover, in many real applications, such as medical imaging, it is practically impossible to build a labeled dataset with both ground truth and degraded data [41].

All these reasons have motivated researchers to inspect unsupervised deep learning approaches which avoid the usage of the training sets [12, 13, 25, 26, 38]. Deep Image Prior (DIP) [38] is among the most promising methods belonging to this class. The DIP framework leverages the fact that the architecture of a deep Convolutional Neural Network (CNN) generator reproduces natural images more easily than random noise, thus inducing implicit regularization. Given a CNN generator \(f: {\mathbb {R}}^{s}\times {\mathbb {R}}^{N} \rightarrow {\mathbb {R}}^{n}\) whose weights are denoted by \(\varvec{\theta }\in {\mathbb {R}}^{s}\) and a random input vector \(\varvec{z}\in {\mathbb {R}}^{N}\) sampled from a uniform distribution, the DIP approach [38] looks for a set of weights \(\varvec{\theta }^{*}\), combining the following minimization problem

$$\begin{aligned} \underset{\varvec{\theta }\in {\mathbb {R}}^{s}}{\text {argmin}}&\ \dfrac{1}{2}\Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert _{2}^{2} \end{aligned}$$
(3)

with an early stopping procedure. More specifically, the weights \(\varvec{\theta }^{*}\) are obtained by applying standard gradient-based iterative algorithms to the problem (3) and early stopping the iterative process before overfitting the degraded image \(\varvec{g}\). The restored image \(\varvec{u}^{*}\) is then computed as \(f(\varvec{\theta }^{*},\varvec{z})\).

Up to now, researchers have mostly worked on a theoretical analysis of DIP [1, 10, 11] as well as on boosting its performance. Inspired by standard variational regularization methods, in [2, 7, 8, 29, 31, 39] the authors improved the DIP performance by adding an explicit penalization term R to the objective in (3). Hence, the optimization problem (3) is replaced by the following regularized one:

$$\begin{aligned} \underset{\varvec{\theta }\in {\mathbb {R}}^{s}}{\text {argmin}}&\ \dfrac{1}{2}\Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert _{2}^{2} + \lambda R(f(\varvec{\theta },\varvec{z})). \end{aligned}$$
(4)

As an example, in [2, 29, 39] R is set as the standard Total Variation (TV) [35], whereas in [31] the authors consider the RED regularizer [34]. In more details, the definition of TV comes from the assumption that natural images often admit very sparse approximations in the gradient domain. Hence, given a vectorized image \(\varvec{u}\in {\mathbb {R}}^{n}\), the TV regularizer is defined as follows:

$$\begin{aligned} \text {TV}(\varvec{u}):= \Vert \varvec{D}\varvec{u}\Vert _{1,2} := \sum _{i=1}^{n} \left( |(\mathbf {D_h} \varvec{u})_i|^2 + |(\mathbf {D_v} \varvec{u})_i|^2\right) ^{1/2}, \end{aligned}$$
(5)

where by \(\varvec{D}= (\varvec{D}_{h};\varvec{D}_{v}) \in {\mathbb {R}}^{2n \times n}\) we denote the discrete gradient such that \(\varvec{D}_{h} \in {\mathbb {R}}^{n}\), \(\varvec{D}_{v} \in {\mathbb {R}}^{n}\) are the first order finite difference discrete operators along the horizontal and vertical axes, respectively. On the other hand, the RED regularizer [34] is based on the so called regularization by denoising principle, i.e. the capability of denoisers to induce regularization. It is defined as follows:

$$\begin{aligned} R(\varvec{u}) = \dfrac{1}{2} \varvec{u}^{T}(\varvec{u}- {\mathsf {D}}(\varvec{u})), \end{aligned}$$
(6)

where \({\mathsf {D}}(\cdot )\) is chosen as any off-the-shelf denoiser. In [34], by assuming the differentiability, local homogeneity, Jacobian symmetry and filter passivity of \({\mathsf {D}}(\cdot )\), the authors prove that R is convex, differentiable and, moreover, \(\nabla R(\varvec{u}) = \varvec{u}- {\mathsf {D}}(\varvec{u})\). Hereafter we denote by DeepRED the method proposed in [31] to solve problem (4) when R is set as the RED regularizer.

The selection of the regularization parameter \(\lambda \) in (4) is an essential issue that this approach inherits from the class of variational regularization methods [37, 43]. A wise choice of regularization parameter is obviously crucial for obtaining useful approximate solutions to ill-posed problems. Indeed, replacing (3) with (4) induces better regularized solutions, provided a suitable value for \(\lambda \) depending both on the level of degradation of the acquired image and on the considered problem. In the literature there exist various strategies for choosing the parameter \(\lambda \), such as the Morozov’s discrepancy principle, the generalized cross-validation (GCV) [14], the L-curve method [17], and the unbiased predictive risk estimator [28]. However, it is well-known that such strategies can present different limitations: they are not at all easy to apply for every regularizer; they can provide either over or under smoothed solutions; they may often require to solve (4) many times for different values of \(\lambda \), making the overall procedure computationally expensive. For these reasons, manually tuning the regularization parameter by trial-and-error procedures is common in the regularized DIP framework [2, 7, 29, 31, 39], leading to an high demanding workload.

Contributions In this work, we provide two different DIP based optimization models which share the property of automatically balancing the effect of the regularization. First, we consider an unconstrained model as the one in (4) where the regularization term is additively separable. The strength of the regularization is pixelwise weighted by a set (one for each pixel) of local regularization parameters whose definition is based on local patterns. Following the idea of estimating the regularization parameter iteratively suggested in [16, 40], we automatically estimate the set of local regularization parameters according to the Uniform PENalty (UPEN) principle [5]. Furthermore, we propose to reformulate the standard regularized unconstrained DIP optimization problem (4) as a constrained one, whose constraints impose that the residual \(\Vert \varvec{H}f(\varvec{\theta }^{*},\varvec{z})-\varvec{g}\Vert _2\) is almost equal to the standard deviation of the noise affecting the acquired data, in accordance to the discrepancy principle. As evident, this approach strictly depends on an estimation of the noise level in the corrupted image. However, in real applications, choosing a reasonable value of the noise level is usually much easier than finding a suitable value of the regularization parameter \(\lambda \). Indeed, many efficient algorithms to estimate the noise level are known in the literature [19, 24] and successfully exploited in many fields [15, 36]. To consider automatically regularized DIP-based optimization models is an interesting issue in the DIP framework, since so far, to the best of our knowledge, no one working in this context has been focused on this aspect. Both the unconstrained and constrained models are solved via a modified and more efficient version of the proximal gradient descent-ascent (PGDA) method in which the computation of the gradient step is split in two blocks. Finally, we show that, upon suitable assumptions [6], some convergence results for the arising iterative schemes can be provided.

Organization of the paper In Sect. 2, we introduce both the unconstrained and the constrained models and we illustrate the resulting PGDA schemes. In Sect. 3, we present several numerical experiments on synthetic as well as real blurred and noisy natural and medical images and we compare the results with the standard DIP [38] and DeepRED [31].

2 Novel automatically regularized DIP-based optimization models

In this section, we introduce the unconstrained and the constrained optimization models to face the regularized DIP problem and we show how they can be treated within the PGDA framework.

2.1 Unconstrained model

The approaches described in [2, 29, 39] consider the unconstrained model (4) setting the regularizer as the handcrafted Total Variation with a single regularization parameter, which does not allow to adapt the regularization to the local image patterns. Conversely, we consider a flexible space variant regularizer and a set of local regularization parameters \(\lambda _{i}\) for \(i=1 \dots n\) weighting the strength of the regularization for each pixel. The resulting unconstrained model reads:

$$\begin{aligned} \underset{\varvec{\theta }\in {\mathbb {R}}^{s}}{\text {argmin}}&\ \dfrac{1}{2}\Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert _{2}^{2} + \sum _{i=1}^{n} \lambda _{i} R_{i}(({\mathcal {A}}f(\varvec{\theta },\varvec{z}))_{I_i}), \end{aligned}$$
(7)

where \(I_i\subset \{1,\ldots ,l\}=I\) such that \(I_i\cap I_j=\emptyset \) for every \(i,j = 1 \dots n\) with \(i\ne j\) and \(\bigcup _{i=1}^{n} I_{i}=I\), \(R_{i}\) are real-valued functions representing the local components of the regularizer, \({\mathcal {A}}: {\mathbb {R}}^{n} \rightarrow {\mathbb {R}}^{l}\) is a generic operator and l is a positive integer such that \(l \ge n\). The functions \(R_{i}\) and the local parameters \(\lambda _{i}\) usually represent local energies defined on a neighbourhood of the i-th pixel thus forcing prior information based on local patterns. Practically, these local parameters are automatically chosen along the iterations as explained in Remark 1. Considering a vector \({\varvec{v}}\) in \({\mathbb {R}}^{l}\), for every \(i=1,\ldots n\) we denote \({\varvec{v}}_{I_i}\in {\mathbb {R}}^{|I_i|}\) as the vector specified by the components of \({\varvec{v}}\) whose indexes are in \(I_i\). Examples of regularization terms belonging to this class are the Tikhonov-like and the Total Variation ones. For instance, in the Tikhonov-based regularizers, \({\mathcal {A}}\) is usually chosen as the identity or the laplacian operators, whereas \(R_{i}: {\mathbb {R}} \rightarrow {\mathbb {R}}\) is chosen as the square function. Concerning the isotropic Total Variation, \({\mathcal {A}}\) represents the discrete gradient and \(R_{i}: {\mathbb {R}}^{2} \rightarrow {\mathbb {R}}\) is chosen as the \(\ell _{2}\)-norm function.

By adding an auxiliary variable \({\varvec{v}}:= {\mathcal {A}}f(\varvec{\theta },\varvec{z})\), the optimization problem (7) is equivalent to the following formulation:

$$\begin{aligned} \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{\theta }\in {\mathbb {R}}^s,{\varvec{v}}\in {\mathbb {R}}^l} \&\dfrac{1}{2} \Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert ^{2}_{2} + \sum _{i=1}^{n} \lambda _{i} R_{i}({\varvec{v}}_{I_i})\nonumber \\ \text {s.t.} \quad&{\mathcal {A}}f(\varvec{\theta },\varvec{z})={\varvec{v}}. \end{aligned}$$
(8)

In order to solve problem (8), we introduce the corresponding augmented Lagrangian function defined as

$$\begin{aligned} \begin{aligned} L(\varvec{\theta },{\varvec{v}}, \varvec{\mu }_{{\varvec{v}}})&= \dfrac{1}{2} \Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert _{2}^{2} + \sum _{i=1}^{n} \lambda _{i} R_{i}({\varvec{v}}_{I_i}) \\&\quad + \dfrac{\beta _{{\varvec{v}}}}{2}\Vert {\mathcal {A}}f(\varvec{\theta },\varvec{z}) - {\varvec{v}} \Vert ^{2}_{2} + \langle \varvec{\mu }_{{\varvec{v}}}, {\mathcal {A}}f(\varvec{\theta },\varvec{z}) - {\varvec{v}} \rangle , \end{aligned} \end{aligned}$$
(9)

where \(\beta _{{\varvec{v}}}\) is a positive scalar, called penalty parameter and \(\varvec{\mu }_{{\varvec{v}}}\) is the Lagrangian parameter associated with the constraint \({\mathcal {A}}f(\varvec{\theta },\varvec{z}) = {\varvec{v}}\). Some papers [8, 31] address the minimization of the regularized DIP optimization problem (4) by seeking the saddle points of the related augmented Lagrangian function through the ADMM algorithm. However, an highly inexact version of ADMM is practically implemented since the updating step for the weights \(\varvec{\theta }\) is, in general, solved inexactly by applying only one iteration of a gradient-based method. For this reason, instead of ADMM, we take into account another class of methods tailored for minimax problems. In more detail, by denoting with \(\varvec{x}\equiv [\varvec{\theta };{\varvec{v}}]\), we handle the saddle point problem

$$\begin{aligned} \min _{\varvec{x}\in {\mathbb {R}}^{s+l}}\max _{\varvec{\mu }_{{\varvec{v}}}\in {\mathbb {R}}^l} L(\varvec{x},\varvec{\mu }_{{\varvec{v}}}) \end{aligned}$$
(10)

by means of the class of alternating proximal gradient descent-ascent (PGDA) methods [6, 9, 27] (see Appendix 1 for a survey of these algorithms). By introducing the notation \({\mathcal {R}}(\varvec{x}) = \sum _{i=1}^{n} \lambda _{i} R_{i}({\varvec{v}}_{I_i})\) and defining

$$\begin{aligned} K(\varvec{x},\varvec{\mu }_{{\varvec{v}}}) = \frac{1}{2}\Vert \varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}\Vert _{2}^{2} + \frac{\beta _{{\varvec{v}}}}{2}\Vert {\mathcal {A}}f(\varvec{\theta },\varvec{z})-{\varvec{v}}\Vert _{2}^{2} +\langle \varvec{\mu }_{{\varvec{v}}},{\mathcal {A}}f(\varvec{\theta },\varvec{z})-{\varvec{v}}\rangle , \end{aligned}$$

upon suitable initialization of the involved variables, the k-th iteration of the alternating PGDA iterative algorithm described in [6] to solve (10) reads as follows:

$$\begin{aligned} \left\{ \begin{aligned} \varvec{x}^{k+1}&= \text {prox}_{\alpha _{\varvec{x}}{\mathcal {R}}} (\varvec{x}^{k}-\alpha _{\varvec{x}}\nabla _{\varvec{x}} K(\varvec{x}^k,\varvec{\mu }_{{\varvec{v}}}^{k}))\\ \varvec{\mu }_{{\varvec{v}}}^{k+1}&=\varvec{\mu }_{{\varvec{v}}}^{k}+\alpha _{\varvec{\mu }_{{\varvec{v}}}}\nabla _{\varvec{\mu }_{{\varvec{v}}}} K(\varvec{x}^{k+1},\varvec{\mu }_{{\varvec{v}}}^{k}) \end{aligned} \right. \end{aligned}$$
(11)

where \(\alpha _{\varvec{x}}\) and \(\alpha _{\varvec{\mu }_{{\varvec{v}}}}\) are proper positive learning rates. By definition of proximal operator, the vector \(\varvec{x}^{k+1}\) in the first step of (11) can be written as in the following:

$$\begin{aligned} \varvec{x}^{k+1} = \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{x}} \ \alpha _{\varvec{x}}{\mathcal {R}}(\varvec{x}) + \frac{1}{2}\Vert \varvec{x}-(\varvec{x}^k-\alpha _{\varvec{x}}\nabla _{\varvec{x}} K(\varvec{x}^k,\varvec{\mu }_{{\varvec{v}}}^k))\Vert _2^2. \end{aligned}$$

Hence, in view of the notation introduced above,

$$\begin{aligned} \begin{aligned} \varvec{x}^{k+1}&= \mathop {\mathrm {\text {argmin}}}\limits _{{\varvec{\theta }\in {\mathbb {R}}^s,{\varvec{v}}\in {\mathbb {R}}^l}} \ \alpha _{\varvec{x}} {\mathcal {R}}({\varvec{x}}) + \frac{1}{2}\left\Vert \left[ \begin{array}{c} \varvec{\theta }\\ {\varvec{v}} \\ \end{array}\right] - \left[ \begin{array}{c} \varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\ \varvec{\mu }_{\varvec{v}}^k) \\ {\varvec{v}}^k-\alpha _{\varvec{x}} \nabla _{{\varvec{v}}}K(\varvec{x}^k,\ \varvec{\mu }_{\varvec{v}}^k) \\ \end{array}\right] \right\Vert _2^2\\ &= \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{\theta }\in {\mathbb {R}}^s,{\varvec{v}}\in {\mathbb {R}}^l} \ \alpha _{\varvec{x}} {\mathcal {R}}(\varvec{x}) + \frac{1}{2}\Vert \varvec{\theta }- (\varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\varvec{\mu }_{\varvec{v}}^k))\Vert _2^2\\ &\quad + \frac{1}{2}\Vert {\varvec{v}}-({\varvec{v}}^k-\alpha _{\varvec{x}}\nabla _{{\varvec{v}}}K(\varvec{x}^k,\varvec{\mu }_{\varvec{v}}^k))\Vert _2^2. \end{aligned} \end{aligned}$$
(12)

Due to the separability of the objective in (12) with respect to the variables \(\varvec{\theta }\) and \({\varvec{v}}\) and by assuming convexity of \({\mathcal {R}}\) and \(\alpha _{\varvec{x}} = \frac{1}{\beta _{\varvec{v}}}\), iteration (11) can be rewritten as

$$\begin{aligned} \left\{ \begin{aligned} \varvec{\theta }^{k+1}&= \varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\varvec{\mu }_{\varvec{v}}^k)\\ {\varvec{v}}^{k+1}&= \mathop {\mathrm {\text {argmin}}}\limits _{{\varvec{v}}\in {\mathbb {R}}^l} \ \alpha _{\varvec{x}} {\mathcal {R}}(\varvec{x})+ \frac{1}{2}\Vert {\varvec{v}}-({\varvec{v}}^k-\alpha _{\varvec{x}}\nabla _{{\varvec{v}}}K(\varvec{x}^k,\varvec{\mu }_{\varvec{v}}^k))\Vert _2^2 =\\&= \mathop {\mathrm {\text {argmin}}}\limits _{{\varvec{v}}\in {\mathbb {R}}^l} \sum _{i=1}^{n} \lambda _{i} R_{i}({\varvec{v}}_{I_i}) + \frac{1}{2\alpha _{\varvec{x}}}\left\Vert {\varvec{v}} -\left( {\mathcal {A}}f(\varvec{\theta }^k,\varvec{z})+\frac{\varvec{\mu }_{\varvec{v}}^k}{\beta _{{\varvec{v}}}}\right) \right\Vert _2^2\\ \varvec{\mu }_{{\varvec{v}}}^{k+1}&= \varvec{\mu }_{{\varvec{v}}}^{k} + \alpha _{\varvec{\mu }_{\varvec{v}}}( {\mathcal {A}} f(\varvec{\theta }^{k+1},\varvec{z}) - {\varvec{v}}^{k+1}). \end{aligned} \right. \end{aligned}$$
(13)

Concerning the optimization problem in the second step of (13), if the proximal map of \(R_{i}\) can be easily computed for all \(i=1 \dots n\), then the problem can be efficiently solved in a closed form by applying the proximity operator of \(R_{i}\) to the n components of \({\mathcal {A}}f(\varvec{\theta }^{k},\varvec{z}) + \frac{\varvec{\mu }_{{\varvec{v}}}^{k}}{\beta _{{\varvec{v}}}}\). Such hypotheses on \(R_{i}\) are not so restrictive. For example both Tikhonov-like and isotropic Total Variation regularizers satisfy these assumptions since the \(R_{i}\) are set as the square or \(\ell _{2}\)-norm functions.

Remark 1

In our implementation, we chose to vary the set of local regularization parameters \(\lambda _{i}\) along the iterations. In particular, their formulation is inspired by [5] and reads:

$$\begin{aligned} \lambda _{i}^{k}= \dfrac{1}{2n} \dfrac{\Vert \varvec{H}f(\varvec{\theta }^{k+1},\varvec{z}) - \varvec{g}\Vert _{2}^{2}}{R_{i} \left( ({\mathcal {A}}f(\varvec{\theta }^{k+1},\varvec{z}))_{I_i}\right) }. \end{aligned}$$
(14)

This entails that the smaller is the value of the local component function the greater is the regularization provided at pixel i. We remark that, in the experimentation, we set these parameters to a certain value when the denominator decreases below a fixed threshold.

2.1.1 Practical and theoretical details of algorithm (13)

We point out that, taking into account the separable nature of problem (12) and the optimization efficiency, in the practical implementation, we exploit the already computed value for \(\varvec{\theta }^{k+1}\) in the update of \({\varvec{v}}^{k+1}\). In Sect. 3.6 we compare the behaviour of the standard alternating PGDA method (13) and that of the implemented version which employs \(\varvec{\theta }^{k+1}\) for the computation of \({\varvec{v}}^{k+1}\) on one of the problem under analysis. The two versions of the alternating PGDA are comparable, even if the standard one needs higher memory requirements.

Remark 2

Under the hypotheses that the function \(K(\varvec{x},\varvec{y})\) is \(\rho \)-weakly convex and L-Lipschitz in the first component uniformly in the second one, and the regularizer \({\mathcal {R}}\) is proper, convex and lower semicontinuous and \(L_{{\mathcal {R}}}\)-Lipschitz continuous on its domain, and since \(K(\varvec{x},\varvec{y})\) is concave and has Lipschitz continuous gradient in the second component uniformly in the first one, then the convergence result [6, Theorem 3.7] can be invoked. Such theorem ensures that an \(\varepsilon \)-stationary point [6, Definition (6)] of (9) can be visited in a finite number of iterations depending on \(\varepsilon \). Both the invoked theorem and definition are recalled in Appendix 1 (see Definition () and Theorem ()). We point out that the learning rates \(\alpha _{\varvec{x}}\) and \(\alpha _{\varvec{\mu }_{\varvec{v}}}\) are required to be bounded by proper constants which we do not know in practice. For this reason, we decide to fix \(\alpha _{\varvec{x}} = \frac{1}{\alpha _{\varvec{\mu }_{\varvec{v}}}} = \beta _{\varvec{v}}\). The choice of \(\beta _{\varvec{v}}\) is discussed in the following remark.

Remark 3

The value of the penalty parameter \(\beta _{{\varvec{v}}}\) is hand-tuned. However, in the experimental part (Sect. 3) we empirically show that the choice of this hyperparameter does not affect the performance of the method as much as the choice of the regularization parameter when dealing with model (4). In detail, we empirically demonstrate that the performance of the proposed model is not sensitive to this penalty parameter \(\beta _{{\varvec{v}}}\) for all considered test problems if chosen in a reasonable set. Moreover none of the considered value for \(\beta _{\varvec{v}}\) makes the algorithm divergent.

2.2 Constrained model

The starting point of this approach is again the regularized DIP optimization problem in (4). Differently from the unconstrained model described in Sect. 2.1, we here assume \(R:{\mathbb {R}}^{n} \rightarrow {\mathbb {R}}\) is a generic regularizer. The constrained model we refer to, in the following, reads as:

$$\begin{aligned} \underset{\varvec{\theta }\in {\mathbb {R}}^s}{\text {argmin}} \ R(f(\varvec{\theta },\varvec{z})) \quad \text {s.t.}\quad f(\varvec{\theta },\varvec{z}) \in D_{\sigma _{\varvec{\eta }}}, \end{aligned}$$
(15)

where \(D_{\sigma _{\varvec{\eta }}}\) is defined as:

$$\begin{aligned} D_{\sigma _{\varvec{\eta }}}:= \lbrace f(\varvec{\theta },\varvec{z}) \in {\mathbb {R}}^{n} \ |\ \Vert \varvec{H}f(\varvec{\theta },\varvec{z}) - \varvec{g}\Vert _{2}^2 \le \tau \sigma _{\varvec{\eta }}^2 m \rbrace , \end{aligned}$$
(16)

with \(\tau \) being a positive scalar and \(\sigma _{\varvec{\eta }}\) being the standard deviation of the noise affecting \(\varvec{g}\). This constrained model (15) exploits the Morozov’s discrepancy principle by simply extending [18, 33]. If \(R \circ f\) is convex, this model is equivalent to (4) for a suitable \(\lambda \ge 0\). In particular, by the KKT complementary condition, the discrepancy principle seeks a \(\lambda >0\), such that the minimizer of (15) lies on the boundary of \(D_{\sigma _{\varvec{\eta }}}\). However, this hypothesis of convexity appears too restrictive for the DIP framework. Nevertheless, under milder assumptions, the KKT conditions ensure that a local optimum for (15) is a stationary point for (4) provided a particular \(\lambda \ge 0\). Therefore, we focus on problem (15) since it allows us to avoid the dependence on the choice of the regularization parameter \(\lambda \). Finally, we cannot theoretically guarantee that the solution of (15) satisfies the discrepancy principle (namely, lies on the boundary of \(D_{\sigma _{\varvec{\eta }}}\)), but in Sect. 3 we empirically verify that our approaches implicitly enforce it. We stress that our general approach (15), largely differs from model (4) proposed in the literature, since it overcomes the problem of tuning the regularization parameter provided the noise standard deviation \(\sigma _{\varvec{\eta }}\). In practice, it is sufficient to consider a good estimate of \(\sigma _{\varvec{\eta }}\) which can be computed by applying the efficient algorithms described in [19, 24].

In order to solve (15), we propose to consider the alternating PGDA method. By introducing two auxiliary variables \(\varvec{t}:=f(\varvec{\theta },\varvec{z})\) and \(\varvec{r}:=\varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}\), two positive penalty parameters \(\beta _{\varvec{t}}\) and \(\beta _{\varvec{r}}\), the augmented Lagrangian functional is defined as:

$$\begin{aligned} \begin{aligned} L(\varvec{\theta },\varvec{t},\varvec{r};\varvec{\mu }_{\varvec{t}},\varvec{\mu }_{\varvec{r}})&= R(\varvec{t})+i_{B_\delta }(\varvec{r})+\langle \varvec{\mu }_{\varvec{t}},f(\varvec{\theta },\varvec{z}) - \varvec{t}\rangle + \frac{\beta _{\varvec{t}}}{2}\Vert \varvec{t}-f(\varvec{\theta },\varvec{z})\Vert ^2 \\ &\quad+ \langle \varvec{\mu }_{\varvec{r}},\varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}- \varvec{r}\rangle +\frac{\beta _{\varvec{r}}}{2}\Vert \varvec{r}-(\varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}) \Vert ^2, \end{aligned} \end{aligned}$$
(17)

where \(i_{B_{\delta }}\) is the indicator function of the ball \(B_{\delta } \subset {\mathbb {R}}^{m}\), centered in \(\varvec{0} \in {\mathbb {R}}^{m}\), of radius \(\delta := \sqrt{\tau \sigma ^2 m}\), and \(\varvec{\mu }_{\varvec{t}},\ \varvec{\mu }_{\varvec{r}}\) are the Lagrangian parameters related to the auxiliary variables. Given the notation \(\varvec{x}=[\varvec{\theta },\varvec{t},\varvec{r}]\) and \(\varvec{y}=[\varvec{\mu }_{\varvec{t}},\varvec{\mu }_{\varvec{r}}]\), and by setting

$$\begin{aligned} \begin{aligned} K(\varvec{x},\varvec{y})&= \frac{\beta _t}{2}\Vert \varvec{t}- f(\varvec{\theta },\varvec{z})\Vert ^2 + \frac{\beta _{\varvec{r}}}{2}\Vert \varvec{r}-(\varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}) \Vert ^2\\ &\quad +\langle \varvec{\mu }_{\varvec{t}},f(\varvec{\theta },\varvec{z}) - \varvec{t}\rangle + \langle \varvec{\mu }_{\varvec{r}},\varvec{H}f(\varvec{\theta },\varvec{z})-\varvec{g}- \varvec{r}\rangle \end{aligned} \end{aligned}$$

and

$$\begin{aligned} {\mathcal {R}}(\varvec{x}) = R(\varvec{t})+i_{B_\rho }(\varvec{r}), \end{aligned}$$

the augmented Lagrangian function (17) has the form:

$$\begin{aligned} L(\varvec{\theta },\varvec{t},\varvec{r};\varvec{\mu }_{\varvec{t}},\varvec{\mu }_{\varvec{r}}) \equiv L(\varvec{x},\varvec{y}) \equiv K(\varvec{x},\varvec{y})+{\mathcal {R}}(\varvec{x}). \end{aligned}$$

The general iteration of the alternating PGDA iterative algorithm to solve

$$\begin{aligned} \min _{\varvec{x}\in {\mathbb {R}}^{s+n+m}} \max _{\varvec{y}\in {\mathbb {R}}^{n+m}} L(\varvec{x},\varvec{y}) \end{aligned}$$
(18)

can be written as:

$$\begin{aligned} \left\{ \begin{aligned} \varvec{x}^{k+1}&= \text {prox}_{\alpha _{\varvec{x}}{\mathcal {R}}}(\varvec{x}^k-\alpha _{\varvec{x}}\nabla _{\varvec{x}} K(\varvec{x}^k,\varvec{y}^k))\\ \varvec{y}^{k+1}&=\varvec{y}^k+\alpha _{\varvec{y}}\nabla _{\varvec{y}} K(\varvec{x}^{k+1},\varvec{y}^k) \end{aligned} \right. \end{aligned}$$
(19)

where \(\alpha _{\varvec{x}}\) and \(\alpha _{\varvec{y}}\) are proper positive learning rates. By following the approach employed in Sect. 2.1 for the unconstrained case, the vector \(\varvec{x}^{k+1}\) in the first step of (19) can be rewritten as

$$\begin{aligned} \begin{aligned} \varvec{x}^{k+1}&= \text {argmin}_{\varvec{x}} \ \alpha _{\varvec{x}} {\mathcal {R}}(\varvec{x}) + \frac{1}{2}\left\Vert \left[ \begin{array}{c} \varvec{\theta }\\ \varvec{t}\\ \varvec{r}\\ \end{array}\right] - \left[ \begin{array}{c} \varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\varvec{y}^k) \\ \varvec{t}^k-\alpha _{\varvec{x}} \nabla _{\varvec{t}}K(\varvec{x}^k,\varvec{y}^k) \\ \varvec{r}^k-\alpha _{\varvec{x}} \nabla _{\varvec{r}}K(\varvec{x}^k,\varvec{y}^k) \\ \end{array}\right] \right\Vert _2^2\\&= \text {argmin}_{\varvec{x}} \ \alpha _{\varvec{x}} {R}(\varvec{t}) + i_{B_\rho }(\varvec{r})+ \frac{1}{2}\Vert \varvec{\theta }-(\varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\varvec{y}^k))\Vert _2^2\\& \quad + \frac{1}{2}\Vert \varvec{t}-(\varvec{t}^k-\alpha _{\varvec{x}}\nabla _{\varvec{t}}K(\varvec{x}^k,\varvec{y}^k))\Vert _2^2 \\&\quad + \frac{1}{2}\Vert \varvec{r}-(\varvec{r}^k-\alpha _{\varvec{x}}\nabla _{\varvec{r}}K(\varvec{x}^k,\varvec{y}^k))\Vert _2^2. \end{aligned} \end{aligned}$$
(20)

As a consequence, by selecting \(\displaystyle \beta _{\varvec{t}} = \beta _{\varvec{r}} = \frac{1}{\alpha _{\varvec{x}}}\),

$$\begin{aligned} \left\{ \begin{aligned} \varvec{\theta }^{k+1}&= \varvec{\theta }^k-\alpha _{\varvec{x}}\nabla _{\varvec{\theta }}K(\varvec{x}^k,\varvec{y}^k)\\ \varvec{t}^{k+1}&= \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{t}\in {\mathbb {R}}^n} \ \alpha _{\varvec{x}} {R}(\varvec{t})+ \frac{1}{2}\Vert \varvec{t}-(\varvec{t}^k-\alpha _{\varvec{x}}\nabla _{\varvec{t}}K(\varvec{x}^k,\varvec{y}^k))\Vert _2^2\\&= \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{t}\in {\mathbb {R}}^n} \ \alpha _{\varvec{x}}{R}(\varvec{t})+ \frac{1}{2}\left\Vert \varvec{t}-\left( f(\varvec{\theta }^{k},\varvec{z})+\frac{\varvec{\mu }_{\varvec{t}}^{k}}{\beta _{\varvec{t}}}\right) \right\Vert _2^2\\ \varvec{r}^{k+1}&= \mathop {\mathrm {\text {argmin}}}\limits _{\varvec{r}\in {\mathbb {R}}^m} \ \alpha _{\varvec{x}} i_{B_\rho }(\varvec{r})+ \frac{1}{2}\Vert \varvec{r}-(\varvec{r}^k-\alpha _{\varvec{x}}\nabla _{\varvec{r}}K(\varvec{x}^k,\varvec{y}^k))\Vert _2^2\\&=\mathop {\mathrm {\text {argmin}}}\limits _{\varvec{r}\in {\mathbb {R}}^m} \ \alpha _{\varvec{x}} i_{B_{\rho }}(\varvec{r})+ \frac{1}{2\alpha _{\varvec{x}}}\left\Vert \varvec{r}-(\varvec{H}f(\varvec{\theta }^k,\varvec{z})-\varvec{g})-\frac{\varvec{\mu }_{\varvec{r}}^{k}}{\beta _{\varvec{r}}}\right\Vert _2^2\\ \varvec{\mu }_{\varvec{t}}^{k+1}&= \varvec{\mu }_{\varvec{t}}^{k} + \alpha _{\varvec{y}}(f(\varvec{\theta }^{k+1},\varvec{z}) - \varvec{t}^{k+1})\\ \varvec{\mu }_{\varvec{r}}^{k+1}&= \varvec{\mu }_{\varvec{r}}^{k} + \alpha _{\varvec{y}}(\left( \varvec{H}f(\varvec{\theta }^{k+1},\varvec{z})-\varvec{g}\right) - \varvec{r}^{k+1}). \end{aligned} \right. \end{aligned}$$
(21)

Similarly to the standard DIP framework solving (3), the first step in (21) updates the network’s weights performing one back-propagation step. The update of \(\varvec{t}\), provided by the second step reported in (21), strictly depends on the choice of the regularizer. However, the minimization problem to find \(\varvec{t}^{k+1}\) is mathematically equivalent to the proximal map of \( \alpha _{\varvec{x}} R\) in \(f(\varvec{\theta }^{k},\varvec{z}) + \dfrac{\varvec{\mu }_{\varvec{t}}^{k}}{\beta _{\varvec{t}}}\), therefore it can admit a closed form solution or it can be solved through either fixed point or gradient descent strategies as in [31]. The update of \(\varvec{r}\) is a simple projection of \(\varvec{H}f(\varvec{\theta }^{k},\varvec{z}) - \varvec{g}+ \frac{\varvec{\mu }_{\varvec{r}}^{k}}{\beta _{\varvec{r}}}\) onto the ball \(B_{\delta }\). From the practical point of view, in the updating steps for \(\varvec{t}^{k+1}\) and \(\varvec{r}^{k+1}\) in (21), we exploit the already computed vector \(\varvec{\theta }^{k+1}\) for improving the optimization efficiency, as already discussed in Sect. 2.1.1. As for the convergence properties of the scheme (21), analogous conclusions to those of Remark 2 hold also in this case. Finally, we point out that the penalty parameters \(\beta _{\varvec{t}}\) and \(\beta _{\varvec{r}}\) are hand-tuned. However, the considerations highlighted for the penalty \(\beta _{{\varvec{v}}}\) in Remark 3 also apply to these two hyperparameters.

3 Results

In this section, we show the results of some numerical experiments carried out to highlight the main benefits of the suggested unconstrained and constrained models and to evaluate their effectiveness in solving image deblurring and denoising tasks on synthetic natural images as well as real medical ones. We perform several tests by varying the level of degradation, evaluate the performances through qualitative visual comparisons and quantitatively by PSNR and SSIM metrics. Finally, we discuss about the effectiveness of our implementation with respect to the standard PGDA.

3.1 Implementation and evaluation details

Implementation details Regarding the choice of the regularizer, both models allow a certain freedom of choice. For the unconstrained model, we consider the handcrafted space variant Total Variation and in the following we refer to it as DIP-WTV. We stress that in this case we consider the model (7) by assuming all \(R_{i}\) are set equal to the 2D \(\ell _{2}\)-norm for \(i = 1 \dots n\) and \({\mathcal {A}}\) is taken equal to the discrete gradient, hence \(l=2n\) and \(I_i=\{i,n+i\}\) for each \(i=1,\ldots ,n\). Upon these assumptions the set of regularization parameters is defined according to the formula given by (14). Concerning the constrained optimization model (15), we set the regularizer R equal to the RED regularizer [34]. We refer to this approach as cDIP-RED in the following. The cDIP-RED approach requires the knowledge of standard deviation of the noise affecting the acquired image. As already mentioned, we point out that we estimate the noise standard deviation by applying the algorithm described in [19], even for the simulated tests where we do know it. The parameter \(\tau \) in (15) is set equal to 1 for all the experiments. For both models and for all the experiments performed we stop the related PGDA iterative process after 5000 iterations. We remark that both the DIP-WTV and the cDIP-RED approaches are based on a modified version of the alternating PGDA scheme, as clarified is Sect. 2.1.1. In Sect. 3.6 we discuss this choice on one of the problem under analysis. As deep neural network architecture f we consider a generative CNN Encoder–Decoder architecture with skip connections by concatenation as suggested in [38], whereas the input \(\varvec{z}\) is taken as a random input tensor sampled from a uniform distribution. The input \(\varvec{z}\) is a 3D tensor having the same dimension of the unknown image and 32 channels, the number of weights \(\varvec{\theta }\) is about 2 millions.

In the experiments, we follow the common practice in the DIP framework [31, 38] and use the Adam [22] algorithm implemented in PyTorch with the default parameters to update \(\varvec{\theta }\) in (13) and (21). As typically done, we also perturb in each iteration the input \(\varvec{z}\) by a component sampled from a Gaussian distribution with zero mean and standard deviation equal to \(\frac{1}{30}\) and we compute the final output as the average of all iterates.

Competitors We compare the proposed approaches DIP-WTV and cDIP-RED with the standard DIP [38] and the DeepRED [31] algorithms. We point out that in [31] the authors prove that DeepRED outperforms other several approaches as far as the deblurring and denoising tasks are concerned. Moreover, we underline that in their implementationFootnote 1, to enforce the regularization, the authors implement a strategy that increases the magnitude of the regularization parameter \(\lambda \) along the iterations when the computed solution starts overfitting the corrupted image. More precisely, when the PSNR value between the restored image and the degraded image is greater than a given threshold \(\gamma \) the regularization parameter is increased by adding a constant.

Test set In Fig. 1, we depict the images used in the numerical simulations. We consider a test set of five red-green-blue (RGB) natural images belonging to the Set5 dataset [4], two black-white (BW) natural images and one chest CT image of a patient affected by COVID-19 already post-processed into a 2D image after the acquisition. We treat all the images belonging to Set5 as ground truths as well as the watercastle BW image. In our experiments, the simulated acquired images are created by applying the image formation model (1) to the related ground truths. In particular, to simulate blurred data we assume that \(\varvec{H}\) represents the discretization of a convolutional product with a Gaussian kernel of standard deviation \(\sigma _{\varvec{H}}\). We remark that the level of degradation of the simulated acquisition is specified by the magnitudes of \(\sigma _{\varvec{\eta }}\) and \(\sigma _{\varvec{H}}\). Finally, we stress that the skyscraper BW image and the real chest CT image are affected by artifacts. Since no ground truths are available for these images, the comparisons among the methods are carried out through visual inspection. The codes and the images used for these numerical experiments are available onlineFootnote 2.

Fig. 1
figure 1

The test images employed in the numerical experiments. Butterfly: RGB image \(256 \times 256\) pixels, Bird: RGB image \(288 \times 288\) pixels, Head: RGB image \(256 \times 256\) pixels, Woman: RGB \(224 \times 320\) pixels, Baby: RGB \(512 \times 512\) pixels, Watercastle: BW \(320 \times 480\) pixels, Skyscraper: BW \(256 \times 256\) pixels, chest CT: BW \(512 \times 512\) pixels

Fig. 2
figure 2

The PSNR values achieved by DIP, DIP-WTV and cDIP-RED along the iterations. In (a)-(b)-(c) the DIP, DIP-WTV and cDIP-RED are tested on three different RGB images degraded setting \(\sigma _{\varvec{\eta }}=35\). In (d)-(e)-(f) DIP, DIP-WTV and cDIP-RED, respectively, are tested on the butterfly RGB image corrupted with different noise levels

3.2 Stability w. r. t. hyperparameters and empirical convergence

In this section, we describe the advantages which the models previously suggested bring over the considered competitors. In the first part, we empirically show the proposed approaches avoid the typical noise overfitting of DIP. Then, we underline how the suggested methods are more robust with respect to the choice of the hyperparameters than DeepRED. Finally, we empirically demonstrate that solutions of the proposed approaches satisfy the Morozov’s discrepancy principle.

No overfitting In the first test, we highlight the sensitivity of the standard DIP algorithm with respect to the choice of the optimal number of iterations to be performed and we compare it with DIP-WTV and cDIP-RED. For all experiments in this section we set the penalty parameters \(\beta _{{\varvec{v}}}=1\), \(\beta _{\varvec{t}}=0.5\) and \(\beta _{\varvec{r}}=1\). We consider the woman, bird, and baby images and we simulate the noisy acquisitions by corrupting the ground truths with an AWGN component of standard deviation \(\sigma _{\varvec{\eta }}=35\). Then, we apply DIP, DIP-WTV, and cDIP-RED and in the upper panel of Fig. 2 we depict the behaviour of the PSNR metric along the performed iterations. In order to analyze the relation between the noise level of the simulated acquisition and the optimal number of iterations to be performed by DIP, DIP-WTV and cDIP-RED, in lower panel of Fig. 2, we report the behavior of the PSNR metric along the iterations while the level of corruption changes. In particular, we consider the butterfly test images and we corrupt it by AWGN with \(\sigma _{\varvec{\eta }}= 25,35,50\). As a general comment, Fig. 2 shows that standard DIP starts overfitting the corrupted image along the iterative process. Moreover, for the DIP approach, this test highlights that the number of iterations to reach the best PSNR strongly depends on the image considered (Fig. 2a), and on the level of corruption (Fig. 2d). Conversely, the DIP-WTV (Fig. 2b and e) and cDIP-RED (Fig. 2c and f) schemes do not overfit the corrupted data while the PSNR does not decrease.

Fig. 3
figure 3

The PSNR values achieved by DIP, DIP-WTV and cDIP-RED along the iterations for the butterfly RGB image with \(\sigma _{\varvec{\eta }}=35\)

No regularization parameter is required The DeepRED algorithm overcomes the problem of overfitting by adding the RED regularizer to the objective minimized by the standard DIP, provided a proper value for the regularization parameter \(\lambda \). In this section, we highlight the sensitivity of the DeepRED algorithm with respect to the choice of the hyparameters defining the sequence of the regularization parameters, namely the threshold \(\gamma \) and the starting value of the regularization parameter \(\lambda _{0}\). In Fig. 3a, we show the behaviour of the PSNR for different values of the threshold \(\gamma \). The degraded image is obtained by corrupting the butterfly image with an AWGN component setting \(\sigma _{\varvec{\eta }}=35\). The parameter \(\lambda _{0}\) is fixed equal to 0.005 while the increasing factor equals 0.03. For high values of the threshold the regularization contribution is too weak thus the PSNR starts decreasing, which means DeepRED starts overfitting the degraded image. For low values of the threshold, the regularization parameter starts becoming high thus too much regularization is enforced. The best compromise for this butterfly test problem is \(\gamma =22\). However, in our experiments we observe that this value depends once again on both the image and the noise level considered and cannot be fixed a priori. In Fig. 3b, we report the PSNR behaviour obtained by fixing \(\gamma =22\) and by changing the starting value of the regularization parameter \(\lambda _{0}\). We observe the output of the DeepRED in 5000 iterates largely depends on the choice of this hyperparameter. We stress that DeepRED is implemented in the ADMM framework which requires the tuning of the penalty parameter. In our experiments, the DeepRED penalty parameter is set equal to 0.5 as suggested by the authors in [31].

The main feature of DIP-WTV and cDIP-RED is that the introduced regularization has no parameters to be estimated. In the case of DIP-WTV, the space variant regularization parameters are automatically estimated along the iterations, whereas the constrained formulation of cDIP-RED allows to automatically estimate the strength of the regularization by the Morozov’s discrepancy principle.

Stability w.r.t. penalty parameters \(\beta _{{\varvec{v}}}\), \(\beta _{\varvec{t}}\) and \(\beta _{\varvec{r}}\) We remark that for the DIP-WTV and cDIP-RED approaches we just need to fix the penalties \(\beta _{{\varvec{v}}}\), \(\beta _{\varvec{t}}\) and \(\beta _{\varvec{r}}\). However, in order to prove the stability of these methodologies with respect to the choice of these parameters, in Fig. 3c and d we depict the PSNR behaviour provided by DIP-WTV and cDIP-RED on the previous test image by setting different values for \(\beta _{{\varvec{v}}}\), \(\beta _{\varvec{t}}\) and \(\beta _{\varvec{r}}\). We stress that the range for the penalties for DIP-WTV and cDIP-RED has been deduced by the values suggested in [31] for their ADMM implementation.

From these figures we can conclude that for the DIP-WTV and the cDIP-RED methods the penalty parameters affect the convergence speed, but the PSNR behaviour of both the approaches is stable along the iterations and no noise-overfitting is present for any of the configurations considered. Moreover, we also observed that these different configurations provide comparable restorations in terms of visual quality in 5000 iterations. We observe that to maximize the performances of the cDIP-RED method we should set \(\beta _{\varvec{t}} < \beta _{\varvec{r}}\). This is due to the fact that a bigger value of \(\beta _{\varvec{r}}\) provides more consistency with the initial data. Finally, we stress that the PSNR behaviour reported in Fig. 3c and d for the particular butterfly test problem are common to all the other tests performed.

All these considerations allow us to state that DIP-WTV and cDIP-RED are more robust than DIP and DeepRED with respect to the choice of the hyperparameters values. Moreover, independently on the penalties parameters setting, if compared to the standard DIP, we can stop DIP-WTV and cDIP-RED being confident that these methods do not overfit noise.

Satisfying the Morozov’s discrepancy principle In Fig. 4, we consider once again the denoising test on the butterfly image described previously. We analyze the behaviour of the constraint ratio \(\Vert f(\varvec{\theta }^{(k)},\varvec{z})-\varvec{g}\Vert /\delta \) as a function of the iterations number. We remark that a constraint ratio equal to 1 entails the corresponding iterate is almost at the boundary of \(D_{\sigma _{\varvec{\eta }}}\) defined in (16). We observe that DIP and DeepRED (setting \(\gamma =24\)) slowly overfit the simulated noisy acquisition and converge to an interior point of \(D_{\sigma _{\varvec{\eta }}}\). On the other hand, DeepRED (with \(\gamma =22\)), DIP-WTV and cDIP-RED converge to a solution which lies on the boundary of \(D_{\sigma _{\varvec{\eta }}}\) and hence implicitly satisfy the discrepancy principle. We stress that we empirically observe the same behaviour for all the other experiments performed. As a general comment, this test confirms once again how the performances of DeepRED largely depend on the choice of the hyperparameter \(\gamma \) defining the strength of the regularization. Moreover, we empirically show a more robust convergent behaviour of DIP-WTV and cDIP-RED avoiding costly parameter tuning.

Fig. 4
figure 4

The constraint ratio’s trend along the iterations obtained by applying DIP, DeepRED (with \(\gamma = 22\) and \(\gamma =24\)), DIP-WTV and cDIP-RED to the butterfly RGB image corrupted by AWGN with \(\sigma _{\varvec{\eta }}=35\)

3.3 Denoising task

We validate DIP-WTV and cDIP-RED by comparing them with DIP and DeepRED on the Set5 [4] dataset for the denoising task. The starting noisy images are created by corrupting the ground truth images with an AWGN component of standard deviation equals to 25 and 50. We remark once again that for the cDIP-RED approach we estimate the noise standard deviation even if we know its value. The performances are evaluated by means of the PSNR metric and, in addition, by a visual comparison. In particular, Figs. 5 and 6 report the restored baby and butterfly images starting from the data with the highest level of corruption considered. In Table 1 we report the mean values of the PSNR metric on Set5. For the DIP algorithm we have selected for each image the number of iteration maximizing the PSNR value. For DeepRED we set the ADMM penalty equal to 0.5, whereas we have selected the threshold \(\gamma \) and the starting regularization parameter \(\lambda _{0}\) in order to maximize the PSNR for each image. For DIP-WTV and cDIP-RED we set for all the images \(\beta _{{\varvec{v}}}=1\) and \(\beta _{\varvec{t}}=0.5\) and \(\beta _{\varvec{r}}=1\), respectively. For DeepRED, DIP-WTV and cDIP-RED the restored images have been obtained performing 5000 iterations.

The results reported in Table 1 show that cDIP-RED outperforms DIP and provides slightly better performances with respect to DeepRED in terms of PSNR metric. We remark that cDIP-RED does not require any hand-tuning of the regularization parameter. Concerning DIP-WTV, we observe that it provides better performances than DIP. Moreover, we stress that it has shown more robustness to the choice of the hyperparameters with respect to the DeepRED and it has the lowest number of hyperparameters to be set. Unfortunately, the handcrafted Total Variation regularizer is not as effective as RED regularization for natural images, which manifests in lower PSNR scores for DIP-WTV. In Figs. 5 and 6, we report the simulated noisy acquisitions of the baby and butterfly images setting \(\sigma _{\varvec{\eta }}=50\) and the restored images obtained by DIP, DeepRED, DIP-WTV and cDIP-RED. Moreover, in the captions, we highlight the PSNR values. As a general comment, the DIP algorithm struggles to recover the image texture. The cDIP-RED restorations look sharper and more faithful to the ground truth than the ones obtained by DeepRED and DIP-WTV as underlined by the close-ups.

Table 1 PSNR mean values for the Set5 for two level of noise. In blue we highlight the best PSNR value
Fig. 5
figure 5

Restored images for the baby test problem setting \(\sigma _{\varvec{\eta }}= 50\). The PSNR values are: Noisy: 19.84 dB, DIP: 27.85 dB, DeepRED: 28.32 dB, DIP-WTV: 28.26 dB, cDIP-RED: 28.43 dB

Fig. 6
figure 6

Restored images for the butterfly test problem setting \(\sigma _{\varvec{\eta }}= 50\). The PSNR values are: Noisy: 19.88 dB, DIP: 27.81 dB, DeepRED: 28.13 dB, DIP-WTV: 28.01 dB, cDIP-RED: 28.69 dB

3.4 Deblurring task

In this section, we compare DIP-WTV and cDIP-RED with DIP and DeepRED on the Set5 [4] dataset for the deblurring task. The starting degraded images are constructed by setting the standard deviation of the noise \(\sigma _{\varvec{\eta }}=10\) and the standard deviation of the Gaussian blur \(\sigma _{\varvec{H}}=2\). The performances have been evaluated by means of the PSNR and SSIM metrics. In Table 2, we report the mean values of the PSNR and SSIM metrics. Moreover, we consider the skyscraper and the watercastle images and we add blur and noise by setting \(\sigma _{\varvec{\eta }}=10\) and \(\sigma _{\varvec{H}}=0.8\) for the first image, \(\sigma _{\varvec{\eta }}=5\) and \(\sigma _{\varvec{H}}=1.6\) for the second. The simulated degraded acquisitions are drawn in Figs. 7b and 8b, respectively. In Figs. 8 and 7, we report the results obtained by applying DIP, DeepRED, DIP-WTV and cDIP-RED and in the caption we report the PSNR and SSIM metrics. For the DIP and DeepRED we set all the hyperparameters in order to maximize the PSNR. For DIP-WTV and cDIP-RED we set for all the tests \(\beta _{{\varvec{v}}}= 1.5\) and \(\beta _{\varvec{t}}= 1.5\) and \(\beta _{\varvec{r}}= 2\), respectively. For DeepRED, DIP-WTV, and cDIP-RED the restored images have been obtained performing 5000 iterations. From Table 2 we observe again that DeepRED and cDIP-RED reach comparable performances on Set5. However, we stress that, differently from DeepRED, the cDIP-RED scheme does not require to fix the regularization parameter. Moreover, DIP-WTV outperforms the standard DIP. For the watercastle image, DeepRED, and cDIP-RED reach similar performances in terms of PSNR and SSIM metrics, however the DeepRED restoration looks noisier than the one provided by cDIP-RED. Finally, DIP-WTV always performs better than the standard DIP.

Fig. 7
figure 7

Restored images for the watercastle test problem with noise level 5 and blur 1.6. The PSNR and SSIM values are: Noisy: 22.87 dB–0.76, DIP: 25.81 dB–0.87, DeepRED: 26.23 dB– 0.89, DIP-WTV: 25.87 dB–0.88, cDIP-RED: 26.28 dB–0.89

Concerning the skyscraper we do not have a ground truth available, therefore we can compare the results only through visual inspection. Indeed, it is clear from Fig. 8a that the skyscraper image is affected by jpeg-compression artifacts. In order to simulate a more realistic acquisition, we further corrupt this compressed image with blur and noise (Fig. 7b). The close-ups in Fig. 8 highlight that the output cDIP-RED suppress the artifacts and outperforms the restorations provided by DIP, DeepRED and DIP-WTV in terms of visual quality. In particular, cDIP-RED can retrieve better the details and remove the artifacts and the noise.

Fig. 8
figure 8

Restored images for the skyscraper test problem with noise level 10 and blur 0.8

Table 2 PSNR and SSIM mean values for the Set5 considering Gaussian blur with \(\sigma _{\varvec{H}}=2\) and the noise-level \(\sigma _{\varvec{\eta }}=10\). In blue we highlight the best PSNR and SSIM values

3.5 Artifact removal for a chest CT image

Finally, we show how our methods can be effective for retrieving one real medical chest CT image of a patient affected by COVID-19 [42]. In Fig. 9a we report the acquired data together with the close-ups of two details (inflammation zones) in the lungs backside where are visible the effects of the interstitial pneumonia caused by COVID-19 disease. From these panels the standard artifacts related to the discrete angles sampling typical of the CT application are clearly visible. In Fig. 9b and c, we show the restored images provided by our DIP-WTV and cDIP-RED approaches, respectively. Generally, all finer structures, such as the inflammation details, alveoli and bronchioles, are sufficiently well retrieved, as highlighted by the close-ups. Finally, it is evident that the restoration provided by cDIP-RED looks sharper than the one restored by DIP-WTV.

Fig. 9
figure 9

Reconstructed images for the real CT problem. In (a) we report the acquired data, in (bc) we report the restored images obtained by DIP-WTV and cDIP-RED, respectively

3.6 Standard and modified versions of the alternating PGDA method

As already mentioned, both the unconstrained and the constrained models developed in this work have been solved by means of a modified version of the alternating PGDA algorithm. In this section we compare the performance of the standard PGDA and that of the employed modified variant on a denoising problem, by a way of example. Particularly, in this analysis we focus on the sole cDIP-RED approach, since analogous considerations can be deduced for DIP-WTV. In the comparison we consider cDIP-RED exploiting the modified PGDA scheme, and its counterpart which uses the standard PGDA algorithm. In this section, the latter approach is simply denoted by PGDA. We consider the same parameters for cDIP-RED and PGDA. In Fig. 10a we report the PSNR values obtained along the iterations by applying PGDA and cDIP-RED on the denoising problem related to the woman image corrupted with AWGN with standard deviation \(\sigma _{\varvec{\eta }}=35\). It is evident that the achieved values are comparable and the differences between the two methods are very limited. The same considerations can be derived from the behaviour of the constraint ratio generated by the two approaches under consideration and depicted in Fig. 10b: the two curves are almost indistinguishable. Finally, we remark that exploiting the already computed values of to perform the successive variables updates, avoids storing the intermediate values and, for this reason, allows a lower memory requirement.

Fig. 10
figure 10

The PSNR values and the constraint ratio’s trend achieved by PGDA and cDIP-RED along the iterations for the woman RGB image with \(\sigma _{\varvec{\eta }}=35\)

4 Conclusion

In this paper, we propose a constrained and an unconstrained DIP optimization models which automatically estimate the strength of the regularization. The unconstrained one uses a space variant handcrafted regularizer whose local regularization parameters are adaptively defined along the optimization process, whereas the constrained model is tailored for a generic regularizer and implicitly forces solutions satisfying the discrepancy principle. Particularly, we used the space variant Total Variation and the RED regularizer in the implementation for the unconstrained and the constrained models, respectively. The main strengths of the developed frameworks are threefold: it is not required to set proper values for the regularization parameter, the schemes implemented are more robust with respect to the selection of the hyperparameters than other state-of-the-art DIP-based methods, and both schemes avoid the typical overfitting behaviour of the DIP framework. The numerical experiments on image denoising and deblurring show comparable results of the developed approaches with respect to state-of-the-art strategies with the great advantage of avoiding costly parameter tuning. Finally, since in the literature highly inexact version of ADMM have been used to solve the regularized DIP models, framing the problem in the PGDA setting opens new possibilities to the theoretical convergence analysis and a more faithful match between theory and practical implementation.