Variational Networks: An Optimal Control Approach to Early Stopping Variational Methods for Image Restoration

We investigate a well-known phenomenon of variational approaches in image processing, where typically the best image quality is achieved when the gradient flow process is stopped before converging to a stationary point. This paradox originates from a tradeoff between optimization and modeling errors of the underlying variational model and holds true even if deep learning methods are used to learn highly expressive regularizers from data. In this paper, we take advantage of this paradox and introduce an optimal stopping time into the gradient flow process, which in turn is learned from data by means of an optimal control approach. After a time discretization, we obtain variational networks, which can be interpreted as a particular type of recurrent neural networks. The learned variational networks achieve competitive results for image denoising and image deblurring on a standard benchmark data set. One of the key theoretical results is the development of first- and second-order conditions to verify optimal stopping time. A nonlinear spectral analysis of the gradient of the learned regularizer gives enlightening insights into the different regularization properties.


Introduction
Throughout the past years, numerous image restoration tasks in computer vision such as denoising [38] or super-resolution [39] have benefited from a variety of pioneering and novel variational methods. In general, variational methods [9] are aiming at the minimization of an energy functional designed for a specific image reconstruction problem, where the Institute of Mathematics and Scientific Computing, University of Graz, Graz, Austria energy minimizer defines the restored output image. For the considered image restoration tasks, the observed corrupted image is generated by a degradation process of the corresponding ground truth image, which is the real uncorrupted image.
In this paper, the energy functional is composed of an a priori known, task-dependent, and quadratic data fidelity term and a Field-of-Experts-type regularizer [36], whose building blocks are learned kernels and learned activation functions. This regularizer generalizes the prominent total variation regularization functional and is capable of accounting for higher-order image statistics. A classical approach to minimize the energy functional is a continuous-time gradient flow, which defines a trajectory emanating from a fixed initial image. Typically, the regularizer is adapted such that the end point image of the trajectory lies in a proximity of the corresponding ground truth image. However, even the general class of Field-of-Experts-type regularizers is not able to capture the entity of the complex structure of natural images, that is why the end point image may substantially differ from the ground truth image. To address this insufficient modeling, we advocate an optimal control problem using the gradient flow differential equation as the state equation and a cost functional that quantifies the distance of the ground truth image and the gradient flow trajectory evaluated at the stopping time T . Besides the parameters of the regularizer, the stopping time is an additional control parameter learned from data.
The main contribution of this paper is the derivation of criteria to automatize the calculation of the optimal stopping time T for the aforementioned optimal control problem. In particular, we observe in the numerical experiments that the learned stopping time is always finite even if the learning algorithm has the freedom to choose a larger stopping time.
For the numerical optimization, we discretize the state equation by means of the explicit Euler and Heun schemes. This results in an iterative scheme which can be interpreted as static variational networks [11,19,24] as a subclass of deep learning models [26]. Here, the prefix "static" refers to constant regularizers with respect to time. In several experiments we demonstrate the superiority of the learned static variational networks for image restoration tasks terminated at the optimal stopping time over classical variational methods. Consequently, the early stopped gradient flow approach is better suited for image restoration problems and computationally more efficient than the classical variational approach.
A well-known major drawback of mainstream deep learning approaches is the lack of interpretability of the learned networks. In contrast, following [16], the variational structure of the proposed model allows us to analyze the learned regularizers by means of a nonlinear spectral analysis. The computed eigenpairs reveal insightful properties of the learned regularizers.
There have been several approaches to cast deep learning models as dynamical systems in the literature, in which the model parameters can be seen as control parameters of an optimal control problem. E [12] clarified that deep neural networks such as residual networks [21] arise from a discretization of a suitable dynamical system. In this context, the training process can be interpreted as the computation of the controls in the corresponding optimal control problem. In [27,28], Pontryagin's maximum principle is exploited to derive necessary optimality conditions for the optimal control problem in continuous time, which results in a rigorous discrete-time optimization. Certain classes of deep learning networks are examined as mean-field optimal control problems in [13], where optimality conditions of the Hamilton-Jacobi-Bellman type and the Pontryagin type are derived. The effect of several discretization schemes for classification tasks has been studied under the viewpoint of stability in [4,7,17], which leads to a variety of different network architectures that are empirically proven to be more stable.
The benefit of early stopping for iterative algorithms is examined in the literature from several perspectives. In the context of ill-posed inverse problems, early stopping of iterative algorithms is frequently considered and analyzed as a regularization technique. There is a variety of literature on the topic, and we therefore only mention the selected monographs [14,23,34,41]. Frequently, early stopping rules for inverse problems are discussed in the context of the Landweber iteration [25] or its continuous analogue commonly referred to as Showalter's method [44] and are based on criteria such as the discrepancy or the balancing principle.
In what follows, we provide an overview of recent advances related to early stopping. Raskutti et al. [37] exploit early stopping for nonparametric regression problems in reproducing kernel Hilbert spaces (RKHS) to prevent overfitting and derive a data-dependent stopping rule. Yao et al. [42] discuss early stopping criteria for gradient descent algorithms for RKHS and relate these results to the Landweber iteration. Quantitative properties of the early stopping condition for the Landweber iteration are presented in Binder et al. [5]. Zhang and Yu [45] prove convergence and consistency results for early stopping in the context of boosting. Prechelt [33] introduces several heuristic criteria for optimal early stopping based on the performance of the training and validation error. Rosasco and Villa [35] investigate early stopping in the context of incremental iterative regularization and prove sample bounds in a stochastic environment. Matet et al. [30] exploit an early stopping method to regularize (strongly) convex functionals. In contrast to these approaches, we propose early stopping on the basis of finding a local minimum with respect to the time horizon of a properly defined energy.
To illustrate the necessity of early stopping for iterative algorithms, we revisit the established TV-L 2 denoising functional [38], which amounts to minimizing the variational problem E[u] = u − g 2 L 2 ( ) + ν|Du|( ) among all functions u ∈ BV( ), where ⊂ R n denotes a bounded domain, ν > 0 is the regularization parameter and g ∈ L ∞ ( ) refers to a corrupted input image. An elementary, yet very inefficient optimization algorithm relies on a gradient descent using a finite difference discretization for the regularized functional ( > 0) where h denotes a lattice, u h , g h : h → R are discrete functions and (Du h ) i, j is a finite difference gradient operator with Neumann boundary constraint (for details see [8,Section 3]). For a comprehensive list of state-of-the-art methods to efficiently solve TV-based variational problems, we refer the reader to [9].   Fig. 1 Contour plot of the peak signal-to-noise ratio depending on the number of iterations and the regularization parameter ν for TV-L 2 denoising. The global maximum is marked with a red cross (Color figure online) Figure 1 depicts the dependency of the peak signal-tonoise ratio on the number of iterations and the regularization parameter ν for the TV-L 2 problem (1) using a step size 10 −4 and = 10 −6 , where the input image g ∈ L ∞ ( h , [0, 1]) with a resolution of 512 × 512 is corrupted by additive Gaussian noise with standard deviation 0.1. As a result, for each regularization parameter ν there exists a unique optimal number of iterations, where the signal-to-noise ratio peaks. Beyond this point, the quality of the resulting image is deteriorated by staircasing artifacts and fine texture patterns are smoothed out. The global maximum (26, 0.0474) is marked with a red cross, the associated image sequence is shown in Fig. 2 (left to right: input image, noisy image, restored images after 13, 26, 39, 52 iterations). 1 If the gradient descent is considered as a discretization of a time continuous evolution process governed by a differential equation, then the optimal number of iterations translates to an optimal stopping time.
In this paper, we refer to the standard inner product in the Euclidean space R n by ·, · . Let ⊂ R n be a domain. We denote the space of continuous functions by C 0 ( ), the space of k-times continuously differentiable functions by C k ( ) for k ≥ 1, the Lebesgue space by L p ( ), p ∈ [1, ∞), and the Sobolev space by H m ( ) = W m,2 ( ), m ∈ N, where the latter space is endowed with the Sobolev (semi- With a slight abuse of notation, we frequently set C k ( ) = C 0 ( ) ∩ C k ( ). The identity matrix in R n is denoted by Id. 1 = (1, . . . , 1) ∈ R n is the one vector. 1 Image by Nichollas Harrison (CC BY-SA 3.0). This paper is organized as follows: In Sect. 2, we argue that certain classes of image restoration problems can be perceived as optimal control problems, in which the state equation coincides with the evolution equation of static variational networks, and we prove the existence of solutions under quite general assumptions. Moreover, we derive a first order necessary as well as a second-order sufficient condition for the optimal stopping time in this optimal control problem. A Runge-Kutta time discretization of the state equation results in the update scheme for static variational network, which is discussed in detail in Sect. 3. In addition, we visualize the effect of the optimality conditions in a simple numerical example in R 2 and discuss alternative approaches for the derivation of static variational networks. Finally, we demonstrate the applicability of the optimality conditions to two prototype image restoration problems in Sect. 4: denoising and deblurring.

Optimal Control Approach to Early Stopping
In this section, we derive a time continuous analog of static variational networks as gradient flows of an energy functional E composed of a data fidelity term D and a Field-of-Experts-type regularizer R. The resulting ordinary differential equation is used as the state equation of an optimal control problem, in which the cost functional incorporates the squared L 2 -distance of the state evaluated at the optimal stopping time to the ground truth as well as (box) constraints of the norms of the stopping time, the kernels and the activation functions. We prove the existence of minimizers of this optimal control problem under quite general assumptions. Finally, we derive first-and second-order optimality conditions for the optimal stopping time using a Lagrangian approach. Let u ∈ R n be a data vector, which is either a signal of length n in 1D, an image of size n = n 1 × n 2 in 2D or spatial data of size n = n 1 × n 2 × n 3 in 3D. Since we are primarily interested in two-dimensional image restoration, we focus on this task in the rest of this paper and merely remark that all results can be generalized to the remaining cases. For convenience, we restrict to grayscale images, the generalization to color or multi-channel images is straightforward. In what follows, we analyze an energy functional of the form that is composed of a data fidelity term D and a regularizer R specified below. We incorporate the Field-of-Experts regularizer [36], which is a common generalization of the discrete total variation regularizer and is given by with kernels K k ∈ R m×n and associated nonlinear functions ρ k : R → R for k = 1, . . . , N K . Throughout this paper, we consider the specific data fidelity term for fixed A ∈ R l×n and fixed b ∈ R l . We remark that various image restoration tasks can be cast in exactly this form for suitable choices of A and b [9]. The gradient flow [1] associated with the energy E for a time t ∈ (0, T ) reads aṡ wherex ∈ C 1 ([0, T ], R n ) denotes the flow of E with T ∈ R, and the function k ∈ V s is given by For a fixed s ≥ 0 and an a priori constant-bounded openinterval I ⊂ R, we consider C s (R, R)-conforming basis functions ψ 1 , . . . , ψ N w with compact support in I for N w ≥ 1. The vectorial function space V s for the activation functions is composed of m identical component functions φ ∈ C s (R, R), which are given as the linear combination of (ψ j ) N w j=1 with weight vector w ∈ R N w , i.e.
We remark that in contrast to [4,17], inverse problems for image restoration rather than image classification are examined. Thus, we incorporate in (3) the classical gradient flow with respect to the full energy functional in order to promote data consistency, whereas in the classification tasks, only the gradient flow with respect to the regularizer is considered.
In what follows, we analyze an optimal control problem, for which the state equation (3) and initial condition (4) will arise as equality constraints. The cost functional J incorporates the L 2 -distance of the flowx evaluated at time T and the ground truth state x g ∈ R n and is given by We assume that the controls T , K k and k satisfy the box constraints as well as the zero mean condition Here, we have k = 1, . . . , N K and we choose a fixed parameter T max > 0. Further, α : R m×n → R + 0 and β : V s → R + 0 are continuously differentiable functions with non-vanishing gradient such that α(K ) → ∞ and β( ) → ∞ as K 2 → ∞ and L ∞ → ∞. We include the condition (7) to reduce the dimensionality of the kernel space. Moreover, this condition ensures an invariance with respect to gray-value shifts of image intensities.
x 0 x gx (T )x∞ Fig. 3 Schematic drawing of optimal trajectory (black curve) as well as suboptimal trajectories (gray dashed curves) emanating from x 0 with ground truth x g , optimal restored imagex(T ), sink/stable nodex ∞ and energy isolines (red concentric circles) (Color figure online) The particular choice of the cost functional originates from the observation that a visually appealing image restoration is obtained as the closest point on the trajectory of the flowx (reflected by the L 2 -distance) to x g subjected to a moderate flow regularization as quantified by the box constraints. Figure 3 illustrates this optimization task for the optimal control problem. Among all trajectories of the ordinary differential equation (3) emanating from a constant initial value x 0 , one seeks the trajectory that is closest to the ground truth x g in terms of the squared Euclidean distance as visualized by the energy isolines. Note that each trajectory is uniquely determined by (K k , k ) N K k=1 . The sink/stable nodex ∞ is an equilibrium point of the ordinary differential equation, in which all eigenvalues of the Jacobian of the right-hand side of (3) have negative real parts [40].
The constraint of the stopping time is solely required for the existence theory. For the image restoration problems, a finite stopping time can always be observed without constraints. Hence, the optimal control problem reads as subject to the constraints (6) and (7) as well as the nonlinear autonomous initial value problem (Cauchy problem) representing the state equatioṅ for t ∈ (0, T ) andx(0) = x 0 . We refer to the minimizing time T in (8) as the optimal early stopping time. To better handle this optimal control problem, we employ the reparametrization x(t) =x(t T ), which results in the equivalent optimal control problem min subject to (6), (7) and the transformed state equatioṅ Remark 1 The long-term dynamics of the nonlinear autonomous state equation (11) is determined by the set of fixed points {y ∈ R n : f (y) = 0} of f . A system is asymptotically stable at a fixed point y if all real parts of the eigenvalues of D f (y) are strictly negative [18,40]. In the case of convex potential functions ρ with unbounded support and a full rank matrix A, the autonomous differential equation (3) is asymptotically stable.
In the next theorem, we apply the direct method in the calculus of variations to prove the existence of minimizers for the optimal control problem. (10) is attained.

Theorem 1 (Existence of solutions) Let s ≥ 0. Then the minimum in
Proof Without restriction, we solely consider the case N K = 1 and omit the subscript. (7) and (11) hold true (the existence of x i is verified below). The coercivity of α and β implies K i 2 ≤ C α and i L ∞ ≤ C β for fixed constants C α , C β > 0. Due to the finite dimensionality of V and the boundedness of i L ∞ ≤ C β , we can deduce the existence of a subsequence (not relabeled) such that i → ∈ V. In addition, using the bounds T i ∈ [0, T max ] and K i 2 ≤ C α we can pass to further subsequences if necessary to deduce This estimate already guarantees that [0, 1] is contained in the maximum domain of existence of the state equation due to the linear growth of the right-hand side in x i [40, Theorem 2.17]. Moreover, Gronwall's inequality [18,40] ensures the uniform boundedness of x i (t) 2 for all t ∈ [0, 1] and all i ∈ N, which in combination with the above estimate already implies the uniform boundedness of ẋ i (t) 2 . Thus, by passing to a subsequence (again not relabeled), we infer that x ∈ H 1 ((0, 1), R n ) exists such that x(0) = x 0 (the pointwise evaluation is possible due to the Sobolev embedding theorem), In addition, we obtain [18]. However, due to the continuity of the right-hand side, we can even conclude Finally, the theorem follows from the continuity of J along this minimizing sequence.
In the next theorem, a first-order necessary condition for the optimal stopping time is derived.
Theorem 2 (First-order necessary condition for optimal stopping time) Let s ≥ 1. Then for each stationary point (T , (K k , (7) and (11) are valid the equation denotes the adjoint state of x, which is given as the solution to the ordinary differential equatioṅ with terminal condition Proof Again, without loss of generality, we restrict to the case N K = 1 and omit the subscript. Let z = (x, T , K , ) ∈ Z := H 1 ((0, 1)) × [0, T max ] × R m×n × V s be a stationary point of J , which exists due to Theorem 1. The constraints (11), (6) and (7) can be written as For multipliers in the space P, we consider the associated Lagrange functional L : Z × P → R to minimize J incorporating the aforementioned constraints, i.e., for z = (x, T , K , ) ∈ Z and p ∈ P we have Following [22,43], the Lagrange multiplier p exists if J is Fréchet differentiable at z, G is continuously Fréchet differentiable at z and z is regular, i.e.
The (continuous) Fréchet differentiability of J and G at z can be proven in a straightforward manner. To show (16), we first prove the surjectivity of The surjectivity of DG 1 (z) with initial condition given by x(0) = x 0 follows from the linear growth in x, which implies that the maximum domain of existence coincides with R. This solution is in general only a solution in the sense of Carathéodory [18,40]. Since α and β have non-vanishing derivatives, the validity of (16) and thus the existence of the Lagrange multiplier follow. The first-order optimality conditions with test functions x ∈ H 1 ((0, 1), R n ), K ∈ R m×n , ∈ V s and p ∈ P read as The fundamental lemma of calculus of variations yields in combination with (17) and (19) for t ∈ (0, 1) in a distributional sense. Since the right-hand sides of (20) and (21) [18,40] and hence (21) holds in the classical sense. Finally, (18) and (19) imply which proves (12) if T > 0 (the case T = 0 is trivial).
The preceding theorem can easily be adapted for fixed kernels and activation functions leading to a reduced optimization problem with respect to the stopping time only: Corollary 1 (First-order necessary condition for subproblem) Let K k ∈ R m×n and k ∈ V s for k = 1, . . . , N K be fixed with s ≥ 1 satisfying (6) and (7). We denote by p the adjoint state (13). Then, for each stationary point T of the subproblem in which the associated state x satisfies (11), the first-order optimality condition (12) holds true.

Remark 2
Under the assumptions of Corollary 1, a rescaling argument reveals the identities for t ∈ (0, 1) We conclude this section with a second-order sufficient condition for the partial optimization problem (23): Theorem 3 (Second-order sufficient conditions for subproblem) Let s ≥ 2. Under the assumptions of Corollary 1, for all Proof As before, we restrict to the case N K = 1 and omit subscripts. Let us denote by L the version of the Lagrange functional (15) with fixed kernels and fixed activation functions to minimize J subject to the side conditions Following [43,Theorem 43.D], J has a strict local minimum at z if the first-order optimality conditions discussed in Corollary 1 holds true and a constant C > 0 exists such that for all z = (x, T ) ∈ Z satisfying and DG 2 (z)(z) = x(0) = 0. The theorem follows from the homogeneity of order 2 in T in (26), which results in the modified condition (25).

Remark 3 All aforementioned statements remain valid when
replacing the function space V s and the norm of the activation functions by suitable Sobolev spaces and Sobolev norms, respectively. Moreover, all statements only require minor modifications if instead of the box constraints (6) nonnegative, coercive and differentiable functions of the norms of T , K k , and k are added in the cost functional J .

Time Discretization
The optimal control problem with state equation originating from the gradient flow for the energy functional E was analyzed in Sect. 2. In this section, we prove that static variational networks can be derived from a time discretization of the state equation incorporating Euler's or Heun's method [2,6].
To illustrate the concepts, we discuss the optimal control problem in R 2 using fixed kernels and activation functions in Sect. 3.1. Finally, a literature overview of alternative ways to derive variational networks as well as relations to other approaches are presented in Sect. 3.2. Let S ≥ 2 be a fixed depth. For a stopping time T ∈ R we define the node points t s = s S for s = 0, . . . , S. Consequently, Euler's explicit method for the transformed state equation (11) with fixed kernels and fixed activation functions = ((K k , k ) N K k=1 ) reads as for s = 0, . . . , S−1 with x 0 = x(0). The discretized ordinary differential equation (27) defines the evolution of the static variational network. We stress that this time discretization is closely related to residual neural networks with constant parameters in each layer. Here, x s is an approximation of x(t s ), the associated global error x(t s ) − x s is bounded from above by , where L f denotes the Lipschitz constant of f [2, Theorem 6.3]. In general, this global error bound has a tendency to overestimate the actual global error. Improved error bounds can either be derived by performing a more refined local error analysis, which solely results in a better constant C, or by using higher-order Runge-Kutta methods. One prominent example of an explicit Runge-Kutta scheme with a quadratic order of convergence is Heun's method [6], which is defined as We abbreviate the right-hand side of (13) as follows: The corresponding update schemes for the adjoint states are given by in the case of Euler's method and in the case of Heun's method. We remark that in general implicit Runge-Kutta schemes are not efficient due to the complex structure of the Field-of-Experts regularizer. In all cases, we have to choose the step size T S such that the explicit Euler scheme is stable [6], i.e., max i=1,...,n for all s = 0, . . . , S, where λ i denotes the ith eigenvalue of the Jacobian of either f or g. Note that this condition already implies the stability of Heun's method. Thus, in the numerical experiments, we need to ensure a constant ratio of the stopping time T and the depth S to satisfy (31).

Optimal Control Problem in R 2
In this subsection, we apply the first-and second-order criteria for the partial optimal control problem (see Corollary 1) to the simple, yet illuminative example in R 2 with a single kernel, i.e., l = m = n = 2 and N K = 1. More general applications of the early stopping criterion to image restoration problems are discussed in Sect. 4. Below, we consider a regularized data fitting problem composed of a squared L 2 -data term and a nonlinear regularizer incorporating a forward finite difference matrix operator with respect to the x-direction. In detail, we choose φ(x) = x To compute the solutions of the state (11) and the adjoint (13) differential equation, we use Euler's explicit method with 100 equidistant steps. All integrals are approximated using a Gaussian quadrature of order 21. Furthermore, we optimize the stopping time T in the discrete set T = 0.05 · N ∩[ 1 10 , 3].

Alternative Derivations of Variational Networks
We conclude this section with a brief review of alternative derivations of the defining equation (27) with an a priori fixed number of iterations, where R and D represent the reaction and diffusion terms, respectively, that coincide with the first and second expression in (11). By exploiting proximal mappings, this scheme can also be used for non-differentiable data terms D. In the same spirit, Kobler et al. [24] related variational networks to incremental proximal and incremental gradient methods. Following [19], variational networks result from a Landweber iteration [25] of the energy functional (2) using the Field-of-Experts regularizer. Structural similarities of variational networks and residual neural networks [21] are analyzed in [24]. In particular, residual neural networks (and thus also variational networks) are known to be less prone to the degradation problem, which is characterized by a simultaneous increase of the training/test error and the model complexity. Note that in most of these approaches time varying kernels and activation functions are examined. In most of the aforementioned papers, the benefit of early stopping has been observed.

Numerical Results for Image Restoration
We examine the advantageousness of early stopping for image denoising and image deblurring using static variational networks in this section. In particular, we show that the firstorder optimality condition results in the optimal stopping time. We do not verify the second-order sufficient condition discussed in Theorem 3 since in all experiments the firstorder condition indicates an energy minimizing solution and thus this verification is not required.

Image Reconstruction Problems
In the case of image denoising, we perturb a ground truth image x g ∈ R n by additive Gaussian noise n ∼ N (0, σ 2 Id) for a certain noise level σ resulting in the noisy input image g = x g + n. Consequently, the linear operator is given by the identity matrix and the corrupted image as well as the initial condition coincide with the noisy image, i.e., A = Id and b = x 0 = g.
For image deblurring, we consider an input image g = x 0 = Ax g + n ∈ R n that is corrupted by a Gaussian blur of the ground truth image x g ∈ R n and a Gaussian noise n with σ = 0.01. Here, A ∈ R n×n refers to the matrix representation of the 9 × 9 normalized convolution filter with the blur strength τ > 0 of the function

Numerical Optimization
For all image reconstruction tasks, we use the BSDS 500 data set [29] with grayscale images in [0, 1] 341×421 . We train all models on 200 train and 200 test images from the BSDS 500 data set and evaluate the performance on 68 validation images as specified by [36].
In all experiments, the activation functions (5) are parametrized using N w = 63 quadratic B-spline basis functions ψ j ∈ C 1 (R) with equidistant centers in the interval [−1, 1]. Let ξ ∈ R n 1 ×n 2 be the two-dimensional image of a corresponding data vector u ∈ R n , n = n 1 · n 2 . Then, the convolution κ * ξ of the image ξ with a filter κ is modeled by applying the corresponding kernel matrix K ∈ R m×n to the data vector u. We only use kernels κ of size 7 × 7. Motivated by the relation F . Additionally, we use β( ) = β(w) = w 2 2 for a weight vector w associated with . Since all numerical experiments yield a finite optimal stopping time T , we omit the constraint T ≤ T max .
For a given training set consisting of pairs of corrupted images x i 0 ∈ R n and corresponding ground truth images x i g ∈ R n , we denote the associated index set by I. To train the model, we consider the discrete energy functional for a subset B ⊂ I, where x i S denotes the terminal value of the Euler/Heun iteration scheme for the corrupted image x i 0 . In all numerical experiments, we use the iPALM algorithm [32] described in Algorithm 1 to optimize all parameters with respect to a randomly selected batch B. Each batch consists of 64 image patches of size 96 × 96 that are uniformly drawn from the training data set.
For an optimization parameter q representing either T , K k or w k , we use in the lth iteration step the over-relaxation We denote by L q the Lipschitz constant that is determined by backtracking and by proj Q the orthogonal projection onto the corresponding set denoted by Q.
Here, the constraint sets K and W are given by Each component of the initial kernels K k in the case of image denoising is independently drawn from a Gaussian random variable with mean 0 and variance 1 such that for l = 1 to L do randomly select batches B ⊂ I; update x i and p i for i ∈ B using either (27)/(29) or (28)/(30); for k = 1 to N K do ; end Algorithm 1: iPALM algorithm for stochastic training of image restoration tasks for L steps.
K k ∈ K. The learned optimal kernels of the denoising task are incorporated for the initialization of the kernels for deblurring. The weights w k of the activation functions are initialized such that φ k (y) ≈ 0.1y around 0 for both reconstruction tasks.

Results
In the first numerical experiment, we train models for denoising and deblurring with N K = 48 kernels, a depth S = 10 and L = 5000 training steps. Afterward, we use the calculated parameters (K k , k ) N K k=1 and T as an initialization and train models for various depths S = 2, 4, . . . , 50 and L = 500. Figure 5 depicts the average PSNR value PSNR(x i S , x i g ) i∈ I with I denoting the index set of the test images and the learned stopping time T as a function of the depth S for denoising (first two plots) and deblurring (last two plots). As a result, we observe that all plots converge for large S, where the PSNR curve monotonically increases. Moreover, the optimal stopping time T is finite in all these cases, which empirically validates that early stopping is beneficial. Thus, we can conclude that beyond a certain depth S the performance increase in terms of PSNR is negligible and a proper choice of the optimal stopping time is significant. The asymptotic value of T for increasing S in the case of image deblurring is approximately 20 times larger compared to image denoising due to the structure of the deblurring operator A. Figure 6 depicts the average 2 -difference of consecutive convolution kernels and activation functions for denoising and deblurring as a function of the depth S. We observe that the differences decrease with larger values of S, which is consistent with the convergence of the optimal stopping time T for increasing S. Both time discretization schemes perform similar, and thus in the following exper- iments, we solely present results calculated with Euler's method due to advantages in the computation time. We next demonstrate the applicability of the first-order condition for the energy minimization in static variational networks using Euler's discretization scheme with S = 20. Figure 7 depicts band plots along with the average curves among all training/test images of the functions for all training and test images for denoising (first two plots) and deblurring (last two plots). We approximate the integral We deduce that the first-order condition for each test image indicates the energy minimizing stopping time. Note that all image dependent stopping times are distributed around the average optimal stopping time that is highlighted by the black vertical line and learned during training. Figure 8 depicts two input images x g (first column): the corrupted images g (second column) and the denoised images for T = T 2 , T , 3T 2 , 100T (third to sixth column). The maximum values of the PSNR obtained for T = T are 29.68 and 29.52, respectively. To ensure a sufficiently fine time discretization, we enforce S T = const, where for T = T we set S = 20. Likewise, Fig. 9 contains the corresponding results for the deblurring task. Again, we enforce a constant ratio of S and T . The PSNR value peaks around the optimal stopping time, and the corresponding values are 29.52 and 27.80, respectively. We observed an average computation time of 5.694 ms f orthedenoisingand8.687 ms for the deblurring task using a RTX 2080 Ti graphics card and the PyTorch machine learning framework.
As desired, T indicates the energy minimizing time, where both the average curves for the training and test sets nearly coincide, which proves that the model generalizes to unseen test images. Although the gradient of the average energy curve (33) is rather flat near the learned optimal stopping time, the proper choice of T is indeed crucial as shown by In the case of denoising, for T < T , we still observe noisy images, whereas for too large T , local image patterns are smoothed out. For image deblurring, images computed with too small values of T remain blurry, while for T > T ringing artifacts are generated and their intensity increase with larger T . For a corrupted image, the associated adjoint state requires the knowledge of the ground truth for the terminal condition (14), which is in general not available. However, Fig. 7 shows that the learned average optimal stopping time T yields the smallest expected error. Thus, for arbitrary corrupted images, T is used as the stopping time. Figure 10 illustrates the plots of the energies (blue plots) and the first-order conditions (red plots) as a function of the stopping time T for all test images for denoising (left) and deblurring (right), which are degraded by noise levels σ ∈ {0.075, 0.1, 0.125, 0.15} and different blur strengths τ ∈ {1.25, 1.5, 1.75, 2.0}. Note that in each plot the associated curves of three prototypic images are visualized. To ensure a proper balancing of the data fidelity term and the regularization energy for the denoising task, we add the factor 1 σ 2 to the data term as typically motivated by Bayesian inference. For all noise levels σ and blur strengths τ , the same fixed pairs of kernels and activation functions trained with σ = 0.1/τ = 1.5 and depth S = 20 are used. Again, the first-order conditions indicate the degradation depending energy minimizing stopping times. The optimal stopping time increases with the noise level and blur strength, which results from a larger distance of x 0 and x g and thus requires longer trajectories.  The first rows of each table present the results obtained by the optimization of all control parameters. The second rows show the results calculated with fixed (K k , w k ) N K k=1 , which were pretrained for σ = 0.1/τ = 1.5. In the third rows, we evaluate the models trained for σ = 0.1/τ = 1.5 on different degradation levels without any further optimization. The last rows present PSNR values obtained with the TV-L 2 model [38] using a dual accelerated gradient descent and the IRcgls [15] algorithm, respectively Table 1 presents pairs of average PSNR values and optimal stopping times T for the test set for denoising (top) and deblurring (bottom) for different noise levels σ ∈ {0.075, 0.1, 0.125, 0.15} and blur strengths τ ∈ {1.25, 1.5, 1.75, 2.0}. All results in the table are obtained using N K = 48 kernels and a depth S = 20. Both first rows present the results incorporating an optimization of all control parameters (K k , w k ) N K k=1 and T . In contrast, both second rows show the resulting PSNR values and optimal stopping times for only a partial optimization of the stopping times and pretrained kernels and activation functions (K k , w k ) N K k=1 for σ = 0.1/τ = 1.5. Further, both third rows present the PSNR values and stopping times obtained by using the reference models without any further optimization. Finally, the last rows present the results obtained by using the FISTA algorithm [3] for the TV-L 2 variational model [38] for denoising and the IRcgls algorithm of the "IR Tools" package [15] for deblurring, which are both not data-driven and thus do not require any training. In detail, the weight parameter of the data term as well as the early stopping time are optimized using a simple grid search for the TV-L 2 model. We exploit the IRcgls algorithm to iteratively minimize Ax −b 2 2 using the conjugate gradient method until an early stopping rule based on the relative noise level of the residuum is satisfied. Note that IRcgls is a Krylov subspace method, which is designed as a general-purpose method for large-scale inverse

Fig. 11
Corrupted input image (first/third row, left), restored images using the FISTA/IRcgls algorithm (first/third row, right), and the proposed method (second/fourth row, left) as well as the ground truth image (second/fourth row, right) for denoising (upper quadruple) and deblurring (lower quadruple). The zoom factor of the magnifying lens is 3 problems (for further details see [15] and [20,Chapter 6] and the references therein). Figure 11 shows comparison of the results of the FISTA algorithm for the TV-L 2 /the IRcgls algorithm and our method (with the optimal early stopping time) for σ = 0.1/τ = 1.5. In Table 1, the resulting PSNR values of the first and second row are almost identical for image denoising despite varying optimal stopping times. Consequently, a model that was pretrained for a specific noise level can be easily adapted to noise levels by only modifying the optimal stopping time. Neglecting the optimization of the stopping time leads to inferior results as presented in the third rows, where the reference model was used for all degradation levels without any change. However, in the case of image deblurring, the model benefits from a full optimization of all con-trols, which is caused by the dependency of A on the blur strength. For the noise level 0.1 we observe the average PSNR value 28.72 which is on par with the corresponding results of [10, Table II]. We emphasize that in their work a costly full minimization of an energy functional is performed, whereas we solely require a depth S = 20 to compute comparable results.
For the sake of completeness, we present in Fig. 12 (denoising) and Fig. 13 (deblurring) the resulting triplets of kernels (top), potential functions (middle) and activation functions (bottom) for a depth S = 20. The scaling of the axes is identical among all potential functions and activation functions, respectively. Note that the potential functions are computed by numerical integration of the learned activation functions and we choose the integration constant such that every potential function is bounded from below by 0. As a result, we observe a large variety of different kernel structures, including bipolar forward operators in different orientations (e.g., 5th kernel in first row, 8th kernel in third row) or pattern kernels representing prototypic image textures (e.g. kernels in first column). Likewise, the learned potential functions can be assigned to several representative classes of common regularization functions like, for instance, truncated total variation (8th function in second row of Fig. 12), truncated concave (4th function in third row of Fig. 12), double-well potential (10th function in first row of Fig. 12) or "negative Mexican hat" (8th function in third row of Fig. 12). Note that the associated kernels in both tasks nearly coincide, whereas the potential and activation functions significantly differ. We observe that the activation functions in the case of denoising have a tendency to generate higher amplitudes compared to deblurring, which results in a higher relative balancing of the regularizer in the case of denoising.

Spectral Analysis of the Learned Regularizers
Finally, in order to gain intuition of the learned regularizer, we perform a nonlinear eigenvalue analysis [16] for the gradient of the Field-of-Experts regularizer learned for S = 20 and T = T . For this reason, we compute several generalized Note that by omitting the data term, the forward Euler scheme (27) applied to the generalized eigenfunctions v j reduces to   where the contrast factor (1 − λ j T S ) determines the global intensity change of the eigenfunction. We point out that due to the nonlinearity of the eigenvalue problem such a formula only holds locally for each iteration of the scheme.
We compute N v = 64 generalized eigenpairs of size 127× 127 by solving for all j = 1, . . . , N v , where denotes the generalized Rayleigh quotient, which is derived by minimizing (35) with respect to (v). The eigenfunctions are computed using an accelerated gradient descent with step size control [32]. All eigenfunctions are initialized with randomly chosen image patches of the test image data set, from which we subtract the mean. Moreover, in order to minimize the influence of the image boundary, we scale the image intensity values with a spatial Gaussian kernel. We run the algorithm for 10 4 iterations, which is sufficient for reaching a residual of approximately 10 −5 for each eigenpair. Figure 14 depicts the resulting pairs of eigenfunctions and eigenvalues for image denoising. We observe that eigenfunctions corresponding to smaller eigenvalues represent in general more complex and smoother image structures. In particular, the first eigenfunctions can be interpreted as cartoon-like image structures with clearly separable interfaces. Most of the eigenfunctions associated with larger eigenvalues exhibit texture-like patterns with a progressive frequency. Finally, wave and noise structures are present in the eigenfunctions with the highest eigenvalues.
We remark that all eigenvalues are in the interval [0.025, 11.696]. Since T S ≈ 0.054, the contrast factors (1 − λ j T S ) in (34) are in the interval [0.368, 0.999], which shows that the regularizer has a tendency to decrease the contrast. Formula (34) also reveals that eigenfunctions corresponding to contrast factors close to 1 are preserved over several iterations. In summary, the learned regularizer has a tendency to reduce the contrast of high-frequency noise patterns, but preserves the contrast of texture-and structure-like patterns. Figure 15 shows the eigenpairs for the deblurring task. All eigenvalues are relatively small and distributed around 0, which means that the corresponding contrast factors lie in the interval [0.992, 1.030]. Therefore, the learned regularizer can both decrease and increase the contrast. Moreover, most eigenfunctions are composed of smooth structures with a distinct overshooting behavior in the proximity of image boundaries. This implies that the learned regularizer has a tendency to perform image sharpening.

Conclusion
Starting from a parametric and autonomous gradient flow perspective of variational methods, we explicitly modeled the stopping time as a control parameter in an optimal control problem. By using a Lagrangian approach, we derived a first-order condition suited to automatize the calculation of the energy minimizing optimal stopping time. A forward Euler discretization of the gradient flow led to static variational networks. Numerical experiments confirmed that a proper choice of the stopping time is of vital importance for the image restoration tasks in terms of the PSNR value. We performed a nonlinear eigenvalue analysis of the gradient of the learned Field-of-Experts regularizer, which revealed interesting properties of the local regularization behavior. A comprehensive long-term spectral analysis in continuous time is left for future research.
A further future research direction would be the extension to dynamic variational networks, in which the kernels and activation functions evolve in time. However, a major issue related to this extension emerges from the continuation of the stopping time beyond its optimal point.