1 Introduction

Throughout the past years, numerous image restoration tasks in computer vision such as denoising [38] or super-resolution [39] have benefited from a variety of pioneering and novel variational methods. In general, variational methods [9] are aiming at the minimization of an energy functional designed for a specific image reconstruction problem, where the energy minimizer defines the restored output image. For the considered image restoration tasks, the observed corrupted image is generated by a degradation process of the corresponding ground truth image, which is the real uncorrupted image.

In this paper, the energy functional is composed of an a priori known, task-dependent, and quadratic data fidelity term and a Field-of-Experts-type regularizer [36], whose building blocks are learned kernels and learned activation functions. This regularizer generalizes the prominent total variation regularization functional and is capable of accounting for higher-order image statistics. A classical approach to minimize the energy functional is a continuous-time gradient flow, which defines a trajectory emanating from a fixed initial image. Typically, the regularizer is adapted such that the end point image of the trajectory lies in a proximity of the corresponding ground truth image. However, even the general class of Field-of-Experts-type regularizers is not able to capture the entity of the complex structure of natural images, that is why the end point image may substantially differ from the ground truth image. To address this insufficient modeling, we advocate an optimal control problem using the gradient flow differential equation as the state equation and a cost functional that quantifies the distance of the ground truth image and the gradient flow trajectory evaluated at the stopping time T. Besides the parameters of the regularizer, the stopping time is an additional control parameter learned from data.

The main contribution of this paper is the derivation of criteria to automatize the calculation of the optimal stopping time T for the aforementioned optimal control problem. In particular, we observe in the numerical experiments that the learned stopping time is always finite even if the learning algorithm has the freedom to choose a larger stopping time.

For the numerical optimization, we discretize the state equation by means of the explicit Euler and Heun schemes. This results in an iterative scheme which can be interpreted as static variational networks [11, 19, 24] as a subclass of deep learning models [26]. Here, the prefix “static” refers to constant regularizers with respect to time. In several experiments we demonstrate the superiority of the learned static variational networks for image restoration tasks terminated at the optimal stopping time over classical variational methods. Consequently, the early stopped gradient flow approach is better suited for image restoration problems and computationally more efficient than the classical variational approach.

A well-known major drawback of mainstream deep learning approaches is the lack of interpretability of the learned networks. In contrast, following [16], the variational structure of the proposed model allows us to analyze the learned regularizers by means of a nonlinear spectral analysis. The computed eigenpairs reveal insightful properties of the learned regularizers.

There have been several approaches to cast deep learning models as dynamical systems in the literature, in which the model parameters can be seen as control parameters of an optimal control problem. E [12] clarified that deep neural networks such as residual networks [21] arise from a discretization of a suitable dynamical system. In this context, the training process can be interpreted as the computation of the controls in the corresponding optimal control problem. In [27, 28], Pontryagin’s maximum principle is exploited to derive necessary optimality conditions for the optimal control problem in continuous time, which results in a rigorous discrete-time optimization. Certain classes of deep learning networks are examined as mean-field optimal control problems in [13], where optimality conditions of the Hamilton–Jacobi–Bellman type and the Pontryagin type are derived. The effect of several discretization schemes for classification tasks has been studied under the viewpoint of stability in [4, 7, 17], which leads to a variety of different network architectures that are empirically proven to be more stable.

The benefit of early stopping for iterative algorithms is examined in the literature from several perspectives. In the context of ill-posed inverse problems, early stopping of iterative algorithms is frequently considered and analyzed as a regularization technique. There is a variety of literature on the topic, and we therefore only mention the selected monographs [14, 23, 34, 41]. Frequently, early stopping rules for inverse problems are discussed in the context of the Landweber iteration [25] or its continuous analogue commonly referred to as Showalter’s method [44] and are based on criteria such as the discrepancy or the balancing principle.

In what follows, we provide an overview of recent advances related to early stopping. Raskutti et al. [37] exploit early stopping for nonparametric regression problems in reproducing kernel Hilbert spaces (RKHS) to prevent overfitting and derive a data-dependent stopping rule. Yao et al. [42] discuss early stopping criteria for gradient descent algorithms for RKHS and relate these results to the Landweber iteration. Quantitative properties of the early stopping condition for the Landweber iteration are presented in Binder et al. [5]. Zhang and Yu [45] prove convergence and consistency results for early stopping in the context of boosting. Prechelt [33] introduces several heuristic criteria for optimal early stopping based on the performance of the training and validation error. Rosasco and Villa [35] investigate early stopping in the context of incremental iterative regularization and prove sample bounds in a stochastic environment. Matet et al. [30] exploit an early stopping method to regularize (strongly) convex functionals. In contrast to these approaches, we propose early stopping on the basis of finding a local minimum with respect to the time horizon of a properly defined energy.

To illustrate the necessity of early stopping for iterative algorithms, we revisit the established TV-\(L^2\) denoising functional [38], which amounts to minimizing the variational problem \(E[u]=\Vert u-g\Vert _{L^2(\Omega )}^2+\nu |Du|(\Omega )\) among all functions \(u\in \mathrm{BV}(\Omega )\), where \(\Omega \subset {\mathbb {R}}^n\) denotes a bounded domain, \(\nu >0\) is the regularization parameter and \(g\in L^\infty (\Omega )\) refers to a corrupted input image. An elementary, yet very inefficient optimization algorithm relies on a gradient descent using a finite difference discretization for the regularized functional (\(\epsilon >0\))

$$\begin{aligned} E_\epsilon [u_h]&=\Vert u_h-g_h\Vert _{L^2(\Omega _h)}^2\nonumber \\&+\nu \sum _{(i,j)\in \Omega _h}\sqrt{|( D u_h)_{i,j}|^2+\epsilon ^2}, \end{aligned}$$
(1)

where \(\Omega _h\) denotes a lattice, \(u_h,g_h:\Omega _h\rightarrow {\mathbb {R}}\) are discrete functions and \(( D u_h)_{i,j}\) is a finite difference gradient operator with Neumann boundary constraint (for details see [8, Section 3]). For a comprehensive list of state-of-the-art methods to efficiently solve TV-based variational problems, we refer the reader to [9].

Fig. 1
figure 1

Contour plot of the peak signal-to-noise ratio depending on the number of iterations and the regularization parameter \(\nu \) for TV-\(L^2\) denoising. The global maximum is marked with a red cross (Color figure online)

Figure 1 depicts the dependency of the peak signal-to-noise ratio on the number of iterations and the regularization parameter \(\nu \) for the TV-\(L^2\) problem (1) using a step size \(10^{-4}\) and \(\epsilon =10^{-6}\), where the input image \(g\in L^\infty (\Omega _h,[0,1])\) with a resolution of \(512\times 512\) is corrupted by additive Gaussian noise with standard deviation 0.1. As a result, for each regularization parameter \(\nu \) there exists a unique optimal number of iterations, where the signal-to-noise ratio peaks. Beyond this point, the quality of the resulting image is deteriorated by staircasing artifacts and fine texture patterns are smoothed out. The global maximum (26, 0.0474) is marked with a red cross, the associated image sequence is shown in Fig. 2 (left to right: input image, noisy image, restored images after 13, 26, 39, 52 iterations).Footnote 1 If the gradient descent is considered as a discretization of a time continuous evolution process governed by a differential equation, then the optimal number of iterations translates to an optimal stopping time.

Fig. 2
figure 2

Image sequence with globally best PSNR value. Left to right: input image, noisy image, restored images after 13, 26, 39, 52 iterations

In this paper, we refer to the standard inner product in the Euclidean space \({\mathbb {R}}^n\) by \(\langle \cdot ,\cdot \rangle \). Let \(\Omega \subset {\mathbb {R}}^n\) be a domain. We denote the space of continuous functions by \(C^0(\Omega )\), the space of k-times continuously differentiable functions by \(C^k(\Omega )\) for \(k\ge 1\), the Lebesgue space by \(L^p(\Omega )\), \(p\in [1,\infty )\), and the Sobolev space by \(H^m(\Omega )=W^{m,2}(\Omega )\), \(m\in {\mathbb {N}}\), where the latter space is endowed with the Sobolev (semi-)norm for \(f\in H^m(\Omega )\) defined as \(|f|_{H^m(\Omega )}=\Vert D^m f\Vert _{L^2(\Omega )}\) and \(\Vert f\Vert _{H^m(\Omega )}=(\sum _{j=0}^m|f|_{H^j(\Omega )}^2)^\frac{1}{2}\). With a slight abuse of notation, we frequently set \(C^k({\overline{\Omega }})=C^0({\overline{\Omega }})\cap C^k(\Omega )\). The identity matrix in \({\mathbb {R}}^n\) is denoted by \(\mathrm {Id}\). \({\mathbf {1}}=(1,\ldots ,1)^\top \in {\mathbb {R}}^n\) is the one vector.

This paper is organized as follows: In Sect. 2, we argue that certain classes of image restoration problems can be perceived as optimal control problems, in which the state equation coincides with the evolution equation of static variational networks, and we prove the existence of solutions under quite general assumptions. Moreover, we derive a first order necessary as well as a second-order sufficient condition for the optimal stopping time in this optimal control problem. A Runge–Kutta time discretization of the state equation results in the update scheme for static variational network, which is discussed in detail in Sect. 3. In addition, we visualize the effect of the optimality conditions in a simple numerical example in \({\mathbb {R}}^2\) and discuss alternative approaches for the derivation of static variational networks. Finally, we demonstrate the applicability of the optimality conditions to two prototype image restoration problems in Sect. 4: denoising and deblurring.

2 Optimal Control Approach to Early Stopping

In this section, we derive a time continuous analog of static variational networks as gradient flows of an energy functional \({\mathcal {E}}\) composed of a data fidelity term \({\mathcal {D}}\) and a Field-of-Experts-type regularizer \({\mathcal {R}}\). The resulting ordinary differential equation is used as the state equation of an optimal control problem, in which the cost functional incorporates the squared \(L^2\)-distance of the state evaluated at the optimal stopping time to the ground truth as well as (box) constraints of the norms of the stopping time, the kernels and the activation functions. We prove the existence of minimizers of this optimal control problem under quite general assumptions. Finally, we derive first- and second-order optimality conditions for the optimal stopping time using a Lagrangian approach.

Let \(u\in {\mathbb {R}}^n\) be a data vector, which is either a signal of length n in 1D, an image of size \(n=n_1\times n_2\) in 2D or spatial data of size \(n=n_1\times n_2\times n_3\) in 3D. Since we are primarily interested in two-dimensional image restoration, we focus on this task in the rest of this paper and merely remark that all results can be generalized to the remaining cases. For convenience, we restrict to grayscale images, the generalization to color or multi-channel images is straightforward. In what follows, we analyze an energy functional of the form

$$\begin{aligned} {\mathcal {E}}[u]={\mathcal {D}}[u]+{\mathcal {R}}[u] \end{aligned}$$
(2)

that is composed of a data fidelity term \({\mathcal {D}}\) and a regularizer \({\mathcal {R}}\) specified below. We incorporate the Field-of-Experts regularizer [36], which is a common generalization of the discrete total variation regularizer and is given by

$$\begin{aligned} {\mathcal {R}}[u]=\sum _{k=1}^{N_K}\sum _{i=1}^m\rho _k((K_ku)_i) \end{aligned}$$

with kernels \(K_k\in {\mathbb {R}}^{m\times n}\) and associated nonlinear functions \(\rho _k:{\mathbb {R}}\rightarrow {\mathbb {R}}\) for \(k=1,\ldots ,N_K\).

Throughout this paper, we consider the specific data fidelity term

$$\begin{aligned} {\mathcal {D}}[u]=\frac{1}{2}\Vert Au-b\Vert _2^2 \end{aligned}$$

for fixed \(A\in {\mathbb {R}}^{l\times n}\) and fixed \(b\in {\mathbb {R}}^l\). We remark that various image restoration tasks can be cast in exactly this form for suitable choices of A and b [9].

The gradient flow [1] associated with the energy \({\mathcal {E}}\) for a time \(t\in (0,T)\) reads as

$$\begin{aligned} \dot{{\tilde{x}}}(t)&=-D{\mathcal {E}}[{\tilde{x}}(t)]\nonumber \\&=-A^\top (A{\tilde{x}}(t)-b)-\sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k{\tilde{x}}(t)), \end{aligned}$$
(3)
$$\begin{aligned} {\tilde{x}}(0)&=x_0, \end{aligned}$$
(4)

where \({\tilde{x}}\in C^1([0,T],{\mathbb {R}}^n)\) denotes the flow of\({\mathcal {E}}\) with \(T\in {\mathbb {R}}\), and the function \(\Phi _k\in {\mathcal {V}}^s\) is given by

$$\begin{aligned} (y_1,\ldots ,y_m)^\top \mapsto (\rho _k'(y_1),\ldots ,\rho _k'(y_m))^\top . \end{aligned}$$

For a fixed \(s\ge 0\) and an a priori constant-bounded open-interval \(I\subset {\mathbb {R}}\), we consider \(C^s({\mathbb {R}},{\mathbb {R}})\)-conforming basis functions \(\psi _1,\ldots ,\psi _{N_w}\) with compact support in \({\overline{I}}\) for \(N_w\ge 1\). The vectorial function space \({\mathcal {V}}^s\) for the activation functions is composed of m identical component functions \(\phi \in C^s({\mathbb {R}},{\mathbb {R}})\), which are given as the linear combination of \((\psi _j)_{j=1}^{N_w}\) with weight vector \(w\in {\mathbb {R}}^{N_w}\), i.e.

$$\begin{aligned} {\mathcal {V}}^s:=\left\{ \Phi =(\phi ,\ldots ,\phi ):{\mathbb {R}}^m\rightarrow {\mathbb {R}}^m\Bigg |\phi =\sum _{j=1}^{N_w}w_j\psi _j\right\} .\nonumber \\ \end{aligned}$$
(5)

We remark that in contrast to [4, 17], inverse problems for image restoration rather than image classification are examined. Thus, we incorporate in (3) the classical gradient flow with respect to the full energy functional in order to promote data consistency, whereas in the classification tasks, only the gradient flow with respect to the regularizer is considered.

In what follows, we analyze an optimal control problem, for which the state equation (3) and initial condition (4) will arise as equality constraints. The cost functional J incorporates the \(L^2\)-distance of the flow \({\tilde{x}}\) evaluated at time T and the ground truth state \(x_g\in {\mathbb {R}}^n\) and is given by

$$\begin{aligned} {\widetilde{J}}(T,(K_k,\Phi _k)_{k=1}^{N_K}):=\frac{1}{2}\Vert {\tilde{x}}(T)-x_g\Vert _2^2. \end{aligned}$$

We assume that the controls T, \(K_k\) and \(\Phi _k\) satisfy the box constraints

$$\begin{aligned} 0\le T\le T_\mathrm {max},\quad \alpha (K_k)\le 1,\quad \beta (\Phi _k)\le 1, \end{aligned}$$
(6)

as well as the zero mean condition

$$\begin{aligned} K_k {\mathbf {1}}=0\in {\mathbb {R}}^m. \end{aligned}$$
(7)

Here, we have \(k=1,\ldots ,N_K\) and we choose a fixed parameter \(T_\mathrm {max}>0\). Further, \(\alpha :{\mathbb {R}}^{m\times n}\rightarrow {\mathbb {R}}_0^+\) and \(\beta :{\mathcal {V}}^s\rightarrow {\mathbb {R}}_0^+\) are continuously differentiable functions with non-vanishing gradient such that \(\alpha (K)\rightarrow \infty \) and \(\beta (\Phi )\rightarrow \infty \) as \(\Vert K\Vert _2\rightarrow \infty \) and \(\Vert \Phi \Vert _{L^\infty }\rightarrow \infty \). We include the condition (7) to reduce the dimensionality of the kernel space. Moreover, this condition ensures an invariance with respect to gray-value shifts of image intensities.

Fig. 3
figure 3

Schematic drawing of optimal trajectory (black curve) as well as suboptimal trajectories (gray dashed curves) emanating from \(x_0\) with ground truth \(x_g\), optimal restored image \({\tilde{x}}({\overline{T}})\), sink/stable node \({\tilde{x}}_\infty \) and energy isolines (red concentric circles) (Color figure online)

The particular choice of the cost functional originates from the observation that a visually appealing image restoration is obtained as the closest point on the trajectory of the flow \({\tilde{x}}\) (reflected by the \(L^2\)-distance) to \(x_g\) subjected to a moderate flow regularization as quantified by the box constraints. Figure 3 illustrates this optimization task for the optimal control problem. Among all trajectories of the ordinary differential equation (3) emanating from a constant initial value \(x_0\), one seeks the trajectory that is closest to the ground truth \(x_g\) in terms of the squared Euclidean distance as visualized by the energy isolines. Note that each trajectory is uniquely determined by \((K_k,\Phi _k)_{k=1}^{N_K}\). The sink/stable node \({\tilde{x}}_\infty \) is an equilibrium point of the ordinary differential equation, in which all eigenvalues of the Jacobian of the right-hand side of (3) have negative real parts [40].

The constraint of the stopping time is solely required for the existence theory. For the image restoration problems, a finite stopping time can always be observed without constraints. Hence, the optimal control problem reads as

$$\begin{aligned} \min _{T\in {\mathbb {R}},K_k\in {\mathbb {R}}^{m\times n},\Phi _k\in {\mathcal {V}}^s}{\widetilde{J}}(T,(K_k,\Phi _k)_{k=1}^{N_K}) \end{aligned}$$
(8)

subject to the constraints (6) and (7) as well as the nonlinear autonomous initial value problem (Cauchy problem) representing the state equation

$$\begin{aligned} \dot{{\tilde{x}}}(t)&=f({\tilde{x}}(t),(K_k,\Phi _k)_{k=1}^{N_K})\nonumber \\ :=&-A^\top (A{\tilde{x}}(t)-b)-\sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k{\tilde{x}}(t)) \end{aligned}$$
(9)

for \(t\in (0,T)\) and \({\tilde{x}}(0)=x_0\). We refer to the minimizing time T in (8) as the optimal early stopping time. To better handle this optimal control problem, we employ the reparametrization \(x(t)={\tilde{x}}(tT)\), which results in the equivalent optimal control problem

$$\begin{aligned} \min _{T\in {\mathbb {R}},K_k\in {\mathbb {R}}^{m\times n},\Phi _k\in {\mathcal {V}}^s}J(T,(K_k,\Phi _k)_{k=1}^{N_K}) \end{aligned}$$
(10)

subject to (6), (7) and the transformed state equation

$$\begin{aligned} {\dot{x}}(t)=Tf(x(t),(K_k,\Phi _k)_{k=1}^{N_K}),\qquad x(0)=x_0 \end{aligned}$$
(11)

for \(t\in (0,1)\), where

$$\begin{aligned} J(T,(K_k,\Phi _k)_{k=1}^{N_K}):=\frac{1}{2}\Vert x(1)-x_g\Vert _2^2. \end{aligned}$$

Remark 1

The long-term dynamics of the nonlinear autonomous state equation (11) is determined by the set of fixed points \(\{y\in {\mathbb {R}}^n:f(y)=0\}\) of f. A system is asymptotically stable at a fixed point y if all real parts of the eigenvalues of Df(y) are strictly negative [18, 40]. In the case of convex potential functions \(\rho \) with unbounded support and a full rank matrix A, the autonomous differential equation (3) is asymptotically stable.

In the next theorem, we apply the direct method in the calculus of variations to prove the existence of minimizers for the optimal control problem.

Theorem 1

(Existence of solutions) Let \(s\ge 0\). Then the minimum in (10) is attained.

Proof

Without restriction, we solely consider the case \(N_K=1\) and omit the subscript.

Let \((T^i,K^i,\Phi ^i)\in {\mathbb {R}}\times {\mathbb {R}}^{m\times n}\times {\mathcal {V}}^s\) be a minimizing sequence for J with an associated state \(x^i\in C^1([0,1],{\mathbb {R}}^n)\) such that (6), (7) and (11) hold true (the existence of \(x^i\) is verified below). The coercivity of \(\alpha \) and \(\beta \) implies \(\Vert K^i\Vert _2\le C_\alpha \) and \(\Vert \Phi ^i\Vert _{L^\infty }\le C_\beta \) for fixed constants \(C_\alpha ,C_\beta >0\). Due to the finite dimensionality of \({\mathcal {V}}\) and the boundedness of \(\Vert \Phi ^i\Vert _{L^\infty }\le C_\beta \), we can deduce the existence of a subsequence (not relabeled) such that \(\Phi ^i\rightarrow \Phi \in {\mathcal {V}}\). In addition, using the bounds \(T^i\in [0,T_\mathrm {max}]\) and \(\Vert K^i\Vert _2\le C_\alpha \) we can pass to further subsequences if necessary to deduce \((T^i,K^i)\rightarrow (T,K)\) for suitable \((T,K)\in [0,T_\mathrm {max}]\times {\mathbb {R}}^{m\times n}\) such that \(\Vert K\Vert _2\le C_\alpha \). The state equation (11) implies

$$\begin{aligned}&\Vert {\dot{x}}^i(t)\Vert _2\\&\quad \le T^i\Vert K^i\Vert _2\Vert \Phi ^i\Vert _{L^\infty }+T^i\Vert A\Vert _2(\Vert A\Vert _2\Vert x^i(t)\Vert _2+\Vert b\Vert _2)\\&\quad \le T_\mathrm {max}C_\alpha C_\beta +T_\mathrm {max}\Vert A\Vert _2(\Vert A\Vert _2\Vert x^i(t)\Vert _2+\Vert b\Vert _2). \end{aligned}$$

This estimate already guarantees that [0, 1] is contained in the maximum domain of existence of the state equation due to the linear growth of the right-hand side in \(x^i\) [40, Theorem 2.17]. Moreover, Gronwall’s inequality [18, 40] ensures the uniform boundedness of \(\Vert x^i(t)\Vert _2\) for all \(t\in [0,1]\) and all \(i\in {\mathbb {N}}\), which in combination with the above estimate already implies the uniform boundedness of \(\Vert {\dot{x}}^i(t)\Vert _2\). Thus, by passing to a subsequence (again not relabeled), we infer that \(x\in H^1((0,1),{\mathbb {R}}^n)\) exists such that \(x(0)=x_0\) (the pointwise evaluation is possible due to the Sobolev embedding theorem), \(x_i\rightharpoonup x\) in \(H^1((0,1),{\mathbb {R}}^n)\) and \(x_i\rightarrow x\) in \(C^0([0,1],{\mathbb {R}}^n)\). In addition, we obtain

$$\begin{aligned} \Vert T^i(K^i)^\top \Phi ^i(K^i x^i(t))-TK^\top \Phi (Kx(t))\Vert _{C^0([0,1])}\rightarrow 0 \end{aligned}$$

as \(i\rightarrow \infty \) and \({\dot{x}}(t)=-TK^\top \Phi (Kx(t))-TA^\top (Ax(t)-b)\) holds true in a weak sense [18]. However, due to the continuity of the right-hand side, we can even conclude \(x\in C^1([0,1],{\mathbb {R}}^n)\) [18, Chapter I]. Finally, the theorem follows from the continuity of J along this minimizing sequence. \(\square \)

In the next theorem, a first-order necessary condition for the optimal stopping time is derived.

Theorem 2

(First-order necessary condition for optimal stopping time) Let \(s\ge 1\). Then for each stationary point \(({\overline{T}},({\overline{K}}_k,{\overline{\Phi }}_k)_{k=1}^{N_K})\) of J with associated state \({\overline{x}}\) such that (6), (7) and (11) are valid the equation

$$\begin{aligned} \int _0^1\langle {\overline{p}}(t),\dot{{\overline{x}}}(t)\rangle \,\mathrm {d}t=0 \end{aligned}$$
(12)

holds true. Here, \({\overline{p}}\in C^1([0,1],{\mathbb {R}}^n)\) denotes the adjoint state of \({\overline{x}}\), which is given as the solution to the ordinary differential equation

$$\begin{aligned} \dot{{\overline{p}}}(t)=\sum _{k=1}^{N_K}{\overline{T}}\,{\overline{K}}_k^\top D{\overline{\Phi }}_k({\overline{K}}_k{\overline{x}}(t)){\overline{K}}_k{\overline{p}}(t)+{\overline{T}}A^\top A{\overline{p}}(t) \end{aligned}$$
(13)

with terminal condition

$$\begin{aligned} {\overline{p}}(1)=x_g-{\overline{x}}(1). \end{aligned}$$
(14)

Proof

Again, without loss of generality, we restrict to the case \(N_K=1\) and omit the subscript. Let \({\overline{z}}=({\overline{x}},{\overline{T}},{\overline{K}},{\overline{\Phi }})\in {\mathcal {Z}}:=H^1((0,1))\times [0,T_\mathrm {max}]\times {\mathbb {R}}^{m\times n}\times {\mathcal {V}}^s\) be a stationary point of J, which exists due to Theorem 1. The constraints (11), (6) and (7) can be written as

$$\begin{aligned} G(x,T,K,\Phi )\in {\mathcal {C}}:=\{0\}\times \{0\} \times {\mathbb {R}}_0^-\times {\mathbb {R}}_0^-\times \{0\}, \end{aligned}$$

where \(G:{\mathcal {Z}}\rightarrow {\mathcal {P}}:=L^2((0,1),{\mathbb {R}}^n)\times {\mathbb {R}}^n\times {\mathbb {R}}\times {\mathbb {R}}\times {\mathbb {R}}^m\) (note that \(L^2((0,1),{\mathbb {R}}^n)^*\cong L^2((0,1),{\mathbb {R}}^n)\)) is given by

$$\begin{aligned} G(x,T,K,\Phi )= \begin{pmatrix} {\dot{x}}+TK^\top \Phi (Kx)+TA^\top (Ax-b)\\ x(0)-x_0\\ \alpha (K)-1\\ \beta (\Phi )-1\\ K{\mathbf {1}} \end{pmatrix}. \end{aligned}$$

For multipliers in the space \({\mathcal {P}}\), we consider the associated Lagrange functional \(L:{\mathcal {Z}}\times {\mathcal {P}}\rightarrow {\mathbb {R}}\) to minimize J incorporating the aforementioned constraints, i.e., for \(z=(x,T,K,\Phi )\in {\mathcal {Z}}\) and \(p\in {\mathcal {P}}\) we have

$$\begin{aligned} L(z,p)&=J(T,K,\Phi )+\int _0^1\langle p_1,G_1(z)\rangle \,\mathrm {d}t\nonumber \\&+\sum _{i=2}^5\langle p_i,G_i(z)\rangle . \end{aligned}$$
(15)

Following [22, 43], the Lagrange multiplier \({\overline{p}}\) exists if J is Fréchet differentiable at \({\overline{z}}\), G is continuously Fréchet differentiable at \({\overline{z}}\) and \({\overline{z}}\) is regular, i.e.

$$\begin{aligned} 0\in {\text {int}}\left\{ DG({\overline{z}})({\mathcal {Z}} -{\overline{z}})+G({\overline{z}})-{\mathcal {C}}\right\} . \end{aligned}$$
(16)

The (continuous) Fréchet differentiability of J and G at \({\overline{z}}\) can be proven in a straightforward manner. To show (16), we first prove the surjectivity of \(DG_1({\overline{z}})\). For any \(z=(x,T,K,\Phi )\in {\mathcal {Z}}\), we have

$$\begin{aligned} DG_1({\overline{z}})(z)&={\dot{x}}+{\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}){\overline{K}}x+{\overline{T}}A^\top Ax\\&+T{\overline{K}}^\top {\overline{\Phi }}({\overline{K}}{\overline{x}}) +TA^\top (A{\overline{x}}-b)\\&+{\overline{T}}K^\top {\overline{\Phi }}({\overline{K}}{\overline{x}})+{\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}})K{\overline{x}}\\&+{\overline{T}}\,{\overline{K}}^\top \Phi ({\overline{K}}{\overline{x}}). \end{aligned}$$

The surjectivity of \(DG_1({\overline{z}})\) with initial condition given by \({\overline{x}}(0)=x_0\) follows from the linear growth in x, which implies that the maximum domain of existence coincides with \({\mathbb {R}}\). This solution is in general only a solution in the sense of Carathéodory [18, 40]. Since \(\alpha \) and \(\beta \) have non-vanishing derivatives, the validity of (16) and thus the existence of the Lagrange multiplier follow.

The first-order optimality conditions with test functions \(x\in H^1((0,1),{\mathbb {R}}^n)\), \(K\in {\mathbb {R}}^{m\times n}\), \(\Phi \in {\mathcal {V}}^s\) and \(p\in {\mathcal {P}}\) read as

$$\begin{aligned}&D_{x}L({\overline{x}},{\overline{T}},{\overline{K}},{\overline{\Phi }}, {\overline{p}})(x)\nonumber \\&\quad =\langle {\overline{x}}(1)-x_g,x(1)\rangle +\langle {\overline{p}}_2,x(0)\rangle \nonumber \\&\qquad +\int _0^1\langle {\overline{p}}_1,{\dot{x}}+{\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}){\overline{K}}x+{\overline{T}}A^\top Ax\rangle \,\mathrm {d}t=0,\end{aligned}$$
(17)
$$\begin{aligned}&\frac{\,\mathrm {d}}{\,\mathrm {d}T}L({\overline{x}},{\overline{T}},{\overline{K}}, {\overline{\Phi }},{\overline{p}})\nonumber \\&\quad =\int _0^1\langle {\overline{p}}_1, {\overline{K}}^\top {\overline{\Phi }}({\overline{K}}{\overline{x}}) +A^\top (A{\overline{x}}-b)\rangle \,\mathrm {d}t=0, \end{aligned}$$
(18)
$$\begin{aligned}&D_{K}L({\overline{x}},{\overline{T}},{\overline{K}}, {\overline{\Phi }},{\overline{p}})(K)\nonumber \\&\quad =\int _0^1\langle {\overline{p}}_1,{\overline{T}}\, K^\top {\overline{\Phi }}({\overline{K}}{\overline{x}}) +{\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }} ({\overline{K}}{\overline{x}})K{\overline{x}}\rangle \,\mathrm {d}t\nonumber \\&\qquad +\langle {\overline{p}}_3,D\alpha ({\overline{K}})(K)\rangle +\langle {\overline{p}}_5,K{\mathbf {1}}\rangle =0,\nonumber \\&D_{\Phi }L({\overline{x}},{\overline{T}},{\overline{K}}, {\overline{\Phi }},{\overline{p}})(\Phi )\nonumber \\&\quad =\int _0^1\langle {\overline{p}}_1,{\overline{T}}\, {\overline{K}}^\top \Phi ({\overline{K}}{\overline{x}})\rangle \,\mathrm {d}t+\langle {\overline{p}}_4,D\beta ({\overline{\Phi }})(\Phi )\rangle =0,\nonumber \\&D_{p}L({\overline{x}},{\overline{T}},{\overline{K}},{\overline{\Phi }}, {\overline{p}})(p)\nonumber \\&\quad =\int _0^1\langle p_1,G_1({\overline{z}})\rangle \,\mathrm {d}t+\sum _{i=2}^5\langle p_i,G_i({\overline{z}})\rangle =0. \end{aligned}$$
(19)

The fundamental lemma of calculus of variations yields in combination with (17) and (19) for \(t\in (0,1)\)

$$\begin{aligned} \dot{{\overline{x}}}(t)&=-{\overline{T}}\,{\overline{K}}^\top {\overline{\Phi }} ({\overline{K}}{\overline{x}}(t)) -{\overline{T}}A^\top (A{\overline{x}}(t)-b),\nonumber \\ {\overline{x}}(0)&=x_0, \end{aligned}$$
(20)
$$\begin{aligned} \dot{{\overline{p}}}_1(t)&={\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}(t)){\overline{K}} {\overline{p}}_1(t)+{\overline{T}}A^\top A{\overline{p}}_1(t),\nonumber \\ {\overline{p}}_1(1)&=x_g-{\overline{x}}(1) \end{aligned}$$
(21)

in a distributional sense. Since the right-hand sides of (20) and (21) are continuous if \(s\ge 1\), we can conclude \({\overline{x}},{\overline{p}}\in C^1([0,1],{\mathbb {R}}^n)\) [18, 40] and hence (21) holds in the classical sense. Finally, (18) and (19) imply

$$\begin{aligned} \frac{\,\mathrm {d}}{\,\mathrm {d}T}L({\overline{x}},{\overline{T}},{\overline{K}},{\overline{\Phi }}, {\overline{p}})=-\frac{1}{{\overline{T}}}\int _0^1\langle {\overline{p}}_1, \dot{{\overline{x}}}\rangle \,\mathrm {d}t=0, \end{aligned}$$
(22)

which proves (12) if \({\overline{T}}>0\) (the case \({\overline{T}}=0\) is trivial). \(\square \)

The preceding theorem can easily be adapted for fixed kernels and activation functions leading to a reduced optimization problem with respect to the stopping time only:

Corollary 1

(First-order necessary condition for subproblem) Let \({\overline{K}}_k\in {\mathbb {R}}^{m\times n}\) and \({\overline{\Phi }}_k\in {\mathcal {V}}^s\) for \(k=1,\ldots ,N_K\) be fixed with \(s\ge 1\) satisfying (6) and (7). We denote by \({\overline{p}}\) the adjoint state (13). Then, for each stationary point \({\overline{T}}\) of the subproblem

$$\begin{aligned} T\mapsto J(T,({\overline{K}}_k,{\overline{\Phi }}_k)_{k=1}^{N_K}), \end{aligned}$$
(23)

in which the associated state \({\overline{x}}\) satisfies (11), the first-order optimality condition (12) holds true.

Remark 2

Under the assumptions of Corollary 1, a rescaling argument reveals the identities for \(t\in (0,1)\)

$$\begin{aligned} \frac{\,\mathrm {d}}{\,\mathrm {d}T}{\overline{x}}(t)&=tf({\overline{x}}(t), ({\overline{K}}_k,{\overline{\Phi }}_k)_{k=1}^{N_K}),\\ \frac{\,\mathrm {d}}{\,\mathrm {d}T}{\overline{p}}(t)&=\sum _{k=1}^{N_K}t{\overline{K}}_k^\top D{\overline{\Phi }}_k({\overline{K}}_k{\overline{x}}(t)) {\overline{K}}_k{\overline{p}}(t)+tA^\top A{\overline{p}}(t). \end{aligned}$$

We conclude this section with a second-order sufficient condition for the partial optimization problem (23):

Theorem 3

(Second-order sufficient conditions for subproblem) Let \(s\ge 2\). Under the assumptions of Corollary 1, \({\overline{T}}\in (0,T_\mathrm {max})\) with associated state \({\overline{x}}\) is a strict local minimum of \(T\mapsto J(T,({\overline{K}}_k,{\overline{\Phi }}_k)_{k=1}^{N_K})\) if a constant \(C>0\) exists such that

$$\begin{aligned}&\int _0^1\sum _{k=1}^{N_K}\langle {\overline{p}}, {\overline{T}}\,{\overline{K}}_k^\top D^2{\overline{\Phi }}_k({\overline{K}}_k{\overline{x}})({\overline{K}}_kx,{\overline{K}}_kx)\nonumber \\&\quad \qquad +\,2{\overline{K}}_k^\top D{\overline{\Phi }}_k({\overline{K}}_k{\overline{x}}){\overline{K}}_kx+2A^\top Ax\rangle \,\mathrm {d}t+\langle x(1),x(1)\rangle \nonumber \\&\qquad \ge C(1+\Vert x\Vert _{H^1((0,1),{\mathbb {R}}^n)}^2) \end{aligned}$$
(24)

for all \(x\in C^1((0,1),{\mathbb {R}}^n)\) satisfying \(x(0)=0\) and

$$\begin{aligned} {\dot{x}}&=\sum _{k=1}^{N_K}\left( -{\overline{T}}\,{\overline{K}}_k^\top D{\overline{\Phi }}_k({\overline{K}}_k{\overline{x}}){\overline{K}}_kx-{\overline{K}}_k^\top {\overline{\Phi }}_k({\overline{K}}_k{\overline{x}})\right) \nonumber \\&-{\overline{T}}A^\top Ax-A^\top (A{\overline{x}}-b). \end{aligned}$$
(25)

Proof

As before, we restrict to the case \(N_K=1\) and omit subscripts. Let us denote by L the version of the Lagrange functional (15) with fixed kernels and fixed activation functions to minimize J subject to the side conditions \(G_1(x,T)=G_2(x,T)=0\) as specified in Corollary 1. Let \({\overline{z}}=({\overline{x}},{\overline{T}})\in {\mathcal {Z}}:=H^1((0,1),{\mathbb {R}}^n)\times (0,T_\mathrm {max})\) be a local minimum of J. Furthermore, we consider arbitrary test functions \(z_1=(x_1,T_1), z_2=(x_2,T_2)\in {\mathcal {Z}}\), where we endow the Banach space \({\mathcal {Z}}\) with the norm \(\Vert z\Vert _{\mathcal {Z}}^2:=\Vert x\Vert _{H^1((0,1),{\mathbb {R}}^n)}^2+|T|^2\) for \(z=(x,T)\in {\mathcal {Z}}\). Then,

$$\begin{aligned} D^2J({\overline{T}})(z_1,z_2)&=\langle x_1(1),x_2(1)\rangle ,\\ D^2G_1({\overline{z}})(z_1,z_2)=&{\overline{T}}\,{\overline{K}}^\top D^2{\overline{\Phi }}({\overline{K}}{\overline{x}})({\overline{K}}x_1,{\overline{K}}x_2)\\&+T_2\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}){\overline{K}}x_1\\&+T_1\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}){\overline{K}}x_2\\&+T_2A^\top Ax_1+T_1A^\top Ax_2,\\ D^2G_2({\overline{x}})(x_1,x_2)&=0. \end{aligned}$$

Following [43, Theorem 43.D], J has a strict local minimum at \({\overline{z}}\) if the first-order optimality conditions discussed in Corollary 1 holds true and a constant \(C>0\) exists such that

$$\begin{aligned}&D^2J({\overline{T}})(z,z)+\int _0^1\langle {\overline{p}}_1,D^2G_1({\overline{z}})(z,z)\rangle \,\mathrm {d}t\nonumber \\&\quad = \langle x(1),x(1)\rangle +\int _0^1\langle {\overline{p}}_1, {\overline{T}}\,{\overline{K}}^\top D^2{\overline{\Phi }}({\overline{K}}{\overline{x}})({\overline{K}}x,{\overline{K}}x)\nonumber \\&\qquad +2T\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}){\overline{K}}x+2TA^\top Ax \rangle \,\mathrm {d}t \ge C\Vert z\Vert _{\mathcal {Z}}^2 \end{aligned}$$
(26)

for all \(z=(x,T)\in {\mathcal {Z}}\) satisfying

$$\begin{aligned} DG_1({\overline{z}})(z)&={\dot{x}}+{\overline{T}}\,{\overline{K}}^\top D{\overline{\Phi }}({\overline{K}}{\overline{x}}) {\overline{K}}x+T\,{\overline{K}}^\top {\overline{\Phi }}({\overline{K}}{\overline{x}})\\&+{\overline{T}}A^\top Ax+T A^\top (A{\overline{x}}-b)=0 \end{aligned}$$

and \(DG_2({\overline{z}})(z)=x(0)=0\). The theorem follows from the homogeneity of order 2 in T in (26), which results in the modified condition (25). \(\square \)

Remark 3

All aforementioned statements remain valid when replacing the function space \({\mathcal {V}}^s\) and the norm of the activation functions by suitable Sobolev spaces and Sobolev norms, respectively. Moreover, all statements only require minor modifications if instead of the box constraints (6) nonnegative, coercive and differentiable functions of the norms of T, \(K_k\), and \(\Phi _k\) are added in the cost functional J.

3 Time Discretization

The optimal control problem with state equation originating from the gradient flow for the energy functional \({\mathcal {E}}\) was analyzed in Sect. 2. In this section, we prove that static variational networks can be derived from a time discretization of the state equation incorporating Euler’s or Heun’s method [2, 6]. To illustrate the concepts, we discuss the optimal control problem in \({\mathbb {R}}^2\) using fixed kernels and activation functions in Sect. 3.1. Finally, a literature overview of alternative ways to derive variational networks as well as relations to other approaches are presented in Sect. 3.2.

Let \(S\ge 2\) be a fixed depth. For a stopping time \(T\in {\mathbb {R}}\) we define the node points \(t_s=\frac{s}{S}\) for \(s=0,\ldots ,S\). Consequently, Euler’s explicit method for the transformed state equation (11) with fixed kernels and fixed activation functions \({\overline{\Theta }}=(({\overline{K}}_k,{\overline{\Phi }}_k)_{k=1}^{N_K})\) reads as

$$\begin{aligned} x_{s+1}=x_s+\frac{T}{S}f(x_s,{\overline{\Theta }}) \end{aligned}$$
(27)

for \(s=0,\ldots ,S-1\) with \(x_0=x(0)\). The discretized ordinary differential equation (27) defines the evolution of the static variational network. We stress that this time discretization is closely related to residual neural networks with constant parameters in each layer. Here, \(x_s\) is an approximation of \(x(t_s)\), the associated global error \(x(t_s)-x_s\) is bounded from above by

$$\begin{aligned} \max _{s=0,\ldots ,S}\Vert x(t_s)-x_s\Vert _2\le \frac{CT}{S} \end{aligned}$$

with \(C:=\frac{\left( e^{L_f}-1\right) \Vert f''\Vert _{C^0}}{2L_f}\), where \(L_f\) denotes the Lipschitz constant of f [2, Theorem 6.3]. In general, this global error bound has a tendency to overestimate the actual global error. Improved error bounds can either be derived by performing a more refined local error analysis, which solely results in a better constant C, or by using higher-order Runge–Kutta methods. One prominent example of an explicit Runge–Kutta scheme with a quadratic order of convergence is Heun’s method [6], which is defined as

$$\begin{aligned} x_{s+1}=x_s+\frac{T}{2S}\left( f(x_s,{\overline{\Theta }})+f\left( x_s +\frac{T}{S}f(x_s,{\overline{\Theta }})\right) \right) .\nonumber \\ \end{aligned}$$
(28)

We abbreviate the right-hand side of (13) as follows:

$$\begin{aligned} g(x,p,(K_k,\Phi _k)_{k=1}^{N_K}) =\sum _{k=1}^{N_K}K_k^\top D\Phi _k(K_kx)K_kp+A^\top Ap. \end{aligned}$$

The corresponding update schemes for the adjoint states are given by

$$\begin{aligned} p_s=p_{s+1}-\frac{T}{S}g(x_{s+1},p_{s+1},{\overline{\Theta }}) \end{aligned}$$
(29)

in the case of Euler’s method and

$$\begin{aligned} p_s&=p_{s+1}-\frac{T}{2S}\bigg (g(x_{s+1},p_{s+1},{\overline{\Theta }})\nonumber \\&+g\bigg (x_s,p_{s+1}-\frac{T}{S}g(x_{s+1},p_{s+1},{\overline{\Theta }}), {\overline{\Theta }}\bigg )\bigg ) \end{aligned}$$
(30)

in the case of Heun’s method. We remark that in general implicit Runge–Kutta schemes are not efficient due to the complex structure of the Field-of-Experts regularizer.

In all cases, we have to choose the step size \(\frac{T}{S}\) such that the explicit Euler scheme is stable [6], i.e.,

$$\begin{aligned} \max _{i=1,\ldots ,n}\left| 1+\frac{T}{S}\lambda _i\right| \le 1 \end{aligned}$$
(31)

for all \(s=0,\ldots ,S\), where \(\lambda _i\) denotes the ith eigenvalue of the Jacobian of either f or g. Note that this condition already implies the stability of Heun’s method. Thus, in the numerical experiments, we need to ensure a constant ratio of the stopping time T and the depth S to satisfy (31).

3.1 Optimal Control Problem in \({\mathbb {R}}^2\)

In this subsection, we apply the first- and second-order criteria for the partial optimal control problem (see Corollary 1) to the simple, yet illuminative example in \({\mathbb {R}}^2\) with a single kernel, i.e., \(l=m=n=2\) and \(N_K=1\). More general applications of the early stopping criterion to image restoration problems are discussed in Sect. 4. Below, we consider a regularized data fitting problem composed of a squared \(L^2\)-data term and a nonlinear regularizer incorporating a forward finite difference matrix operator with respect to the x-direction. In detail, we choose \({\overline{\phi }}(x)=\frac{x}{\sqrt{x^2 + 1}}\) and

$$\begin{aligned} x_0&=\begin{pmatrix} 1 \\ 2 \end{pmatrix},&x_g&=\begin{pmatrix} \frac{3}{2}\\ \frac{1}{2} \end{pmatrix},&b&=\begin{pmatrix} 1\\ \frac{1}{2} \end{pmatrix},&\\ A&=\begin{pmatrix} 1 &{}\quad 0\\ 0 &{}\quad 1 \end{pmatrix},&{\overline{K}}&=\begin{pmatrix} 1 &{}\quad -1 \\ 0 &{}\quad 0 \end{pmatrix}.&\end{aligned}$$

To compute the solutions of the state (11) and the adjoint (13) differential equation, we use Euler’s explicit method with 100 equidistant steps. All integrals are approximated using a Gaussian quadrature of order 21. Furthermore, we optimize the stopping time \({\overline{T}}\) in the discrete set \({\mathcal {T}}=0.05\cdot {\mathbb {N}}\cap [\frac{1}{10},3]\).

Figure 4 (left) depicts all trajectories for \(T\in {\mathcal {T}}\) (black curves) of the state equation emanating from \(x_0\) with sink/stable node \(x_\infty \). The end points of the optimal trajectory and the ground truth state are marked by red points. Moreover, the gray line indicates the trajectory of \({\overline{x}}_T+{\overline{p}}_T\) (the subscript denotes the solutions calculated with stopping time T) associated with the optimal stopping time. The dependency of the energy (red curve)

$$\begin{aligned} T\mapsto J(T,{\overline{K}},{\overline{\Phi }}) \end{aligned}$$

and of the first-order condition (12) (blue curve)

$$\begin{aligned} T\mapsto -\frac{1}{T}\int _0^1\langle p_T,{\dot{x}}_T\rangle \,\mathrm {d}t \end{aligned}$$

on the stopping time T is visualized in the right plot in Figure 4. Note that the black vertical line indicating the optimal stopping time \({\overline{T}}\) given by (12) crosses the energy plot at the minimum point. The function value of the second-order condition (24) in Theorem 3 is 0.071, which confirms that \({\overline{T}}\) is indeed a strict local minimum of the energy.

Fig. 4
figure 4

Left: Trajectories of the state equation for T varying in \({\mathcal {T}}\) (black curves), initial value \(x_0\) (blue point), sink/stable node \(x_\infty \) (blue point), ground truth state \(x_g\) and end point of optimal trajectory (red points). Right: function plots of the energy (red plot) and the first-order condition (blue plot) (Color figure online)

3.2 Alternative Derivations of Variational Networks

We conclude this section with a brief review of alternative derivations of the defining equation (27) for variational networks. Inspired by the classical nonlinear anisotropic diffusion model by Perona and Malik [31], Chen and Pock [11] derive variational networks as discretized nonlinear reaction diffusion models of the form \(\frac{x_{s+1}-x_s}{h}=-{\mathcal {R}}[x_s]-{\mathcal {D}}[x_s]\) with an a priori fixed number of iterations, where \({\mathcal {R}}\) and \({\mathcal {D}}\) represent the reaction and diffusion terms, respectively, that coincide with the first and second expression in (11). By exploiting proximal mappings, this scheme can also be used for non-differentiable data terms \({\mathcal {D}}\). In the same spirit, Kobler et al. [24] related variational networks to incremental proximal and incremental gradient methods. Following [19], variational networks result from a Landweber iteration [25] of the energy functional (2) using the Field-of-Experts regularizer. Structural similarities of variational networks and residual neural networks [21] are analyzed in [24]. In particular, residual neural networks (and thus also variational networks) are known to be less prone to the degradation problem, which is characterized by a simultaneous increase of the training/test error and the model complexity. Note that in most of these approaches time varying kernels and activation functions are examined. In most of the aforementioned papers, the benefit of early stopping has been observed.

4 Numerical Results for Image Restoration

We examine the advantageousness of early stopping for image denoising and image deblurring using static variational networks in this section. In particular, we show that the first-order optimality condition results in the optimal stopping time. We do not verify the second-order sufficient condition discussed in Theorem 3 since in all experiments the first-order condition indicates an energy minimizing solution and thus this verification is not required.

4.1 Image Reconstruction Problems

In the case of image denoising, we perturb a ground truth image \(x_g\in {\mathbb {R}}^n\) by additive Gaussian noise

$$\begin{aligned} n\sim {\mathcal {N}}(0,\sigma ^2\mathrm {Id}) \end{aligned}$$

for a certain noise level \(\sigma \) resulting in the noisy input image \(g =x_g+n\). Consequently, the linear operator is given by the identity matrix and the corrupted image as well as the initial condition coincide with the noisy image, i.e., \(A=\mathrm {Id}\) and \(b=x_0=g\).

For image deblurring, we consider an input image \(g=x_0=Ax_g+n\in {\mathbb {R}}^n\) that is corrupted by a Gaussian blur of the ground truth image \(x_g\in {\mathbb {R}}^n\) and a Gaussian noise n with \(\sigma =0.01\). Here, \(A\in {\mathbb {R}}^{n\times n}\) refers to the matrix representation of the \(9\times 9\) normalized convolution filter with the blur strength \(\tau >0\) of the function

$$\begin{aligned} (x,y)\mapsto \frac{1}{\sqrt{2\pi \tau ^2}}\exp \left( -\frac{x^2+y^2}{2\tau ^2}\right) . \end{aligned}$$

4.2 Numerical Optimization

For all image reconstruction tasks, we use the BSDS 500 data set [29] with grayscale images in \([0,1]^{341\times 421}\). We train all models on 200 train and 200 test images from the BSDS 500 data set and evaluate the performance on 68 validation images as specified by [36].

In all experiments, the activation functions (5) are parametrized using \(N_w=63\) quadratic B-spline basis functions \(\psi _j\in C^1({\mathbb {R}})\) with equidistant centers in the interval \([-1,1]\). Let \(\xi \in {\mathbb {R}}^{n_1\times n_2}\) be the two-dimensional image of a corresponding data vector \(u\in {\mathbb {R}}^n\), \(n=n_1\cdot n_2\). Then, the convolution \(\kappa *\xi \) of the image \(\xi \) with a filter \(\kappa \) is modeled by applying the corresponding kernel matrix \(K\in {\mathbb {R}}^{m\times n}\) to the data vector u. We only use kernels \(\kappa \) of size \(7\times 7\). Motivated by the relation \(\Vert K\Vert _F=m\Vert \kappa \Vert _F\), we choose \(\alpha (K)=\frac{1}{m^2}\Vert K\Vert _F^2\). Additionally, we use \(\beta (\Phi )=\beta (w)=\Vert w\Vert _2^2\) for a weight vector w associated with \(\Phi \). Since all numerical experiments yield a finite optimal stopping time T, we omit the constraint \(T\le T_\mathrm {max}\).

For a given training set consisting of pairs of corrupted images \(x_0^i\in {\mathbb {R}}^n\) and corresponding ground truth images \(x_g^i\in {\mathbb {R}}^n\), we denote the associated index set by \({\mathcal {I}}\). To train the model, we consider the discrete energy functional

$$\begin{aligned} J_{\mathcal {B}}(T,(K_k,w_k)_{k=1}^{N_K}):=\frac{1}{|{\mathcal {B}}|} \sum _{i\in {\mathcal {B}}}\frac{1}{2}\Vert x_S^i-x_g^i\Vert _2^2 \end{aligned}$$
(32)

for a subset \({\mathcal {B}}\subset {\mathcal {I}}\), where \(x_S^i\) denotes the terminal value of the Euler/Heun iteration scheme for the corrupted image \(x_0^i\). In all numerical experiments, we use the iPALM algorithm [32] described in Algorithm 1 to optimize all parameters with respect to a randomly selected batch \({\mathcal {B}}\). Each batch consists of 64 image patches of size \(96\times 96\) that are uniformly drawn from the training data set.

For an optimization parameter q representing either T, \(K_k\) or \(w_k\), we use in the lth iteration step the over-relaxation

$$\begin{aligned} {\widetilde{q}}^{[l]}=q^{[l]}+\frac{1}{\sqrt{2}}(q^{[l]}-q^{[l-1]}). \end{aligned}$$

We denote by \(L_q\) the Lipschitz constant that is determined by backtracking and by \({\text {proj}}_{\mathcal {Q}}\) the orthogonal projection onto the corresponding set denoted by \({\mathcal {Q}}\).

figure e

Here, the constraint sets \({\mathcal {K}}\) and \({\mathcal {W}}\) are given by

$$\begin{aligned} {\mathcal {K}}&=\left\{ K\in {\mathbb {R}}^{m\times n}:\alpha (K)\le 1,K{\mathbf {1}}=0\right\} ,\\ {\mathcal {W}}&=\left\{ w\in {\mathbb {R}}^{N_w}:\beta (w)\le 1\right\} . \end{aligned}$$

Each component of the initial kernels \(K_k\) in the case of image denoising is independently drawn from a Gaussian random variable with mean 0 and variance 1 such that \(K_k\in {\mathcal {K}}\). The learned optimal kernels of the denoising task are incorporated for the initialization of the kernels for deblurring. The weights \(w_k\) of the activation functions are initialized such that \(\phi _k(y)\approx 0.1y\) around 0 for both reconstruction tasks.

4.3 Results

In the first numerical experiment, we train models for denoising and deblurring with \(N_K=48\) kernels, a depth \(S=10\) and \(L=5000\) training steps. Afterward, we use the calculated parameters \((K_k,\Phi _k)_{k=1}^{N_K}\) and T as an initialization and train models for various depths \(S=2,4,\ldots ,50\) and \(L=500\). Figure 5 depicts the average PSNR value \(\overline{\mathrm {PSNR}(x_S^i,x_g^i)}_{i\in \widehat{{\mathcal {I}}}}\) with \(\widehat{{\mathcal {I}}}\) denoting the index set of the test images and the learned stopping time T as a function of the depth S for denoising (first two plots) and deblurring (last two plots). As a result, we observe that all plots converge for large S, where the PSNR curve monotonically increases. Moreover, the optimal stopping time T is finite in all these cases, which empirically validates that early stopping is beneficial. Thus, we can conclude that beyond a certain depth S the performance increase in terms of PSNR is negligible and a proper choice of the optimal stopping time is significant. The asymptotic value of T for increasing S in the case of image deblurring is approximately 20 times larger compared to image denoising due to the structure of the deblurring operator A. Figure 6 depicts the average \(\ell ^2\)-difference of consecutive convolution kernels and activation functions for denoising and deblurring as a function of the depth S. We observe that the differences decrease with larger values of S, which is consistent with the convergence of the optimal stopping time T for increasing S. Both time discretization schemes perform similar, and thus in the following experiments, we solely present results calculated with Euler’s method due to advantages in the computation time.

Fig. 5
figure 5

Plots of the average PSNR value across the test set (first and third plot) as well as the learned optimal stopping time T (second and fourth plot) as a function of the depth S for denoising (top) and deblurring (bottom). All plots show the results for the explicit Euler and explicit Heun schemes

Fig. 6
figure 6

Average change of consecutive convolution kernels (solid blue) and activation functions (dotted orange) for denoising (left) and deblurring (right) in terms of the \(\ell ^2\)-norm (Color figure online)

Fig. 7
figure 7

Plots of the energies (first and third plot) and first-order conditions (second and fourth plot) for training and test set with \(\sigma =0.1\) for denoising (first pair of plots) and \(\tau =1.5\)/\(\sigma =0.01\) for deblurring (last pair of plots). The average value across the training/test sets are indicated by the dotted red/solid green curves. The area between the minimal and maximal function value for each T across the training/test set are indicated by the red and green area, respectively (Color figure online)

Fig. 8
figure 8

From left to right: ground truth image, noisy input image (\(\sigma =0.1\)), restored images for \(T=\frac{{\overline{T}}}{2},{\overline{T}},\frac{{\overline{3T}}}{2},50{\overline{T}}\) with \({\overline{T}}=1.08\) for image denoising. The zoom factor of the magnifying lens is 3

Fig. 9
figure 9

From left to right: ground truth image, blurry input image (\(\tau =1.5\), \(\sigma =0.01\)), restored images for \(T=\frac{{\overline{T}}}{2},{\overline{T}},\frac{{\overline{3T}}}{2},50{\overline{T}}\) with \({\overline{T}}=20.71\) for image deblurring. The zoom factor of the magnifying lens is 3

We next demonstrate the applicability of the first-order condition for the energy minimization in static variational networks using Euler’s discretization scheme with \(S=20\). Figure 7 depicts band plots along with the average curves among all training/test images of the functions

$$\begin{aligned} T&\mapsto J_{\{i\}}(T,({\overline{K}}_k, {\overline{w}}_k)_{k=1}^{N_K})\text { and}\\ T&\mapsto -\frac{1}{T}\int _0^1 \langle p_T^i, {\dot{x}}_T^i \rangle \,\mathrm {d}t\nonumber \end{aligned}$$
(33)

for all training and test images for denoising (first two plots) and deblurring (last two plots). We approximate the integral in the first-order condition (12) via

$$\begin{aligned} \int _0^1\langle p(t),{\dot{x}}(t)\rangle \,\mathrm {d}t\approx \frac{1}{S+1}\sum _{s=0}^S\langle p_s,Tf(x_s,(K_k,\Phi _k)_{k=1}^{N_K})\rangle . \end{aligned}$$

We deduce that the first-order condition for each test image indicates the energy minimizing stopping time. Note that all image dependent stopping times are distributed around the average optimal stopping time that is highlighted by the black vertical line and learned during training.

Figure 8 depicts two input images \(x_g\) (first column): the corrupted images g (second column) and the denoised images for \(T=\frac{{\overline{T}}}{2},{\overline{T}},\frac{{\overline{3T}}}{2},100{\overline{T}}\) (third to sixth column). The maximum values of the PSNR obtained for \(T={\overline{T}}\) are 29.68 and 29.52, respectively. To ensure a sufficiently fine time discretization, we enforce \(\frac{S}{T}=\mathrm {const}\), where for \(T={\overline{T}}\) we set \(S=20\). Likewise, Fig. 9 contains the corresponding results for the deblurring task. Again, we enforce a constant ratio of S and T. The PSNR value peaks around the optimal stopping time, and the corresponding values are 29.52 and 27.80, respectively. We observed an average computation time of \(5.694\,ms for the denoising and \)8.687 ms for the deblurring task using a RTX 2080 Ti graphics card and the PyTorch machine learning framework.

As desired, \({\overline{T}}\) indicates the energy minimizing time, where both the average curves for the training and test sets nearly coincide, which proves that the model generalizes to unseen test images. Although the gradient of the average energy curve (33) is rather flat near the learned optimal stopping time, the proper choice of \({\overline{T}}\) is indeed crucial as shown by the qualitative results in Figs. 8 and  9. In the case of denoising, for \(T<{\overline{T}}\), we still observe noisy images, whereas for too large T, local image patterns are smoothed out. For image deblurring, images computed with too small values of T remain blurry, while for \(T>{\overline{T}}\) ringing artifacts are generated and their intensity increase with larger T. For a corrupted image, the associated adjoint state requires the knowledge of the ground truth for the terminal condition (14), which is in general not available. However, Fig. 7 shows that the learned average optimal stopping time \({\overline{T}}\) yields the smallest expected error. Thus, for arbitrary corrupted images, \({\overline{T}}\) is used as the stopping time.

Figure 10 illustrates the plots of the energies (blue plots) and the first-order conditions (red plots) as a function of the stopping time T for all test images for denoising (left) and deblurring (right), which are degraded by noise levels \(\sigma \in \{0.075,0.1,0.125,0.15\}\) and different blur strengths \(\tau \in \{1.25,1.5,1.75,2.0\}\). Note that in each plot the associated curves of three prototypic images are visualized. To ensure a proper balancing of the data fidelity term and the regularization energy for the denoising task, we add the factor \(\frac{1}{\sigma ^2}\) to the data term as typically motivated by Bayesian inference. For all noise levels \(\sigma \) and blur strengths \(\tau \), the same fixed pairs of kernels and activation functions trained with \(\sigma =0.1\)/\(\tau =1.5\) and depth \(S=20\) are used. Again, the first-order conditions indicate the degradation depending energy minimizing stopping times. The optimal stopping time increases with the noise level and blur strength, which results from a larger distance of \(x_0\) and \(x_g\) and thus requires longer trajectories.

Fig. 10
figure 10

Band plots of the energies (blue plots) and first-order conditions (red plots) for image denoising (left) and image deblurring (right) and various degradation levels. In each plot, the curves of three prototypic images are shown (Color figure online)

Table 1 Average PSNR value of the test set for image denoising/deblurring with different degradation levels along with the optimal stopping time \({\overline{T}}\)

Table 1 presents pairs of average PSNR values and optimal stopping times \({\overline{T}}\) for the test set for denoising (top) and deblurring (bottom) for different noise levels \(\sigma \in \{0.075,0.1,0.125,0.15\}\) and blur strengths \(\tau \in \{1.25,1.5,1.75,2.0\}\). All results in the table are obtained using \(N_K=48\) kernels and a depth \(S=20\). Both first rows present the results incorporating an optimization of all control parameters \((K_k, w_k)_{k=1}^{N_K}\) and T. In contrast, both second rows show the resulting PSNR values and optimal stopping times for only a partial optimization of the stopping times and pretrained kernels and activation functions \((K_k, w_k)_{k=1}^{N_K}\) for \(\sigma =0.1\)/\(\tau =1.5\). Further, both third rows present the PSNR values and stopping times obtained by using the reference models without any further optimization. Finally, the last rows present the results obtained by using the FISTA algorithm [3] for the TV-\(L^2\) variational model [38] for denoising and the IRcgls algorithm of the “IR Tools” package [15] for deblurring, which are both not data-driven and thus do not require any training. In detail, the weight parameter of the data term as well as the early stopping time are optimized using a simple grid search for the TV-\(L^2\) model. We exploit the IRcgls algorithm to iteratively minimize \(\Vert Ax-b\Vert _2^2\) using the conjugate gradient method until an early stopping rule based on the relative noise level of the residuum is satisfied. Note that IRcgls is a Krylov subspace method, which is designed as a general-purpose method for large-scale inverse problems (for further details see [15] and [20, Chapter 6] and the references therein). Figure 11 shows comparison of the results of the FISTA algorithm for the TV-\(L^2\)/the IRcgls algorithm and our method (with the optimal early stopping time) for \(\sigma =0.1\)/\(\tau =1.5\).

Fig. 11
figure 11

Corrupted input image (first/third row, left), restored images using the FISTA/IRcgls algorithm (first/third row, right), and the proposed method (second/fourth row, left) as well as the ground truth image (second/fourth row, right) for denoising (upper quadruple) and deblurring (lower quadruple). The zoom factor of the magnifying lens is 3

In Table 1, the resulting PSNR values of the first and second row are almost identical for image denoising despite varying optimal stopping times. Consequently, a model that was pretrained for a specific noise level can be easily adapted to noise levels by only modifying the optimal stopping time. Neglecting the optimization of the stopping time leads to inferior results as presented in the third rows, where the reference model was used for all degradation levels without any change. However, in the case of image deblurring, the model benefits from a full optimization of all controls, which is caused by the dependency of A on the blur strength. For the noise level 0.1 we observe the average PSNR value 28.72 which is on par with the corresponding results of [10, Table II]. We emphasize that in their work a costly full minimization of an energy functional is performed, whereas we solely require a depth \(S=20\) to compute comparable results.

For the sake of completeness, we present in Fig. 12 (denoising) and Fig. 13 (deblurring) the resulting triplets of kernels (top), potential functions (middle) and activation functions (bottom) for a depth \(S=20\). The scaling of the axes is identical among all potential functions and activation functions, respectively. Note that the potential functions are computed by numerical integration of the learned activation functions and we choose the integration constant such that every potential function is bounded from below by 0. As a result, we observe a large variety of different kernel structures, including bipolar forward operators in different orientations (e.g., 5th kernel in first row, 8th kernel in third row) or pattern kernels representing prototypic image textures (e.g. kernels in first column). Likewise, the learned potential functions can be assigned to several representative classes of common regularization functions like, for instance, truncated total variation (8th function in second row of Fig. 12), truncated concave (4th function in third row of Fig. 12), double-well potential (10th function in first row of Fig. 12) or “negative Mexican hat” (8th function in third row of Fig. 12). Note that the associated kernels in both tasks nearly coincide, whereas the potential and activation functions significantly differ. We observe that the activation functions in the case of denoising have a tendency to generate higher amplitudes compared to deblurring, which results in a higher relative balancing of the regularizer in the case of denoising.

Fig. 12
figure 12

Triplets of \(7\times 7\)-kernels (top), potential functions \(\rho \) (middle) and activation functions \(\phi \) (bottom) learned for image denoising

Fig. 13
figure 13

Triplets of \(7\times 7\)-kernels (top), potential functions \(\rho \) (middle) and activation functions \(\phi \) (bottom) learned for image deblurring

Fig. 14
figure 14

\(N_v=64\) eigenpairs for image denoising, where all eigenfunctions have the resolution \(127\times 127\) and the intensity of each eigenfunction is adjusted to [0, 1]

Fig. 15
figure 15

\(N_v=64\) eigenpairs for image deblurring, where all eigenfunctions have the resolution \(127\times 127\) and the intensity of each eigenfunction is adjusted to [0, 1]

4.4 Spectral Analysis of the Learned Regularizers

Finally, in order to gain intuition of the learned regularizer, we perform a nonlinear eigenvalue analysis [16] for the gradient of the Field-of-Experts regularizer learned for \(S=20\) and \(T={\overline{T}}\). For this reason, we compute several generalized eigenpairs \((\lambda _j,v_j)\in {\mathbb {R}}\times {\mathbb {R}}^{n}\) satisfying

$$\begin{aligned} \sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k v_j)=\lambda _j v_j \end{aligned}$$

for \(j=1,\ldots ,N_v\). Note that by omitting the data term, the forward Euler scheme (27) applied to the generalized eigenfunctions \(v_j\) reduces to

$$\begin{aligned} v_j-\frac{T}{S}\sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k v_j)=\left( 1-\frac{\lambda _j T}{S}\right) v_j, \end{aligned}$$
(34)

where the contrast factor \((1-\frac{\lambda _j T}{S})\) determines the global intensity change of the eigenfunction. We point out that due to the nonlinearity of the eigenvalue problem such a formula only holds locally for each iteration of the scheme.

We compute \(N_v=64\) generalized eigenpairs of size \(127\times 127\) by solving

$$\begin{aligned} \min _{v_j}\left\| \sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k v_j)-\Lambda (v_j)v_j\right\| _2^2 \end{aligned}$$
(35)

for all \(j=1,\ldots ,N_v\), where

$$\begin{aligned} \Lambda (v)=\frac{\left\langle \sum _{k=1}^{N_K}K_k^\top \Phi _k(K_k v),v\right\rangle }{\Vert v\Vert _2^2} \end{aligned}$$

denotes the generalized Rayleigh quotient, which is derived by minimizing (35) with respect to \(\Lambda (v)\). The eigenfunctions are computed using an accelerated gradient descent with step size control [32]. All eigenfunctions are initialized with randomly chosen image patches of the test image data set, from which we subtract the mean. Moreover, in order to minimize the influence of the image boundary, we scale the image intensity values with a spatial Gaussian kernel. We run the algorithm for \(10^4\) iterations, which is sufficient for reaching a residual of approximately \(10^{-5}\) for each eigenpair.

Figure 14 depicts the resulting pairs of eigenfunctions and eigenvalues for image denoising. We observe that eigenfunctions corresponding to smaller eigenvalues represent in general more complex and smoother image structures. In particular, the first eigenfunctions can be interpreted as cartoon-like image structures with clearly separable interfaces. Most of the eigenfunctions associated with larger eigenvalues exhibit texture-like patterns with a progressive frequency. Finally, wave and noise structures are present in the eigenfunctions with the highest eigenvalues.

We remark that all eigenvalues are in the interval \([0.025,11.696]\). Since \(\frac{T}{S}\approx 0.054\), the contrast factors \((1-\frac{\lambda _j T}{S})\) in (34) are in the interval [0.368, 0.999], which shows that the regularizer has a tendency to decrease the contrast. Formula (34) also reveals that eigenfunctions corresponding to contrast factors close to 1 are preserved over several iterations. In summary, the learned regularizer has a tendency to reduce the contrast of high-frequency noise patterns, but preserves the contrast of texture- and structure-like patterns.

Figure 15 shows the eigenpairs for the deblurring task. All eigenvalues are relatively small and distributed around 0, which means that the corresponding contrast factors lie in the interval [0.992, 1.030]. Therefore, the learned regularizer can both decrease and increase the contrast. Moreover, most eigenfunctions are composed of smooth structures with a distinct overshooting behavior in the proximity of image boundaries. This implies that the learned regularizer has a tendency to perform image sharpening.

5 Conclusion

Starting from a parametric and autonomous gradient flow perspective of variational methods, we explicitly modeled the stopping time as a control parameter in an optimal control problem. By using a Lagrangian approach, we derived a first-order condition suited to automatize the calculation of the energy minimizing optimal stopping time. A forward Euler discretization of the gradient flow led to static variational networks. Numerical experiments confirmed that a proper choice of the stopping time is of vital importance for the image restoration tasks in terms of the PSNR value. We performed a nonlinear eigenvalue analysis of the gradient of the learned Field-of-Experts regularizer, which revealed interesting properties of the local regularization behavior. A comprehensive long-term spectral analysis in continuous time is left for future research.

A further future research direction would be the extension to dynamic variational networks, in which the kernels and activation functions evolve in time. However, a major issue related to this extension emerges from the continuation of the stopping time beyond its optimal point.