1 Introduction

Wavelet and frame shrinkage operators became very popular in recent years. A certain starting point was the iterative shrinkage-thresholding algorithm (ISTA) in [16], which was interpreted as a special case of the forward-backward algorithm in [14]. For relations with other algorithms see also [8, 40]. Let \(T \in \mathbb R^{n \times d}\), \(n\ge d\), have full column rank. Then, the problem

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x -y \Vert _2^2 + \lambda \Vert Ty\Vert _{1} \bigr \}, \quad \lambda >0, \end{aligned}$$
(1)

is known as the analysis point of view. For orthogonal \(T \in {\mathbb R}^{d \times d}\), the solution of (1) is given by the frame soft shrinkage operator \(T^\dagger \, S_\lambda \, T = T^* \, S_\lambda \, T\), see Example 2.3. If \(T \in \mathbb R^{n \times d}\) with \(n \le d\) and \(T T^* = I_n\), the solution of problem (1) is given by \(I_d - T^* T + T^* S_\lambda T\), see [6, Theorem 6.15]. For arbitrary \(T \in {\mathbb R}^{n \times d}\), \(n \ge d\), there are no analytic expressions for the solution of (1) in the literature.

The question whether the frame shrinkage operator can itself be seen as a proximity operator has been recently studied in [20]. They showed that the set-valued operator \((T^\dagger S_\lambda T)^{-1} - I_d\) is maximally cyclically monotone, which implies that it is a proximity operator with respect to some norm in \(\mathbb R^d\). In this paper, we prove that for any operator \(T\in \mathcal {B} (\mathcal {H},\mathcal {K})\) with closed range, \(b \in \mathcal {K}\) and any proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) the new operator \(T^\dagger \, \mathrm {Prox}\, ( T \cdot + b) :\mathcal {H}\rightarrow \mathcal {H}\) is also a proximity operator on the linear space \(\mathcal {H}\), but equipped with another inner product. The above mentioned finite dimensional setting is included as a special case. In contrast to [20], we directly approach the problem using a classical result of Moreau [34]. Moreover, we provide the function for the definition of the proximity operator. Here, we like to mention that this function can be also deduced from Proposition 3.9 in [12]. However, since this deduction appears to be more space consuming than the direct proof of our Theorem 3.4, we prefer to give a direct approach. Note that different norms in the definition of the proximity operator were successfully used in variable metric algorithms, see [10].

Recently, it was shown that many activation functions appearing in neural networks are indeed proximity functions [13]. Based on this observations and our previous findings, we consider neural networks that are concatenations of proximity operators and call them proximal neural networks (PNNs). PNNs can be considered within the framework of the variational networks proposed in [29]. Due to stability reasons, PNNs related to linear operators from Stiefel manifolds are of special interest. They form so-called averaged operators and are consequently nonexpansive. Orthogonal matrices have already shown advantages in training recurrent neural networks (RNNs) [3, 4, 28, 31, 47, 49]. Using orthogonal matrices, vanishing or exploding gradients in training RNNs can be avoided [17]. The more general setting of learning rectangular matrices from a Stiefel manifold was proposed, e.g., in [25], but with a different focus than in this paper. The most relevant paper with respect to our setting is [26], where the authors considered the so-called optimization over multiple dependent Stiefel manifolds (OMDSM). We will see that the NNs in [26] are special cases of our PNNs so that our analysis ensures that they are averaged operators.

Using matrices from Stiefel manifolds results in 1-Lipschitz neural networks. Consequently, our approach is naturally related to other methods for controlling the Lipschitz constant of neural networks, which provably increases robustness against adversarial attacks [46]. In [23], the constant is controlled by projecting back all weight matrices in the network that violate a pre-defined threshold on the \(\Vert \cdot \Vert _p\) norm, \(p \in [1,\infty ]\), of the weight matrices. The authors in [39] characterize the singular values of the linear map associated with convolutional layers and use this for projecting a convolutional layer onto an operator-norm ball. Another closely related approach is spectral normalization as proposed in [33], where the spectral norm of every weight matrix is enforced to be one. Compared to our approach, this only restricts the largest singular value of the linear operators arising in the neural network. Limitations of the expressiveness of networks with restricted Lipschitz constants in every layer were discussed in [2, 27]. Note that our approach does not restrict the Lipschitz constants in every individual layer. Further, none of the above approaches is able to impose more structure on the network such as being an averaged operator.

Our results may be of interest in so-called Plug-and-Play algorithms [9, 42, 45]. In these algorithms a well-behaved operator, e.g., a proximity operator, is replaced by an efficient denoiser such as a neural network. However, training a denoising framework without structure can lead to a divergent algorithm, see [41]. In contrast, it was shown in [44] that a particular version of a Plug-and-Play algorithm converges if the network is averaged.

Our paper is organized as follows: We begin with preliminaries on convex analysis in Hilbert spaces in Sect. 2. In Sect. 3, we prove our general results on the interplay between proximity and certain affine operators. As a special case we emphasize that the frame soft shrinkage operator is itself a proximity operator in Sect. 4. In Sect. 5, we use our findings to set up neural networks as a concatenation of proximity operators on \(\mathbb R^d\) equipped with different norms related to linear operators. If these operators are related to tight frames, our proposed network is actually an averaged operator. In case of Parseval frames, the involved matrices are in Stiefel manifolds and we end up with PPNNs. Sect. 6 deals with the training of PPNNs via stochastic gradient descent on Stiefel manifolds. In Sect. 7, we provide first numerical examples. Finally, Sect. 8 contains conclusions and addresses further research questions.

2 Preliminaries

Let \(\mathcal {H}\) be a real Hilbert space with inner product \(\langle \cdot ,\cdot \rangle \) and norm \(\Vert \cdot \Vert \). By \(\Gamma _0(\mathcal {H})\) we denote the set of proper, convex, lower semi-continuous functions on \(\mathcal {H}\) mapping into \((-\infty , \infty ]\). For \(f \in \Gamma _0(\mathcal {H})\) and \(\lambda > 0\), the proximity operator \(\mathrm {prox}_{\lambda f}:\mathcal {H}\rightarrow \mathcal {H}\) and its Moreau envelope \(M_{\lambda f}:\mathcal {H}\rightarrow \mathbb {R}\) are defined by

$$\begin{aligned} \mathrm {prox}_{\lambda f} (x)&:= {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert ^2 + \lambda f(y) \bigr \}, \\ M_{\lambda f} (x)&:= \min _{y \in \mathcal {H}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert ^2 + \lambda f(y) \bigr \}. \end{aligned}$$

Clearly, the proximity operator and its Moreau envelope depend on the underlying space \(\mathcal {H}\), in particular on the chosen inner product. Recall that an operator \(A:\mathcal {H}\rightarrow \mathcal {H}\) is called firmly nonexpansive if for all \(x,y \in \mathcal {H}\) the following relation is fulfilled

$$\begin{aligned} \Vert Ax -Ay\Vert ^2 \le \langle x-y,Ax-Ay \rangle . \end{aligned}$$
(2)

Obviously, firmly nonexpansive operators are nonexpansive.

For a Fréchet differentiable function \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\), the gradient \(\nabla \Phi (x)\) at \(x \in \mathcal {H}\) is defined as the vector satisfying for all \(h \in \mathcal {H}\),

$$\begin{aligned} \langle \nabla \Phi (x), h \rangle = D\Phi (x) h, \end{aligned}$$

where \(D\Phi :\mathcal {H}\rightarrow \mathcal {B} (\mathcal {H},\mathbb {R})\) denotes the Fréchet derivative of \(\Phi \), i.e., for all \(x,h \in \mathcal {H}\),

$$\begin{aligned} \Phi (x+h) - \Phi (x) = D\Phi (x) h + o(\Vert h\Vert ). \end{aligned}$$

Note that the gradient crucially depends on the chosen inner product in \(\mathcal {H}\). The following results can be found, e.g., in [5, Propositions 12.27, 12.29].

Theorem 2.1

Let \(f \in \Gamma _0(\mathcal {H})\). Then, the following relations hold true:

(i) The operator \(\mathrm {prox}_{\lambda f} :\mathcal {H}\rightarrow \mathcal {H}\) is firmly nonexpansive.

(ii) The function \(M_{\lambda f}\) is (Fréchet) differentiable with Lipschitz-continuous gradient given by

$$\begin{aligned} \nabla M_{\lambda f}(x) = x - \mathrm {prox}_{\lambda f}(x). \end{aligned}$$

Clearly, (ii) implies that

$$\begin{aligned} \mathrm {prox}_{\lambda f} (x) = \nabla \bigl ( \tfrac{1}{2} \Vert x \Vert ^2 - M_{\lambda f}(x) \bigr ) = \nabla \Phi (x), \end{aligned}$$
(3)

where \(\Phi := \frac{1}{2} \Vert \cdot \Vert ^2 - M_{\lambda f}\) is convex as \(\mathrm {prox}_{\lambda f}\) is nonexpansive [5, Proposition 17.10]. Further, it was shown by Moreau that also the following (reverse) statement holds true [34, Corrolary 10c].

Theorem 2.2

The operator \(\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}\) is a proximity operator if and only if it is nonexpansive and there exists a function \(\Psi \in \Gamma _0(\mathcal {H})\) with \(\mathrm {Prox}(x) \in \partial \Psi (x)\) for any \(x \in \mathcal {H}\), where \(\partial \Psi \) denotes the subdifferential of \(\Psi \).

Thanks to (3), we conclude that \(\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}\) is a proximity operator if and only if it is nonexpansive and the gradient of a convex, differentiable function \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\). Recently, the characterization of Bregman proximity operators in a more general setting was discussed in [24]. In the following example, we recall the Moreau envelope and the proximity operator related to the soft thresholding operator.

Example 2.3

Let \(\mathcal {H}= \mathbb {R}\) with usual norm \(|\cdot |\) and \(f(x) := |x|\). Then, \(\mathrm {prox}_{\lambda f}\) is the soft shrinkage operator \(S_\lambda \) defined by

$$\begin{aligned} S_\lambda (x):= \left\{ \begin{array}{cl} x - \lambda &{} \mathrm {for} \; x > \lambda ,\\ 0 &{} \mathrm {for} \; x \in [-\lambda ,\lambda ],\\ x + \lambda &{} \mathrm {for} \; x < -\lambda , \end{array} \right. \end{aligned}$$

and the Moreau envelope is the Huber function

$$\begin{aligned} m_{\lambda | \cdot |} (x) = \left\{ \begin{array}{cl} \lambda x - \frac{\lambda ^2}{2} &{} \mathrm {for} \; x > \lambda ,\\ \frac{1}{2} x^2 &{} \mathrm {for} \; x \in [-\lambda ,\lambda ],\\ - \lambda x - \frac{\lambda ^2}{2}&{} \mathrm {for} \; x < -\lambda . \end{array} \right. \end{aligned}$$

Hence, \(\mathrm {prox}_{\lambda f} = \nabla \varphi \), where \(\varphi (x) = \frac{x^{2}}{2} - m_{\lambda | \cdot |}(x)\), i.e.,

$$\begin{aligned} \varphi (x) = \left\{ \begin{array}{cl} \tfrac{1}{2}(x-\lambda )^2&{} \mathrm {for} \; x > \lambda ,\\ 0 &{}\mathrm {for} \; x \in [-\lambda ,\lambda ],\\ \tfrac{1}{2} (x + \lambda )^2&{} \mathrm {for} \; x < -\lambda . \end{array} \right. \end{aligned}$$

For \(\mathcal {H}= \mathbb R^d\) and \(f(x) := \Vert x\Vert _1\), we can use a componentwise approach. Then, \(S_\lambda \) is defined componentwise, the Moreau envelope reads as \(M_{\lambda \Vert \cdot \Vert _1} (x) = \sum _{i=1}^d m_{\lambda | \cdot |}(x_i)\) and the potential of \(\mathrm {prox}_{\lambda \Vert \cdot \Vert _1}\) is \(\Phi (x) = \sum _{i=1}^d \varphi (x_i)\).

3 Interplay Between Proximity and Linear Operators

Let \(\mathcal {H}\) and \(\mathcal {K}\) be real Hilbert spaces with inner products \(\langle \cdot ,\cdot \rangle _\mathcal {H}\) and \(\langle \cdot ,\cdot \rangle _\mathcal {K}\) and corresponding norms \(\Vert \cdot \Vert _\mathcal {H}\) and \(\Vert \cdot \Vert _\mathcal {K}\), respectively. By \(\mathcal {B}(\mathcal {H},\mathcal {K})\) we denote the space of bounded, linear operators from \(\mathcal {H}\) to \(\mathcal {K}\). The kernel and the range of \(T\in \mathcal {B}(\mathcal {H},\mathcal {K})\) are denoted by \(\mathcal N(T)\) and \(\mathcal R(T)\), respectively. In this section, we show that for any nontrivial operator \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range \(\mathcal {R}(T)\), \(b \in \mathcal {K}\) and proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\), the operator \(T^\dagger \, \mathrm {Prox}( T \cdot + b):\mathcal {H}\rightarrow \mathcal {H}\) is itself a proximity operator on the linear space \(\mathcal {H}\) equipped with a suitable (equivalent) norm \(\Vert \cdot \Vert _{{\mathcal {H}_{T}}}\), i.e., there exits a function \(f \in \Gamma _0(\mathcal {H})\) such that

$$\begin{aligned} T^\dagger \, \mathrm {Prox}( Tx + b) = {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y) \bigr \}. \end{aligned}$$

Throughout this section, let \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) have closed range. Then, the same holds true for its adjoint \(T^*:\mathcal {K}\rightarrow \mathcal {H}\) and the following (orthogonal) decompositions hold

$$\begin{aligned} \mathcal {K}= \mathcal {R}(T) \oplus \mathcal {N}(T^*), \qquad \mathcal {H}= \mathcal {R}(T^*) \oplus \mathcal {N}(T). \end{aligned}$$
(4)

The Moore–Penrose inverse (generalized inverse, pseudo-inverse) \(T^\dagger \in \mathcal {B}(\mathcal {K},\mathcal {H})\) is given point-wise by

$$\begin{aligned} \{T^{\dagger }y\} = \{x \in \mathcal {H}: T^*\,T x = T^*y\} \cap \mathcal {R}(T^*), \end{aligned}$$

see [5]. Further, it satisfies \(\mathcal {R} (T^\dagger ) = \mathcal {R} (T^*)\) and

$$\begin{aligned} T^{\dagger }\, T = P_{\mathcal {R}(T^*)}, \quad T\, T^{\dagger } = P_{\mathcal {R}(T)}, \end{aligned}$$
(5)

where \(P_{C}\) is the orthogonal projection onto the closed, convex set C, see [5, Proposition 3.28]. Then, it follows

$$\begin{aligned} T^\dagger \, T\, T^* = P_{\mathcal {R}(T^*)}\,T^* = T^* \quad \mathrm {and} \quad T^\dagger \, P_{\mathcal {R}(T)} = T^\dagger \, T\, T^\dagger = T^\dagger . \end{aligned}$$
(6)

If T is injective, then \(T^\dagger = (T^*\,T)^{-1} T^*\) and if T is surjective, we have \(T^\dagger = T^*(T\,T^*)^{-1}\).

Every \(T\in \mathcal {B}(\mathcal {H},\mathcal {K})\) gives rise to an inner product in \(\mathcal {H}\) via

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x, P_{\mathcal {N}(T)} y \rangle _\mathcal {H}\end{aligned}$$
(7)

with corresponding norm

$$\begin{aligned} \Vert x\Vert _{\mathcal {H}_{T}}= \bigl (\Vert Tx\Vert ^2_\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \Vert P_{\mathcal {N}(T)} x\Vert ^2_\mathcal {H}\bigr )^{\frac{1}{2}}. \end{aligned}$$

If T is injective, the second summand vanishes. In general, this norm only induces a pre-Hilbert structure. Since \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) has closed range, the norms \(\Vert \cdot \Vert _\mathcal {H}\) and \(\Vert \cdot \Vert _{\mathcal {H}_{T}}\) are equivalent on \(\mathcal {H}\) due to

$$\begin{aligned} \Vert x\Vert _{\mathcal {H}_{T}}^2 = \Vert T x \Vert _\mathcal {K}^2/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \Vert P_{\mathcal {N}(T)} x\Vert _\mathcal {H}^2 \le 2 \Vert x \Vert _\mathcal {H}^2 \end{aligned}$$
(8)

and

$$\begin{aligned} \Vert x \Vert _\mathcal {H}^2 = \Vert T^\dagger \,T x \Vert _\mathcal {H}^2 +\Vert P_{\mathcal {N}(T)} x \Vert _\mathcal {H}^2 \le \bigl (\Vert T^\dagger \Vert ^2_{\mathcal {B}(\mathcal {K},\mathcal {H})}\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + 1\bigr ) \Vert x \Vert _{\mathcal {H}_{T}}^2 \end{aligned}$$

for all \(x \in \mathcal {H}\). The norm equivalence also ensures the completeness of \(\mathcal {H}\) equipped with the new norm. To emphasize that we consider the linear space \(\mathcal {H}\) with this norm, we write \({\mathcal {H}_{T}}\). For special \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\), the inner product (7) coincides with the one in \(\mathcal {H}\).

Lemma 3.1

Let \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) fulfill \(T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {H}\) or \(T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {K}\), where \(\mathrm {Id}_\mathcal {H}\) and \(\mathrm {Id}_\mathcal {K}\) denote the identity operator on \(\mathcal {H}\) and \(\mathcal {K}\), respectively. Then, the inner product (7) coincides with the one in \(\mathcal {H}\) and consequently \(\mathcal {H}= \mathcal {H}_T\).

Proof

If \(T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {H}\), then T is injective such that (7) implies

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} = \langle x, T^* Ty \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} = \langle x, y \rangle _\mathcal {H}. \end{aligned}$$

If \(T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {K}\), then (6) implies that \(P_{\mathcal {N}(T)} = \mathrm {Id}_\mathcal {H}- T^\dagger T = \mathrm {Id}_\mathcal {H}- T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\) and

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}&= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x, P_{\mathcal {N}(T)} y \rangle _\mathcal {H}\\&= \langle x,T^*Ty \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x,y\rangle _\mathcal {H}- \langle x,T^* T y\rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\\&= \langle x,y\rangle _\mathcal {H}. \end{aligned}$$

   \(\square \)

To apply the characterization of proximal mappings in \(\mathcal {H}_T\) by Moreau, see Theorem 2.2, we have to compute gradients in \(\mathcal {H}_T\). Here, the following result is crucial.

Lemma 3.2

Let \(\mathcal {H}\) and \(\mathcal {K}\) be real Hilbert spaces with inner products \(\langle \cdot ,\cdot \rangle _\mathcal {H}\) and \(\langle \cdot ,\cdot \rangle _\mathcal {K}\), respectively. For an operator \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range, let \(\mathcal {H}_T\) be the Hilbert space with inner product (7). For (Fréchet) differentiable \(\Phi :\mathcal {H}\rightarrow \mathbb R\), the gradients \(\nabla _\mathcal {H}\Phi \) and \(\nabla _{\mathcal {H}_T} \Phi \) with respect to the different inner products are related by

$$\begin{aligned} \bigl (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}\bigr )\nabla _{\mathcal {H}_{T}}\, \Phi (x) = \nabla _\mathcal {H}\Phi (x). \end{aligned}$$

Proof

The gradient \(\nabla _{\mathcal {H}_{T}}\Phi (x)\) at \(x \in \mathcal {H}\) in the space \({\mathcal {H}_{T}}\) is given by the vector satisfying

$$\begin{aligned} \langle \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _{\mathcal {H}_{T}}= D\Phi (x) h = \langle \nabla _\mathcal {H}\Phi (x), h \rangle _\mathcal {H}\end{aligned}$$

for all \(h \in \mathcal {H}\). Since

$$\begin{aligned} \langle \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _{\mathcal {H}_{T}}&= \langle T \nabla _{\mathcal {H}_{T}}\Phi (x), Th \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle \nabla _{\mathcal {H}_{T}}\Phi (x),P_{\mathcal {N}(T)} h \rangle _\mathcal {H}\\&= \langle T^* T \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle P_{\mathcal {N}(T)} \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _\mathcal {H}, \end{aligned}$$

the gradient depends on the chosen inner product through

$$\begin{aligned} \bigl (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}\bigr )\nabla _{\mathcal {H}_{T}}\, \Phi (x) = \nabla _\mathcal {H}\Phi (x). \end{aligned}$$

   \(\square \)

Now, the desired result follows from the next theorem.

Theorem 3.3

Let \(b \in \mathcal {K}\), \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) have closed range and \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) be a proximity operator on \(\mathcal {K}\). Then, the operator \(A := T^\dagger \, \mathrm {Prox}\, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}\) is a proximity operator.

Proof

In view of Theorems 2.1 and 2.2, it suffices to show that A is nonexpansive and that there exists a convex function \(\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}\) with \(A = \nabla _{\mathcal {H}_{T}}\Psi \).

1. First, we show that A is firmly nonexpansive, and thus nonexpansive. By (4), we see that

$$\begin{aligned} P_{\mathcal {N}(T)} T^\dagger = 0. \end{aligned}$$
(9)

Using this and (5), it follows

$$\begin{aligned} \Vert Ax - Ay \Vert _{\mathcal {H}_{T}}^2&= \frac{\Vert T T^\dagger \left( \, \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \right) \Vert _\mathcal {K}^2}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}} + \Vert P_{\mathcal {N}(T)} (Ax -Ay)\Vert ^2_\mathcal {H}\nonumber \\&\le \frac{\Vert \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \Vert _\mathcal {K}^2}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}. \end{aligned}$$
(10)

By (9) and (5), we obtain

$$\begin{aligned} \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\bigl \langle A x - A y, x-y \bigr \rangle _{\mathcal {H}_{T}}&= \bigl \langle T T^\dagger \bigl ( \mathrm {Prox}\, (T x +b) - \mathrm {Prox}\, (T y + b) \bigr ), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle P_{\mathcal {R}(T)} \bigl ( \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \bigr ), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b), Tx + b - (Ty + b) \bigr \rangle _\mathcal {K}, \end{aligned}$$

and since \(\mathrm {Prox}\) is firmly nonexpansive with respect to \(\Vert \cdot \Vert _{\mathcal {K}}\), see (2), the estimate (10) further implies that A is firmly nonexpansive

$$\begin{aligned} \bigl \langle Ax - Ay , x-y \bigr \rangle _{\mathcal {H}_{T}}&\ge \frac{\Vert \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \Vert ^2_\mathcal {K}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}} \ge \Vert Ax - Ay \Vert _{\mathcal {H}_{T}}^2. \end{aligned}$$

2. It remains to prove that there exists a convex function \(\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}\) with \(\nabla _{\mathcal {H}_{T}}\Psi = A\). Since \(\mathrm {Prox}\) is a proximity operator, there exists \(\Phi :\mathcal {H}\rightarrow \mathbb {R}\) with \(\mathrm {Prox}= \nabla _\mathcal {K}\Phi \). Then, a natural candidate is given by \(\Psi =\Phi \, (T \cdot + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\). Using the definition of the gradient and the chain rule, it holds for all \(x,h\in \mathcal {H}\) that

$$\begin{aligned} \langle \nabla _\mathcal {H}\Psi (x) , h \rangle _\mathcal {H}&= D\Psi (x)h = \frac{D\Phi (Tx + b)\,Th}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}= \frac{\langle \nabla _\mathcal {K}\Phi (Tx + b) , Th \rangle _\mathcal {K}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}\\&= \frac{\langle T^* \mathrm {Prox}\, (Tx + b) , h \rangle _\mathcal {H}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}. \end{aligned}$$

Incorporating Lemma 3.2, we conclude

$$\begin{aligned} (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}) \nabla _{\mathcal {H}_{T}}\Psi = \nabla _\mathcal {H}\Psi (x) = T^* \, \mathrm {Prox}\, (Tx+ b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}, \end{aligned}$$

which implies \(T^*\, T\, \nabla _{\mathcal {H}_{T}}\Psi = T^* \, \mathrm {Prox}\, (Tx + b)\) and \(\nabla _{\mathcal {H}_{T}}\Psi \in \mathcal {R}(T^*)\). By definition of \(T^\dagger \), we obtain \(\nabla _{\mathcal {H}_{T}}\Psi = A\). Finally, \(\Psi \) is convex as it is the concatenation of a convex function with a linear function.    \(\square \)

Let

$$\begin{aligned} (f\square g)(x) := \inf _{y\in \mathcal {H}} f(y) + g(x-y) \end{aligned}$$

denote the infimal convolution of \(f,g \in \Gamma _0(\mathcal {H})\) and \(x \mapsto \iota _S(x)\) the indicator function of the set S taking the value 0 if \(x \in S\) and \(+\infty \) otherwise.

For \(\mathrm {Prox}:= \mathrm {prox}_{g}\) with \(g \in \Gamma _0(\mathcal {H})\), we are actually able to explicitly compute \(f \in \Gamma _0(\mathcal {H})\) such that \(T^\dagger \, \mathrm {Prox}\, (T \cdot + b) = \mathrm {prox}_f\) on \({\mathcal {H}_{T}}\). Clearly, this also gives an alternative proof for Theorem 3.3.

Theorem 3.4

Let \(b \in \mathcal {K}\), \(T \in \mathcal {B}(\mathcal {H},\mathcal {K})\) with closed range and \(\mathrm {Prox}:= \mathrm {prox}_{g}\) for some \(g \in \Gamma _0(\mathcal {K})\). Then, \(T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}\) is the proximity operator on \({\mathcal {H}_{T}}\) of \(f \in \Gamma _0(\mathcal {H})\) given by

$$\begin{aligned} f(x) := g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(x). \end{aligned}$$
(11)

This expression simplifies to

$$\begin{aligned} f(x)&= g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \quad \text{ if } \text{ T } \text{ is } \text{ injective },\\ f(x)&= g(Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(x) \quad \text{ if } \text{ T } \text{ is } \text{ surjective },\\ f(x)&= g(Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \quad \text{ if } \text{ T } \text{ is } \text{ bijective }. \end{aligned}$$

Proof

By (6) and (4), we obtain

$$\begin{aligned}&T^\dagger \, \mathrm {prox}_{g} \, (T x + b)\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z \in \mathcal {K}}} \bigl \{ \tfrac{1}{2} \Vert z - Tx - b\Vert _{\mathcal {K}}^2 + g(z)\bigr \}\\ =&T^\dagger P_{\mathcal {R}(T)} {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T), z_2 \in \mathcal {N}(T^*)}} \bigl \{ \tfrac{1}{2} \Vert z_1 + z_2 - Tx\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \}\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T)}} \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2} \Vert z_1 - Tx\Vert _{\mathcal {K}}^2 + \tfrac{1}{2} \Vert z_2\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \}\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T)}} \Bigl \{ \tfrac{1}{2} \Vert z_1 - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \} \Bigr \}\\ =&T^\dagger T {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert Ty - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \} \Bigr \} \end{aligned}$$

and by (5) further

$$\begin{aligned}&T^\dagger \, \mathrm {prox}_{g} \, (T x + b)\nonumber \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert Ty - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \} \Bigr \}\nonumber \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert y - x\Vert _{\mathcal {H}_{T}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \Bigr \} \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \Bigl \{ \tfrac{1}{2} \Vert y - x\Vert _{\mathcal {H}_{T}}^2 + g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(y)\Bigr \}.\nonumber \end{aligned}$$
(12)

Hence, we conclude that \(T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b)\) is the proximity operator on \({\mathcal {H}_{T}}\) of f in (11).    \(\square \)

Note that for surjective T and \(b=0\), the function f is in general a weaker regularizer than g. This is necessary since for the latter (12) would lead to

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \bigl \{\tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_T}^2 + g(Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \bigr \}&= T^\dagger {\mathop {\hbox {argmin}}\limits _{z \in \mathcal {K}}} \bigl \{ \tfrac{1}{2} \Vert z - Tx\Vert _\mathcal {K}^2 + g(z) + \iota _{\mathcal{R}(T)}(z)\bigr \}\\ {}&\ne T^\dagger \mathrm {prox}_{g} (T x). \end{aligned}$$

4 Frame Soft Shrinkage as Proximity Operator

In this section, we investigate the frame soft shrinkage as a special proximity operator. Let \(\mathcal {K}= \ell _2\) be the Hilbert space of square summable sequences \(c = \{c_k\}_{k \in \mathbb N}\) with norm \(\Vert c \Vert _{\ell _2} := ( \sum _{k \in \mathbb N} |c_k|^2)^{\frac{1}{2}}\) and assume that \(\mathcal {H}\) is separable. A sequence \(\{x_k\}_{k\in \mathbb {N}}\), \(x_k \in \mathcal {H}\), is called a frame of \(\mathcal {H}\), if constants \(0< A \le B < \infty \) exist such that for all \(x \in \mathcal {H}\),

$$\begin{aligned} A \Vert x\Vert _\mathcal {H}^2 \le \sum _{k\in \mathbb {N}} |\langle x,x_k \rangle _\mathcal {H}|^2 \le B \Vert x\Vert _\mathcal {H}^2. \end{aligned}$$

Given a frame \(\{x_k\}_{k\in \mathbb {N}}\) of \(\mathcal {H}\), the corresponding analysis operator \(T :\mathcal {H}\rightarrow \ell _2\) is defined as

$$\begin{aligned} Tx=\bigl \{ \langle x,x_k \rangle _\mathcal {H}\bigr \}_{k\in \mathbb {N}}, \quad x\in \mathcal {H}. \end{aligned}$$

Its adjoint \(T^*:\ell _2 \rightarrow \mathcal {H}\) is the synthesis operator given by

$$\begin{aligned} T^*\{c_k\}_{k\in \mathbb {N}} = \sum _{k\in \mathbb {N}} c_k x_k, \quad \{c_k\}_{k\in \mathbb {N}} \in \ell _2. \end{aligned}$$

By composing T and \(T^*\), we obtain the frame operator

$$\begin{aligned} T^*Tx = \sum _{k\in \mathbb {N}} \langle x , x_k \rangle _\mathcal {H}x_k, \quad x\in \mathcal {H}, \end{aligned}$$

which is invertible on \(\mathcal {H}\), see [11], such that

$$\begin{aligned} x= \sum _{k\in \mathbb {N}} \langle x , x_k \rangle _{\mathcal H} (T^*T)^{-1} x_k, \quad x\in \mathcal {H}. \end{aligned}$$

The sequence \(\{ (T^*T)^{-1}x_k\}_{k\in \mathbb {N}}\) is called the canonical dual frame of \(\{ x_k\}_{k\in \mathbb {N}}\). If

$$\begin{aligned} T^*T = \Vert T^* T\Vert \mathrm {Id}_\mathcal {H}= \Vert T\Vert ^2 \mathrm {Id}_\mathcal {H}, \end{aligned}$$

then \(\{x_k\}_{k \in \mathbb N}\) is called a tight frame, and for \(T^*T = \mathrm {Id}_\mathcal {H}\) a Parseval frame. Here, Lemma 3.1 comes into the play. Note that \(T^\dagger \) is indeed the synthesis operator for the canonical dual frame of \(\{ f_k\}_{k\in \mathbb {N}}\). The relation between linear, bounded, injective operators of closed range and frame analysis operators is given in the next proposition.

Proposition 4.1

  1. (i)

    An operator \(T \in \mathcal {B}(\mathcal {H},\ell _2)\) is injective and has closed range if and only if it is the analysis operator of some frame of \(\mathcal {H}\).

  2. (ii)

    An operator \(T \in \mathcal {B}(\ell _2,\mathcal {H})\) is surjective if and only if it is the synthesis operator of some frame of \(\mathcal {H}\).

Proof

(i) If T is the analysis operator for a frame \(\{x_k\}_{k\in \mathbb {N}}\), then T is bounded, injective and has closed range, see [11]. Conversely, assume that \(T \in \mathcal {B}(\mathcal {H},\ell _2)\) is injective and that \(\mathcal {R}(T)\) is closed. By (4), it holds \(\mathcal {R}(T^*) = \mathcal H\). Let \(\{\delta _k\}_{k\in \mathbb {N}}\) be the canonical basis of \(\ell _2\) and set \(\{x_k \}_{k\in \mathbb {N}}:= \{T^{*} \delta _k\}_{k\in \mathbb {N}}\). Since \(\sum _{k\in \mathbb {N}} |\langle x,x_k \rangle _\mathcal {H}|^2 = \Vert Tx \Vert _{\ell _2}^2\), we conclude that \(\{x_k \}_{k\in \mathbb {N}}\) is a frame of \(\mathcal {H}\) and that T is the corresponding analysis operator.

(ii) Let \(\{x_k\}_{k\in \mathbb {N}} = \{T\delta _k\}_{k\in \mathbb {N}}\). Then, the result follows from [11, Theorem 5.5.1].    \(\square \)

The soft shrinkage operator \(S_\lambda \) on \(\ell _2\) (applied componentwise) is the proximity operator corresponding to the function \(g := \lambda \Vert \cdot \Vert _1\), \(\lambda >0\). As immediate consequence of Theorem 3.4 we obtain the following corollary.

Corollary 4.2

Assume that \(T:\mathcal {H}\rightarrow \ell _2\) is an analysis operator for some frame of \(\mathcal {H}\) and \(\mathrm {Prox}:\ell _2 \rightarrow \ell _2\) is an arbitrary proximity operator. Then, \(T^{\dagger } \, \mathrm {Prox}\, T \) is itself a proximity operator on \(\mathcal {H}\) equipped with the norm \(\Vert \cdot \Vert _{\mathcal {H}_{T}}\). In particular, if \(\mathrm {Prox}:= S_\lambda \) with \(\lambda >0\), then

$$\begin{aligned}&T^{\dagger } \, S_\lambda \, (T x) = {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y)\bigr \},\\&f(y) := \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\ell _2}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}. \end{aligned}$$

Finally, let us have a look at the finite dimensional setting with \(\mathcal {H}:= \mathbb {R}^d\), \(\mathcal {K}:= \mathbb {R}^n\), \(n\ge d\). Then, we have for any \(T \in \mathbb {R}^{n,d}\) with full rank d and the proximity operator \(S_\lambda \) with \(\lambda >0\) on \(\mathbb {R}^n\) that

$$\begin{aligned}&T^{\dagger } \, S_\lambda \, \left( T (x) \right) = {\mathop {\hbox {argmin}}\limits _{y \in \mathbb {R}^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y) \bigr \},\nonumber \\&f(y) := \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{2}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}. \end{aligned}$$
(13)

Example 4.3

We want to compute f for the matrix \(T:\mathbb {R}^{1} \rightarrow \mathbb {R}^{2}\) given by \(T = ( 1 , 2)^\mathrm {T}\) and the soft shrinkage operator \(S_\lambda \) on \(\mathbb {R}^2\) with \(\lambda >0\). Note that this example was also considered in [20]. By (13) and since \(x = (x_1, x_2)^\mathrm {T}\in \mathcal {N}(T^*)\) if and only if \(x_1 = -2 x_2\), we obtain

$$\begin{aligned} f(y) \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}&= \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _2^2 + \iota _{\mathcal {N}(T^*)} (\cdot ) \bigr ) (Ty)\\&= \min _{Ty = z+x} \left\{ \lambda \Vert z\Vert _1 + \tfrac{1}{2} \Vert x \Vert _{2}^2 + \iota _{\mathcal {N}(T^*)}(x) \right\} \\&= \min _{x \in \mathbb {R}^2} \left\{ \lambda \Vert Ty-x\Vert _1 + \tfrac{1}{2} \Vert x \Vert _{2}^2+ \iota _{\mathcal {N}(T^*)}(x) \right\} \\&= \min _{x \in \mathbb {R}^2} \bigl \{\lambda \bigl \Vert (y , 2y )^\mathrm {T}- ( x_1, x_2)^\mathrm {T}\bigr \Vert _1 + \tfrac{1}{2} \Vert x \Vert _2^2 + \iota _{\mathcal {N}(T^*)}(x) \bigr \}\\&= \min _{x_2 \in \mathbb {R}} \bigl \{ \lambda \vert y +2x_2 \vert + \lambda \vert 2y-x_2 \vert + \tfrac{5}{2} x_2^2 \bigr \}. \end{aligned}$$

Consider the strictly convex function \(g_y(x_2) = \lambda \vert y +2x_2 \vert + \lambda \vert 2y-x_2 \vert + \frac{5}{2} x_2^2\). For \(\vert y \vert \le \frac{2}{5} \lambda \), it holds

$$\begin{aligned} 0 \in \partial _{x_2} g_y \left( -\tfrac{y}{2} \right) = [-2\lambda ,2\lambda ] - \lambda \mathrm {sgn}(y) -\tfrac{5}{2} y. \end{aligned}$$

Hence, by Fermat’s theorem, the unique minimizer of \(g_y(x_2)\) is given by \(-\frac{y}{2}\). Consequently, we have for \(\vert y \vert \le \frac{2}{5}\lambda \) that

$$\begin{aligned} f(y) = \tfrac{1\lambda }{2} \vert y \vert + \tfrac{1}{8} y^2. \end{aligned}$$

For \(\vert y \vert > \frac{2\lambda }{5}\), the function \(g_y\) is differentiable in \(-\frac{\lambda }{5} \mathrm {sgn}(y)\) and it holds

$$\begin{aligned} \partial _{x_2} g_y\bigl (-\tfrac{\lambda }{5} \mathrm {sgn}(y)\bigr ) = 2 \lambda \mathrm {sgn}(y) - \lambda \mathrm {sgn}(y) - \lambda \mathrm {sgn}(y) = 0. \end{aligned}$$

Therefore, for \(\vert y \vert > \frac{2\lambda }{5}\), the minimizer of \(g_y\) is \(-\frac{\lambda }{5} \mathrm {sgn}(y)\) and

$$\begin{aligned} f(y) = \tfrac{3\lambda }{5}\vert y \vert - \tfrac{\lambda ^{2}}{50}. \end{aligned}$$

Choosing, e.g., \(\lambda = \frac{1}{3}\) we obtain

$$\begin{aligned} f (y) = \left\{ \begin{array}{ll} \frac{1}{6} \vert y \vert + \frac{1}{8} y^{2} &{} \vert y \vert \le \frac{2}{15} \\ \tfrac{1}{5}\vert y \vert - \frac{1}{450} &{} \vert y \vert > \frac{2}{15} \end{array} \right. , \end{aligned}$$

which is a good approximation of \(\tfrac{1}{5} \vert y \vert \).

5 Proximal Neural Networks

In this section, we consider neural networks (NNs) consisting of \(K \in {\mathbb N}\) layers with dimensions \(n_{1}, \ldots , n_{K}\) defined by mappings \(\Phi = \Phi (\cdot \,; u):{\mathbb R}^{d} \rightarrow {\mathbb R}^{n_{K}}\) of the form

$$\begin{aligned} \Phi \left( x;u\right) := A_K \sigma \circ A_{K-1}\sigma \circ \cdots \sigma \circ A_1(x). \end{aligned}$$
(14)

Such NNs are composed of affine functions \(A_{k}:{\mathbb R}^{n_{k-1}} \rightarrow {\mathbb R}^{n_{k}}\) given by

$$\begin{aligned} A_{k}(x) := L_{k} x + b_{k}, \qquad k =1,\ldots , K, \end{aligned}$$
(15)

with weight matrices \(L_{k} \in {\mathbb R}^{n_{k}, n_{k-1}}\), \(n_{0}=d\), bias vectors \(b_{k} \in {\mathbb R}^{n_{k}}\) as well as a non-linear activation \(\sigma :\mathbb {R} \rightarrow \mathbb {R}\) acting at each component, i.e., for \({x}= (x_{j})_{j=1}^{n}\) we have \(\sigma (x) = (\sigma (x_{j}))_{j=1}^{n}\). The parameter set \(u:=\left( L_k,b_k\right) _{k=1}^{K}\) of such a NN has the overall dimension \(D:= n_0 n_1 + n_1 n_2 + \dots + n_{K-1}n_K + n_1 + \dots + n_K\). For an illustration see Fig. 1.

Fig. 1
figure 1

Model of a NN with three hidden layers, i.e., \(d=4\), \(K=4\), \(n_1=n_2=n_3=5, n_4=1\)

In [13], the notation of stable activation functions was introduced. An activation function \(\sigma :\mathbb R \rightarrow \mathbb R\) is called stable if it is monotone increasing, 1-Lipschitz continuous and satisfies \(\sigma (0) = 0\). The following result was shown in [13].

Lemma 5.1

A function \(\sigma :\mathbb {R} \rightarrow \mathbb {R}\) is a stable activation function if and only if there exists \(g \in \Gamma _0(\mathbb R)\) having 0 as a minimizer such that \(\sigma = \mathrm {prox}_{g}\).

Various common activation functions \(\sigma \) and corresponding functions \(g \in \Gamma _0(\mathbb R)\) are listed in Table 3 in the appendix. For \(T_k \in \mathbb R^{n_k,d}\), we consider the norm (8) and denote it by

$$\begin{aligned} \Vert x\Vert _{T_{k}} := \bigl ( \Vert T_{k} x\Vert _2^2/\Vert T_{k} \Vert _2^2 + \Vert (I - T_{k}^\dagger T_{k})x\Vert _2^2 \bigr )^\frac{1}{2}, \qquad x \in \mathbb R^d. \end{aligned}$$
(16)

In the previous sections, we have considered two different kinds of proximity operators, namely \(\mathrm {prox}_g\) with respect to the Euclidean norm

$$\begin{aligned} \mathrm {prox}_{g}= {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{2}^2 + g(y) \bigr \}, \end{aligned}$$
(17)

and \(\mathrm {prox}_{T_k,g}\) with respect to the norm (16)

$$\begin{aligned} \mathrm {prox}_{T_k,g}= {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{T_k}^2 + g(y) \bigr \}. \end{aligned}$$

Further, we derived a function \(f_k\) depending on g, \(T_k\) and \(b_k\), see Theorem 3.4, such that

$$\begin{aligned} \mathrm {prox}_{T_k,f_k}(x) = {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \left\{ \tfrac{1}{2} \Vert x-y\Vert _{T_k}^2 + f_k(y) \right\} = T_k^\dagger \mathrm {prox}_g (T_k x + b_k). \end{aligned}$$

Based on our observations in the previous sections, we consider the following special NNs. We choose a stable activation function \(\sigma = \mathrm {prox}_g\) for some \(g \in \Gamma _0(\mathbb R)\) and matrices \(T_{k} \in {\mathbb R}^{n_{k} , d}\), as well as bias vectors \(b_{k} \in {\mathbb R}^{n_{k}}\), \(k=1, \ldots , K\), and construct according to (15) the affine mappings

$$\begin{aligned} A_{k}(x) := \underbrace{T_{k} T_{k-1}^{\dagger }}_{L_k} (x) + b_{k} , \qquad k=1,\ldots ,K. \end{aligned}$$
(18)

Then, the NN \(\Phi :\mathbb {R}^d\rightarrow \mathbb {R}^{n_K}\) in (14) with \(A_{k}\), \(b_{k}\) in (18) can be rewritten as

$$\begin{aligned} \Phi \left( x;u\right)&= T_K \, T_{K-1}^\dagger \sigma \bigl ( T_{K-1} \ldots T_2^\dagger \sigma \bigl ( T_2 T_{1}^\dagger \sigma (T_1 x + b_1) + b_2 \bigr ) \ldots \bigr )+ b_K\nonumber \\&= T_K\, \mathrm {prox}_{T_{K-1},f_{K-1}} \circ \cdots \circ \mathrm {prox}_{T_{1},f_1} (x) + b_K. \end{aligned}$$
(19)

We call \(\Phi \) a proximal neural network (PNN) with network parameters \(u := (T_k,b_k)_{k=1}^{K}\).

Next, we investigate stability properties of such networks. Recall that an operator \(\Psi :\mathcal {H}\rightarrow \mathcal {H}\) on a Hilbert space \(\mathcal {H}\) is \(\alpha \)-averaged, \(\alpha \in (0,1)\), if there exists a nonexpansive operator \(R:\mathcal {H}\rightarrow \mathcal {H}\) such that

$$\begin{aligned} \Psi = \alpha R + (1-\alpha ) I_{\mathcal {H}}. \end{aligned}$$

The following theorem summarizes properties of \(\alpha \)-averaged operators, c.f. [5] and [38] for the third statement.

Theorem 5.2

Let \(\mathcal {H}\) be a separable real Hilbert space. Then the following holds true:

  1. (i)

    An operator on \(\mathcal {H}\) is firmly nonexpansive if and only if it is \(\frac{1}{2}\)-averaged.

  2. (ii)

    The concatenation of K operators which are \(\alpha _k\)-averaged with respect to the same norm is \(\alpha \)-averaged with \( \alpha = \frac{K}{K-1 + 1/\max _k \alpha _k} \).

  3. (iii)

    For an \(\alpha \)-averaged operator \(\Psi :\mathcal {H}\rightarrow \mathcal {H}\) with a nonempty fixed point set, the sequence generated by the iteration

    $$\begin{aligned} x^{(r+1)} = \Psi \bigl (x^{(r)} \bigr ) \end{aligned}$$

    converges weakly for every starting point \(x^{(0)} \in \mathcal {H}\) to a fixed point of \(\Psi \).

In the following, we study special PNNs, which are \(\alpha \)-averaged operators such that \(x^{(r+1)} = \Phi (x^{(r)};u)\) converges to a fixed point of \(\Phi \) if such a point exists.

Lemma 5.3

  1. (i)

    Let \(T_k\in \mathbb R^{n_k,d}\) fulfill \(T_k^* T_k = \Vert T_k \Vert ^2 I_{d}\) or \(T_k T_k^* = \Vert T_k \Vert ^2 I_{n_{k}}\) for all \(k=1,\ldots , K-1\) and let \(T_K = I_d\). Then \(\Phi \) in (19) is \(\alpha \)-averaged with \(\alpha = \frac{K-1}{K}\).

  2. (ii)

    Let \(T_1 \in \mathbb R^{n_k,d}\) with full column rank fulfill \(\Vert T_1\Vert ^2 T_k^* T_k =\Vert T_k\Vert ^2 T_1^* T_1\) for \(k=1,\ldots ,K-1\) and \(T_K = I_d\). Then \(\Phi \) in (19) is \(\alpha \)-averaged with \(\alpha = \frac{K-1}{K}\).

Proof

(i) By Lemma 3.1, we know that \(\Vert \cdot \Vert _{T_k} = \Vert \cdot \Vert _2\) so that \(\Phi \) is the concatenation of \(K-1\) proximity operators on \(\mathbb R^d\) with respect to the Euclidean norm. More precisely,

$$\begin{aligned} \Phi = T_K\,\mathrm {prox}_{f_{K-1}} \circ \cdots \circ \mathrm {prox}_{f_1} (x) + b_K \end{aligned}$$

with \(f_k\) as in Theorem 3.4. Now, the assertion follows from Theorem 5.2.

(ii) By assumption, we obtain \(\Vert x\Vert _{T_k} = x^* T_k^* T_k x /\Vert T_k\Vert ^2 = x^* T_1^* T_1 x/\Vert T_1\Vert ^2 = \Vert x\Vert _{T_1}\). Hence, \(\Phi \) becomes the concatenation of \(K-1\) proximity operators on \(\mathbb R^d\) all with respect to the \(\Vert \cdot \Vert _{T_1}\) norm. Again, the assertion follows from Theorem 5.2.    \(\square \)

Remark 5.4

Lemma 5.3(i) can be generalized to the case where \(T_K\in \mathbb {R}^{d,d}\) is a symmetric positive semi-definite matrix with norm not larger than 1. In this case, \(T_K\) can be written in the form \(T_K=Q^*Q\) for some \(Q\in \mathbb {R}^{d,d}\) with \(\Vert Q\Vert ^2=\Vert Q^*Q\Vert =\Vert T_K\Vert \le 1\). Thus, for every \(x,y\in \mathbb {R}^d\),

$$\begin{aligned} \Vert T_Kx +b_K -(T_Ky+b_K)\Vert ^2&= \Vert Q^*Q(x-y)\Vert ^2 \le \Vert Q(x-y)\Vert ^2 \\&=\langle Q(x-y), Q(x-y)\rangle \\&= \langle x-y, T_Kx+b_K - (T_Ky+b_K) \rangle . \end{aligned}$$

This shows that \(T_K\cdot +b_K\) is firmly nonexpansive and therefore \(\frac{1}{2}\)-averaged. Consequently, \(\Phi \) in (19) is the concatenation of K \(\frac{1}{2}\)-averaged operators with respect to the Euclidean norm. Hence, \(\Phi \) is itself \(\alpha \)-averaged with \(\alpha = \frac{K}{K+1}\).

Remark 5.5

In [13], the following NN structure was studied: Let \(\mathcal {H}_0, \ldots ,\mathcal {H}_K\) be a sequence of real Hilbert spaces and \(\mathcal {H}_0 = \mathcal {H}_K = \mathcal {H}\). Further, let \(W_k \in {\mathcal B} (\mathcal {H}_{k-1},\mathcal {H}_k)\) and \(P_k:\mathcal {H}_k \rightarrow \mathcal {H}_k\), \(k=1,\ldots ,K\) be firmly nonexpansive operators. For this case, Combettes and Pesquet [13] have posed conditions on \(W_k\) such that

$$\begin{aligned} \Psi := W_K \circ P_{K-1} \circ W_{K-1} \circ \cdots \circ W_2 \circ P_1 \circ W_1 \end{aligned}$$
(20)

is \(\alpha \)-averaged for some \(\alpha \in (1/2,1)\). For \(\mathcal {H}= \mathbb R^d\) equipped with the Euclidean norm, \(\mathcal {H}_k = \mathbb R^d\) equipped with the norm (16) and \(T_K = I_d\), our PNN \(\Phi \) has exactly the form (20) with \(P_k := \mathrm {prox}_{T_k,f_k}:\mathcal {H}_k \rightarrow \mathcal {H}_k\) and the embedding operators \(W_k:\mathcal {H}_{k-1} \hookrightarrow \mathcal {H}_k\), \(k=1,\ldots ,K\). For the special PNNs in Lemma 5.3 it holds \(W_k = I_d\), such that the conditions in [13] are fulfilled.

In the rest of this paper, we restrict our attention to matrices \(T_k:\mathbb R^{d} \rightarrow \mathbb R^{n_k}\) fulfilling

$$\begin{aligned} T_k^* T_k = I_{d} \quad \mathrm {or} \quad T_k T_k^* = I_{n_{k} }, \quad k=1,\ldots , K-1, \end{aligned}$$

and \(T_{K} = I_{d}\), i.e., the rows, resp. columns of \(T_k\) form a Parseval frame. Then, the PNN in (19) has the form

$$\begin{aligned} \mathrm {(PPNN)} \qquad \qquad \Phi \left( x;u\right) = T_K \circ \mathrm {prox}_{f_{K-1}} \circ \cdots \circ \mathrm {prox}_{f_1} (x) + b_K, \end{aligned}$$

with the “usual” proximity operator, cf. (17), and

$$\begin{aligned} f_k(x)&= g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _2^2 + \iota _{\mathcal {N}(T_k^*)} \bigr ) (T_k x + b_k) \quad \text{ if } \; d \le n_k,\\ f_k(x)&= g(T_k x + b_k) + \iota _{\mathcal {R}(T_k^*)}(x) \quad \text{ if } \; d \ge n_k. \end{aligned}$$

Due to the use of Parseval frames, we call these networks Parseval (frame) proximal neural networks (PPNNs). By our previous considerations, see Lemma 5.3, PPNNs are averaged operators.

Remark 5.6

An interesting result follows from convergence considerations of the cyclic proximal point algorithm, see [7]. Let \(\{\lambda _r\}_{r \in \mathbb N} \in \ell _2 \setminus \ell _1\). Then, for every \(x^{(0)} \in \mathbb R^d\), the sequence generated by

$$\begin{aligned} x^{(r+1)} := \mathrm {prox}_{\lambda _r f_{K-1}} \circ \cdots \circ \mathrm {prox}_{\lambda _r f_1} \bigl (x^{(r)}\bigr ) \end{aligned}$$
(21)

converges to a minimizer of \(f_1 + \cdots + f_{K-1}\). In particular, Theorem 3.4 implies that for orthogonal matrices \(T_k\), \(k=1,\ldots ,K-1\) and \(T_K = I_d\), \(b_K = 0\), the sequence \(\{x^{(r)}\}_{r \in \mathbb N}\) in (21) converges to

$$\begin{aligned} \hat{x} \in {\mathop {\hbox {argmin}}\limits _{x}} \sum _{k=1}^{K-1} g(T_k x - b_k). \end{aligned}$$

6 Training PPNNs on Stiefel Manifolds

In this section, we show how to train PPNNs.

Remark 6.1

According to Lemma 5.3(i), we could add more flexibility to our model by allowing tight frames instead. Then, we must train an additional scaling constant, which does not introduce difficulties in the training process and may be useful for special applications, see [2]. In our numerical experiments, we omitted the additional scaling constant as we do not want to focus on this particular issue.

In PPNNs, we assume that either \(T_k\) or \(T_k^*\), \(k=1,\ldots ,K-1\), is an element of the Stiefel manifold

$$\begin{aligned} \mathrm {St}\big ( \min (n_k,d ), \max (n_k,d)\big ), \quad k=1,\ldots , K-1. \end{aligned}$$

The following facts on Stiefel manifolds can be found, e.g., in [1]. For \(d\le n\), the (compact) Stiefel manifold is defined as

$$\begin{aligned} \mathrm {St}(d,n) := \bigl \{ T \in \mathbb R ^{n,d} : T^* T = I_d\bigr \}. \end{aligned}$$

For \(d=1\) this reduces to the sphere \(\mathbb S^{n-1}\), and for \(d=n\) we obtain the special orthogonal group \(\mathrm {SO}(n)\). In general, \(\mathrm {St}(d,n)\) is a manifold of dimension \(nd -\frac{1}{2} d(d+1)\) with tangential space at \(T \in \mathrm {St}(d,n)\) given by

$$\begin{aligned} {\mathcal T}_T \mathrm {St}(d,n) =\bigl \{T U + T_\perp V: U^* = -U, V\in \mathbb {R}^{n-d,d}\bigr \}, \end{aligned}$$

where the columns of \(T_{\perp }\in \mathbb {R}^{n,n-d}\) are the basis of an orthonormal complement of T fulfilling \(T^*_{\perp } T_{\perp } = I_{n-d}\) and \(T^*T_{\perp }=0\). The Riemannian gradient of a function on \(\mathrm {St}(d,n)\) can be obtained by the orthogonal projection of the gradient in \(\mathbb R^{n,d}\) onto \(\mathrm {St}(d,n)\). The orthogonal projection of \(X \in \mathbb {R}^{n,d}\) onto \({\mathcal T}_T \mathrm {St}(d,n)\) is given by

$$\begin{aligned} P_T X&= (I_n - T T^*) X + \tfrac{1}{2} T(T^* X - X^* T),\end{aligned}$$
(22)
$$\begin{aligned}&= WT, \qquad W := \hat{W}-\hat{W}^*,\quad \hat{W} := XT^*-\tfrac{1}{2} T(T^*XT^*). \end{aligned}$$
(23)

To emphasize that for fixed T the matrix W depends on X, we will also write \(W_X\). A retraction \(\mathcal {R}\) on the manifold \(\mathrm {St}(d,n)\) is a smooth mapping from the tangent bundle of \(\mathrm {St}(d,n)\) to the manifold fulfilling \({\mathcal R}_{T}(0) = T\), where 0 is the zero element in \({\mathcal T}_T \mathrm {St}(d,n)\), and with the identification \({\mathcal T}_0({\mathcal T}_{T} \mathrm {St}(d,n)) \cong {\mathcal T}_T \mathrm {St}(d,n)\) the local rigidity condition \( D {\mathcal R}_{T} (0) = \mathrm {Id}_{ {\mathcal T}_T \mathrm {St}(d,n)} \) holds true. A well-known retraction on \(\mathrm {St}(d,n)\) is

$$\begin{aligned} \tilde{\mathcal R}_T(X)=\mathrm {qf}(T+X), \quad X \in {\mathcal T}_T\mathrm {St}(d,n), \end{aligned}$$
(24)

where \(\mathrm {qf}(A)\) denotes the Q factor of the decomposition of a matrix \(A\in \mathbb {R}^{n,d}\) with linearly independent columns as \(A=QR\) with \(Q\in \mathrm {St}(d,n)\) and R an upper triangular matrix of size \(d \times d\) with strictly positive diagonal elements. The complexity of the QR decomposition using the Householder algorithm is \(2d^2(n-d/3)\), see [21]. Since the computation of the QR decomposition appears to be time consuming on a GPU, we prefer to apply another retraction, based on the Cayley transform of skew-symmetric matrices W in (23), namely

$$\begin{aligned} {\mathcal R}_T(X)=(I_n-\tfrac{1}{2} W)^{-1}(I_n +\tfrac{1}{2} W)T, \quad X \in {\mathcal T}_T\mathrm {St}(d,n), \end{aligned}$$
(25)

see [36, 48]. By straightforward computation it can be seen that \(W_X\) and \(W_{P_T X}\) coincide, so that the retraction (25) enlarged to the whole \(\mathbb R^{n,d}\) fulfills

$$\begin{aligned} {\mathcal R}_T (X) = {\mathcal R}_T (P_T X), \qquad X \in \mathbb R^{n,d}. \end{aligned}$$
(26)

Remark 6.2

The retraction (25) has the drawback that it contains a matrix inversion. In our numerical algorithm, the following simple fixed point iteration is used for computing the matrix \(R = {\mathcal R}_T(X)\) with fixed T and X. By definition, R fulfills the fixed point equation

$$\begin{aligned} R = \tfrac{1}{2} W R + (I_n + \tfrac{1}{2} W) T. \end{aligned}$$
(27)

Starting with an arbitrary \(R^{(0)} \in \mathrm {St}(d,n)\), we apply the iteration

$$\begin{aligned} R^{(r+1)} := \tfrac{1}{2} W R^{(r)} + (I_n + \tfrac{1}{2} W) T, \end{aligned}$$

which converges by Banach’s fixed point theorem to the fixed point of (27) if \(\frac{1}{2} \rho (W) < 1\), where \(\rho (W)\) denotes the spectral radius of W.

We want to train a PPNN by minimizing

$$\begin{aligned} \mathcal J(u) := \sum _{i=1}^N {\ell } \bigl (\Phi (x_i;u); y_i\bigr ), \end{aligned}$$
(28)

where \({\ell }:\mathbb R^d \times \mathbb R^d \rightarrow \mathbb R\) is a differentiable loss function on the first d variables.

Example 6.3

Let us specify two special cases of PPNNs with one layer.

  1. (i)

    For one layer without bias and componentwise soft shrinkage \(\sigma \) as activation function, i.e., summands

    $$\begin{aligned} \sum _{i=1}^N {\ell } \bigl (T_1^* \sigma (T_1 x_i);y_i \bigr ), \quad T_1 \in \mathrm {St}(d,n_1), \end{aligned}$$

    we learn Parseval frames, e.g., for denoising tasks with \(y_i\) as a noisy version of \(x_i\). Here, we want to mention the significant amount of work on dictionary learning, see [18], which starts with the same goal.

  2. (ii)

    For \(x_i = y_i\), \(i=1,\ldots ,N\), the above network could be used as so-called auto-encoder. Again, for one layer without activation function, \(b_1 = 0\) and \({\ell } = h(\Vert x-y\Vert )\) with some norm \(\Vert \cdot \Vert \) on \(\mathbb R^d\) we get

    $$\begin{aligned} \sum _{i=1}^N {\ell } \left( T_1 ^* T_1 x_i ;x_i \right) = \sum _{i=1}^N h\bigl ( \Vert (I_d - T_1 ^* T_1) x_i \Vert \bigr ), \quad T_1^* \in \mathrm {St}(d,n_1). \end{aligned}$$

    For the Euclidean norm and \(h(x) = x^2\) we get the classical PCA approach and for \(h(x) = x\) the robust rotationally invariant \(L_1\)-norm PCA, recently discussed in [30, 35].

The following remark points out that special cases of our PPNNs were already considered in the literature.

Remark 6.4

In [26], NNs with weight matrices \(L_k \in \mathbb R^{n_k,n_{k-1}}\), \(k \in \{1,\ldots ,K-1\}\), (or their transpose) lying in a Stiefel manifold were examined. The authors called this approach optimization over multiple dependent Stiefel manifolds (OMDSM). Indeed, by the following reasons, these NNs are special cases of our PPNNs if \(n_k \le d\) for all \(k=1,\ldots ,K-1\). In particular, this implies that the NNs considered in [26] (with appropriately chosen last layer) are averaged operators.

(i) Case \(n_{k} \le n_{k-1}\): Let \(L_k^* \in \mathrm {St}(n_k, n_{k-1})\), i.e., \(L_k L_k^* = I_{n_k}\). Choosing an arbitrary fixed \(T_{k-1} \in \mathbb R^{n_{k-1},d}\) with \(T_{k-1} T_{k-1}^* = I_{n_{k-1}}\), we want to find \(T_{k} \in \mathbb R^{n_{k},d}\) such that

$$\begin{aligned} T_{k} T_{k}^* = I_{n_k} \quad \mathrm {and} \quad L_k = T_k T_{k-1}^*. \end{aligned}$$
(29)

It is straightforward to verify that \(T_k := L_k T_{k-1}\) has the desired properties.

Note that if the transposes of \(T_k\) and \(T_{k-1}\) are in a Stiefel manifold, this does not necessarily hold for the transpose of \(T_k T_{k-1}^*\). Therefore, our PPNNs are more general.

(ii) Case \(n_{k-1}< n_k\): Let \(L_k \in \mathrm {St}(n_{k-1},n_k)\), i.e., \(L_k^* L_k = I_{n_{k-1}}\). For an arbitrary fixed \(T_{k-1} \in \mathbb R^{n_{k-1},d}\) with \(T_{k-1} T_{k-1}^* = I_{n_{k-1}}\), we want to find \(T_{k} \in \mathbb R^{n_{k},d}\) fulfilling (29). To this end, we complete \(L_k\) to an orthogonal matrix \(\tilde{L}_k \in \mathbb {R}^{n_k,n_k}\) and \(T_{k-1}\) to a matrix \(\tilde{T}_{k-1}\in \mathbb {R}^{n_k,d}\) with orthogonal rows. By straightforward computation we verify that \(T_k := \tilde{L}_k \tilde{T}_{k-1}\) satisfies (29) such that this case also fits into our PPNN framework.

We apply a stochastic gradient descent algorithm on the Stiefel manifold to find a minimizer of (28). To this end, we compute the Euclidean gradient with respect to one layer and apply the usual backpropagation for multiple layers.

Lemma 6.5

Let

$$\begin{aligned} J(T,b) := \ell \bigl (T^* \sigma (Tx+b);y\bigr ), \end{aligned}$$

where T or \(T^*\) are in \(\mathrm {St}(d,n)\), and \(\ell \) and \(\sigma \) are differentiable. Set

$$\begin{aligned} r := \sigma (Tx+b),\quad s := T^*\sigma (Tx+b), \quad t := \nabla \ell \bigl (T^* \sigma (Tx+b);y\bigr ), \quad \Sigma := \mathrm {diag}\bigl (\sigma '(Tx + b) \bigr ), \end{aligned}$$

where the gradient of \(\ell \) is taken with respect to the first d variables. Then it holds for the Euclidean gradient

$$\begin{aligned} \nabla _T J(T,b)&= -T (t s^* + s t^*) + r t^* + \Sigma \, T t x^*,\qquad \nabla _b J(T,b) = \Sigma \, T t. \end{aligned}$$

The proof follows by straightforward computations that are carried out in Appendix.

Now, we can formulate stochastic gradient descent (SGD) for \(\mathcal J\) as in Algorithm 1. This algorithm works for an arbitrary retraction, in particular the retraction in (24). In our numerical computations, we use the special retraction (25) in connection with the iteration scheme. Then, by (26), the projection step 3 of the algorithm can be skipped and the retraction can be directly applied to the Euclidean gradient.

figure a

7 Some Numerical Results

In this section, we present simple numerical results to get a first impression on the performance of PPNNs for denoising and classification. More sophisticated examples, which include the full repertoire of fine tuning of NNs, will follow in an experimental paper, see also our conclusions. Throughout this section, we use the quadratic loss function \(\ell (x;y) := \Vert x-y\Vert _2^2\). For training we apply a stochastic gradient descent algorithm. We initialize the matrices \(T_k\in \mathrm {St}(d,n)\) randomly using the orthogonal initializer from Tensorflow. That is, we generate a matrix \(\tilde{T}_k\in \mathbb {R}^{n,d}\) with independent random entries following the standard normal distribution and use the initialization \(T_k=\mathrm {qf}(\tilde{T}_k)\). The batch size and learning rate are given for all examples separately.

Denoising In this experiment, we compare PPNNs with Haar wavelet thresholding both for discrete Haar bases and Haar frames arising from the undecimated (translation invariant) version of the Haar transform. In particular, the experiment is linked to the starting point of our considerations, namely wavelet and frame shrinkage. For further details on the corresponding filter banks we refer to [15, 37]. As quality measure for our experiments we choose the average peak signal to noise ratio (PSNR) over the test set. Recall that for a prediction \(x\in \mathbb {R}^{n}\) and ground truth \(y\in \mathbb {R}^{n}\) the PSNR is defined by

$$\begin{aligned} \mathrm {PSNR}(x,y)=10\log _{10}\Big (\frac{(\max y - \min y)^2}{\sum _{i=1}^{n} (x_i-y_i)^2}\Big ). \end{aligned}$$

Since we focus on the Haar filter, we restrict our attention to piecewise constant signals with mean 0. By \((x_i,y_i)\in \mathbb {R}^{d}\times \mathbb {R}^{d}\), \(i=1,\ldots ,N\), we denote pairs of piecewise constant signals \(y_i\) of length \(d=2^7=128\) and their noisy versions by \(x_i=y_i+\epsilon _i\), where \(\epsilon _i\) is white noise with standard deviation \(\sigma = 0.1\). For the signal generation, we choose

  • The number of constant parts of \(y_i\) as \(\max \{2,t_i\}\), where \(t_i\) is the realization of a random variable following the Poisson distribution with mean 5;

  • The discontinuities of \(y_i\) as realization of a uniform distribution;

  • The signal intensity of \(y_i\) for every constant part as realization of the standard normal distribution, where we subtract the mean of the signal finally.

Using this procedure, we generate training data \((x_i,y_i)_{i=1}^N\) and test data \((x_i,y_i)_{i=N+1}^{N+N_\text {test}}\) with \(N=500{,}000\) and \(N_\text {test}=1000\). The average PSNR of the noisy signals in the test set is 25.22. We use PPNNs with \(K-1 \in \{1, 2, 3\}\) layers and set \(T_K=I_d\) and \(b_K=0\). In all examples a batch size of 32 and a learning rate of 0.5 is used.

We are interested in two different settings:

1. Learned orthogonal matrices versus Haar basis. First, we consider PPNNs with 128 neurons in each hidden layer and componentwise soft-shrinkage \(S_\lambda \) as activation function. In particular, all matrices \(T_k\) have to be orthogonal. The denoising results of our learned PPNN are compared with the soft wavelet shrinkage with respect to the discrete orthogonal Haar basis in \(\mathbb R^{128}\), i.e., the signal on all 6 scales is decomposed by

$$\begin{aligned} \Psi (x) = H^* S_{\lambda }(H x), \end{aligned}$$
(30)

where \( H := H_2 \, \cdots \, H_7 \) with matrices

$$\begin{aligned} H_{j} := \begin{pmatrix} \tilde{H}_j&{}0\\ 0&{}I_{2^7 - 2^{j}} \end{pmatrix}, \quad \tilde{H}_j := \frac{1}{\sqrt{2}} \begin{pmatrix} 1&{}1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}0&{}1&{}1&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}1&{}1&{}0&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}1\\ 1&{}-1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}0&{}1&{}-1&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}1&{}-1&{}0&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}-1\\ \end{pmatrix} \in \mathbb R^{2^j,2^j}. \end{aligned}$$

The average PSNRs on the test data are given in Table 1. For determining the optimal threshold in \(S_\lambda \), we implemented two different methods. The first one is 5-fold cross validation (CV). More precisely, the training data is divided into 5 subsets and each is used once as a test set with the remaining samples as training set. The test loss for given \(\lambda \) is averaged over all 5 trials for judging the quality of the model. The tested parameters \(\lambda \) are chosen in [0.05, 0.3] with steps of 0.05 for a NN with one layer. The second method is to set \(\lambda \) as a trainable variable of the neural network and optimize it via stochastic gradient descent (SGD) during the training process. For NNs with two and three layers, the tested parameters are divided by 2 and 3, respectively. It appears that for only one hidden layer the Haar wavelet shrinkage is still better than the learned orthogonal matrix. If we increase the number of layers, then PPNNs lead to a better average PSNR.

Two exemplary noisy signals and their denoised versions are shown in Fig. 2. Since we have learned the orthogonal matrices \(T_k\) with respect to the quadratic loss function, the visual quality of the PPNN denoised signals is clearly not satisfactory even with an improved PSNR. The visual impression of signals denoised by Haar wavelet shrinkage can by improved (smoother signal) by increasing the threshold to \(\lambda = 0.3\), resulting in a worse PSNR. To achieve a similar behavior with orthogonal matrices learned by PPNNs, we have to choose a different loss function.

Table 1 PSNRs (average on test data) for denoising piecewise constant signals
Fig. 2
figure 2

Two denoising examples using the Haar basis and learned orthogonal matrices. The signals denoised by PPNNs look better for an increasing number of layers. For Haar wavelet shrinkage, a smoother denoised signal can be attained by increasing the threshold to \(\lambda = 0.3\), although this signal has a smaller PSNR

2. Learned Stiefel matrices versus Haar frame. Haar wavelet shrinkage can be improved by using Haar wavelet frames within a so-called “algorithm á trous”, see [32]. We apply a similar method as in (30), but with a rectangular matrix H whose rows form a Haar frame. More precisely, the Haar filter is used without subsampling. This results, in contrast to the original Haar transform, in a translational invariant multiscale transform. Instead of the matrices \(H_7\), the nonsubsampled (convolution) matrix

$$\begin{aligned} \frac{1}{\sqrt{2}} \begin{pmatrix} 1&{}1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}1&{}1&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}&{}1&{}1&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}1\\ 1&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}1\\ 1&{}-1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}1&{}-1&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}1&{}-1&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}-1\\ -1&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}1\\ \end{pmatrix} \in \mathbb R^{256,128}\text{. } \end{aligned}$$

is applied to the original signal. Note that we assume a periodic continuation of the signal. In each of the following j steps, \(j=1, \ldots ,6\), we keep the lower part and transform again the upper smoothed part by essentially the same matrix, where \(2^{j}-1\) zeros are inserted between the filter coefficients. The output signal has size \(8\cdot 128\), where the last part of the signal is just the averaged signal, which is equal to zero. Overall, the original signal is multiplied by \(H\in \mathbb {R}^{1024, 128}\), where for \(j\in \{0,\ldots ,6\}\) and \(i\in \{0,\ldots ,128-2j\}\) the \((i+128 j)\)-th row is given by one element of a Haar wavelet frame

$$\begin{aligned} \big (\underbrace{\begin{array}{ccc}0&\ldots&0 \end{array}}_{i \text { times } 0} \underbrace{\begin{array}{ccc}1&\ldots&1 \end{array}}_{j \text { times } 1} \underbrace{\begin{array}{ccc}-1&\ldots&-1 \end{array}}_{j \text { times } -1} \underbrace{\begin{array}{ccc}0&\ldots&0 \end{array}}_{128-i-2j \text { times } 0}\big ) \end{aligned}$$

and the last 128 rows of H are given by \((\begin{array}{ccc}1&\ldots&1\end{array})\). It is well known that for the above translation invariant Haar frame transform, a scale dependent shrinkage has to be applied to each scale, namely starting with threshold \(\lambda \) the next scales should be thresholded by

$$\begin{aligned} \frac{1}{\sqrt{2}^{j}}\lambda , \quad j=0,\ldots ,6. \end{aligned}$$
(31)

For an explanation of this statement we refer to [43]. In summary, we obtain

$$\begin{aligned} \hat{\Psi }(x)=H ^*\tilde{S}_\lambda (H x), \end{aligned}$$

where \(\tilde{S}_\lambda \) denotes the scale-wise adapted thresholding.

Now, we compare this scale-dependent Haar frame soft thresholding method with a learned PPNN with 1024 neurons in the hidden layers and componentwise soft-shrinkage \(S_\lambda \) as activation function. The optimal threshold in \(S_\lambda \) is again determined either by threefold cross validation, where the tested parameters are chosen in [0.01, 0.1] with steps of 0.01, or by setting \(\lambda \) as a trainable variable of the neural network and optimize it via stochastic gradient descent. We emphasize that in contrast to the Haar frame shrinkage procedure with (31), the same threshold for each component is used in the activation function of our PPNN. The resulting PSNRs are given in Table 2. As expected, using the same threshold in the classical Haar frame shrinkage is worse than the scale-adapted Haar frame shrinkage. PPNNs with learned Stiefel matrices perform better for an increasing number of layers.

Finally, Fig. 3 contains the denoised signals from Fig. 2. The results are visually better than in the previous figure, although still not satisfactory due to the used loss function.

Table 2 PSNRs (average on test data) for denoising piecewise constant signals
Fig. 3
figure 3

Denoising for the signals in Fig. 2. The undecimated Haar frame with scale-adapted shrinkage and learned Stiefel matrices of the same size as the Haar frame are compared for \(\lambda \) from Table 2. The PPNN denoised signals are visually nicer than those with scale-adapted Haar frame shrinkage

Classification In this example, we train a PPNN for classifying the MNIST data setFootnote 1. The length of the input signals is \(d=28^2\). We consider a PPNN with \(K-1 =5\) layers and \(n_1=n_2=784\), \(n_3=n_4=400\) and \(n_5=200\) neurons in the layers and componentwise applied ReLu activation function \(\sigma (x) = \max (0,x)\). To get 10 output elements (probabilities) in (0, 1), we use an additional sixth layer

$$\begin{aligned} g (T_K x + b_K), \quad T_K \in \mathbb R^{10,d}, \, b_K \in \mathbb R^{10} \end{aligned}$$

with another activation function \(g(x) := \frac{1}{1 + \mathrm {exp} (-x)}\). For training, we use a batch size of 1024 and a learning rate of 5. After 1000 epochs we reach an accuracy of 0.9855 on the test set. One epoch takes about one second on a NVIDIA Quadro M5000 GPU. In Fig. 4, the training and test loss of our PPNN during training are plotted.

Remark 7.1

As already mentioned in Remark 6.4, NNs with Stiefel matrices were also applied in [26]. The authors of [26] reported that the training process using Riemannian optimization on the Stiefel manifold could be unstable or divergent. We do not observe such instabilities in our setting.

Fig. 4
figure 4

Training loss (solid) and test loss (dashed) of a PPNN on the MNIST data set. The x-axis corresponds to the number of epochs and the y-axis to the associated loss value

Adversarial Attacks Neural networks with bounded Lipschitz constants were successfully applied to defend against adversarial attacks, see [22, 46]. In this example, we demonstrate that a PPNN is more robust under adversarial attacks than a standard neural network. Assume that we have a neural network \(f=(f_1,\ldots ,f_{10}):\mathbb {R}^{28^2}\rightarrow (0,1)^{10}\) for classifying MNIST, e.g., a PPNN as described in the previous example. Further, we have given an input \(x\in \mathbb {R}^{28^2}\). Now, we perform an adversarial attack in the following way:

  • Set \(\nu :={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x)\), and \(g := \nabla _x \tfrac{f_\nu (x)}{\Vert f(x)\Vert _1}\).

  • Initialize \(\epsilon = 10^{-2}\) and while \(\nu ={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x-\epsilon g)\) update \(\epsilon =2\epsilon \).

This procedure is applied on two neural networks. More precisely, the first one is the PPNN from the previous example and the second one is a neural network with the same structure as the PPNN, but without the orthogonality constraint. We train the standard neural network using the Adam optimizer with learning rate of \(10^{-4}\) and end up with an accuracy of 0.9863. Then, we perform an adversarial attack on both of these networks and record the norm \(\Vert \epsilon g\Vert _2\) of the noise that changes the prediction. We perform this for each input \(x_k\in [0,255]^{28^2}\), \(k=1,\ldots ,10{,}000\), in the test set and compute the mean, standard deviation and median of these norms. For the PPNN, we record an average norm of \(38.28\pm 24.51\) and a median of 33.71. For the standard neural network, we record an average norm of \(30.48\pm 15.72\) and a median of 28.75. Overall, the PPNN seems to be more stable against such adversarial attacks.

8 Conclusions

In this paper, we have shown that for real Hilbert spaces \(\mathcal {H}\) and \(\mathcal {K}\), a proximity operator \(\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}\) and a linear bounded operator \(T:\mathcal {H}\rightarrow \mathcal {K}\) the operator \(T^\dagger \, \mathrm {Prox}(T \cdot + b)\) with \(b \in \mathcal {K}\) is a proximity operator on \(\mathcal {H}\). As a consequence, the famous frame soft shrinkage operator can be seen as a proximity operator. Using this new relations, we have discussed special neural networks arising from Parseval frames and stable activation functions. Our networks are Lipschitz networks, which are moreover averaged operators. These networks include recently proposed ones containing matrices whose transposes are in a Stiefel manifold and interpret them from another, more general point of view.

In our future work, we want to explore for which learning tasks the higher flexibility of our PPNNs is advantageous. Taking more general operators T into account may be also useful. In particular, we will apply our PNNs within Plug-and-Play algorithms. Another question that we want to address is to constrain our Stiefel matrices further, e.g., towards convolutional networks and to sparsity constraints. Depending on the application, we have to design appropriate loss functions as well as incorporating regularizing terms.

Table 3 Stable activation functions and their corresponding proximal mappings, see [13]

For our experiments the stochastic gradient algorithm on Stiefel manifolds worked well. However, other minimization methods could be taken into account. In [26] for example, the authors proposed an orthogonal weight normalization algorithm that was inspired by the fact that eigenvalue decomposition is differentiable. Finally, we like to mention that a proximal backpropagation algorithm taking implicit instead of explicit gradient steps to update the network parameters during neural network training was proposed in [19].

A better understanding of the convergence of the cyclic proximal point algorithm, see Remark 5.6, and suitable early stopping criteria if the network \(\Phi \) is iteratively used may help to design NNs and to understand their success.