Parseval Proximal Neural Networks

The aim of this paper is twofold. First, we show that a certain concatenation of a proximity operator with an affine operator is again a proximity operator on a suitable Hilbert space. Second, we use our findings to establish so-called proximal neural networks (PNNs) and stable tight frame proximal neural networks. Let $\mathcal {H}$ and $\mathcal {K}$ be real Hilbert spaces, $b \in \mathcal {K}$ and $T \in \mathcal {B} (\mathcal {H},\mathcal {K})$ a linear operator with closed range and Moore–Penrose inverse $T^\dagger $. Based on the well-known characterization of proximity operators by Moreau, we prove that for any proximity operator $\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}$ the operator $T^\dagger \, \mathrm {Prox}( T \cdot + b)$ is a proximity operator on $\mathcal {H}$ equipped with a suitable norm. In particular, it follows for the frequently applied soft shrinkage operator $\mathrm {Prox}= S_{\lambda }:\ell _2 \rightarrow \ell _2$ and any frame analysis operator $T:\mathcal {H}\rightarrow \ell _2$ that the frame shrinkage operator $T^\dagger \, S_\lambda \, T$ is a proximity operator on a suitable Hilbert space. The concatenation of proximity operators on $\mathbb R^d$ equipped with different norms establishes a PNN. If the network arises from tight frame analysis or synthesis operators, then it forms an averaged operator. In particular, it has Lipschitz constant 1 and belongs to the class of so-called Lipschitz networks, which were recently applied to defend against adversarial attacks. Moreover, due to its averaging property, PNNs can be used within so-called Plug-and-Play algorithms with convergence guarantee. In case of Parseval frames, we call the networks Parseval proximal neural networks (PPNNs). Then, the involved linear operators are in a Stiefel manifold and corresponding minimization methods can be applied for training of such networks. Finally, some proof-of-the concept examples demonstrate the performance of PPNNs.

A Relaxed Proximal Gradient Descent Algorithm for Convergent Plug-and-Play with Proximal Denoiser

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Article 13 March 2021

Proximal Residual Flows for Bayesian Inverse Problems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Wavelet and frame shrinkage operators became very popular in recent years. A certain starting point was the iterative shrinkage-thresholding algorithm (ISTA) in [16], which was interpreted as a special case of the forward-backward algorithm in [14]. For relations with other algorithms see also [8, 40]. Let $T \in \mathbb R^{n \times d}$, $n\ge d$, have full column rank. Then, the problem

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x -y \Vert _2^2 + \lambda \Vert Ty\Vert _{1} \bigr \}, \quad \lambda >0, \end{aligned}$$

(1)

is known as the analysis point of view. For orthogonal $T \in {\mathbb R}^{d \times d}$, the solution of (1) is given by the frame soft shrinkage operator $T^\dagger \, S_\lambda \, T = T^* \, S_\lambda \, T$, see Example 2.3. If $T \in \mathbb R^{n \times d}$ with $n \le d$ and $T T^* = I_n$, the solution of problem (1) is given by $I_d - T^* T + T^* S_\lambda T$, see [6, Theorem 6.15]. For arbitrary $T \in {\mathbb R}^{n \times d}$, $n \ge d$, there are no analytic expressions for the solution of (1) in the literature.

The question whether the frame shrinkage operator can itself be seen as a proximity operator has been recently studied in [20]. They showed that the set-valued operator $(T^\dagger S_\lambda T)^{-1} - I_d$ is maximally cyclically monotone, which implies that it is a proximity operator with respect to some norm in $\mathbb R^d$. In this paper, we prove that for any operator $T\in \mathcal {B} (\mathcal {H},\mathcal {K})$ with closed range, $b \in \mathcal {K}$ and any proximity operator $\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}$ the new operator $T^\dagger \, \mathrm {Prox}\, ( T \cdot + b) :\mathcal {H}\rightarrow \mathcal {H}$ is also a proximity operator on the linear space $\mathcal {H}$, but equipped with another inner product. The above mentioned finite dimensional setting is included as a special case. In contrast to [20], we directly approach the problem using a classical result of Moreau [34]. Moreover, we provide the function for the definition of the proximity operator. Here, we like to mention that this function can be also deduced from Proposition 3.9 in [12]. However, since this deduction appears to be more space consuming than the direct proof of our Theorem 3.4, we prefer to give a direct approach. Note that different norms in the definition of the proximity operator were successfully used in variable metric algorithms, see [10].

Recently, it was shown that many activation functions appearing in neural networks are indeed proximity functions [13]. Based on this observations and our previous findings, we consider neural networks that are concatenations of proximity operators and call them proximal neural networks (PNNs). PNNs can be considered within the framework of the variational networks proposed in [29]. Due to stability reasons, PNNs related to linear operators from Stiefel manifolds are of special interest. They form so-called averaged operators and are consequently nonexpansive. Orthogonal matrices have already shown advantages in training recurrent neural networks (RNNs) [3, 4, 28, 31, 47, 49]. Using orthogonal matrices, vanishing or exploding gradients in training RNNs can be avoided [17]. The more general setting of learning rectangular matrices from a Stiefel manifold was proposed, e.g., in [25], but with a different focus than in this paper. The most relevant paper with respect to our setting is [26], where the authors considered the so-called optimization over multiple dependent Stiefel manifolds (OMDSM). We will see that the NNs in [26] are special cases of our PNNs so that our analysis ensures that they are averaged operators.

Using matrices from Stiefel manifolds results in 1-Lipschitz neural networks. Consequently, our approach is naturally related to other methods for controlling the Lipschitz constant of neural networks, which provably increases robustness against adversarial attacks [46]. In [23], the constant is controlled by projecting back all weight matrices in the network that violate a pre-defined threshold on the $\Vert \cdot \Vert _p$ norm, $p \in [1,\infty ]$, of the weight matrices. The authors in [39] characterize the singular values of the linear map associated with convolutional layers and use this for projecting a convolutional layer onto an operator-norm ball. Another closely related approach is spectral normalization as proposed in [33], where the spectral norm of every weight matrix is enforced to be one. Compared to our approach, this only restricts the largest singular value of the linear operators arising in the neural network. Limitations of the expressiveness of networks with restricted Lipschitz constants in every layer were discussed in [2, 27]. Note that our approach does not restrict the Lipschitz constants in every individual layer. Further, none of the above approaches is able to impose more structure on the network such as being an averaged operator.

Our results may be of interest in so-called Plug-and-Play algorithms [9, 42, 45]. In these algorithms a well-behaved operator, e.g., a proximity operator, is replaced by an efficient denoiser such as a neural network. However, training a denoising framework without structure can lead to a divergent algorithm, see [41]. In contrast, it was shown in [44] that a particular version of a Plug-and-Play algorithm converges if the network is averaged.

Our paper is organized as follows: We begin with preliminaries on convex analysis in Hilbert spaces in Sect. 2. In Sect. 3, we prove our general results on the interplay between proximity and certain affine operators. As a special case we emphasize that the frame soft shrinkage operator is itself a proximity operator in Sect. 4. In Sect. 5, we use our findings to set up neural networks as a concatenation of proximity operators on $\mathbb R^d$ equipped with different norms related to linear operators. If these operators are related to tight frames, our proposed network is actually an averaged operator. In case of Parseval frames, the involved matrices are in Stiefel manifolds and we end up with PPNNs. Sect. 6 deals with the training of PPNNs via stochastic gradient descent on Stiefel manifolds. In Sect. 7, we provide first numerical examples. Finally, Sect. 8 contains conclusions and addresses further research questions.

2 Preliminaries

Let $\mathcal {H}$ be a real Hilbert space with inner product $\langle \cdot ,\cdot \rangle $ and norm $\Vert \cdot \Vert $. By $\Gamma _0(\mathcal {H})$ we denote the set of proper, convex, lower semi-continuous functions on $\mathcal {H}$ mapping into $(-\infty , \infty ]$. For $f \in \Gamma _0(\mathcal {H})$ and $\lambda > 0$, the proximity operator $\mathrm {prox}_{\lambda f}:\mathcal {H}\rightarrow \mathcal {H}$ and its Moreau envelope $M_{\lambda f}:\mathcal {H}\rightarrow \mathbb {R}$ are defined by

$$\begin{aligned} \mathrm {prox}_{\lambda f} (x)&:= {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert ^2 + \lambda f(y) \bigr \}, \\ M_{\lambda f} (x)&:= \min _{y \in \mathcal {H}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert ^2 + \lambda f(y) \bigr \}. \end{aligned}$$

Clearly, the proximity operator and its Moreau envelope depend on the underlying space $\mathcal {H}$, in particular on the chosen inner product. Recall that an operator $A:\mathcal {H}\rightarrow \mathcal {H}$ is called firmly nonexpansive if for all $x,y \in \mathcal {H}$ the following relation is fulfilled

$$\begin{aligned} \Vert Ax -Ay\Vert ^2 \le \langle x-y,Ax-Ay \rangle . \end{aligned}$$

(2)

Obviously, firmly nonexpansive operators are nonexpansive.

For a Fréchet differentiable function $\Phi :\mathcal {H}\rightarrow \mathbb {R}$, the gradient $\nabla \Phi (x)$ at $x \in \mathcal {H}$ is defined as the vector satisfying for all $h \in \mathcal {H}$,

$$\begin{aligned} \langle \nabla \Phi (x), h \rangle = D\Phi (x) h, \end{aligned}$$

where $D\Phi :\mathcal {H}\rightarrow \mathcal {B} (\mathcal {H},\mathbb {R})$ denotes the Fréchet derivative of $\Phi $, i.e., for all $x,h \in \mathcal {H}$,

$$\begin{aligned} \Phi (x+h) - \Phi (x) = D\Phi (x) h + o(\Vert h\Vert ). \end{aligned}$$

Note that the gradient crucially depends on the chosen inner product in $\mathcal {H}$. The following results can be found, e.g., in [5, Propositions 12.27, 12.29].

Theorem 2.1

Let $f \in \Gamma _0(\mathcal {H})$. Then, the following relations hold true:

(i) The operator $\mathrm {prox}_{\lambda f} :\mathcal {H}\rightarrow \mathcal {H}$ is firmly nonexpansive.

(ii) The function $M_{\lambda f}$ is (Fréchet) differentiable with Lipschitz-continuous gradient given by

$$\begin{aligned} \nabla M_{\lambda f}(x) = x - \mathrm {prox}_{\lambda f}(x). \end{aligned}$$

Clearly, (ii) implies that

$$\begin{aligned} \mathrm {prox}_{\lambda f} (x) = \nabla \bigl ( \tfrac{1}{2} \Vert x \Vert ^2 - M_{\lambda f}(x) \bigr ) = \nabla \Phi (x), \end{aligned}$$

(3)

where $\Phi := \frac{1}{2} \Vert \cdot \Vert ^2 - M_{\lambda f}$ is convex as $\mathrm {prox}_{\lambda f}$ is nonexpansive [5, Proposition 17.10]. Further, it was shown by Moreau that also the following (reverse) statement holds true [34, Corrolary 10c].

Theorem 2.2

The operator $\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}$ is a proximity operator if and only if it is nonexpansive and there exists a function $\Psi \in \Gamma _0(\mathcal {H})$ with $\mathrm {Prox}(x) \in \partial \Psi (x)$ for any $x \in \mathcal {H}$, where $\partial \Psi $ denotes the subdifferential of $\Psi $.

Thanks to (3), we conclude that $\mathrm {Prox}:\mathcal {H}\rightarrow \mathcal {H}$ is a proximity operator if and only if it is nonexpansive and the gradient of a convex, differentiable function $\Phi :\mathcal {H}\rightarrow \mathbb {R}$. Recently, the characterization of Bregman proximity operators in a more general setting was discussed in [24]. In the following example, we recall the Moreau envelope and the proximity operator related to the soft thresholding operator.

Example 2.3

Let $\mathcal {H}= \mathbb {R}$ with usual norm $|\cdot |$ and $f(x) := |x|$. Then, $\mathrm {prox}_{\lambda f}$ is the soft shrinkage operator $S_\lambda $ defined by

$$\begin{aligned} S_\lambda (x):= \left\{ \begin{array}{cl} x - \lambda &{} \mathrm {for} \; x > \lambda ,\\ 0 &{} \mathrm {for} \; x \in [-\lambda ,\lambda ],\\ x + \lambda &{} \mathrm {for} \; x < -\lambda , \end{array} \right. \end{aligned}$$

and the Moreau envelope is the Huber function

$$\begin{aligned} m_{\lambda | \cdot |} (x) = \left\{ \begin{array}{cl} \lambda x - \frac{\lambda ^2}{2} &{} \mathrm {for} \; x > \lambda ,\\ \frac{1}{2} x^2 &{} \mathrm {for} \; x \in [-\lambda ,\lambda ],\\ - \lambda x - \frac{\lambda ^2}{2}&{} \mathrm {for} \; x < -\lambda . \end{array} \right. \end{aligned}$$

Hence, $\mathrm {prox}_{\lambda f} = \nabla \varphi $, where $\varphi (x) = \frac{x^{2}}{2} - m_{\lambda | \cdot |}(x)$, i.e.,

$$\begin{aligned} \varphi (x) = \left\{ \begin{array}{cl} \tfrac{1}{2}(x-\lambda )^2&{} \mathrm {for} \; x > \lambda ,\\ 0 &{}\mathrm {for} \; x \in [-\lambda ,\lambda ],\\ \tfrac{1}{2} (x + \lambda )^2&{} \mathrm {for} \; x < -\lambda . \end{array} \right. \end{aligned}$$

For $\mathcal {H}= \mathbb R^d$ and $f(x) := \Vert x\Vert _1$, we can use a componentwise approach. Then, $S_\lambda $ is defined componentwise, the Moreau envelope reads as $M_{\lambda \Vert \cdot \Vert _1} (x) = \sum _{i=1}^d m_{\lambda | \cdot |}(x_i)$ and the potential of $\mathrm {prox}_{\lambda \Vert \cdot \Vert _1}$ is $\Phi (x) = \sum _{i=1}^d \varphi (x_i)$.

3 Interplay Between Proximity and Linear Operators

Let $\mathcal {H}$ and $\mathcal {K}$ be real Hilbert spaces with inner products $\langle \cdot ,\cdot \rangle _\mathcal {H}$ and $\langle \cdot ,\cdot \rangle _\mathcal {K}$ and corresponding norms $\Vert \cdot \Vert _\mathcal {H}$ and $\Vert \cdot \Vert _\mathcal {K}$, respectively. By $\mathcal {B}(\mathcal {H},\mathcal {K})$ we denote the space of bounded, linear operators from $\mathcal {H}$ to $\mathcal {K}$. The kernel and the range of $T\in \mathcal {B}(\mathcal {H},\mathcal {K})$ are denoted by $\mathcal N(T)$ and $\mathcal R(T)$, respectively. In this section, we show that for any nontrivial operator $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ with closed range $\mathcal {R}(T)$, $b \in \mathcal {K}$ and proximity operator $\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}$, the operator $T^\dagger \, \mathrm {Prox}( T \cdot + b):\mathcal {H}\rightarrow \mathcal {H}$ is itself a proximity operator on the linear space $\mathcal {H}$ equipped with a suitable (equivalent) norm $\Vert \cdot \Vert _{{\mathcal {H}_{T}}}$, i.e., there exits a function $f \in \Gamma _0(\mathcal {H})$ such that

$$\begin{aligned} T^\dagger \, \mathrm {Prox}( Tx + b) = {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y) \bigr \}. \end{aligned}$$

Throughout this section, let $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ have closed range. Then, the same holds true for its adjoint $T^*:\mathcal {K}\rightarrow \mathcal {H}$ and the following (orthogonal) decompositions hold

$$\begin{aligned} \mathcal {K}= \mathcal {R}(T) \oplus \mathcal {N}(T^*), \qquad \mathcal {H}= \mathcal {R}(T^*) \oplus \mathcal {N}(T). \end{aligned}$$

(4)

The Moore–Penrose inverse (generalized inverse, pseudo-inverse) $T^\dagger \in \mathcal {B}(\mathcal {K},\mathcal {H})$ is given point-wise by

$$\begin{aligned} \{T^{\dagger }y\} = \{x \in \mathcal {H}: T^*\,T x = T^*y\} \cap \mathcal {R}(T^*), \end{aligned}$$

see [5]. Further, it satisfies $\mathcal {R} (T^\dagger ) = \mathcal {R} (T^*)$ and

$$\begin{aligned} T^{\dagger }\, T = P_{\mathcal {R}(T^*)}, \quad T\, T^{\dagger } = P_{\mathcal {R}(T)}, \end{aligned}$$

(5)

where $P_{C}$ is the orthogonal projection onto the closed, convex set C, see [5, Proposition 3.28]. Then, it follows

$$\begin{aligned} T^\dagger \, T\, T^* = P_{\mathcal {R}(T^*)}\,T^* = T^* \quad \mathrm {and} \quad T^\dagger \, P_{\mathcal {R}(T)} = T^\dagger \, T\, T^\dagger = T^\dagger . \end{aligned}$$

(6)

If T is injective, then $T^\dagger = (T^*\,T)^{-1} T^*$ and if T is surjective, we have $T^\dagger = T^*(T\,T^*)^{-1}$.

Every $T\in \mathcal {B}(\mathcal {H},\mathcal {K})$ gives rise to an inner product in $\mathcal {H}$ via

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x, P_{\mathcal {N}(T)} y \rangle _\mathcal {H}\end{aligned}$$

(7)

with corresponding norm

$$\begin{aligned} \Vert x\Vert _{\mathcal {H}_{T}}= \bigl (\Vert Tx\Vert ^2_\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \Vert P_{\mathcal {N}(T)} x\Vert ^2_\mathcal {H}\bigr )^{\frac{1}{2}}. \end{aligned}$$

If T is injective, the second summand vanishes. In general, this norm only induces a pre-Hilbert structure. Since $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ has closed range, the norms $\Vert \cdot \Vert _\mathcal {H}$ and $\Vert \cdot \Vert _{\mathcal {H}_{T}}$ are equivalent on $\mathcal {H}$ due to

$$\begin{aligned} \Vert x\Vert _{\mathcal {H}_{T}}^2 = \Vert T x \Vert _\mathcal {K}^2/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \Vert P_{\mathcal {N}(T)} x\Vert _\mathcal {H}^2 \le 2 \Vert x \Vert _\mathcal {H}^2 \end{aligned}$$

(8)

and

$$\begin{aligned} \Vert x \Vert _\mathcal {H}^2 = \Vert T^\dagger \,T x \Vert _\mathcal {H}^2 +\Vert P_{\mathcal {N}(T)} x \Vert _\mathcal {H}^2 \le \bigl (\Vert T^\dagger \Vert ^2_{\mathcal {B}(\mathcal {K},\mathcal {H})}\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + 1\bigr ) \Vert x \Vert _{\mathcal {H}_{T}}^2 \end{aligned}$$

for all $x \in \mathcal {H}$. The norm equivalence also ensures the completeness of $\mathcal {H}$ equipped with the new norm. To emphasize that we consider the linear space $\mathcal {H}$ with this norm, we write ${\mathcal {H}_{T}}$. For special $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$, the inner product (7) coincides with the one in $\mathcal {H}$.

Lemma 3.1

Let $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ fulfill $T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {H}$ or $T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \mathrm {Id}_\mathcal {K}$, where $\mathrm {Id}_\mathcal {H}$ and $\mathrm {Id}_\mathcal {K}$ denote the identity operator on $\mathcal {H}$ and $\mathcal {K}$, respectively. Then, the inner product (7) coincides with the one in $\mathcal {H}$ and consequently $\mathcal {H}= \mathcal {H}_T$.

Proof

If $T^* T = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {H}$, then T is injective such that (7) implies

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} = \langle x, T^* Ty \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} = \langle x, y \rangle _\mathcal {H}. \end{aligned}$$

If $T T^* = \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\mathrm {Id}_\mathcal {K}$, then (6) implies that $P_{\mathcal {N}(T)} = \mathrm {Id}_\mathcal {H}- T^\dagger T = \mathrm {Id}_\mathcal {H}- T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}$ and

$$\begin{aligned} \langle x, y \rangle _{\mathcal {H}_{T}}&= \langle Tx, Ty \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x, P_{\mathcal {N}(T)} y \rangle _\mathcal {H}\\&= \langle x,T^*Ty \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle x,y\rangle _\mathcal {H}- \langle x,T^* T y\rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\\&= \langle x,y\rangle _\mathcal {H}. \end{aligned}$$

$\square $

To apply the characterization of proximal mappings in $\mathcal {H}_T$ by Moreau, see Theorem 2.2, we have to compute gradients in $\mathcal {H}_T$. Here, the following result is crucial.

Lemma 3.2

Let $\mathcal {H}$ and $\mathcal {K}$ be real Hilbert spaces with inner products $\langle \cdot ,\cdot \rangle _\mathcal {H}$ and $\langle \cdot ,\cdot \rangle _\mathcal {K}$, respectively. For an operator $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ with closed range, let $\mathcal {H}_T$ be the Hilbert space with inner product (7). For (Fréchet) differentiable $\Phi :\mathcal {H}\rightarrow \mathbb R$, the gradients $\nabla _\mathcal {H}\Phi $ and $\nabla _{\mathcal {H}_T} \Phi $ with respect to the different inner products are related by

$$\begin{aligned} \bigl (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}\bigr )\nabla _{\mathcal {H}_{T}}\, \Phi (x) = \nabla _\mathcal {H}\Phi (x). \end{aligned}$$

Proof

The gradient $\nabla _{\mathcal {H}_{T}}\Phi (x)$ at $x \in \mathcal {H}$ in the space ${\mathcal {H}_{T}}$ is given by the vector satisfying

$$\begin{aligned} \langle \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _{\mathcal {H}_{T}}= D\Phi (x) h = \langle \nabla _\mathcal {H}\Phi (x), h \rangle _\mathcal {H}\end{aligned}$$

for all $h \in \mathcal {H}$. Since

$$\begin{aligned} \langle \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _{\mathcal {H}_{T}}&= \langle T \nabla _{\mathcal {H}_{T}}\Phi (x), Th \rangle _\mathcal {K}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle \nabla _{\mathcal {H}_{T}}\Phi (x),P_{\mathcal {N}(T)} h \rangle _\mathcal {H}\\&= \langle T^* T \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _\mathcal {H}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \langle P_{\mathcal {N}(T)} \nabla _{\mathcal {H}_{T}}\Phi (x), h \rangle _\mathcal {H}, \end{aligned}$$

the gradient depends on the chosen inner product through

$$\begin{aligned} \bigl (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}\bigr )\nabla _{\mathcal {H}_{T}}\, \Phi (x) = \nabla _\mathcal {H}\Phi (x). \end{aligned}$$

$\square $

Now, the desired result follows from the next theorem.

Theorem 3.3

Let $b \in \mathcal {K}$, $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ have closed range and $\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}$ be a proximity operator on $\mathcal {K}$. Then, the operator $A := T^\dagger \, \mathrm {Prox}\, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}$ is a proximity operator.

Proof

In view of Theorems 2.1 and 2.2, it suffices to show that A is nonexpansive and that there exists a convex function $\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}$ with $A = \nabla _{\mathcal {H}_{T}}\Psi $.

1. First, we show that A is firmly nonexpansive, and thus nonexpansive. By (4), we see that

$$\begin{aligned} P_{\mathcal {N}(T)} T^\dagger = 0. \end{aligned}$$

(9)

Using this and (5), it follows

$$\begin{aligned} \Vert Ax - Ay \Vert _{\mathcal {H}_{T}}^2&= \frac{\Vert T T^\dagger \left( \, \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \right) \Vert _\mathcal {K}^2}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}} + \Vert P_{\mathcal {N}(T)} (Ax -Ay)\Vert ^2_\mathcal {H}\nonumber \\&\le \frac{\Vert \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \Vert _\mathcal {K}^2}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}. \end{aligned}$$

(10)

By (9) and (5), we obtain

$$\begin{aligned} \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}\bigl \langle A x - A y, x-y \bigr \rangle _{\mathcal {H}_{T}}&= \bigl \langle T T^\dagger \bigl ( \mathrm {Prox}\, (T x +b) - \mathrm {Prox}\, (T y + b) \bigr ), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle P_{\mathcal {R}(T)} \bigl ( \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \bigr ), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b), Tx - Ty \bigr \rangle _\mathcal {K}\\&= \bigl \langle \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b), Tx + b - (Ty + b) \bigr \rangle _\mathcal {K}, \end{aligned}$$

and since $\mathrm {Prox}$ is firmly nonexpansive with respect to $\Vert \cdot \Vert _{\mathcal {K}}$, see (2), the estimate (10) further implies that A is firmly nonexpansive

$$\begin{aligned} \bigl \langle Ax - Ay , x-y \bigr \rangle _{\mathcal {H}_{T}}&\ge \frac{\Vert \mathrm {Prox}\, (T x + b) - \mathrm {Prox}\, (T y + b) \Vert ^2_\mathcal {K}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}} \ge \Vert Ax - Ay \Vert _{\mathcal {H}_{T}}^2. \end{aligned}$$

2. It remains to prove that there exists a convex function $\Psi :{\mathcal {H}_{T}}\rightarrow \mathbb {R}$ with $\nabla _{\mathcal {H}_{T}}\Psi = A$. Since $\mathrm {Prox}$ is a proximity operator, there exists $\Phi :\mathcal {H}\rightarrow \mathbb {R}$ with $\mathrm {Prox}= \nabla _\mathcal {K}\Phi $. Then, a natural candidate is given by $\Psi =\Phi \, (T \cdot + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}$. Using the definition of the gradient and the chain rule, it holds for all $x,h\in \mathcal {H}$ that

$$\begin{aligned} \langle \nabla _\mathcal {H}\Psi (x) , h \rangle _\mathcal {H}&= D\Psi (x)h = \frac{D\Phi (Tx + b)\,Th}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}= \frac{\langle \nabla _\mathcal {K}\Phi (Tx + b) , Th \rangle _\mathcal {K}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}\\&= \frac{\langle T^* \mathrm {Prox}\, (Tx + b) , h \rangle _\mathcal {H}}{\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}}. \end{aligned}$$

Incorporating Lemma 3.2, we conclude

$$\begin{aligned} (T^* T/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + P_{\mathcal {N}(T)}) \nabla _{\mathcal {H}_{T}}\Psi = \nabla _\mathcal {H}\Psi (x) = T^* \, \mathrm {Prox}\, (Tx+ b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}, \end{aligned}$$

which implies $T^*\, T\, \nabla _{\mathcal {H}_{T}}\Psi = T^* \, \mathrm {Prox}\, (Tx + b)$ and $\nabla _{\mathcal {H}_{T}}\Psi \in \mathcal {R}(T^*)$. By definition of $T^\dagger $, we obtain $\nabla _{\mathcal {H}_{T}}\Psi = A$. Finally, $\Psi $ is convex as it is the concatenation of a convex function with a linear function. $\square $

Let

$$\begin{aligned} (f\square g)(x) := \inf _{y\in \mathcal {H}} f(y) + g(x-y) \end{aligned}$$

denote the infimal convolution of $f,g \in \Gamma _0(\mathcal {H})$ and $x \mapsto \iota _S(x)$ the indicator function of the set S taking the value 0 if $x \in S$ and $+\infty $ otherwise.

For $\mathrm {Prox}:= \mathrm {prox}_{g}$ with $g \in \Gamma _0(\mathcal {H})$, we are actually able to explicitly compute $f \in \Gamma _0(\mathcal {H})$ such that $T^\dagger \, \mathrm {Prox}\, (T \cdot + b) = \mathrm {prox}_f$ on ${\mathcal {H}_{T}}$. Clearly, this also gives an alternative proof for Theorem 3.3.

Theorem 3.4

Let $b \in \mathcal {K}$, $T \in \mathcal {B}(\mathcal {H},\mathcal {K})$ with closed range and $\mathrm {Prox}:= \mathrm {prox}_{g}$ for some $g \in \Gamma _0(\mathcal {K})$. Then, $T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b) :{\mathcal {H}_{T}}\rightarrow {\mathcal {H}_{T}}$ is the proximity operator on ${\mathcal {H}_{T}}$ of $f \in \Gamma _0(\mathcal {H})$ given by

$$\begin{aligned} f(x) := g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(x). \end{aligned}$$

(11)

This expression simplifies to

$$\begin{aligned} f(x)&= g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \quad \text{ if } \text{ T } \text{ is } \text{ injective },\\ f(x)&= g(Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(x) \quad \text{ if } \text{ T } \text{ is } \text{ surjective },\\ f(x)&= g(Tx + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \quad \text{ if } \text{ T } \text{ is } \text{ bijective }. \end{aligned}$$

Proof

By (6) and (4), we obtain

$$\begin{aligned}&T^\dagger \, \mathrm {prox}_{g} \, (T x + b)\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z \in \mathcal {K}}} \bigl \{ \tfrac{1}{2} \Vert z - Tx - b\Vert _{\mathcal {K}}^2 + g(z)\bigr \}\\ =&T^\dagger P_{\mathcal {R}(T)} {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T), z_2 \in \mathcal {N}(T^*)}} \bigl \{ \tfrac{1}{2} \Vert z_1 + z_2 - Tx\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \}\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T)}} \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2} \Vert z_1 - Tx\Vert _{\mathcal {K}}^2 + \tfrac{1}{2} \Vert z_2\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \}\\ =&T^\dagger {\mathop {\hbox {argmin}}\limits _{z_1 \in \mathcal {R}(T)}} \Bigl \{ \tfrac{1}{2} \Vert z_1 - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(z_1 + z_2 + b) \bigr \} \Bigr \}\\ =&T^\dagger T {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert Ty - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \} \Bigr \} \end{aligned}$$

and by (5) further

$$\begin{aligned}&T^\dagger \, \mathrm {prox}_{g} \, (T x + b)\nonumber \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert Ty - Tx\Vert _{\mathcal {K}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \} \Bigr \}\nonumber \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \Bigl \{ \tfrac{1}{2} \Vert y - x\Vert _{\mathcal {H}_{T}}^2 + \inf _{z_2 \in \mathcal {N}(T^*)} \bigl \{ \tfrac{1}{2}\Vert z_2\Vert _{\mathcal {K}}^2 + g(Ty + z_2 + b) \bigr \}/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \Bigr \} \\ =&{\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \Bigl \{ \tfrac{1}{2} \Vert y - x\Vert _{\mathcal {H}_{T}}^2 + g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\mathcal {K}}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty + b)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} + \iota _{\mathcal {R}(T^*)}(y)\Bigr \}.\nonumber \end{aligned}$$

(12)

Hence, we conclude that $T^{\dagger } \, \mathrm {prox}_{g} \, (T \cdot + b)$ is the proximity operator on ${\mathcal {H}_{T}}$ of f in (11). $\square $

Note that for surjective T and $b=0$, the function f is in general a weaker regularizer than g. This is necessary since for the latter (12) would lead to

$$\begin{aligned} {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {R}(T^*)}} \bigl \{\tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_T}^2 + g(Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})} \bigr \}&= T^\dagger {\mathop {\hbox {argmin}}\limits _{z \in \mathcal {K}}} \bigl \{ \tfrac{1}{2} \Vert z - Tx\Vert _\mathcal {K}^2 + g(z) + \iota _{\mathcal{R}(T)}(z)\bigr \}\\ {}&\ne T^\dagger \mathrm {prox}_{g} (T x). \end{aligned}$$

4 Frame Soft Shrinkage as Proximity Operator

In this section, we investigate the frame soft shrinkage as a special proximity operator. Let $\mathcal {K}= \ell _2$ be the Hilbert space of square summable sequences $c = \{c_k\}_{k \in \mathbb N}$ with norm $\Vert c \Vert _{\ell _2} := ( \sum _{k \in \mathbb N} |c_k|^2)^{\frac{1}{2}}$ and assume that $\mathcal {H}$ is separable. A sequence $\{x_k\}_{k\in \mathbb {N}}$, $x_k \in \mathcal {H}$, is called a frame of $\mathcal {H}$, if constants $0< A \le B < \infty $ exist such that for all $x \in \mathcal {H}$,

$$\begin{aligned} A \Vert x\Vert _\mathcal {H}^2 \le \sum _{k\in \mathbb {N}} |\langle x,x_k \rangle _\mathcal {H}|^2 \le B \Vert x\Vert _\mathcal {H}^2. \end{aligned}$$

Given a frame $\{x_k\}_{k\in \mathbb {N}}$ of $\mathcal {H}$, the corresponding analysis operator $T :\mathcal {H}\rightarrow \ell _2$ is defined as

$$\begin{aligned} Tx=\bigl \{ \langle x,x_k \rangle _\mathcal {H}\bigr \}_{k\in \mathbb {N}}, \quad x\in \mathcal {H}. \end{aligned}$$

Its adjoint $T^*:\ell _2 \rightarrow \mathcal {H}$ is the synthesis operator given by

$$\begin{aligned} T^*\{c_k\}_{k\in \mathbb {N}} = \sum _{k\in \mathbb {N}} c_k x_k, \quad \{c_k\}_{k\in \mathbb {N}} \in \ell _2. \end{aligned}$$

By composing T and $T^*$, we obtain the frame operator

$$\begin{aligned} T^*Tx = \sum _{k\in \mathbb {N}} \langle x , x_k \rangle _\mathcal {H}x_k, \quad x\in \mathcal {H}, \end{aligned}$$

which is invertible on $\mathcal {H}$, see [11], such that

$$\begin{aligned} x= \sum _{k\in \mathbb {N}} \langle x , x_k \rangle _{\mathcal H} (T^*T)^{-1} x_k, \quad x\in \mathcal {H}. \end{aligned}$$

The sequence $\{ (T^*T)^{-1}x_k\}_{k\in \mathbb {N}}$ is called the canonical dual frame of $\{ x_k\}_{k\in \mathbb {N}}$. If

$$\begin{aligned} T^*T = \Vert T^* T\Vert \mathrm {Id}_\mathcal {H}= \Vert T\Vert ^2 \mathrm {Id}_\mathcal {H}, \end{aligned}$$

then $\{x_k\}_{k \in \mathbb N}$ is called a tight frame, and for $T^*T = \mathrm {Id}_\mathcal {H}$ a Parseval frame. Here, Lemma 3.1 comes into the play. Note that $T^\dagger $ is indeed the synthesis operator for the canonical dual frame of $\{ f_k\}_{k\in \mathbb {N}}$. The relation between linear, bounded, injective operators of closed range and frame analysis operators is given in the next proposition.

Proposition 4.1

(i)
An operator $T \in \mathcal {B}(\mathcal {H},\ell _2)$ is injective and has closed range if and only if it is the analysis operator of some frame of $\mathcal {H}$.
(ii)
An operator $T \in \mathcal {B}(\ell _2,\mathcal {H})$ is surjective if and only if it is the synthesis operator of some frame of $\mathcal {H}$.

Proof

(i) If T is the analysis operator for a frame $\{x_k\}_{k\in \mathbb {N}}$, then T is bounded, injective and has closed range, see [11]. Conversely, assume that $T \in \mathcal {B}(\mathcal {H},\ell _2)$ is injective and that $\mathcal {R}(T)$ is closed. By (4), it holds $\mathcal {R}(T^*) = \mathcal H$. Let $\{\delta _k\}_{k\in \mathbb {N}}$ be the canonical basis of $\ell _2$ and set $\{x_k \}_{k\in \mathbb {N}}:= \{T^{*} \delta _k\}_{k\in \mathbb {N}}$. Since $\sum _{k\in \mathbb {N}} |\langle x,x_k \rangle _\mathcal {H}|^2 = \Vert Tx \Vert _{\ell _2}^2$, we conclude that $\{x_k \}_{k\in \mathbb {N}}$ is a frame of $\mathcal {H}$ and that T is the corresponding analysis operator.

(ii) Let $\{x_k\}_{k\in \mathbb {N}} = \{T\delta _k\}_{k\in \mathbb {N}}$. Then, the result follows from [11, Theorem 5.5.1]. $\square $

The soft shrinkage operator $S_\lambda $ on $\ell _2$ (applied componentwise) is the proximity operator corresponding to the function $g := \lambda \Vert \cdot \Vert _1$, $\lambda >0$. As immediate consequence of Theorem 3.4 we obtain the following corollary.

Corollary 4.2

Assume that $T:\mathcal {H}\rightarrow \ell _2$ is an analysis operator for some frame of $\mathcal {H}$ and $\mathrm {Prox}:\ell _2 \rightarrow \ell _2$ is an arbitrary proximity operator. Then, $T^{\dagger } \, \mathrm {Prox}\, T $ is itself a proximity operator on $\mathcal {H}$ equipped with the norm $\Vert \cdot \Vert _{\mathcal {H}_{T}}$. In particular, if $\mathrm {Prox}:= S_\lambda $ with $\lambda >0$, then

$$\begin{aligned}&T^{\dagger } \, S_\lambda \, (T x) = {\mathop {\hbox {argmin}}\limits _{y \in \mathcal {H}}} \bigl \{ \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y)\bigr \},\\&f(y) := \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{\ell _2}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}. \end{aligned}$$

Finally, let us have a look at the finite dimensional setting with $\mathcal {H}:= \mathbb {R}^d$, $\mathcal {K}:= \mathbb {R}^n$, $n\ge d$. Then, we have for any $T \in \mathbb {R}^{n,d}$ with full rank d and the proximity operator $S_\lambda $ with $\lambda >0$ on $\mathbb {R}^n$ that

$$\begin{aligned}&T^{\dagger } \, S_\lambda \, \left( T (x) \right) = {\mathop {\hbox {argmin}}\limits _{y \in \mathbb {R}^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{\mathcal {H}_{T}}^2 + f(y) \bigr \},\nonumber \\&f(y) := \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _{2}^2 + \iota _{\mathcal {N}(T^*)} \bigr ) (Ty)/\Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}. \end{aligned}$$

(13)

Example 4.3

We want to compute f for the matrix $T:\mathbb {R}^{1} \rightarrow \mathbb {R}^{2}$ given by $T = ( 1 , 2)^\mathrm {T}$ and the soft shrinkage operator $S_\lambda $ on $\mathbb {R}^2$ with $\lambda >0$. Note that this example was also considered in [20]. By (13) and since $x = (x_1, x_2)^\mathrm {T}\in \mathcal {N}(T^*)$ if and only if $x_1 = -2 x_2$, we obtain

$$\begin{aligned} f(y) \Vert T\Vert ^2_{\mathcal {B}(\mathcal {H},\mathcal {K})}&= \lambda \Vert \cdot \Vert _1 \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _2^2 + \iota _{\mathcal {N}(T^*)} (\cdot ) \bigr ) (Ty)\\&= \min _{Ty = z+x} \left\{ \lambda \Vert z\Vert _1 + \tfrac{1}{2} \Vert x \Vert _{2}^2 + \iota _{\mathcal {N}(T^*)}(x) \right\} \\&= \min _{x \in \mathbb {R}^2} \left\{ \lambda \Vert Ty-x\Vert _1 + \tfrac{1}{2} \Vert x \Vert _{2}^2+ \iota _{\mathcal {N}(T^*)}(x) \right\} \\&= \min _{x \in \mathbb {R}^2} \bigl \{\lambda \bigl \Vert (y , 2y )^\mathrm {T}- ( x_1, x_2)^\mathrm {T}\bigr \Vert _1 + \tfrac{1}{2} \Vert x \Vert _2^2 + \iota _{\mathcal {N}(T^*)}(x) \bigr \}\\&= \min _{x_2 \in \mathbb {R}} \bigl \{ \lambda \vert y +2x_2 \vert + \lambda \vert 2y-x_2 \vert + \tfrac{5}{2} x_2^2 \bigr \}. \end{aligned}$$

Consider the strictly convex function $g_y(x_2) = \lambda \vert y +2x_2 \vert + \lambda \vert 2y-x_2 \vert + \frac{5}{2} x_2^2$. For $\vert y \vert \le \frac{2}{5} \lambda $, it holds

$$\begin{aligned} 0 \in \partial _{x_2} g_y \left( -\tfrac{y}{2} \right) = [-2\lambda ,2\lambda ] - \lambda \mathrm {sgn}(y) -\tfrac{5}{2} y. \end{aligned}$$

Hence, by Fermat’s theorem, the unique minimizer of $g_y(x_2)$ is given by $-\frac{y}{2}$. Consequently, we have for $\vert y \vert \le \frac{2}{5}\lambda $ that

$$\begin{aligned} f(y) = \tfrac{1\lambda }{2} \vert y \vert + \tfrac{1}{8} y^2. \end{aligned}$$

For $\vert y \vert > \frac{2\lambda }{5}$, the function $g_y$ is differentiable in $-\frac{\lambda }{5} \mathrm {sgn}(y)$ and it holds

$$\begin{aligned} \partial _{x_2} g_y\bigl (-\tfrac{\lambda }{5} \mathrm {sgn}(y)\bigr ) = 2 \lambda \mathrm {sgn}(y) - \lambda \mathrm {sgn}(y) - \lambda \mathrm {sgn}(y) = 0. \end{aligned}$$

Therefore, for $\vert y \vert > \frac{2\lambda }{5}$, the minimizer of $g_y$ is $-\frac{\lambda }{5} \mathrm {sgn}(y)$ and

$$\begin{aligned} f(y) = \tfrac{3\lambda }{5}\vert y \vert - \tfrac{\lambda ^{2}}{50}. \end{aligned}$$

Choosing, e.g., $\lambda = \frac{1}{3}$ we obtain

$$\begin{aligned} f (y) = \left\{ \begin{array}{ll} \frac{1}{6} \vert y \vert + \frac{1}{8} y^{2} &{} \vert y \vert \le \frac{2}{15} \\ \tfrac{1}{5}\vert y \vert - \frac{1}{450} &{} \vert y \vert > \frac{2}{15} \end{array} \right. , \end{aligned}$$

which is a good approximation of $\tfrac{1}{5} \vert y \vert $.

5 Proximal Neural Networks

In this section, we consider neural networks (NNs) consisting of $K \in {\mathbb N}$ layers with dimensions $n_{1}, \ldots , n_{K}$ defined by mappings $\Phi = \Phi (\cdot \,; u):{\mathbb R}^{d} \rightarrow {\mathbb R}^{n_{K}}$ of the form

$$\begin{aligned} \Phi \left( x;u\right) := A_K \sigma \circ A_{K-1}\sigma \circ \cdots \sigma \circ A_1(x). \end{aligned}$$

(14)

Such NNs are composed of affine functions $A_{k}:{\mathbb R}^{n_{k-1}} \rightarrow {\mathbb R}^{n_{k}}$ given by

$$\begin{aligned} A_{k}(x) := L_{k} x + b_{k}, \qquad k =1,\ldots , K, \end{aligned}$$

(15)

with weight matrices $L_{k} \in {\mathbb R}^{n_{k}, n_{k-1}}$, $n_{0}=d$, bias vectors $b_{k} \in {\mathbb R}^{n_{k}}$ as well as a non-linear activation $\sigma :\mathbb {R} \rightarrow \mathbb {R}$ acting at each component, i.e., for ${x}= (x_{j})_{j=1}^{n}$ we have $\sigma (x) = (\sigma (x_{j}))_{j=1}^{n}$. The parameter set $u:=\left( L_k,b_k\right) _{k=1}^{K}$ of such a NN has the overall dimension $D:= n_0 n_1 + n_1 n_2 + \dots + n_{K-1}n_K + n_1 + \dots + n_K$. For an illustration see Fig. 1.

In [13], the notation of stable activation functions was introduced. An activation function $\sigma :\mathbb R \rightarrow \mathbb R$ is called stable if it is monotone increasing, 1-Lipschitz continuous and satisfies $\sigma (0) = 0$. The following result was shown in [13].

Lemma 5.1

A function $\sigma :\mathbb {R} \rightarrow \mathbb {R}$ is a stable activation function if and only if there exists $g \in \Gamma _0(\mathbb R)$ having 0 as a minimizer such that $\sigma = \mathrm {prox}_{g}$.

Various common activation functions $\sigma $ and corresponding functions $g \in \Gamma _0(\mathbb R)$ are listed in Table 3 in the appendix. For $T_k \in \mathbb R^{n_k,d}$, we consider the norm (8) and denote it by

$$\begin{aligned} \Vert x\Vert _{T_{k}} := \bigl ( \Vert T_{k} x\Vert _2^2/\Vert T_{k} \Vert _2^2 + \Vert (I - T_{k}^\dagger T_{k})x\Vert _2^2 \bigr )^\frac{1}{2}, \qquad x \in \mathbb R^d. \end{aligned}$$

(16)

In the previous sections, we have considered two different kinds of proximity operators, namely $\mathrm {prox}_g$ with respect to the Euclidean norm

$$\begin{aligned} \mathrm {prox}_{g}= {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{2}^2 + g(y) \bigr \}, \end{aligned}$$

(17)

and $\mathrm {prox}_{T_k,g}$ with respect to the norm (16)

$$\begin{aligned} \mathrm {prox}_{T_k,g}= {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \bigl \{ \tfrac{1}{2} \Vert x-y\Vert _{T_k}^2 + g(y) \bigr \}. \end{aligned}$$

Further, we derived a function $f_k$ depending on g, $T_k$ and $b_k$, see Theorem 3.4, such that

$$\begin{aligned} \mathrm {prox}_{T_k,f_k}(x) = {\mathop {\hbox {argmin}}\limits _{y \in \mathbb R^d}} \left\{ \tfrac{1}{2} \Vert x-y\Vert _{T_k}^2 + f_k(y) \right\} = T_k^\dagger \mathrm {prox}_g (T_k x + b_k). \end{aligned}$$

Based on our observations in the previous sections, we consider the following special NNs. We choose a stable activation function $\sigma = \mathrm {prox}_g$ for some $g \in \Gamma _0(\mathbb R)$ and matrices $T_{k} \in {\mathbb R}^{n_{k} , d}$, as well as bias vectors $b_{k} \in {\mathbb R}^{n_{k}}$, $k=1, \ldots , K$, and construct according to (15) the affine mappings

$$\begin{aligned} A_{k}(x) := \underbrace{T_{k} T_{k-1}^{\dagger }}_{L_k} (x) + b_{k} , \qquad k=1,\ldots ,K. \end{aligned}$$

(18)

Then, the NN $\Phi :\mathbb {R}^d\rightarrow \mathbb {R}^{n_K}$ in (14) with $A_{k}$, $b_{k}$ in (18) can be rewritten as

$$\begin{aligned} \Phi \left( x;u\right)&= T_K \, T_{K-1}^\dagger \sigma \bigl ( T_{K-1} \ldots T_2^\dagger \sigma \bigl ( T_2 T_{1}^\dagger \sigma (T_1 x + b_1) + b_2 \bigr ) \ldots \bigr )+ b_K\nonumber \\&= T_K\, \mathrm {prox}_{T_{K-1},f_{K-1}} \circ \cdots \circ \mathrm {prox}_{T_{1},f_1} (x) + b_K. \end{aligned}$$

(19)

We call $\Phi $ a proximal neural network (PNN) with network parameters $u := (T_k,b_k)_{k=1}^{K}$.

Next, we investigate stability properties of such networks. Recall that an operator $\Psi :\mathcal {H}\rightarrow \mathcal {H}$ on a Hilbert space $\mathcal {H}$ is $\alpha $-averaged, $\alpha \in (0,1)$, if there exists a nonexpansive operator $R:\mathcal {H}\rightarrow \mathcal {H}$ such that

$$\begin{aligned} \Psi = \alpha R + (1-\alpha ) I_{\mathcal {H}}. \end{aligned}$$

The following theorem summarizes properties of $\alpha $-averaged operators, c.f. [5] and [38] for the third statement.

Theorem 5.2

Let $\mathcal {H}$ be a separable real Hilbert space. Then the following holds true:

(i)
An operator on $\mathcal {H}$ is firmly nonexpansive if and only if it is $\frac{1}{2}$-averaged.
(ii)
The concatenation of K operators which are $\alpha _k$-averaged with respect to the same norm is $\alpha $-averaged with $ \alpha = \frac{K}{K-1 + 1/\max _k \alpha _k} $.
(iii)
For an $\alpha $-averaged operator $\Psi :\mathcal {H}\rightarrow \mathcal {H}$ with a nonempty fixed point set, the sequence generated by the iteration
$$\begin{aligned} x^{(r+1)} = \Psi \bigl (x^{(r)} \bigr ) \end{aligned}$$
converges weakly for every starting point $x^{(0)} \in \mathcal {H}$ to a fixed point of $\Psi $.

In the following, we study special PNNs, which are $\alpha $-averaged operators such that $x^{(r+1)} = \Phi (x^{(r)};u)$ converges to a fixed point of $\Phi $ if such a point exists.

Lemma 5.3

(i)
Let $T_k\in \mathbb R^{n_k,d}$ fulfill $T_k^* T_k = \Vert T_k \Vert ^2 I_{d}$ or $T_k T_k^* = \Vert T_k \Vert ^2 I_{n_{k}}$ for all $k=1,\ldots , K-1$ and let $T_K = I_d$. Then $\Phi $ in (19) is $\alpha $-averaged with $\alpha = \frac{K-1}{K}$.
(ii)
Let $T_1 \in \mathbb R^{n_k,d}$ with full column rank fulfill $\Vert T_1\Vert ^2 T_k^* T_k =\Vert T_k\Vert ^2 T_1^* T_1$ for $k=1,\ldots ,K-1$ and $T_K = I_d$. Then $\Phi $ in (19) is $\alpha $-averaged with $\alpha = \frac{K-1}{K}$.

Proof

(i) By Lemma 3.1, we know that $\Vert \cdot \Vert _{T_k} = \Vert \cdot \Vert _2$ so that $\Phi $ is the concatenation of $K-1$ proximity operators on $\mathbb R^d$ with respect to the Euclidean norm. More precisely,

$$\begin{aligned} \Phi = T_K\,\mathrm {prox}_{f_{K-1}} \circ \cdots \circ \mathrm {prox}_{f_1} (x) + b_K \end{aligned}$$

with $f_k$ as in Theorem 3.4. Now, the assertion follows from Theorem 5.2.

(ii) By assumption, we obtain $\Vert x\Vert _{T_k} = x^* T_k^* T_k x /\Vert T_k\Vert ^2 = x^* T_1^* T_1 x/\Vert T_1\Vert ^2 = \Vert x\Vert _{T_1}$. Hence, $\Phi $ becomes the concatenation of $K-1$ proximity operators on $\mathbb R^d$ all with respect to the $\Vert \cdot \Vert _{T_1}$ norm. Again, the assertion follows from Theorem 5.2. $\square $

Remark 5.4

Lemma 5.3(i) can be generalized to the case where $T_K\in \mathbb {R}^{d,d}$ is a symmetric positive semi-definite matrix with norm not larger than 1. In this case, $T_K$ can be written in the form $T_K=Q^*Q$ for some $Q\in \mathbb {R}^{d,d}$ with $\Vert Q\Vert ^2=\Vert Q^*Q\Vert =\Vert T_K\Vert \le 1$. Thus, for every $x,y\in \mathbb {R}^d$,

$$\begin{aligned} \Vert T_Kx +b_K -(T_Ky+b_K)\Vert ^2&= \Vert Q^*Q(x-y)\Vert ^2 \le \Vert Q(x-y)\Vert ^2 \\&=\langle Q(x-y), Q(x-y)\rangle \\&= \langle x-y, T_Kx+b_K - (T_Ky+b_K) \rangle . \end{aligned}$$

This shows that $T_K\cdot +b_K$ is firmly nonexpansive and therefore $\frac{1}{2}$-averaged. Consequently, $\Phi $ in (19) is the concatenation of K $\frac{1}{2}$-averaged operators with respect to the Euclidean norm. Hence, $\Phi $ is itself $\alpha $-averaged with $\alpha = \frac{K}{K+1}$.

Remark 5.5

In [13], the following NN structure was studied: Let $\mathcal {H}_0, \ldots ,\mathcal {H}_K$ be a sequence of real Hilbert spaces and $\mathcal {H}_0 = \mathcal {H}_K = \mathcal {H}$. Further, let $W_k \in {\mathcal B} (\mathcal {H}_{k-1},\mathcal {H}_k)$ and $P_k:\mathcal {H}_k \rightarrow \mathcal {H}_k$, $k=1,\ldots ,K$ be firmly nonexpansive operators. For this case, Combettes and Pesquet [13] have posed conditions on $W_k$ such that

$$\begin{aligned} \Psi := W_K \circ P_{K-1} \circ W_{K-1} \circ \cdots \circ W_2 \circ P_1 \circ W_1 \end{aligned}$$

(20)

is $\alpha $-averaged for some $\alpha \in (1/2,1)$. For $\mathcal {H}= \mathbb R^d$ equipped with the Euclidean norm, $\mathcal {H}_k = \mathbb R^d$ equipped with the norm (16) and $T_K = I_d$, our PNN $\Phi $ has exactly the form (20) with $P_k := \mathrm {prox}_{T_k,f_k}:\mathcal {H}_k \rightarrow \mathcal {H}_k$ and the embedding operators $W_k:\mathcal {H}_{k-1} \hookrightarrow \mathcal {H}_k$, $k=1,\ldots ,K$. For the special PNNs in Lemma 5.3 it holds $W_k = I_d$, such that the conditions in [13] are fulfilled.

In the rest of this paper, we restrict our attention to matrices $T_k:\mathbb R^{d} \rightarrow \mathbb R^{n_k}$ fulfilling

$$\begin{aligned} T_k^* T_k = I_{d} \quad \mathrm {or} \quad T_k T_k^* = I_{n_{k} }, \quad k=1,\ldots , K-1, \end{aligned}$$

and $T_{K} = I_{d}$, i.e., the rows, resp. columns of $T_k$ form a Parseval frame. Then, the PNN in (19) has the form

$$\begin{aligned} \mathrm {(PPNN)} \qquad \qquad \Phi \left( x;u\right) = T_K \circ \mathrm {prox}_{f_{K-1}} \circ \cdots \circ \mathrm {prox}_{f_1} (x) + b_K, \end{aligned}$$

with the “usual” proximity operator, cf. (17), and

$$\begin{aligned} f_k(x)&= g \square \bigl ( \tfrac{1}{2} \Vert \cdot \Vert _2^2 + \iota _{\mathcal {N}(T_k^*)} \bigr ) (T_k x + b_k) \quad \text{ if } \; d \le n_k,\\ f_k(x)&= g(T_k x + b_k) + \iota _{\mathcal {R}(T_k^*)}(x) \quad \text{ if } \; d \ge n_k. \end{aligned}$$

Due to the use of Parseval frames, we call these networks Parseval (frame) proximal neural networks (PPNNs). By our previous considerations, see Lemma 5.3, PPNNs are averaged operators.

Remark 5.6

An interesting result follows from convergence considerations of the cyclic proximal point algorithm, see [7]. Let $\{\lambda _r\}_{r \in \mathbb N} \in \ell _2 \setminus \ell _1$. Then, for every $x^{(0)} \in \mathbb R^d$, the sequence generated by

$$\begin{aligned} x^{(r+1)} := \mathrm {prox}_{\lambda _r f_{K-1}} \circ \cdots \circ \mathrm {prox}_{\lambda _r f_1} \bigl (x^{(r)}\bigr ) \end{aligned}$$

(21)

converges to a minimizer of $f_1 + \cdots + f_{K-1}$. In particular, Theorem 3.4 implies that for orthogonal matrices $T_k$, $k=1,\ldots ,K-1$ and $T_K = I_d$, $b_K = 0$, the sequence $\{x^{(r)}\}_{r \in \mathbb N}$ in (21) converges to

$$\begin{aligned} \hat{x} \in {\mathop {\hbox {argmin}}\limits _{x}} \sum _{k=1}^{K-1} g(T_k x - b_k). \end{aligned}$$

6 Training PPNNs on Stiefel Manifolds

In this section, we show how to train PPNNs.

Remark 6.1

According to Lemma 5.3(i), we could add more flexibility to our model by allowing tight frames instead. Then, we must train an additional scaling constant, which does not introduce difficulties in the training process and may be useful for special applications, see [2]. In our numerical experiments, we omitted the additional scaling constant as we do not want to focus on this particular issue.

In PPNNs, we assume that either $T_k$ or $T_k^*$, $k=1,\ldots ,K-1$, is an element of the Stiefel manifold

$$\begin{aligned} \mathrm {St}\big ( \min (n_k,d ), \max (n_k,d)\big ), \quad k=1,\ldots , K-1. \end{aligned}$$

The following facts on Stiefel manifolds can be found, e.g., in [1]. For $d\le n$, the (compact) Stiefel manifold is defined as

$$\begin{aligned} \mathrm {St}(d,n) := \bigl \{ T \in \mathbb R ^{n,d} : T^* T = I_d\bigr \}. \end{aligned}$$

For $d=1$ this reduces to the sphere $\mathbb S^{n-1}$, and for $d=n$ we obtain the special orthogonal group $\mathrm {SO}(n)$. In general, $\mathrm {St}(d,n)$ is a manifold of dimension $nd -\frac{1}{2} d(d+1)$ with tangential space at $T \in \mathrm {St}(d,n)$ given by

$$\begin{aligned} {\mathcal T}_T \mathrm {St}(d,n) =\bigl \{T U + T_\perp V: U^* = -U, V\in \mathbb {R}^{n-d,d}\bigr \}, \end{aligned}$$

where the columns of $T_{\perp }\in \mathbb {R}^{n,n-d}$ are the basis of an orthonormal complement of T fulfilling $T^*_{\perp } T_{\perp } = I_{n-d}$ and $T^*T_{\perp }=0$. The Riemannian gradient of a function on $\mathrm {St}(d,n)$ can be obtained by the orthogonal projection of the gradient in $\mathbb R^{n,d}$ onto $\mathrm {St}(d,n)$. The orthogonal projection of $X \in \mathbb {R}^{n,d}$ onto ${\mathcal T}_T \mathrm {St}(d,n)$ is given by

$$\begin{aligned} P_T X&= (I_n - T T^*) X + \tfrac{1}{2} T(T^* X - X^* T),\end{aligned}$$

(22)

$$\begin{aligned}&= WT, \qquad W := \hat{W}-\hat{W}^*,\quad \hat{W} := XT^*-\tfrac{1}{2} T(T^*XT^*). \end{aligned}$$

(23)

To emphasize that for fixed T the matrix W depends on X, we will also write $W_X$. A retraction $\mathcal {R}$ on the manifold $\mathrm {St}(d,n)$ is a smooth mapping from the tangent bundle of $\mathrm {St}(d,n)$ to the manifold fulfilling ${\mathcal R}_{T}(0) = T$, where 0 is the zero element in ${\mathcal T}_T \mathrm {St}(d,n)$, and with the identification ${\mathcal T}_0({\mathcal T}_{T} \mathrm {St}(d,n)) \cong {\mathcal T}_T \mathrm {St}(d,n)$ the local rigidity condition $ D {\mathcal R}_{T} (0) = \mathrm {Id}_{ {\mathcal T}_T \mathrm {St}(d,n)} $ holds true. A well-known retraction on $\mathrm {St}(d,n)$ is

$$\begin{aligned} \tilde{\mathcal R}_T(X)=\mathrm {qf}(T+X), \quad X \in {\mathcal T}_T\mathrm {St}(d,n), \end{aligned}$$

(24)

where $\mathrm {qf}(A)$ denotes the Q factor of the decomposition of a matrix $A\in \mathbb {R}^{n,d}$ with linearly independent columns as $A=QR$ with $Q\in \mathrm {St}(d,n)$ and R an upper triangular matrix of size $d \times d$ with strictly positive diagonal elements. The complexity of the QR decomposition using the Householder algorithm is $2d^2(n-d/3)$, see [21]. Since the computation of the QR decomposition appears to be time consuming on a GPU, we prefer to apply another retraction, based on the Cayley transform of skew-symmetric matrices W in (23), namely

$$\begin{aligned} {\mathcal R}_T(X)=(I_n-\tfrac{1}{2} W)^{-1}(I_n +\tfrac{1}{2} W)T, \quad X \in {\mathcal T}_T\mathrm {St}(d,n), \end{aligned}$$

(25)

see [36, 48]. By straightforward computation it can be seen that $W_X$ and $W_{P_T X}$ coincide, so that the retraction (25) enlarged to the whole $\mathbb R^{n,d}$ fulfills

$$\begin{aligned} {\mathcal R}_T (X) = {\mathcal R}_T (P_T X), \qquad X \in \mathbb R^{n,d}. \end{aligned}$$

(26)

Remark 6.2

The retraction (25) has the drawback that it contains a matrix inversion. In our numerical algorithm, the following simple fixed point iteration is used for computing the matrix $R = {\mathcal R}_T(X)$ with fixed T and X. By definition, R fulfills the fixed point equation

$$\begin{aligned} R = \tfrac{1}{2} W R + (I_n + \tfrac{1}{2} W) T. \end{aligned}$$

(27)

Starting with an arbitrary $R^{(0)} \in \mathrm {St}(d,n)$, we apply the iteration

$$\begin{aligned} R^{(r+1)} := \tfrac{1}{2} W R^{(r)} + (I_n + \tfrac{1}{2} W) T, \end{aligned}$$

which converges by Banach’s fixed point theorem to the fixed point of (27) if $\frac{1}{2} \rho (W) < 1$, where $\rho (W)$ denotes the spectral radius of W.

We want to train a PPNN by minimizing

$$\begin{aligned} \mathcal J(u) := \sum _{i=1}^N {\ell } \bigl (\Phi (x_i;u); y_i\bigr ), \end{aligned}$$

(28)

where ${\ell }:\mathbb R^d \times \mathbb R^d \rightarrow \mathbb R$ is a differentiable loss function on the first d variables.

Example 6.3

Let us specify two special cases of PPNNs with one layer.

(i)
For one layer without bias and componentwise soft shrinkage $\sigma $ as activation function, i.e., summands
$$\begin{aligned} \sum _{i=1}^N {\ell } \bigl (T_1^* \sigma (T_1 x_i);y_i \bigr ), \quad T_1 \in \mathrm {St}(d,n_1), \end{aligned}$$
we learn Parseval frames, e.g., for denoising tasks with $y_i$ as a noisy version of $x_i$. Here, we want to mention the significant amount of work on dictionary learning, see [18], which starts with the same goal.
(ii)
For $x_i = y_i$, $i=1,\ldots ,N$, the above network could be used as so-called auto-encoder. Again, for one layer without activation function, $b_1 = 0$ and ${\ell } = h(\Vert x-y\Vert )$ with some norm $\Vert \cdot \Vert $ on $\mathbb R^d$ we get
$$\begin{aligned} \sum _{i=1}^N {\ell } \left( T_1 ^* T_1 x_i ;x_i \right) = \sum _{i=1}^N h\bigl ( \Vert (I_d - T_1 ^* T_1) x_i \Vert \bigr ), \quad T_1^* \in \mathrm {St}(d,n_1). \end{aligned}$$
For the Euclidean norm and $h(x) = x^2$ we get the classical PCA approach and for $h(x) = x$ the robust rotationally invariant $L_1$-norm PCA, recently discussed in [30, 35].

The following remark points out that special cases of our PPNNs were already considered in the literature.

Remark 6.4

In [26], NNs with weight matrices $L_k \in \mathbb R^{n_k,n_{k-1}}$, $k \in \{1,\ldots ,K-1\}$, (or their transpose) lying in a Stiefel manifold were examined. The authors called this approach optimization over multiple dependent Stiefel manifolds (OMDSM). Indeed, by the following reasons, these NNs are special cases of our PPNNs if $n_k \le d$ for all $k=1,\ldots ,K-1$. In particular, this implies that the NNs considered in [26] (with appropriately chosen last layer) are averaged operators.

(i) Case $n_{k} \le n_{k-1}$: Let $L_k^* \in \mathrm {St}(n_k, n_{k-1})$, i.e., $L_k L_k^* = I_{n_k}$. Choosing an arbitrary fixed $T_{k-1} \in \mathbb R^{n_{k-1},d}$ with $T_{k-1} T_{k-1}^* = I_{n_{k-1}}$, we want to find $T_{k} \in \mathbb R^{n_{k},d}$ such that

$$\begin{aligned} T_{k} T_{k}^* = I_{n_k} \quad \mathrm {and} \quad L_k = T_k T_{k-1}^*. \end{aligned}$$

(29)

It is straightforward to verify that $T_k := L_k T_{k-1}$ has the desired properties.

Note that if the transposes of $T_k$ and $T_{k-1}$ are in a Stiefel manifold, this does not necessarily hold for the transpose of $T_k T_{k-1}^*$. Therefore, our PPNNs are more general.

(ii) Case $n_{k-1}< n_k$: Let $L_k \in \mathrm {St}(n_{k-1},n_k)$, i.e., $L_k^* L_k = I_{n_{k-1}}$. For an arbitrary fixed $T_{k-1} \in \mathbb R^{n_{k-1},d}$ with $T_{k-1} T_{k-1}^* = I_{n_{k-1}}$, we want to find $T_{k} \in \mathbb R^{n_{k},d}$ fulfilling (29). To this end, we complete $L_k$ to an orthogonal matrix $\tilde{L}_k \in \mathbb {R}^{n_k,n_k}$ and $T_{k-1}$ to a matrix $\tilde{T}_{k-1}\in \mathbb {R}^{n_k,d}$ with orthogonal rows. By straightforward computation we verify that $T_k := \tilde{L}_k \tilde{T}_{k-1}$ satisfies (29) such that this case also fits into our PPNN framework.

We apply a stochastic gradient descent algorithm on the Stiefel manifold to find a minimizer of (28). To this end, we compute the Euclidean gradient with respect to one layer and apply the usual backpropagation for multiple layers.

Lemma 6.5

Let

$$\begin{aligned} J(T,b) := \ell \bigl (T^* \sigma (Tx+b);y\bigr ), \end{aligned}$$

where T or $T^*$ are in $\mathrm {St}(d,n)$, and $\ell $ and $\sigma $ are differentiable. Set

$$\begin{aligned} r := \sigma (Tx+b),\quad s := T^*\sigma (Tx+b), \quad t := \nabla \ell \bigl (T^* \sigma (Tx+b);y\bigr ), \quad \Sigma := \mathrm {diag}\bigl (\sigma '(Tx + b) \bigr ), \end{aligned}$$

where the gradient of $\ell $ is taken with respect to the first d variables. Then it holds for the Euclidean gradient

$$\begin{aligned} \nabla _T J(T,b)&= -T (t s^* + s t^*) + r t^* + \Sigma \, T t x^*,\qquad \nabla _b J(T,b) = \Sigma \, T t. \end{aligned}$$

The proof follows by straightforward computations that are carried out in Appendix.

Now, we can formulate stochastic gradient descent (SGD) for $\mathcal J$ as in Algorithm 1. This algorithm works for an arbitrary retraction, in particular the retraction in (24). In our numerical computations, we use the special retraction (25) in connection with the iteration scheme. Then, by (26), the projection step 3 of the algorithm can be skipped and the retraction can be directly applied to the Euclidean gradient.

7 Some Numerical Results

In this section, we present simple numerical results to get a first impression on the performance of PPNNs for denoising and classification. More sophisticated examples, which include the full repertoire of fine tuning of NNs, will follow in an experimental paper, see also our conclusions. Throughout this section, we use the quadratic loss function $\ell (x;y) := \Vert x-y\Vert _2^2$. For training we apply a stochastic gradient descent algorithm. We initialize the matrices $T_k\in \mathrm {St}(d,n)$ randomly using the orthogonal initializer from Tensorflow. That is, we generate a matrix $\tilde{T}_k\in \mathbb {R}^{n,d}$ with independent random entries following the standard normal distribution and use the initialization $T_k=\mathrm {qf}(\tilde{T}_k)$. The batch size and learning rate are given for all examples separately.

Denoising In this experiment, we compare PPNNs with Haar wavelet thresholding both for discrete Haar bases and Haar frames arising from the undecimated (translation invariant) version of the Haar transform. In particular, the experiment is linked to the starting point of our considerations, namely wavelet and frame shrinkage. For further details on the corresponding filter banks we refer to [15, 37]. As quality measure for our experiments we choose the average peak signal to noise ratio (PSNR) over the test set. Recall that for a prediction $x\in \mathbb {R}^{n}$ and ground truth $y\in \mathbb {R}^{n}$ the PSNR is defined by

$$\begin{aligned} \mathrm {PSNR}(x,y)=10\log _{10}\Big (\frac{(\max y - \min y)^2}{\sum _{i=1}^{n} (x_i-y_i)^2}\Big ). \end{aligned}$$

Since we focus on the Haar filter, we restrict our attention to piecewise constant signals with mean 0. By $(x_i,y_i)\in \mathbb {R}^{d}\times \mathbb {R}^{d}$, $i=1,\ldots ,N$, we denote pairs of piecewise constant signals $y_i$ of length $d=2^7=128$ and their noisy versions by $x_i=y_i+\epsilon _i$, where $\epsilon _i$ is white noise with standard deviation $\sigma = 0.1$. For the signal generation, we choose

The number of constant parts of $y_i$ as $\max \{2,t_i\}$, where $t_i$ is the realization of a random variable following the Poisson distribution with mean 5;
The discontinuities of $y_i$ as realization of a uniform distribution;
The signal intensity of $y_i$ for every constant part as realization of the standard normal distribution, where we subtract the mean of the signal finally.

Using this procedure, we generate training data $(x_i,y_i)_{i=1}^N$ and test data $(x_i,y_i)_{i=N+1}^{N+N_\text {test}}$ with $N=500{,}000$ and $N_\text {test}=1000$. The average PSNR of the noisy signals in the test set is 25.22. We use PPNNs with $K-1 \in \{1, 2, 3\}$ layers and set $T_K=I_d$ and $b_K=0$. In all examples a batch size of 32 and a learning rate of 0.5 is used.

We are interested in two different settings:

1. Learned orthogonal matrices versus Haar basis. First, we consider PPNNs with 128 neurons in each hidden layer and componentwise soft-shrinkage $S_\lambda $ as activation function. In particular, all matrices $T_k$ have to be orthogonal. The denoising results of our learned PPNN are compared with the soft wavelet shrinkage with respect to the discrete orthogonal Haar basis in $\mathbb R^{128}$, i.e., the signal on all 6 scales is decomposed by

$$\begin{aligned} \Psi (x) = H^* S_{\lambda }(H x), \end{aligned}$$

(30)

where $ H := H_2 \, \cdots \, H_7 $ with matrices

$$\begin{aligned} H_{j} := \begin{pmatrix} \tilde{H}_j&{}0\\ 0&{}I_{2^7 - 2^{j}} \end{pmatrix}, \quad \tilde{H}_j := \frac{1}{\sqrt{2}} \begin{pmatrix} 1&{}1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}0&{}1&{}1&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}1&{}1&{}0&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}1\\ 1&{}-1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}0&{}1&{}-1&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}1&{}-1&{}0&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}-1\\ \end{pmatrix} \in \mathbb R^{2^j,2^j}. \end{aligned}$$

The average PSNRs on the test data are given in Table 1. For determining the optimal threshold in $S_\lambda $, we implemented two different methods. The first one is 5-fold cross validation (CV). More precisely, the training data is divided into 5 subsets and each is used once as a test set with the remaining samples as training set. The test loss for given $\lambda $ is averaged over all 5 trials for judging the quality of the model. The tested parameters $\lambda $ are chosen in [0.05, 0.3] with steps of 0.05 for a NN with one layer. The second method is to set $\lambda $ as a trainable variable of the neural network and optimize it via stochastic gradient descent (SGD) during the training process. For NNs with two and three layers, the tested parameters are divided by 2 and 3, respectively. It appears that for only one hidden layer the Haar wavelet shrinkage is still better than the learned orthogonal matrix. If we increase the number of layers, then PPNNs lead to a better average PSNR.

Two exemplary noisy signals and their denoised versions are shown in Fig. 2. Since we have learned the orthogonal matrices $T_k$ with respect to the quadratic loss function, the visual quality of the PPNN denoised signals is clearly not satisfactory even with an improved PSNR. The visual impression of signals denoised by Haar wavelet shrinkage can by improved (smoother signal) by increasing the threshold to $\lambda = 0.3$, resulting in a worse PSNR. To achieve a similar behavior with orthogonal matrices learned by PPNNs, we have to choose a different loss function.

Table 1 PSNRs (average on test data) for denoising piecewise constant signals

Full size table

2. Learned Stiefel matrices versus Haar frame. Haar wavelet shrinkage can be improved by using Haar wavelet frames within a so-called “algorithm á trous”, see [32]. We apply a similar method as in (30), but with a rectangular matrix H whose rows form a Haar frame. More precisely, the Haar filter is used without subsampling. This results, in contrast to the original Haar transform, in a translational invariant multiscale transform. Instead of the matrices $H_7$, the nonsubsampled (convolution) matrix

$$\begin{aligned} \frac{1}{\sqrt{2}} \begin{pmatrix} 1&{}1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}1&{}1&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}&{}1&{}1&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}1\\ 1&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}1\\ 1&{}-1&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ 0&{}1&{}-1&{}0&{}\ldots &{}0&{}0&{}0&{}0\\ \vdots \\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}1&{}-1&{}0\\ 0&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}1&{}-1\\ -1&{}0&{}0&{}0&{}\ldots &{}0&{}0&{}0&{}1\\ \end{pmatrix} \in \mathbb R^{256,128}\text{. } \end{aligned}$$

is applied to the original signal. Note that we assume a periodic continuation of the signal. In each of the following j steps, $j=1, \ldots ,6$, we keep the lower part and transform again the upper smoothed part by essentially the same matrix, where $2^{j}-1$ zeros are inserted between the filter coefficients. The output signal has size $8\cdot 128$, where the last part of the signal is just the averaged signal, which is equal to zero. Overall, the original signal is multiplied by $H\in \mathbb {R}^{1024, 128}$, where for $j\in \{0,\ldots ,6\}$ and $i\in \{0,\ldots ,128-2j\}$ the $(i+128 j)$-th row is given by one element of a Haar wavelet frame

$$\begin{aligned} \big (\underbrace{\begin{array}{ccc}0&\ldots&0 \end{array}}_{i \text { times } 0} \underbrace{\begin{array}{ccc}1&\ldots&1 \end{array}}_{j \text { times } 1} \underbrace{\begin{array}{ccc}-1&\ldots&-1 \end{array}}_{j \text { times } -1} \underbrace{\begin{array}{ccc}0&\ldots&0 \end{array}}_{128-i-2j \text { times } 0}\big ) \end{aligned}$$

and the last 128 rows of H are given by $(\begin{array}{ccc}1&\ldots&1\end{array})$. It is well known that for the above translation invariant Haar frame transform, a scale dependent shrinkage has to be applied to each scale, namely starting with threshold $\lambda $ the next scales should be thresholded by

$$\begin{aligned} \frac{1}{\sqrt{2}^{j}}\lambda , \quad j=0,\ldots ,6. \end{aligned}$$

(31)

For an explanation of this statement we refer to [43]. In summary, we obtain

$$\begin{aligned} \hat{\Psi }(x)=H ^*\tilde{S}_\lambda (H x), \end{aligned}$$

where $\tilde{S}_\lambda $ denotes the scale-wise adapted thresholding.

Now, we compare this scale-dependent Haar frame soft thresholding method with a learned PPNN with 1024 neurons in the hidden layers and componentwise soft-shrinkage $S_\lambda $ as activation function. The optimal threshold in $S_\lambda $ is again determined either by threefold cross validation, where the tested parameters are chosen in [0.01, 0.1] with steps of 0.01, or by setting $\lambda $ as a trainable variable of the neural network and optimize it via stochastic gradient descent. We emphasize that in contrast to the Haar frame shrinkage procedure with (31), the same threshold for each component is used in the activation function of our PPNN. The resulting PSNRs are given in Table 2. As expected, using the same threshold in the classical Haar frame shrinkage is worse than the scale-adapted Haar frame shrinkage. PPNNs with learned Stiefel matrices perform better for an increasing number of layers.

Finally, Fig. 3 contains the denoised signals from Fig. 2. The results are visually better than in the previous figure, although still not satisfactory due to the used loss function.

Table 2 PSNRs (average on test data) for denoising piecewise constant signals

Full size table

Classification In this example, we train a PPNN for classifying the MNIST data set^{Footnote 1}. The length of the input signals is $d=28^2$. We consider a PPNN with $K-1 =5$ layers and $n_1=n_2=784$, $n_3=n_4=400$ and $n_5=200$ neurons in the layers and componentwise applied ReLu activation function $\sigma (x) = \max (0,x)$. To get 10 output elements (probabilities) in (0, 1), we use an additional sixth layer

$$\begin{aligned} g (T_K x + b_K), \quad T_K \in \mathbb R^{10,d}, \, b_K \in \mathbb R^{10} \end{aligned}$$

with another activation function $g(x) := \frac{1}{1 + \mathrm {exp} (-x)}$. For training, we use a batch size of 1024 and a learning rate of 5. After 1000 epochs we reach an accuracy of 0.9855 on the test set. One epoch takes about one second on a NVIDIA Quadro M5000 GPU. In Fig. 4, the training and test loss of our PPNN during training are plotted.

Remark 7.1

As already mentioned in Remark 6.4, NNs with Stiefel matrices were also applied in [26]. The authors of [26] reported that the training process using Riemannian optimization on the Stiefel manifold could be unstable or divergent. We do not observe such instabilities in our setting.

Adversarial Attacks Neural networks with bounded Lipschitz constants were successfully applied to defend against adversarial attacks, see [22, 46]. In this example, we demonstrate that a PPNN is more robust under adversarial attacks than a standard neural network. Assume that we have a neural network $f=(f_1,\ldots ,f_{10}):\mathbb {R}^{28^2}\rightarrow (0,1)^{10}$ for classifying MNIST, e.g., a PPNN as described in the previous example. Further, we have given an input $x\in \mathbb {R}^{28^2}$. Now, we perform an adversarial attack in the following way:

Set $\nu :={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x)$, and $g := \nabla _x \tfrac{f_\nu (x)}{\Vert f(x)\Vert _1}$.
Initialize $\epsilon = 10^{-2}$ and while $\nu ={\mathop {\hbox {argmax}}\nolimits _{i\in \{1,\ldots ,10\}}} f_i(x-\epsilon g)$ update $\epsilon =2\epsilon $.

This procedure is applied on two neural networks. More precisely, the first one is the PPNN from the previous example and the second one is a neural network with the same structure as the PPNN, but without the orthogonality constraint. We train the standard neural network using the Adam optimizer with learning rate of $10^{-4}$ and end up with an accuracy of 0.9863. Then, we perform an adversarial attack on both of these networks and record the norm $\Vert \epsilon g\Vert _2$ of the noise that changes the prediction. We perform this for each input $x_k\in [0,255]^{28^2}$, $k=1,\ldots ,10{,}000$, in the test set and compute the mean, standard deviation and median of these norms. For the PPNN, we record an average norm of $38.28\pm 24.51$ and a median of 33.71. For the standard neural network, we record an average norm of $30.48\pm 15.72$ and a median of 28.75. Overall, the PPNN seems to be more stable against such adversarial attacks.

8 Conclusions

In this paper, we have shown that for real Hilbert spaces $\mathcal {H}$ and $\mathcal {K}$, a proximity operator $\mathrm {Prox}:\mathcal {K}\rightarrow \mathcal {K}$ and a linear bounded operator $T:\mathcal {H}\rightarrow \mathcal {K}$ the operator $T^\dagger \, \mathrm {Prox}(T \cdot + b)$ with $b \in \mathcal {K}$ is a proximity operator on $\mathcal {H}$. As a consequence, the famous frame soft shrinkage operator can be seen as a proximity operator. Using this new relations, we have discussed special neural networks arising from Parseval frames and stable activation functions. Our networks are Lipschitz networks, which are moreover averaged operators. These networks include recently proposed ones containing matrices whose transposes are in a Stiefel manifold and interpret them from another, more general point of view.

In our future work, we want to explore for which learning tasks the higher flexibility of our PPNNs is advantageous. Taking more general operators T into account may be also useful. In particular, we will apply our PNNs within Plug-and-Play algorithms. Another question that we want to address is to constrain our Stiefel matrices further, e.g., towards convolutional networks and to sparsity constraints. Depending on the application, we have to design appropriate loss functions as well as incorporating regularizing terms.

Table 3 Stable activation functions and their corresponding proximal mappings, see [13]

Full size table

For our experiments the stochastic gradient algorithm on Stiefel manifolds worked well. However, other minimization methods could be taken into account. In [26] for example, the authors proposed an orthogonal weight normalization algorithm that was inspired by the fact that eigenvalue decomposition is differentiable. Finally, we like to mention that a proximal backpropagation algorithm taking implicit instead of explicit gradient steps to update the network parameters during neural network training was proposed in [19].

A better understanding of the convergence of the cyclic proximal point algorithm, see Remark 5.6, and suitable early stopping criteria if the network $\Phi $ is iteratively used may help to design NNs and to understand their success.

Change history

28 April 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00041-021-09848-9

Notes

http://yann.lecun.com/exdb/mnist.

References

Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton and Oxford (2008)
Book Google Scholar
Anil, C., Lucas, J., Grosse, R.: Sorting out Lipschitz function approximation. In: Chaudhuri, K., Salakhutdinov, R., editors, Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, pp. 291–301, Long Beach, California, USA. PMLR (2019)
Arjovsky, M., Shah, A., Bengio, Y. Unitary evolution recurrent neural networks. In: International Conference on Machine Learning, pp. 1120–1128 (2016)
Bansal, N., Chen, X., Wang, Z.: Can we gain more from orthogonality regularizations in training deep networks? In: Advances in Neural Information Processing Systems, pp. 4261–4271 (2018)
Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York (2011)
Book Google Scholar
Beck, A.: First-Order Methods in Optimization, vol. 25 of MOS-SIAM Series on Optimization. SIAM (2017)
Bertsekas, D.P.: Incremental proximal methods for large scale convex optimization. Math. Program. 129, 163–195 (2011)
Article MathSciNet Google Scholar
Burger, M., Sawatzky, A., Steidl, G.: First Order Algorithms in Variational Image Processing. In Operator Splittings and Alternating Direction Methods. Springer, New York (2017)
MATH Google Scholar
Chan, S.H., Wang, X., Elgendy, O.A.: Plug-and-play ADMM for image restoration: fixed-point convergence and applications. IEEE Trans. Comput. Imaging 3, 84–98 (2016)
Article MathSciNet Google Scholar
Chouzenoux, E., Pesquet, J.-C., Repetti, A.: Variable metric forward-backward algorithm for minimizing the sum of a differentiable function and a convex function. J. Optim. Theory Appl. 162, 107–132 (2014)
Article MathSciNet Google Scholar
Christensen, O.: An Introduction to Frames and Riesz Bases. Springer, New York (2016)
MATH Google Scholar
Combettes, P.L.: Monotone operator theory in convex optimization. Math. Program. 170(1), 177–206 (2018)
Article MathSciNet Google Scholar
Combettes, P.L., Pesquet, J.-C.: Deep neural network structures solving variational inequalities. In: Set-Valued and Variational Analysis, pp. 1–28 (2020)
Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forward-backward splitting. Multisc. Model. Simul. 4, 1168–1200 (2005)
Article MathSciNet Google Scholar
Cvetković, Z., Vetterli, M.: Oversampled filter banks. IEEE Trans. Signal Process. 46, 1245–1255 (1998)
Article Google Scholar
Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 57(11), 1413–1457 (2004)
Article MathSciNet Google Scholar
Dorobantu, V., Stromhaug, P.A., Renteria, J.: DIZZYRNN: reparameterizing recurrent neural networks for norm-preserving backpropagation. In: CoRR arXiv:1612.04035 (2016)
Elad, M., Aharon, M.: Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 15(12), 3736–3745 (2006)
Article MathSciNet Google Scholar
Frerix, T., Möllenhoff, T., Moeller, M., Cremers, D.: Proximal backpropagation. Technical report. ArXiv Preprint arXiv:1706.04638 (2018)
Geppert, J.A., Plonka, G.: Frame soft shrinkage operators are proximity operators. Technical report, arXiv preprint arXiv:1910.01820 (2019)
Golub, G.H., Loan, C.F.V.: Matrix Computations. The Johns Hopkins University Press, Baltimore (2013)
MATH Google Scholar
Goodfellow, J., Shlens, J., Szegedy, C.: Expalining and harnessing adversarial examples. In: International Conference on Learning Representations (2015)
Gouk, H., Frank, E., Pfahringer, B., Cree, M.: Regularisation of neural networks by enforcing Lipschitz continuity. arXiv:1804.04368 (2018)
Gribonval, R., Nikolova, M.: A characterization of proximity operators. arXiv:1807.04014 (2020)
Harandi, M., Fernando, B.: Generalized backpropagation, etude de cas: orthogonality. In CoRR abs/1611.05927 (2016)
Huang, L., Liu, X., Lang, B., Yu, A.W., Wang, Y., Li, B.: Orthogonal weight normalization: solution to optimization over multiple dependent Stiefel manifolds in deep neural networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Huster, T.P., Chiang, C.-Y.J., Chadha, R.: Limitations of the Lipschitz constant as a defense against adversarial examples. In: ECML PKDD 2018 Workshops, pp. 16–29. Springer, New York (2019)
Jing, L., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (EUNN) and their application to RNNs. In: Proceedings of the 34th International Conference on Machine Learning-Vol. 70, pp. 1733–1741. JMLR. org (2017)
Kobler, E., Klatzer, T., Hammernik, K., Pock, T.: Variational networks: connecting variational methods and deep learning. In: German conference on pattern recognition, pp. 281–293. Springer, New York (2017)
Lerman, G., Maunu, T.: An overview of robust subspace recovery. Proc. IEEE 106(8), 1380–1410 (2018)
Article Google Scholar
Lezcano-Casado, M., Martínez-Rubio, D.: Cheap orthogonal constraints in neural networks: A simple parametrization of the orthogonal and unitary group. arXiv:1901.08428 3794–3803, (2019)
Mallat, S.: A Wavelet Tour of Signal Processing: The Sparse Way. Elsevier, Amsterdam (2008)
MATH Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018)
Moreau, J.-J.: Proximité et dualité dans un espace Hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)
Article MathSciNet Google Scholar
Neumayer, S., Nimmer, M., Setzer, S., Steidl, G.: On the rotational invariant $l_1$-norm PCA. Linear Algeb. Appl. 587, 243–270 (2019)
Article Google Scholar
Nishimori, Y., Akaho, S.: Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold. Neurocomputing 67, 106–135 (2005)
Article Google Scholar
Plonka, G., Steidl, G.: A multiscale wavelet-inspired scheme for nonlinear diffusion. Int. J. Wavel. Multiresol. Inf. Process. 4(1), 1–21 (2006)
Article MathSciNet Google Scholar
Reich, S.: Weak convergence theorems for nonexpansive mappings in Banach spaces. J. Math. Anal. Appl. 67, 274–276 (1979)
Article MathSciNet Google Scholar
Sedghi, H., Gupta, V., Long, P.M.: The singular values of convolutional layers. In: International Conference on Learning Representations (2019)
Setzer, S.: Operator splittings, Bregman methods and frame shrinkage in image processing. Int. J. Comput. Vis. 92(3), 265–280 (2011)
Article MathSciNet Google Scholar
Sommerhoff, H., Kolb, A., Moeller, M.: Energy dissipation with plug-and-play priors. In: NeurIPS 2019 Workshop (2019)
Sreehariand, S., Venkatakrishnan, S.V., Wohlberg, B.: Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Trans. Comput. Imaging 2, 408–423 (2016)
MathSciNet Google Scholar
Steidl, G., Weickert, J., Brox, T., Mrázek, P., Welk, M.: On the equivalence of soft wavelet shrinkage, total variation diffusion, total variation regularization, and sides. SIAM Journal on Numerical Analysis 42(2), 686–713 (2004)
Article MathSciNet Google Scholar
Sun, Y., Wohlberg, B., Kamilov, U.: An online plug-and-play algorithm for regularized image reconstruction. IEEE Trans. Comput. Imaging 5, 395–408 (2018)
Article MathSciNet Google Scholar
Teodoro, A.M., Bioucas-Dias, J.M., Figueiredo, M.A.: A convergent image fusion algorithm using scene-adapted Gaussian-mixture-based denoising. IEEE Trans. Image Process. 28(1), 451–463 (2018)
Article MathSciNet Google Scholar
Tsuzuku, Y., Sato, I., Sugiyama, M.: Lipschitz-margin training: scalable certification of perturbation invariance for deep neural networks. Adv. Neural Inf. Process. Syst. 31, 6541–6550 (2018)
Google Scholar
Vorontsov, E., Trabelsi, C., Kadoury, S., Pal, C.: On orthogonality and learning recurrent networks with long term dependencies. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3570–3578. JMLR.org (2017)
Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)
Article MathSciNet Google Scholar
Wisdom, S., Powers, T., Hershey, J., Le Roux, J., Atlas, L.: Full-capacity unitary recurrent neural networks. In: Advances in Neural Information Processing Systems, pp. 4880–4888 (2016)

Download references

Acknowledgements

Many thanks to the reviewer for the constructive suggestions in relation with Lipschitz networks and Plug-and-Play algorithms. Further, we like to thank J.-C. Pesquet for pointing us to [12], which we were not aware of when writing this paper. Funding by the German Research Foundation (DFG) within the Project STE 571/14-1, the RTG 1932 and the RTG 2088 is gratefully acknowledged. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Quadro M5000 GPU used for this research.

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Institute of Mathematics, TU Berlin, Straße des 17. Juni 136, 10623, Berlin, Germany
Marzieh Hasannasab, Johannes Hertrich, Sebastian Neumayer & Gabriele Steidl
Institute for Numerical and Applied Mathematics, Göttingen University, Lotzestr. 16-18, 37083, Göttingen, Germany
Gerlind Plonka
London, UK
Simon Setzer

Authors

Marzieh Hasannasab
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Hertrich
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Neumayer
View author publications
You can also search for this author in PubMed Google Scholar
Gerlind Plonka
View author publications
You can also search for this author in PubMed Google Scholar
Simon Setzer
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Steidl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastian Neumayer.

Additional information

Communicated by Holger Rauhut.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article has been updated to publish Open Access.

Appendices

Appendix: Gradient Computation

Proof

Proof of Lemma 6.5: For potential further applications, we compute the gradient of the more general functional

$$\begin{aligned} J(T,b) := \ell \bigl (T^\dagger \sigma (Tx+b);y\bigr ) = \ell \bigl ( f(T,b);y \bigr ), \end{aligned}$$

where $T^\dagger = (T^* T)^{-1} T^*$. Note that the case $T^\dagger = T^* (T T^* )^{-1}$ can be treated in an analog way. Using $f_1(T) = (T^* T) ^{-1}$, $f_2(T) = T^*$, $f_3(T,b) = \sigma (Tx+b)$, the function $f(T,b) := T^\dagger \sigma (Tx + b)$ is decomposed as

$$\begin{aligned} f(T,b) = f_1(T) \, f_2(T) \, f_3(T,b). \end{aligned}$$

By the product rule it holds

$$\begin{aligned} D_T f(T,b)[H]&= Df_1(T)[H] \, f_2(T) \, f_3(T,b)\\&\quad + f_1(T) \, Df_2(T)[H] \, f_3(T,b) + f_1(T) \, f_2(T) \, D_Tf_3(T,b)[H]. \end{aligned}$$

The involved differentials are given by

$$\begin{aligned} D f_1(T)[H]&= -(T^*T )^{-1} (T^* H + H^* T) (T^*T )^{-1},\\ D f_2(T)[H]&= H^*,\\ D_Tf_3(T,b)[H]&= \mathrm { diag} \bigl (\sigma '(Tx+b)\bigr ) \, Hx. \end{aligned}$$

Consequently, we obtain

$$\begin{aligned} D_T f(T,b)[H] =&-(T^* T)^{-1} \bigl ((T^* H + H^* T) (T^* T)^{-1} s + H^* \, r + T^* \Sigma H x\bigr ). \end{aligned}$$

Using the chain rule, we conclude

$$\begin{aligned} D_T J(T,b)[H]&= \langle \nabla _T J(T,b) , H \rangle = \langle t, D_T f(T,b)[H] \rangle \\&= - \langle t, (T^* T)^{-1} (T^* H + H^* T) (T^* T)^{-1} s \rangle + \langle t, (T^* T) ^{-1} H^* r\rangle \\&\quad + \langle t,(T^* T) ^{-1}T^* \Sigma Hx\rangle \\&= - \langle T (T^* T)^{-1}t s^* (T^* T)^{-1}, H \rangle - \langle (T^* T)^{-1} t s^* (T^* T)^{-1} T^*, H^*\rangle \\&\quad +\langle (T^* T) ^{-1} t r^*, H^* \rangle + \langle \Sigma \, T (T^* T) ^{-1} t x^*, H \rangle \\&= \bigl \langle -T (T^* T)^{-1}(t s^* + s t^*)(T^* T)^{-1} + r t^* (T^* T) ^{-1} + \Sigma \, T (T^* T) ^{-1} t x^*, H \bigr \rangle . \end{aligned}$$

Thus,

$$\begin{aligned} \nabla _T J(T,b)&= -T (T^* T)^{-1} \left( t s^* + s t^* \right) (T^* T)^{-1} + r t^* (T^* T) ^{-1} + \Sigma \, T (T^* T) ^{-1} t x^*. \end{aligned}$$

For the gradient with respect to b, we obtain by the chain rule

$$\begin{aligned} \nabla _b J(T,b)^*&= t^* T^\dagger \nabla _b \bigl (\sigma (Tx+b)\bigr )(b)= t^* T^\dagger \Sigma . \end{aligned}$$

$\square $

Activation Functions

The following table lists many functions f having a proximity $\sigma $ that is a common activation function in NNs from [13].

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hasannasab, M., Hertrich, J., Neumayer, S. et al. Parseval Proximal Neural Networks. J Fourier Anal Appl 26, 59 (2020). https://doi.org/10.1007/s00041-020-09761-7

Download citation

Received: 20 December 2019
Published: 06 July 2020
DOI: https://doi.org/10.1007/s00041-020-09761-7

Keywords

Mathematics Subject Classification

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parseval Proximal Neural Networks

Abstract

Similar content being viewed by others

A Relaxed Proximal Gradient Descent Algorithm for Convergent Plug-and-Play with Proximal Denoiser

A stochastic extra-step quasi-Newton method for nonsmooth nonconvex optimization

Proximal Residual Flows for Bayesian Inverse Problems

1 Introduction

2 Preliminaries

Theorem 2.1

Theorem 2.2

Example 2.3

3 Interplay Between Proximity and Linear Operators

Lemma 3.1

Proof

Lemma 3.2

Proof

Theorem 3.3

Proof

Theorem 3.4

Proof

4 Frame Soft Shrinkage as Proximity Operator

Proposition 4.1

Proof

Corollary 4.2

Example 4.3

5 Proximal Neural Networks

Lemma 5.1

Theorem 5.2

Lemma 5.3

Proof

Remark 5.4

Remark 5.5

Remark 5.6

6 Training PPNNs on Stiefel Manifolds

Remark 6.1

Remark 6.2

Example 6.3

Remark 6.4

Lemma 6.5

7 Some Numerical Results

Remark 7.1

8 Conclusions

Change history

28 April 2021

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix: Gradient Computation

Proof

Activation Functions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation