Parseval Proximal Neural Networks

The aim of this paper is twofold. First, we show that a certain concatenation of a proximity operator with an affine operator is again a proximity operator on a suitable Hilbert space. Second, we use our findings to establish so-called proximal neural networks (PNNs) and stable Parseval (frame) proximal neural networks (PPNNs). Let $\mathcal{H}$ and $\mathcal{K}$ be real Hilbert spaces, $b \in \mathcal{K}$ and $T \in \mathcal{B} (\mathcal{H},\mathcal{K})$ a linear operator with closed range and Moore-Penrose inverse $T^\dagger$. Based on the well-known characterization of proximity operators by Moreau, we prove that for any proximity operator $\text{Prox} \colon \mathcal{K} \to \mathcal{K}$ the operator $T^\dagger \, \text{Prox} ( T \cdot + b)$ is a proximity operator on $\mathcal{H}$ equipped with a suitable norm. In particular, it follows for the frequently applied soft shrinkage operator $\text{Prox} = S_{\lambda}\colon \ell_2 \rightarrow \ell_2$ and any frame analysis operator $T\colon \mathcal{H} \to \ell_2$, that the frame shrinkage operator $T^\dagger\, S_\lambda\, T$ is a proximity operator in a suitable Hilbert space. Further, the concatenation of proximity operators on $\mathbb R^d$ equipped with different norms establishes a PNN. If the network arises from Parseval frame analysis or synthesis operators, it forms an averaged operator, called PPNN. The involved linear operators, respectively their transposed operators, are in a Stiefel manifold, so that minimization methods on Stiefel manifolds can be applied for training such networks. Finally, some proof-of-the concept examples demonstrate the performance of PPNNs.


Introduction
Wavelet and frame shrinkage operators became very popular in recent years. A certain starting point was the iterative shrinkage-thresholding algorithm (ISTA) in [15] which was interpreted as a special case of the forward-backward algorithm in [13]. For relations with other algorithms see also [7,33]. Let T ∈ R n×d , n ≥ d, have full column rank. Then, the problem argmin y∈R d 1 2 x − y 2 2 + λ T y 1 , λ > 0 (1) is known as analysis point of view. For orthogonal T ∈ R d×d , the solution of (1) is given by the frame soft shrinkage operator T † S λ T = T * S λ T . If T ∈ R n×d with n ≤ d and T T * = I n , the solution of problem (1) is given by I d − T * T + T * S λ T , see [5,Theorem 6.15]. For arbitrary T ∈ R n×d , n ≥ d, there are no analytic expressions for the solution of (1) in the literature. The question, whether the frame shrinkage operator can itself be seen as a proximity operator, has been recently studied in [19]. They showed that the set-valued operator (T † S λ T ) −1 − I d is maximally cyclically monotone, which implies that it is a proximity operator with respect to some norm in R d . In this paper, we prove that for any operator T ∈ B(H, K) with closed range, b ∈ K and any proximity operator Prox : K → K, the new operator T † Prox (T ·+b) : H → H is also a proximity operator on the linear space H, but equipped with another inner product. The above mentioned finite dimensional setting is included as a special case. In contrast to [19], we directly approach the problem using a classical result of Moreau [28]. Moreover, we provide the function for the definition of the proximity operator. Here, we like to mention that this function can be also deduced from Proposition 3.9 in [11]. However, since this deduction appears to be more space consuming than the direct proof of our Theorem 3.4, we prefer to give a direct approach.
There are several fields where our results may be of interest. Different norms in the definition of the proximity operator were successfully used in variable metric algorithms, see [9]. Our findings may be also of interest in connection with so-called Plug-and-Play algorithms [8,34,36].
Recently, it was shown that many activation functions appearing in neural networks are indeed proximity functions [12]. Based on this observations and our previous findings, we consider neural networks which are concatenations of proximity operators and call them proximal neural networks. Due to stability reasons, PNNs related to linear operators from Stiefel manifolds are of special interest. They form so-called averaged operators and are consequently nonexpansive. We refer to these networks as Parseval (frame) proximal neural networks (PPNNs). Orthogonal matrices have already shown advantages in training recurrent neural networks (RNNs) [2,3,24,26,37,39]. Using orthogonal matrices, the vanishing or explosion of gradient in learning RNNs can be avoided [16]. The more general setting of learning rectangular matrices from a Stiefel manifold was proposed, e.g., in [22], but with a different focus than in this paper. The most relevant paper with respect to our setting is [23], where the authors considered the so-called optimization over multiple dependent Stiefel manifolds (OMDSM). We will see that the NNs in [23] are special cases of our PNNs so that our analysis ensures that they are averaged operators.
Our paper is organized as follows: We begin with preliminaries on convex analysis in Hilbert spaces in Section 2. In Section 3, we prove our general results on the interplay between proximity and certain affine operators. As a special case we emphasize that the frame soft shrinkage operator is itself a proximity operator in Section 4. In Section 5, we use our findings to set up neural networks as a concatenation of proximity operators on R d equipped with different norms related to linear operators. If these operators or their transposed operators are in a Stiefel manifold, which is equivalent to the property that their columns or rows form a Parseval frame in the Euclidean space, then our proposed network is actually an averaged operator. Section 6 deals with the training of PPNNs by stochastic gradient descent on Stiefel manifolds. In Section 7, we provide first numerical examples. Finally, Section 8 contains conclusions and addresses further research questions.

Preliminaries
Let H be a real Hilbert space with inner product ·, · and norm · . By Γ 0 (H) we denote the set of proper, convex, lower semi-continuous functions on H mapping into (−∞, ∞]. For f ∈ Γ 0 (H) and λ > 0, the proximity operator prox λf : H → H and its Moreau envelope M λf : H → R are defined by Clearly, the proximity operator and its Moreau envelope depend on the underlying space H, in particular on the chosen inner product. Recall that an operator A : H → H is called firmly nonexpansive if for all x, y ∈ H the following relation is fulfilled Obviously, firmly nonexpansive operators are nonexpansive.
For a Fréchet differentiable function Φ : H → R, the gradient ∇Φ(x) at x ∈ H is defined as the vector satisfying for all h ∈ H, where DΦ : H → B(H, R) denotes the Fréchet derivative of Φ, i.e., for all x, h ∈ H, Note that the gradient crucially depends on the chosen inner product in H. The following results can be found, e.g., in [4,Props. 12.27,12.29].
Theorem 2.2. The operator Prox : H → H is a proximity operator if and only if it is nonexpansive and there exists a function Ψ ∈ Γ 0 (H), such that for any x ∈ H we have Prox(x) ∈ ∂Ψ(x), where ∂Ψ denotes the subdifferential of Ψ.
Thanks to (2), we conclude that Prox : H → H is a proximity operator if and only if it is nonexpansive and the gradient of a convex, differentiable function Φ : H → R. Note that recently, the characterization of Bregman proximity operators in a more general setting was discussed in [21]. In the following example, we recall the Moreau envelope and the proximity operator related to the soft thresholding operator.
and the Moreau envelope is the Huber function For H = R d and f (x) := x 1 , we can just use a componentwise approach. Then S λ is defined componentwise, the Moreau envelope reads as

Interplay Between Proximity and Linear Operators
Let H and K be real Hilbert spaces with inner products ·, · H and ·, · K and corresponding norms · H and · K , respectively. By B(H, K) we denote the space of bounded, linear operators from H to K. The kernel and the range of T are denoted by N (T ) and R(T ), respectively. In this section, we show that for any operator T ∈ B(H, K) with closed range R(T ), b ∈ K and proximity operator Prox : K → K, the operator T † Prox(T · +b) : H → H is itself a proximity operator on the linear space H equipped with a suitable (equivalent) norm · H T , i.e., there exits a function f ∈ Γ 0 (H) such that Throughout this section, let T ∈ B(H, K) have closed range. Then, the same holds true for its adjoint T * and the following (orthogonal) decompositions hold The Moore-Penrose inverse (generalized inverse, pseudo-inverse) T † ∈ B(K, H) is given point-wise by see [4]. Further, it satisfies R(T † ) = R(T * ) and where P C is the orthogonal projection onto the closed, convex set C, see [4,Prop. 3.28]. Then, it follows Every T ∈ B(H, K) gives rise to an inner product in H via x, y H T = T x, T y K + P N (T ) x, P N (T ) y H with corresponding norm If T is injective, the second summand vanishes. In general, this norm only induces a pre-Hilbert structure. Since T ∈ B(H, K) has closed range, the norms · H and · H T are equivalent on H due to and for all x ∈ H. The norm equivalence also ensures the completeness of H equipped with the new norm. To emphasize that we consider the linear space H with this norm, we write H T . For special T ∈ B(H, K), the inner product (7) coincides with the one in H. Proof. If T * T = Id H , then T is injective such that (6) implies x, y H T = T x, T y K = x, T * T y H = x, y H .
To apply the characterization of proximal mappings in H T by Moreau, see Theorem 2.2, we have to compute gradients in H T . Here, the following result is crucial.
Lemma 3.2. Let H and K be real Hilbert spaces with inner products ·, · H and ·, · K , respectively. For an operator T ∈ B(H, K) with closed range, let H T be the Hilbert space with inner product (6). For (Fréchet) differentiable Φ : H → R, the gradients ∇ H Φ and ∇ H T Φ with respect to the different inner products are related by Proof. The gradient ∇ H T Φ(x) at x ∈ H in the space H T is given by the vector satisfying the gradient depends on the chosen inner product through Now, the desired result follows from the next theorem. Proof. In view of Theorems 2.1 and 2.2, it suffices to show that A is nonexpansive and that there exists a convex function Ψ : 1. First, we show that A is firmly nonexpansive, and thus nonexpansive. By (3), we see that Using this and (4), it follows By (8) and (4), we obtain and since Prox is firmly nonexpansive with respect to · K , the estimate (9) further implies that A is firmly nonexpansive 2. It remains to prove that there exists a convex function Ψ : Then, a natural candidate is given by Ψ = Φ (T · +b). Using the definition of the gradient and the chain rule, it holds for all x, h ∈ H that Incorporating Lemma 3.2, we conclude Finally, Ψ is convex since it is the concatenation of a convex function with a linear function.
denote the infimal convolution of f, g ∈ Γ 0 (H) and x → ι S (x) the indicator function of the set S taking the value 0 if x ∈ S and +∞ otherwise. For Prox := prox g with g ∈ Γ 0 (H), we are actually able to explicitly compute f ∈ Γ 0 (H) such that T † Prox (T · +b) = prox f on H T . Clearly, this also gives an alternative proof for Theorem 3.3.
Theorem 3.4. Let b ∈ K, T ∈ B(H, K) with closed range and Prox := prox g for some g ∈ Γ 0 (K). Then T † prox g (T · +b) : H T → H T is the proximity operator on H T of f ∈ Γ 0 (H) given by This expression simplifies to Proof. By (5) and (3), we obtain and by (4) further Hence, we conclude that T † prox g (T ·+b) is the proximity operator on H T of f in (10).
Note that for injective T and b = 0, the function f is in general a weaker regularizer than g, i.e., f ≤ g. This is necessary since for the latter (11) would lead to

Frame Soft Shrinkage as Proximity Operator
In this section, we investigate the frame soft shrinkage as a special proximity operator. Let K = 2 be the Hilbert space of square summable sequences c = {c k } k∈N with norm Given a frame {x k } k∈N of H, the corresponding analysis operator T : H → 2 is defined as Its adjoint T * : 2 → H is the synthesis operator given by By composing T and T * , we obtain the frame operator The frame operator T * T is invertible on H, see [10], such that Proof. i) If T is the analysis operator for a frame {x k } k∈N , then T is bounded, injective and has closed range, see [10]. Conversely, assume that T ∈ B(H, 2 ) is injective and that R(T ) is closed. By (3), it holds R(T * ) = H. Let {δ k } k∈N be the canonical basis of The soft shrinkage operator S λ on 2 (applied componentwise) is the proximity operator corresponding to the function g := λ · 1 , λ > 0. As immediate consequence of Theorem 3.4 we obtain the following corollary.
Corollary 4.2. Assume that T : H → 2 is an analysis operator for some frame of H and Prox : 2 → 2 is an arbitrary proximity operator. Then T † Prox T is itself a proximity operator on H equipped with the norm · H T . In particular, if Prox := S λ , λ > 0, then Finally, let us have a look at the finite dimensional setting with H := R d , K := R n , n ≥ d. Then, we have for any T ∈ R n,d with full rank d and the proximity operator S λ , λ > 0, on R n that Example 4.3. We want to compute f for the matrix T : R 1 → R 2 given by T = (1, 2) T and the soft shrinkage operator S λ on R 2 with λ > 0. Note that this example was also considered in [19]. By (12) and since = min Consider the strictly convex function g y ( Hence, by Fermat's theorem, the unique minimizer of g y (x 2 ) is given by − y 2 . Consequently, we have for |y| ≤ 2 5 λ that For |y| > 2λ 5 , the function g y is differentiable in − λ 5 sgn(y) and it holds Therefore, for |y| > 2λ 5 , the minimizer of g y is − λ 5 sgn(y) and Choosing, e.g., λ = 1 3 we obtain which is a good approximation of |y|.

Proximal Neural Networks
In this section, we consider neural networks (NNs) consisting of K ∈ N layers with dimensions n 1 , . . . , n K , defined by mappings Φ = Φ(·; u) : Such NNs are composed of affine functions A k : R n k−1 → R n k given by with weight matrices L k ∈ R n k ,n k−1 , n 0 = d, bias vectors b k ∈ R n k as well as a nonlinear activation σ : R → R acting at each component, i.e., for x = (x j ) n j=1 we have σ(x) = (σ(x j )) n j=1 . The parameter set u := (L k , b k ) K k=1 of such a NN has the overall dimension D := n 0 n 1 + n 1 n 2 + · · · + n K−1 n K + n 1 + · · · + n K . For an illustration see Fig. 1.
In [12], the notation of stable activation functions was introduced. An activation function σ : R → R is called stable, if it is monotone increasing, 1-Lipschitz continuous and satisfies σ(0) = 0. The following result was proved in [12].
Lemma 5.1. A function σ : R → R is a stable activation function if and only if there exists g ∈ Γ 0 (R) having 0 as a minimizer such that σ = prox g .
Various common activation functions σ and corresponding functions g ∈ Γ 0 (R) are listed in Tab. 3 in the appendix. For T k ∈ R n k ,d , we consider the norm (7) and denote it by In the previous sections, we have considered two different kinds of proximity operators, namely prox g with respect to the Euclidean norm and prox T k ,g with respect to the norm (15) prox T k ,g = argmin Further, we derived a function f k depending on g, T k and b k , see Theorem 3.4, such that Based on our observations in the previous sections, we consider the following special NNs. We choose a stable activation function σ = prox g for some g ∈ Γ 0 (R) and matrices T k ∈ R n k ,d , as well as bias vectors b k ∈ R n k , k = 1, . . . , K and construct according to (14) the affine mappings Then, the NN Φ : (17) can be rewritten as We call Φ a proximal neural network (PNN) with network parameters u := (T k , b k ) K k=1 . Next, we investigate stability properties of such networks. Recall that an operator Ψ : H → H on a Hilbert space H is α-averaged, α ∈ (0, 1), if there exists a nonexpansive operator R : H → H such that The following theorem summarizes properties of α-averaged operators, c.f. [4] and [32] for the third statement. ii) The concatenation of K operators which are α k -averaged with respect to the same norm is α-averaged with α = iii) For an α-averaged operator Ψ : H → H with a nonempty fixed point set, the sequence generated by the iteration converges weakly for every starting point x (0) ∈ H to a fixed point of Ψ.
In the following, we study special PNNs, which are α-averaged operators such that x (r+1) = Φ(x (r) ; u) converges to a fixed point of Φ.
The setting is in particular fulfilled if the rows of T k , k = 1, . . . , K − 1 form a tight frame.
Proof. i) By Lemma 3.1 we know that · T k = · 2 so that Φ is the concatenation of K − 1 proximity operators on R d with respect to the Euclidean norm with f k as in Theorem 3.4. Now, the assertion follows from Theorem 5.2.
ii) By assumption, we obtain Clearly, if f k ∈ Γ 0 (R d ) then the same holds true for d −1 k f k . Hence, Φ becomes the concatenation of K − 1 proximity operators on R d all with respect to the T 1 norm. Then, the assertion follows from Theorem 5.2.
Remark 5.4. Lemma 5.3 i) can be generalized to the case where T K ∈ R d,d is a symmetric positive semi-definite matrix with norm not larger than 1. In this case, T K can be written in the form Thus, for every x, y ∈ R d , This shows that T K ·+b K is firmly nonexpansive and therefore 1 2 -averaged. Consequently, Φ in (18) is the concatenation of K 1 2 -averaged operators with respect to the Euclidean norm. Hence Φ is itself α-averaged with α = K K+1 . Remark 5.5. In [12], the following NN structure was studied: Let H 0 , . . . , H K be a sequence of real Hilbert spaces and H 0 = H K = H. Further, let W k ∈ B(H k−1 , H k ) and P k : H k → H k , k = 1, . . . , K be firmly nonexpansive operators. For this case, Combettes and Pesquet [12] have posed conditions on W k such that is α-averaged for some α ∈ (1/2, 1). For H = R d equipped with the Euclidean norm, H k = R d equipped with the norm (15) and T K = I d , our PNN Φ has exactly the form (19) with P k := prox T k ,f k : H k → H k and the embedding operators W k : H k−1 → H k , k = 1, . . . , K. For the special PNNs in Lemma 5.3 it holds W k = I d , such that the conditions in [12] are fulfilled.
In the rest of this paper, we restrict our attention to matrices T k : and T K = I d , i.e., the rows, resp. columns of T k form a Parseval frame. Then, the PNN in (18) has the form with the "usual" proximity operator, cf. (16), and Due to the use of Parseval frames we call these networks Parseval (frame) proximal neural networks (PPNNs). By our previous considerations, see Lemma 5.3, PPNNs are averaged operators.
The following facts on Stiefel manifolds can be found, e.g., in [1]. For d ≤ n, the (compact) Stiefel manifold is defined as For d = 1, this reduces to the sphere S n−1 , and for d = n we obtain the special orthogonal group SO(n). In general, St(d, n) is a manifold of dimension nd− 1 2 d(d+1) with tangential space at T ∈ St(d, n) given by where the columns of T ⊥ ∈ R n,n−d are the basis of an orthonormal complement of T fulfilling T * ⊥ T ⊥ = I n−d and T * T ⊥ = 0. The Riemannian gradient of a function on St(d, n) can be obtained by the orthogonal projection of the gradient in R n,d onto St(d, n). The orthogonal projection of X ∈ R n,d onto T T St(d, n) is given by To emphasize that for fixed T the matrix W depends on X we will also write W X .
where qf(A) denotes the Q factor of the decomposition of a matrix A ∈ R n,d with linearly independent columns as A = QR with Q ∈ St(d, n) and R an upper triangular matrix of size d × d with strictly positive diagonal elements. The complexity of the QR decomposition using the Householder algorithm is 2d 2 (n − d/3), see [20]. Since the computation of the QR decomposition appears to be time consuming on a GPU, we prefer to apply another retraction, based on the Cayley transform of skew-symmetric matrices W in (22), see [30,38]. By straightforward computation it can be seen that W X and W P T X coincide, so that the retraction (24) enlarged to the whole R n,d fulfills Remark 6.1. The retraction (24) has the drawback that it contains a matrix inversion. In our numerical algorithm, the following simple fixed point iteration is used for computing the matrix R = R T (X) for fixed T and X. By definition, R fulfills the fixed point equation Starting with an arbitrary R (0) ∈ St(d, n), we apply the iteration which converges by Banach's fixed point theorem to the fixed point of (26) if 1 2 ρ(W ) < 1, where ρ(W ) denotes the spectral radius of W .
We want to train a PPNN by minimizing where : R d × R d → R is a differentiable loss function on the first d variables.
Example 6.2. Let us specify two special cases of PPNNs with one layer.
i) For one layer without bias and componentwise soft shrinkage σ as activation function, i.e., summands we learn Parseval frames, e.g., for denoising tasks with y i as a noisy version of x i . Here, we want to mention the significant amount of work on dictionary learning, see [17], which starts with the same goal.
ii) For x i = y i , i = 1, . . . , N , the above network could be used as so-called autoencoder. Again, for one layer without activation function, b 1 = 0 and = h( x − y ) with some norm · on R d we get For the Euclidean norm and h(x) = x 2 we get the classical PCA approach and for h(x) = x the robust rotationally invariant L 1 -norm PCA, recently discussed in [25,29].
The following remark points out that special cases of our PPNNs were already considered in the literature. Remark 6.3. In [23], NNs with weight matrices L k ∈ R n k ,n k−1 , k ∈ {1, . . . , K − 1} such that the matrices or their transpose lie in a Stiefel manifold, were examined. The authors called this approach optimization over multiple dependent Stiefel manifolds (OMDSM). Indeed, by the following reasons, these NNs are special cases of our PPNNs if n k ≤ d for all k = 1, . . . , K − 1. In particular, this implies that the NNs considered in [23] (with appropriately chosen last layer) are averaged operators. i) Case n k ≤ n k−1 : Let L * k ∈ St(n k , n k−1 ), i.e., L k L * k = I n k . Choosing an arbitrary fixed T k−1 ∈ R n k−1 ,d with T k−1 T * k−1 = I n k−1 , we want to find T k ∈ R n k ,d such that It is straightforward to verify that T k := L k T k−1 has the desired properties. Note that if the transposes of T k and T k−1 are in a Stiefel manifold, this does not necessarily hold for the transpose of T k T * k−1 . Therefore, our PPNNs are more general. ii) Case n k−1 < n k : Let L k ∈ St(n k−1 , n k ), i.e., L * k L k = I n k−1 . For an arbitrary fixed T k−1 ∈ R n k−1 ,d with T k−1 T * k−1 = I n k−1 , we want to find T k ∈ R n k ,d fulfilling (28). To this end, we complete L k to an orthogonal matrixL k ∈ R n k ,n k and T k−1 to a matrix T k−1 ∈ R n k ,d with orthogonal rows. By straightforward computation we verify that T k := L kTk−1 satisfies (28) such that this case also fits into our PPNN framework.
We apply a stochastic gradient descent algorithm on the Stiefel manifold to find a minimizer of (27). To this end, we compute the Euclidean gradient with respect to one layer and apply the usual backpropagation for multiple layers. where the gradient of is taken with respect to the first d variables. Then it holds for the Euclidean gradient The proof follows by straightforward computations which are carried out in the appendix. Now, we can formulate stochastic gradient descent (SGD) for J as in Algorithm 1. This algorithm works for an arbitrary retraction, in particular the retraction in (23). In our numerical computations we use the special retraction (24) in connection with the iteration scheme (6.1). Then, by (25), the projection step 3 of the algorithm can be skipped and the retraction can be directly applied to the Euclidean gradient. with mean 0. By (x i , y i ) ∈ R d × R d , i = 1, . . . , N , we denote pairs of piecewise constant signals y i of length d = 2 7 = 128 and their noisy versions by x i = y i + i , where i is white noise with standard deviation σ = 0.1. For the signal generation, we choose • the number of constant parts of y i as max{2, t i }, where t i is the realization of a random variable following the Poisson distribution with mean 5; • the discontinuities of y i as realization of a uniform distribution; • the signal intensity of y i for every constant part as realization of the standard normal distribution, where we subtract the mean of the signal finally.
Using this procedure, we generate training data ( with N = 500000 and N test = 1000. The average PSNR of the noisy signals in the test set is 25.22. We use PPNNs with K − 1 ∈ {1, 2, 3} layers and set T K = I d and b K = 0.
In all examples a batch size of 32 and a learning rate of 0.5 is used.
We are interested in two different settings: 1. Learned orthogonal matrices versus Haar basis. First, we consider PPNNs with 128 neurons in each hidden layer and componentwise soft-shrinkage S λ as activation function. In particular, all matrices T k have to be orthogonal. The denoising results of our learned PPNN are compared with the soft wavelet shrinkage with respect to the discrete orthogonal Haar basis in R 128 , i.e., the signal on all 6 scales is decomposed by where H := H 2 · · · H 7 with matrices The average PSNRs on the test data are given in Tab. 1, where the optimal threshold in S λ was determined by 5-fold cross validation. More precisely, the training data is divided into 5 subsets and each is used once as a test set with the remaining samples as training set. The test loss for given λ is averaged over all 5 trials for judging the quality of the model. Two exemplary noisy signals and their denoised versions are shown in Fig. 2. Since we have learned the orthogonal matrices T k with respect to the quadratic loss function, the visual quality of the PPNN denoised signals is clearly not satisfactory even with an improved PSNR. The visual impression of signals denoised by Haar wavelet shrinkage can by improved (smoother signal) by increasing the threshold to λ = 0.3, resulting in a worse PSNR. To achieve a similar behavior with orthogonal matrices learned by PPNNs, we have to choose a different loss function.
2. Learned Stiefel matrices versus Haar frame. Haar wavelet shrinkage can be improved by using Haar wavelet frames within a so-called "algorithmá trous", see [27]. We apply a similar method as in (29), but with a rectangular matrix H whose rows form a Haar frame. More precisely, the Haar filter is used without subsampling. This results, in contrast to the original Haar transform, in a translational invariant multiscale transform. Instead of the matrices H 7 , the nonsubsampled (convolution) matrix is applied to the original signal. Note that we assume a periodic continuation of the signal. In each of the following j steps, j = 1, . . . , 6, we keep the lower part and transform again the upper smoothed part by essentially the same matrix, where 2 j − 1 zeros are inserted between the filter coefficients. The output signal has size 8 · 128, where the last part of the signal is just the averaged signal, which is equal to zero.
For an explanation of this statement we refer to [35]. In summary, we obtain whereS λ denotes the scale-wise adapted thresholding. Now, we compare this scale-dependent Haar frame soft thresholding method with a learned PPNN with 1024 neurons in the hidden layers and componentwise soft-shrinkage S λ as activation function. The optimal threshold in S λ is determined by 3-fold cross validation, where the tested parameters are chosen in [0.01, 0.1] with steps of 0.01. We emphasize that in contrast to the Haar frame shrinkage procedure with (30), the same threshold for each component is used in the activation function of our PPNN. The resulting PSNRs are given in Tab. 2. As expected, using the same threshold in the classical Haar frame shrinkage is worse than the scale-adapted Haar frame shrinkage. PPNNs with learned Stiefel matrices perform better for an increasing number of layers.
Finally, Fig. 3 contains the denoised signals from Fig. 2. The results are visually better than in the previous figure, although still not satisfactory due to the used loss function.
with another activation function g(x) := 1 1+exp(−x) . For training, we use a batch size of 1024 and a learning rate of 5, ending up with an accuracy of 0.9855 on the test set. In Fig. 4 the training and test loss of our PPNN during training are plotted. Remark 7.1. As already mentioned in Remark 6.3, NNs with Stiefel matrices were also applied in [23]. The authors of [23] reported that the training process using Riemannian optimization on the Stiefel manifold could be unstable or divergent. We do not observe such instabilities in our setting.

Conclusions
In this paper, we have shown that for real Hilbert spaces H and K, a proximity operator Prox : K → K and a linear bounded operator T : H → K the operator T † Prox(T · +b) Figure 4: Training loss (blue) and test loss (orange) of a PPNN on the MNIST data set. The x-axis corresponds to the number of epochs and the y-axis to the associated loss value.
with b ∈ K is a proximity operator in H. As a consequence, the famous frame soft shrinkage operator can be seen as a proximity operator. Using this new relations, we have discussed special neural networks arising from Parseval frames and stable activation functions. These networks include recently proposed ones containing matrices whose transposes are in a Stiefel manifold and interpret them from another, more general point of view. In our future work we want to explore for which learning tasks the higher flexibility of our PPNNs is advantageous. Taking more general operators T into account, may be also useful. Another question we want to address is to constrain our Stiefel matrices further, e.g., towards convolutional networks and to sparsity constraints. Depending on the application, we have to design appropriate loss functions as well as to incorporate regularizing terms.
For our experiments the stochastic gradient algorithm on Stiefel manifolds worked well. However, other minimization methods could be taken into account. In [23] for example, the authors proposed an orthogonal weight normalization algorithm which was inspired by the fact that eigenvalue decomposition is differentiable. Finally, we like to mention that a proximal backpropagation algorithm taking implicit instead of explicit gradient steps to update the network parameters during neural network training was proposed in [18].
A better understanding of the convergence of the cyclic proximal point algorithm, see Remark 5.6, and suitable early stopping criteria if the network Φ is iteratively used may help to design NNs and to understand their success.

A. Gradient computation
Proof of Lemma 6.4: For potential further applications, we compute the gradient of the more general functional For the gradient with respect to b, we obtain by the chain rule

B. Activation functions
The following table lists many functions f having a proximity σ which is a common activation function in NNs from [12].