Inertial Stochastic PALM (iSPALM) and Applications in Machine Learning

Inertial algorithms for minimizing nonsmooth and nonconvex functions as the inertial proximal alternating linearized minimization algorithm (iPALM) have demonstrated their superiority with respect to computation time over their non inertial variants. In many problems in imaging and machine learning, the objective functions have a special form involving huge data which encourage the application of stochastic algorithms. While algorithms based on stochastic gradient descent are still used in the majority of applications, recently also stochastic algorithms for minimizing nonsmooth and nonconvex functions were proposed. In this paper, we derive an inertial variant of a stochastic PALM algorithm with variance-reduced gradient estimator, called iSPALM, and prove linear convergence of the algorithm under certain assumptions. Our inertial approach can be seen as generalization of momentum methods widely used to speed up and stabilize optimization algorithms, in particular in machine learning, to nonsmooth problems. Numerical experiments for learning the weights of a so-called proximal neural network and the parameters of Student-t mixture models show that our new algorithm outperforms both stochastic PALM and its deterministic counterparts.


Introduction
Recently, duality concepts were successfully applied for minimizing nonsmooth and nonconvex functions appearing in certain applications in image and data processing.
A frequently applied algorithm in this direction is the proximal alternating linearized minimization algorithm (PALM) by Bolte, Teboulle and Sabach [4] based on results in [1,2].Pock and Sabach [37] realized that the convergence speed of PALM can be considerably improved by inserting some nonexpensive inertial steps and called the accelerated algorithm iPALM.In many problems in imaging and machine learning, parts of the objective function can be often written as sum of a huge number of functions sharing the same structure.In general the computation of the gradient of these parts is too time and storage consuming so that stochastic gradient approximations were applied, see, e.g.[5] and the references therein.A combination of the simple stochastic gradient descent (SGD) estimator with PALM was first discussed by Xu and Yin in [47].The authors refer to their method as block stochastic gradient iteration and do not mention the connection to PALM.Under rather hard assumptions on the objective function F , they proved that the sequence (x k ) k produced by their algorithm is such that E dist(0, ∂F (x k ) converges to zero as k → ∞.Another idea for a stochastic variant of PALM was proposed by Davis et al. [11].The authors introduce an asynchronous variant of PALM with stochastic noise in the gradient and called it SAPALM.Assuming an explicit bound of the variance of the noise, they proved certain convergence results.Their approach requires an explicit bound on the noise, which is not fulfilled for the gradient estimators considered in this paper.Further, we like to mention that a stochastic variant of the primal-dual algorithm of Chambolle and Pock [9] for solving convex problems was developed in [8].
Replacing the simple stochastic gradient descent estimators by more sophisticated so-called variance-reduced gradient estimators, Driggs et al. [14] could weaken the assumptions on the objective function in [47] and improve the estimates on the convergence rate of a stochastic PALM algorithm.They called the corresponding algorithm SPRING.Note that the advantages of variance reduction to accelerate stochastic gradient methods were discussed by several authors, see, e.g.[24,40].
In this paper, we merge a stochastic PALM algorithm with an inertial procedure to obtain a new iSPALM algorithm.The inertial parameters can also be viewed as a generalization of momentum parameters to nonsmooth problems.Momentum parameters are widely used to speed up and stabilize optimization algorithms based on (stochastic) gradient descent.In particular, for machine learning applications it is known that momentum algorithms [32,38,39,42] as well as their stochastic modifications like the Adam optimizer [25] perform much better than a plain (stochastic) gradient descent, see e.g.[16,44].From this point of view, inertial or momentum parameters are one of the core ingredients for an efficient optimization algorithm to minimize the loss in data driven approaches.We examine the convergence behavior of iSPALM both theoretically and numerically.Under certain assumptions on the parameters of the algorithm which also appear in the iPALM algorithm, we show that iSPALM converges linearly.In particular, we have to adapt the definition of variance-reduced gradient estimators to the sequence produced by iSPALM.More precisely, we have to introduce inertial variance-reduced gradient estimators.In the numerical part, we focus on two examples, namely (i) MNIST classification with proximal neural networks (PNNs), and (ii) parameter learning for Student-t mixture models (MMs).
PNNs basically replace the standard layer σ(T x + b) of a feed-forward neural network by T T σ(T x + b) and require that T is an element of the (compact) Stiefel manifold, i.e. has orthonormal columns, see [19,21].This implies that PNNs are 1-Lipschitz and hence more stable under adversarial attacks than a neural network of comparable size without the orthogonality constraints.While the PNNs were trained in [19] using a SGD on the Stiefel manifold, we train it in this paper by adding the characteristic function of the feasible weights to the loss for incorporating the orthogonality constraints and use PALM, iPALM, SPRING and iSPALM for the optimization.
Learned MMs provide a powerful tool in data and image processing.While Gaussian MMs are mostly used in the field, more robust methods can be achieved by using heavier tailed distributions, as, e.g. the Student-t distribution.In [45], it was shown that Student-t MMs are superior to Gaussian ones for modeling image patches and the authors proposed an application in image compression.Image denoising based on Student-t models was addressed in [27] and image deblurring in [13,48].Further applications include robust image segmentation [3,34,43] and superresolution [20] as well as registration [15,49].For learning MMs a maximizer of the corresponding log-likelihood has to be computed.Usually an expectation maximization (EM) algorithm [26,30,35] or certain of its acceleration [6,31,46] are applied for this purpose.However, if the MM has many components and we are given large data, a stochastic optimization approach appears to be more efficient.Indeed, recently, also stochastic variants of the EM algorithm were proposed [7,10], but show various disadvantages and we are not aware of a circumvent convergence result for these algorithms.In particular, one assumption on the stochastic EM algorithm is that the underlying distribution family is an exponential family, which is not the case for MMs.
In this paper, we propose for the first time to use the (inertial) PALM algorithms as well as their stochastic variants for maximizing a modified version of the log-likelihood function.
This paper is organized as follows: In Section 2, we provide the notation used throughout the paper.To understand the differences of existing algorithms to our novel one, we discuss PALM and iPALM together with convergence results in Section 3. Section 4 contains their stochastic variants, where where our iSPALM is new.We discuss the convergence behavior of iSPALM in Section 5.In Section 6, we propose a model for learning the parameters of Student-t MMs based on its log-likelihood function.We show how (inertial) PALM and its stochastic variants (inertial) SPRING can be used for optimization.Further, we prove that our model fulfills the assumptions on the convergence of these algorithms.Section 7 compares the performance of the four algorithms for two examples.We provide the code online1 .Finally, conclusions are drawn and directions of further research are addressed in Section 8.

Preliminaries
In this section, we introduce the basic notation and results which we will use throughout this paper.(ii) If f is convex, then prox f τ (x) contains exactly one value for any x ∈ R d and τ > 0.
To describe critical points, we will need the definition of (general) subgradients.
Definition 2.2.Let f : R d → (−∞, ∞] be a proper and lower semi-continuous function and v ∈ R d .Then we call The following proposition lists useful properties of subgradients.
Proof.Part (i) was proved in [41, Theorem 8.6 and Proposition 8.12] and part (ii) in [41,Exercise 8.8].Concerning part (iii) we have for This proves the claim for regular subgradients.
For general subgradients consider By the statement for regular subgradients we know that ( . Thus, it follows by definition of the general subgradient that (v . By [41,Theorem 10.1] we have that any local minimizer x of a proper and lower semi-continuous function In particular, it is a critical point of f .Further, we have by Proposition 2.3 that In this paper, we consider functions with proper, lower semicontinuous functions f : R d 1 → (−∞, ∞] and g : R bounded from below and a continuously differentiable function H : R d 1 ×R d 2 → R. Further, we assume throughout this paper that By Proposition 2.3 it holds The generalized gradient of F : R .
To motivate this definition, note that 0 being a critical point of F .This can be seen as follows: For (x 1 , x 2 ) ∈ G F τ 1 ,τ 2 (x 1 , x 2 ) we have Using (1), this implies Similarly we get 0 ∈ ∇ x 2 H(x 1 , x 2 ) + ∂g(x 2 ).By (3) we conclude that (x 1 , x 2 ) is a critical point of F .

PALM and iPALM
In this section, we review PALM [4] and its inertial version iPALM [37].

PALM
The following Algorithm 3.1 for minimizing (2) was proposed in [4].
Algorithm 3.1.Proximal Alternating Linearized Minimization (PALM) To prove convergence of PALM the following additional assumptions on H are needed: is globally Lipschitz continuous with Lipschitz constant L 2 (x 1 ).Similarly, for any Remark 3.2.Assume that H ∈ C 2 (R d 1 ×d 2 ) fulfills assumption 3.1(i).Then, the authors of [4] showed, that there are partial Lipschitz constants L 1 (x 2 ) and L 2 (x 1 ), such that Assumption 3.1(ii) is satisfied.
The following theorem was proven in [4, Lemma 3, Theorem 1].For the definition of KL functions see Appendix A.Here we just mention that semi-algebraic functions are KL functions, see, e.g.[4].
fulfills the Assumptions 3.1 and that ∇H is Lipschitz continuous on bounded subsets k be the sequence generated by PALM, where the step size parameters fulfill for some γ 1 , γ 2 > 1.Then, for η : If F is in addition a KL function and the sequence (x k 1 , x k 2 ) k is bounded, then it converges to a critical point of F .

iPALM
To speed up the performance of PALM the inertial variant iPALM in Algorithm 3.2 was suggested in [37].Algorithm 3.2.Inertial Proximal Alternating Linearized Minimization (iPALM) Set Remark 3.4 (Relation to Momentum Methods).The inertial parameters in iPALM can be viewed as a generalization of momentum parameters for nonsmooth functions.To see this, note that iPALM with one block, f = 0 and β k = 0 reads as By introducing g k := x k − x k−1 , this can be rewritten as This is exactly the momentum method as introduced by Polyak in [38].Similar, if f = 0 and α k = β k = 0, iPALM can be rewritten as which is known as Nesterov's Accelerated Gradient (NAG) [32].Consequently, iPALM can be viewed as a generalization of both the classical momentum method and NAG to the nonsmooth case.Even if there exists no proof of tighter convergence rates for iPALM than for PALM, this motivates that the inertial steps really accelerate PALM, since NAG has tighter convergence rates than a plain gradient descent algorithm.
To prove the convergence of iPALM the parameters of the algorithm must be carefully chosen.

Assumption 3.5 (Conditions on the Parameters of iPALM
2 ), L 2 (x k 1 ) be defined by Assumption 3.1.There exists some > 0 such that for all k ∈ N and i = 1, 2 the following holds true: (ii) The parameters τ k 1 and τ k 2 are given by and for i = 1, 2, The following theorem was proven in [37,Theorem 4.1].
given by (2) be a KL function.Suppose that H fulfills the Assumptions 3.1 and that ∇H is Lipschitz continuous on bounded subsets of R d 1 × R d 2 .Further, let the parameters of iPALM fulfill the parameter conditions 3.5.If the sequence (x k 1 , x k 2 ) k generated by iPALM is bounded, then it converges to a critical point of F .Remark 3.7.Even though we cited PALM and iPALM just for two blocks (x 1 , x 2 ) of variables, the convergence proofs from [4] and [37] even work with more than two blocks.

Stochastic Variants of PALM and iPALM
For many problems in imaging and machine learning the function where n is large.Then the computation of the gradients in PALM and iPALM is very time consuming.Therefore stochastic approximations of the gradients were considered in the literature.

Stochastic PALM and SPRING
The idea to combine stochastic gradient estimators with a PALM scheme was first discussed by Xu and Yin in [47].The authors replaced the gradient in Algorithm 3.1 by the stochastic gradient descent (SGD) estimator where B ⊂ {1, ..., n} is a random subset (mini-batch) of fixed batch size b = |I|.This gives Algorithm 4.1 which we call SPALM.
Stochastic Proximal Alternating Linearized Minimization (SPALM/SPRING) Xu and Yin showed under rather strong assumptions, in particular f, g have to be Lipschitz continuous and the variance of the SGD estimator has to be bounded, that there exists a subsequence (x k 1 , x k 2 ) k of iterates generated by Algorithm 4.1 such that the sequence E dist(0, ∂F (x k 1 , x k 2 ) converges to zero as k → ∞.If F , f and g are strongly convex, the authors proved also convergence of the function values to the infimum of F .Driggs et al. [14] could weaken the assumptions and improve the convergence rate by replacing the SGD estimator by so-called variance-reduced gradient estimators.Let )) be the conditional expectation on the first k sequence elements.Then these estimators have to fulfill the following properties.
While the SGD estimator is not variance-reduced, many popular gradient estimators as the SAGA [12] and SARAH [33] estimators have this property.Since for many problems in image processing and machine learning the SAGA estimator is not applicable due to its high memory requirements, we will focus on the SARAH estimator in this paper.

Definition 4.2 (SARAH Estimator). The SARAH estimator reads for
) is a fixed chosen parameter.Further, we define B k i to be random subsets uniformly drawn from {1, ..., n} of fixed batch size b.Then for k = 1, 2, ... the SARAH estimator reads as and analogously for ∇x 2 H.In the sequel, we assume that the family of the random The following proposition was shown in [14].
having a globally M -Lipschitz continuous gradient.Then the SARAH gradient estimator is variance-reduced with parameters ρ The convergence results in the next theorem were proven in [14].We refer to the type of convergence in ( 6) as linear convergence.Note that the parameter V 2 from the definition of variance reductions does not appear in the theorem.Actually, Driggs et al. [14] need the assumption containing V 2 to prove tighter convergence rates for semi-algebraic functions F .
) is non-increasing and that ).Further suppose that for all k ∈ N, . Then, with t drawn uniformly from {0, ..., T − 1}, the generalized gradient at Furthermore, if for some γ > 0, the function F fulfills the error bound where Θ = min( γηβ 2 4 , ρ 2 ).In particular, the sequence E(F (x T 1 , x T 2 )) converges linearly to Finally, we like to mention that Davis et al. [11] considered an asynchronous variant of PALM with stochastic noise in the gradient.Their approach requires an explicit bound on the noise, which is not fulfilled for the above gradient estimators.Thus, focus and setting in [11] differ from those of SPRING.

iSPALM
In particular for machine learning and deep learning, the combination of momentum-like methods and a stochastic gradient estimator turned out to be essential, as discussed e.g. in [16,44].Thus, inspired by the inertial PALM, we propose the inertial stochastic PALM (iSPALM) algorithm outlined in Algorithm 4.2.
Algorithm 4.2.Inertial Stochastic Proximal Alternating Linearized Minimization (iSPALM) Set Remark 4.5.Similarly as in Remark 3.4, iSPALM can be viewed as a generalization of the stochastic versions of the momentum method and NAG to the nonsmooth case.Note, that in the stochastic setting the theoretical error bounds of momentum methods are not tighter than for a plain gradient descent.An overview over these convergence results can be found in [16,44].Consequently, we are not able to show tighter convergence rates for iSPALM than for stochastic PALM.Nevertheless, stochastic momentum methods as the momentum SGD and the Adam optimizer [25] are widely used and have shown a better convergence behavior than a plain SGD in a huge number of applications.
To prove that the generalized gradients on the sequence of iterates produced by iSRING convergence to zero, some properties of the gradient estimator are required.The authors of [14] assumed that the estimators are evaluated at ( In contrast, we require that the gradient estimators are evaluated at (z k 1 , x k 2 ) and (x k+1 1 , z k 2 ) for k ∈ N 0 .To prove a counterpart of Theorem 4.4, we modify Definition 4.1.In particular, we need only the first part in (i).
Definition 4.6 (Inertial Variance-Reduced Gradient Estimator).A gradient estimator ∇ is called inertial variance-reduced for a differentiable function there exists a sequence of random variables (Υ k ) k∈N with E(Υ 1 ) < ∞ such that following holds true: (ii) The sequence (Υ k ) k decays geometrically, that is To prove that the SARAH gradient estimator is inertial variance-reduced and that iSPALM converges, we need the following auxiliary lemma, which can be proved analo- k be an arbitrary sequence and and Then, for any k ∈ N and i = 1, 2, we have Now we can show the desired property of the SARAH gradient estimator.
having a globally M -Lipschitz continuous gradient.Then the SARAH estimator ∇ is inertial variance-reduced with parameters ρ = 1 p and Furthermore, we can choose The proof is given in Appendix B.

Convergence Analysis of iSPALM
We assume that the parameters of iSPALM fulfill the following conditions.
Assumption 5.1 (Conditions on the Parameters of iSPALM).Let λ + i , i = 1, 2 and L 1 (x k 2 ), L 2 (x k 1 ) be defined by Assumption 3.1 and ρ, V 1 , V Υ by Definition 4.6.Further, let H : R d 1 × R d 2 → R be given by (4) with functions h i : R d 1 × R d 2 having a globally M -Lipschitz continuous gradient.There exist , ε > 0 such that for all k ∈ N and i = 1, 2 the following holds true: The parameters τ k i , i = 1, 2 are given by where S := 4 ρV 1 +V Υ ρM + ε and for i = 1, 2, To analyze the convergence behavior of iSPALM, we start with an auxiliary lemma which can be proven analogously to [ Now we can establish a result on the expectation of squared subsequent iterates.Note that equivalent results were shown for PALM, iPALM and SPRING.Here we use a function Ψ, which not only contains the current function value, but also the distance of the iterates to the previous ones.A similar idea was used in the convergence proof of iPALM [37].Nevertheless, incorporating the stochastic gradient estimator here makes the proof much more involved.
that there exists γ > 0 such that where In particular, we have Proof.By Lemma 5.2 with ψ := H(•, x 2 ) + f , we obtain Combined with (7) this becomes Using Lemma 4.7 we get Analogously we conclude for ψ := H(x 1 , •) + g that Adding the last two inequalities and using the abbreviation 1 ), we obtain Reformulating ( 8) in terms of Now, we set s k 1 = s k 2 := M use that L k i ≤ M , take the conditional expectation E k in (10) and use that ∇ is an inertial variance-reduced estimator to get Since ∇ is inertial variance-reduced, we know from Definition 4.6 (ii) that Inserting this in (11) and using the definition of S yields Choosing τ k i , δ i , i = 1, 2 and as in Assumption 5.1(ii), we obtain by straightforward computation for i = 1, 2 and all k ∈ N that a k i = δ i and Applying this in (13), we get By definition of S it holds 2V Υ ρM − S 2 ≥ ε.Thus, we get for γ := Taking the full expectation yields and summing up for k = 1, ..., T , This finishes the proof.
Next, we want relate the sequence of iterates generated by iSPALM to the subgradient of the objective function.Such a relation was also established for the (inertial) PALM algorithm.However, due to the stochastic gradient estimator the proof differs significantly from its deterministic counterparts.Note that the convergence analysis of SPRING in [14] does not use the subdifferential but the so-called generalized gradient G F τ 1 ,τ 2 .This is not satisfying at all, since it becomes not clear how this generalized gradient is related to the (sub)differential of the objective function in limit processes with varying τ 1 and τ 2 .
In particular, it is easy to find examples of F and sequences (τ k 1 ) k and (τ k 2 ) k such that the generalized gradient G F τ k 1 ,τ k 2 (x 1 , x 2 ) is non-zero, but converges to zero for fixed x 1 and x 2 .
Theorem 5.4.Under the assumptions of Theorem 5.3 there exists some C > 0 such that In particular, it holds Proof.By definition of x k+1 1 , and (1) as well as Proposition 2.3 it holds This is equivalent to 2 ).
Analogously we get that 2 ).
Then we obtain by Proposition 2.3 that 2 ), and it remains to show that the squared norm of v is in expectation bounded by Since ∇H is M -Lipschitz continuous and (a + b) 2 ≤ 2(a 2 + b 2 ), we get further Using Lemma 4.7 and the fact that ∇ is inertial variance-reduced, this implies where Noting that dist(0, ∂F (x k+1 1 , x k+1 2 )) 2 ≤ v 2 , taking the conditional expectation E k and using that ∇ is inertial variance-reduced, we conclude Taking the full expectation on both sides and setting C := C 0 + 3V 1 proves the claim.
Using Theorem 5.4, we can show the linear decay of the expected squared distance of the subgradient to 0.
In [14] the authors proved global convergence of the objective function evaluated at the iterates of SPRING in expectation if the global error bound is fulfilled for some µ > 0. Using this error bound, we can also prove global convergence of iSPALM in expectation with a linear convergence rate.Note that the authors of [14] used the generalized gradient instead of the subgradient also for this error bound.Similar as before this seems to be unsuitable due to the heavy dependence on of the generalized gradient on the step size parameters.
Theorem 5.6 (Convergence of iSPALM).Let the assumptions of Theorem 5.3 hold true.
Then we get Finally, setting Θ 0 := C d and Θ 1 := (1+s)C Υ C d and applying the last equation iteratively, we obtain ) and that Υ T +1 ≥ 0. This yields and we are done.

Student-t Mixture Models
In this section, we show how PALM and its inertial and stochastic variants can be applied , with the Gamma function Γ(s) := ∞ 0 t s−1 e −t dt.The expectation of the Student-t distribution is E(X) = µ for ν > 1 and the covariance matrix is given by Cov(X) = ν ν−2 Σ for ν > 2, otherwise these quantities are undefined.The smaller the value of ν, the heavier are the tails of the T ν (µ, Σ) distribution.For ν → ∞, the Student-t distribution T ν (µ, Σ) converges to the normal distribution N (µ, Σ) and for ν = 0 it is related to the projected normal distribution on the sphere S d−1 ⊂ R d .Figure 1 illustrates this behavior for the one-dimensional standard Student-t distribution.is a random variable mapping into {1, ..., K} and X 1 , ..., X k are random variables with For samples X = (x 1 , ..., x n ), we aim to find the parameters of the the Student-t MM by minimizing its negative log-likelihood function subject to the parameter constraints.A first idea to rewrite this problem in the form (2) looks as where , f 3 := 0, f 4 := ι SPD(d) K , and ι S denotes the indicator function of the set S defined by ι S (x) := 0 if x ∈ S and ι S (x) := ∞ otherwise.Indeed one of the authors has applied PALM and iPALM to such a setting without any convergence guarantee in [20].The problem is that L is not defined on the whole Euclidean space and since L(α, ν, µ, Σ) → ∞ if Σ k → 0 for some k, the function can also not continuously extended to the whole the functions f 2 and f 4 are not lower semi-continuous.Consequently, the function (17) does not fulfill the assumptions required for the convergence of PALM and iPALM as well as their stochastic variants.Therefore we modify the above model as follows: Then we use the surjective mappings to reshape problem (17) as the unconstrained optimization problem Note that the functions f i , i = 1, . . ., 4 are just zero.
For problem (19), PALM and iPALM reduce basically to block gradient descent algorithms as in Algorithm 6.1 and 6.2, respectively.Note that we use β k i = 0 for all k, i and α k i = ρ k for i = 1, . . ., 4 as iPALM parameters in Algorithm 6.2.For the stochastic variants SPRING and iSPALM, we have just to replace the gradient by a stochastic gradient estimator.Algorithm 6.1.Proximal Alternating Linearized Minimization (PALM) for Student-t MMs Finally, we will show that H in (19) • is a KL function which is bounded from below, and For the stochastic algorithms we replace H by the approximated function H(x 1 , x 2 ) := , where B k i is the current mini-batch.The analytical computation of Li in ( 21) is still hard.Even computing the gradient of a complicated function H can be error prone and laborious.Therefore, we compute the (partial) gradients of H or H, respectively, using the reverse mode of algorithmic differentiation (also called backpropagation), see e.g.[17].To this end, note that the chain rule yields that Thus, we can compute Li (x 1 , x 2 ) by applying two times the reverse mode.If we neglect the taping, the execution time of this procedure can provably be bounded by a constant times the execution time of H, see [17,Section 5.4].Therefore, this procedure gives us an accurate and computationally very efficient estimation of the local partial Lipschitz constant.
Inertial Parameters For the iPALM and iSPALM-SARAH we have to choose the inertial parameters α k i ≥ 0 and β k i ≥ 0. With respect to our convergence results we have to assume that there exist α k i ≤ ᾱi < 1 2 and β k i ≤ βi < 1, i = 1, 2. Note that for convex functions f and g, the authors in [37] proved that the assumption on the α's can be lowered to α k i ≤ ᾱi < 1 and suggested to use . Unfortunately, we cannot show this for iSPALM and indeed we observe instability and divergence in iSPALM-SARAH, if we choose α k i > 1 2 .Therefore, we choose for iSPALM-SARAH the parameters where the scalar 0 < s 2 < 1 is manually chosen depending on the application.It was demonstrated in [19] that this kind of network is more stable under adversarial attacks than the same network without the orthogonality constraints.
Training PNNs with iSPALM-SARAH Now, we want to train a PNN with K −1 = 3 layers and n 1 = 784, n 2 = 400 and n 3 = 200 for MNIST classification.In order of applying our theory, we use the exponential linear unit (ELU) as activation function, which is differentiable with a 1-Lipschitz gradient.Then, the loss function is given by where and Since H is unfortunately not Lipschitz continuous, we propose a slight modification.Note that for any u = (T 1 , ..., T 4 , b 1 , ..., b 4 ) which appears as x k , y k or z k in PALM, iPALM, SPRING-SARAH or iSPALM-SARAH we have that there exist v, w ∈ U such that u = v + w.In particular, we have that Therefore, we can replace H by with Y 0 = A, which converges for any non-singular A to U , see [22].
Further, we have Altogether we obtain for where ) .This proves the properties (i) and (ii) of Definition 4.6.Taking the full expectation in (24) and iterating, we We want to show that Since the first summand converges to zero for k large enough, it remains to prove that for an arbitrary > 0, there exists some k 0 ∈ N such that for all k ≥ k 0 the sum becomes not larger than Recall that ).Then the above sum can be estimated for k ≥ k and we are done.
We use the abbreviations Then we obtain for that the derivative with respect to α is given by ∂L(α, ν, µ, Σ|X ) Using that for g = log f , the relation g f = f holds true, the derivatives with respect to µ l , Σ l and ν l have the form ∂L(α, ν, µ, Σ|X ) Together with ( 25) -( 27), we obtain  This is bounded so that ∇ α h(•, ν, µ, Σ) is globally Lipschitz continuous. .
Using again (30) and ψ (x + 1) = ψ (x) + 1 x 2 as well as the fact that the digamma function and its derivatives are monotone increasing we get, lim Since g 2 (ν) is continuous, this yields that it is bounded which implies that g 2 is Lipschitz continuous.
This expression is bounded.Similarly, we get for j = l that Thus, g 1 is Lipschitz continuous.

Forτ 2 xProposition 2 . 1 .
an proper and lower semi-continuous function f : R d → (−∞, ∞] and τ > 0 the proximal mapping prox f τ : R d → P(R d ) is defined by prox f τ (x) := argmin y∈R d − y 2 + f (y) , where P(R d ) denotes the power set of R d .The proximal mapping admits the following properties, see e.g.[41].Let f : R d → R be proper and lower semi-continuous with inf R d f > −∞.Then, the following holds true.(i) The set prox f τ (x) is nonempty and compact for any x ∈ R d and τ > 0.

Theorem 5 . 3 .
Let F : R d 1 × R d 2 → (−∞, ∞] be given by (2) and fulfill Assumption 3.1.Let (x k 1 , x k 2 ) k be generated by iSPALM with parameters fulfilling Assumption 5.1, where we use an inertial variance-reduced gradient estimator ∇.Then it holds for Ψ to learn Student-t MMs.To this end, we denote by Sym(d) the linear space of symmetric d × d matrices, by SPD(d) the cone of symmetric, positive definite d × d matrices and by ∆ K
(a) Average objective versus epochs (b) Average objective versus execution time (c) Standard deviation of the objective versus epochs (d) Average squared norm of the gradient versus epochs

Figure 2 .
Figure 2.: Objective function versus number of epochs and versus execution time for estimating the parameters of Student-t MMs
Now we run PALM, iPALM, SPRING-SARAH and iSPRING-SARAH algorithms for 200 epochs using a batch size of b = 1500.One epoch contains for SPRING-SARAH and iSPALM-SARAH 40 steps and for PALM and iPALM 1 step.We repeat the experiment 10 times with the same initialization and plot for the resulting loss functions mean and standard deviation to represent the randomness of the algorihtms.Figure3shows the mean and standard deviation of the loss versus the number of epochs or the execution time as well as the squared norm of the Riemannian gradient for the iterates of iSPALM-SARAH after each epoch.We observe that iSPALM-SARAH performs much better than SPRING-SARAH and that iPALM performs much better than PALM.Therefore this example demonstrates the importance of the inertial parameters in iPALM and iSPALM-SARAH.Further, iSPALM-SARAH and SPRING-SARAH outperform their deterministic versions significantly.The resulting weights from iSPALM-SARAH reach after 200 epochs an average accuracy of 0.985 on the test set.Using (a + b + c) 2 ≤ 3(a 2 + b 2 + c 2 ) and Lemma 4.7, we can estimate

3 .. 1 lh 2
By(29) we obtain∇ µ l h(α, ν, µ, Σ) = αl γf l g1 d + νl νl + s l Σ−1 l (x − µ l ) g2As in the second part of the proof, it suffices to show that g i , i = 1, 2 are bounded and Lipschitz continuous.Calculating the Jacobian∇ µ l g 2 (µ l ) = d + νl (ν l + s l ) 2 Σ−1 l (µ l − x)(µ l − x) T Σ−1 l + d + νl νl + s l Σ−and taking the Frobenius norm • F , we obtain∇ µ l g 2 (µ l ) F ≤ d + νl (ν l + s l ) 2Σ−1 l (µ l − x) are Lipschitz continuous and bounded.Thus also g 2 = h 1 h 2 is Lipschitz continuous and bounded.Finally, the function g 1 maps into the interval [0, 1] and is Lipschitz continuous by the same arguments as in the second part of the proof.(a) Average loss versus epochs on the training set.(b) Average loss versus execution time on the training set.(c) Standard deviation of the loss versus execution time on the training set.(d) Riemannian gradient of the loss versus epochs on the training set.(e) Average loss versus epochs on the test set.(f) Average loss versus execution time on the test set.

Figure 3 .
Figure 3.: Loss function versus number of epochs and versus execution time for training a PNN for MNIST classification.
37, Lemma 3.2].Lemma 5.2.Let ψ = σ + h, where h : R d → R is a continuously differentiable function with L h -Lipschitz continuous gradient, and σ : R d → (−∞, ∞] is proper and lower semicontinuous with inf R d σ > −∞.Then it holds for any u, v, w ∈ R d and any u + ∈ R d we know by Remark 3.2 that Assumption 3.1(ii) is also fulfilled.Further, ∇H is continuous on bounded sets.Then, choosing the parameters of PALM, resp.iPALM as required by Theorem 3.3 resp.3.6, we conclude that the sequences generated by both algorithms converge to a critical point of H supposed that they are bounded.Similarly, if we assume in addition that the stochastic gradient estimators are variance-reduced, resp.inertial variance-reduced, we can conclude Computation of Gradients and Approximative Lipschitz Constants Since the global and partial Lipschitz constants of H are usually unknown, we estimate them locally using the second order derivative of H which exists in our examples.If H acts on a high dimensional space, it is often computationally to costly to compute the full Hessian matrix.Thus we compute a local Lipschitz constant only in the gradient direction, i.e.
[23,the implementation, we need to calculate prox f , which is the orthogonal projection P U onto U.This includes the projection of the matrices T i , i = 1, 2, 3 onto the Stiefel manifold.In[23, Section 7.3,7.4]it is shown, that the projection of a matrix A onto the Stiefel manifold is given by the U -factor of the polar decomposition A = U S ∈ R d,n , [22,e U ∈ St(d, n) and S is symmetric and positive definite.Note that U is only unique, if A is non-singular.Several possibilities for the computing U are considered in[22,   Chapter 8].In particular, U is given by V W , where A = V ΣW is the singular value decomposition of A. For our numerical experiments we use the iteration (19)e the sum of Lipschitz continuous functions is Lipschitz continuous, it is sufficient to show the claim for the summands of H in(19). Hence we consider only D. Proof of Lemma 6.2 h(α, ν, µ, Σ) := log We show that the functions g i , i = 1, 2 are Lipschitz continuous and bounded.This implies that∂ ∂ν l h(α, •, µ, Σ) is Lipschitz continuous.It holds |g 2 (ν l )| ≤ |ν l | |ψand g 2 is continuous we conclude that g 2 is bounded.Further, it holds − s l )ν l (ν l + s l ) 2 + 2s l ν l (ν 2 l + s l ) 2 + s l (ν 2 l + s l )