A Riemannian Newton Trust-Region Method for Fitting Gaussian Mixture Models

Gaussian Mixture Models are a powerful tool in Data Science and Statistics that are mainly used for clustering and density approximation. The task of estimating the model parameters is in practice often solved by the Expectation Maximization (EM) algorithm which has its benefits in its simplicity and low per-iteration costs. However, the EM converges slowly if there is a large share of hidden information or overlapping clusters. Recent advances in manifold optimization for Gaussian Mixture Models have gained increasing interest. We introduce an explicit formula for the Riemannian Hessian for Gaussian Mixture Models. On top, we propose a new Riemannian Newton Trust-Region method which outperforms current approaches both in terms of runtime and number of iterations. We apply our method on clustering problems and density approximation tasks. Our method is very powerful for data with a large share of hidden information compared to existing methods.


Introduction
Gaussian Mixture Models are widely recognized in Data Science and Statistics. The fact that any probability density can be approximated by a Gaussian Mixture Model with a sufficient number of components makes it an attractive tool in statistics. However, this comes with some computational limitations, where some of them are described in Ormoneit and Tresp (1995), Lee and Mclachlan (2013), Coretto (2021). Nevertheless, we here focus on the benefits of Gaussian mixture models. Besides the goal of density approximation, the possibility of modeling latent features by the underlying components make it also a strong tool for soft clustering tasks. Typical applications are to be found in the area of image analysis (Alfò et al., 2008, Dresvyanskiy et al., 2020, Zoran and Weiss, 2012, pattern recognition (Wu et al., 2012, Bishop, 2006, econometrics (Articus andBurgard, 2014, Compiani andKitamura, 2016) and many others.
We state the Gaussian Mixture Model in the following: Let K ∈ N be given. The Gaussian Mixture Model (with K components) is given by the (multivariate) density function p of the form with positive mixture components α j that sum up to 1 and Gaussian density functions p N with means µ j ∈ R d and covariance matrices Σ j ∈ R d×d . In order to have a well-defined expression, we impose Σ j 0, i.e. the Σ j are symmetric positive definite.
Given observations x 1 , . . . , x m , the goal of parameter estimation for Gaussian Mixture Models consists in maximizing the log-likelihood. This yields the optimization problem where ∆ K = {(α 1 , . . . , α K ), α j ∈ R + ∀j, is the K-dimensional probability simplex and the covariance matrices Σ j are restricted to the set of positive definite matrices.
In practice, this problem is commonly solved by the Expectation Maximization (EM) algorithm. It is known that the Expectation Maximization algorithm converges fast if the K clusters are well separated. Ma et al. (2000) showed that in such a case, the convergence rate is superlinear. However, Expectation Maximization has its speed limits for highly overlapping clusters, where the latent variables have a non-neglibible large probability among more than one cluster. In such a case, the convergence is linear (Xu and Jordan, 1996) which might results in slow parameter estimation despite very low per-iteration costs.
From a nonlinear optimization perspective, the problem in (2) can be seen as a constrained nonlinear optimization problem. However, the positive definiteness constraint of the covariance matrices Σ j is a challenge for applying standard nonlinear optimization algorithms. While this constraint is naturally fulfilled in the EM algorithm, we cannot simply drop it as we might leave the parameter space in alternative methods. Approaches to this problem like introducing a Cholesky decomposition (Salakhutdinov et al., 2003, Xu andJordan, 1996), or using interior point methods via smooth convex inequalities (Vanderbei and Benson, 1999) can be applied and one might hope for faster convergence with Newton-type algorithms. Methods like using a Conjugate Gradient Algorithm (or a combination of both EM and CG) led to faster convergence for highly overlapping clusters (Salakhutdinov et al., 2003). However, this induces an additional numeric overhead by imposing the positive-definiteness constraint by a Cholesky decomposition.
In recent approaches Sra (2015, 2020) suggest to exploit the geometric structure of the set of positive definite matrices. As an open set in the set of symmetric matrices, it admits a manifold structure (Bhatia, 2007) and thus the concepts of Riemannian optimization can be applied. The concept of Riemannian optimization, i.e. optimizing over parameters that live on a smooth manifold is well studied and has gained increasing interest in the domain of Data Science, for example for tensor completion problems (Heidel and Schulz, 2018). However, the idea is quite new for Gaussian Mixture Models and Sra (2015, 2020) showed promising results with a Riemannian LBFGS and Riemannian Stochastic Gradient Descent Algorithm. The results in Sra (2015, 2020) are based on a reformulation of the log-likelihood in (2) that turns out to be very efficient in terms of runtime. By design, the algorithms investigated in Sra (2015, 2020) do not use exact second-order information of the objective function. Driven by the quadratic local convergence of the Riemannian Newton method, we thus might hope for faster algorithms with the availability of the Riemannian Hessian. In the present work, we derive a formula for the Riemannian Hessian of the reformulated log-likelihood and suggest a Riemannian Newton Trust-Region method for parameter estimation of Gaussian Mixture Models.
The paper is organized as follows. In Section 2, we introduce the reader to the concepts of Riemannian optimization and the Riemannian setting of the reformulated log-likelihood for Gaussian Mixture Models. In particular, we derive the expression for the Riemannian Hessian in Subsection 2.3 which is a big contribution to richer Riemannian methods for Gaussian Mixture Models. In Section 3 we present the Riemannian Newton Trust-Region Method and prove the global convergence and superlinear local convergence for our problem. We compare our proposed method with existing algorithms both on artificial and real world data sets in Section 4 for the task of clustering and density approximation.

Riemannian Setting for Gaussian Mixture Models
We will build the foundations of Riemannian Optimization in the following to specify the characteristics for Gaussian Mixture Models afterwards. In particular, we introduce a formula for the Riemannian Hessian for the reformulated problem which is the basis for second-order optimization algorithms.

Riemannian Optimization
To construct Riemannian Optimization methods, we briefly state the main concepts of Optimization on Manifolds or Riemannian Optimization. A good introduction is Absil et al. (2008) and Boumal (2020), we here follow the notations of Absil et al. (2008). The concepts of Riemannian Optimization are based on concepts from unconstrained Euclidean optimization algorithms and are generalized to (possibly nonlinear) manifolds.
A manifold is a space that locally resembles Euclidean space, meaning that we can locally map points on manifolds to R n via bicontinuous mappings. Here, n denotes the dimension of the manifold. In order to define a generalization of differentials, Riemannian optimization methods require smooth manifolds meaning that the transition mappings are smooth functions. As manifolds are in general not vector spaces, standard optimization algorithms like line-search methods cannot be directly applied as the iterates might leave the admissible set. Instead, one moves along tangent vectors in tangent spaces T θ M, local approximations of a point θ on the manifold, i.e. θ ∈ M. Tangent spaces are basically first-order approximations of the manifold at specific points and the tangent bundle T M is the disjoint union of the tangent spaces T θ M. In Riemannian manifolds, each of the tangent spaces T θ M for θ ∈ M is endowed with an inner product ·, · θ that varies smoothly with θ. The inner product is essential for Riemannian optimization methods as it admits some notion of length associated with the manifold. The optimization methods also require some local pull-back from the tangent spaces T θ M to the manifold M which can be interpreted as moving along a specific curve on M (dotted curve in Figure 2.1). This is realized by the concept of retractions: Retractions are mappings from the tangent bundle T M to the manifold M with rigidity conditions: we move through the zero element 0 θ with velocity

Figure 2.1: Retraction-based Riemannian Optimization
Roughly spoken, a step of a Riemannian optimization algorithm works as follows: • At iterate θ t , take a new step ξ θ t on the tangent space T θ t M • Pull back the new step to the manifold by applying the retraction at point Here, the crucial part that has an impact on convergence speed is updating the new iterate on the tangent space, just like in the Euclidean case. As Riemannian optimization algorithms are a generalization of Euclidean unconstrained optimization algorithms, we thus introduce a generalization of the gradient and the Hessian.
Riemannian Gradient. In order to characterize Riemannian gradients, we need a notion of differential of functions defined on manifolds.
The differential of f : M → R at θ is the linear operator Df (θ) : T M θ → R defined by: The Riemannian gradient can be uniquely characterized by the differential of the function f and the inner product associated with the manifold: The Riemannian gradient of a smooth function f : M → R on a Riemannian manifold is a mapping grad f : M → T M such that, for all θ ∈ M, grad f (θ) is the unique tangent vector in T θ M satisfying Riemannian Hessian. We can also generalize the Hessian to its Riemannian version. To do this, we need a tool to differentiate along tangent spaces, namely the Riemannian connection (for details see (Absil et al., 2008, Section 5.3)).
The Riemannian Hessian of f : M → R at θ is the linear operator Hess f (θ) : where ∇ is the Riemannian connection with respect to the Riemannian manifold.

Reformulation of the Log-likelihood
In , the authors experimentally showed that applying the concepts of Riemannian optimization to the objective in (2) cannot compete with Expectation Maximization. This can be mainly led back to the fact that the maximization in the M-step of EM, i.e. the maximization of the log-likelihood for a single Gaussian, is a concave problem and easy to solve, it even admits a closedform solution. Nevertheless, when considering Riemannian optimization for (2), the maximization of the log-likelihood of a single Gaussian is not geodesically concave (concavity along the shortest curve connecting two points on a manifold). The following reformulation introduced by Hosseini and Sra (2015) removes this geometric mismatch and results in a speed-up for the Riemannian algorithms.
We augment the observations x i by introducing the observations y i = (x i , 1) T ∈ R d+1 for i = 1, . . . , m and consider the optimization problem max θ=((S 1 ,...,S K ),(η 1 ,...,η K−1 ))L where This means that instead of considering Gaussians of d -dimensional variables x, we now consider Gaussians of d + 1 -dimensional variables y with zero mean and covariance matrices S j . The reformulation leads to faster Riemannian algorithms, as it has been shown in  that the maximization of a single Gaussian is geodesically concave.
The relationship of the maximizers is given by This means that instead of solving the original optimization problem (2) we can easily solve the reformulated problem (3) on its according parameter space and transform the optima back by the relationships (5) and (6).
Penalizing the objective. When applying Riemannian optimization algorithms on the reformulated problem (3), covariance singularity is a challenge. Although this is not observed in many cases in practice, it might result in unstable algorithms. This is due to the fact that the objective in (3) is unbounded from above, see Appendix A for details. The same problem is extensively studied for the original problem (2) and makes convergence theory hard to investigate. An alternative consists in considering the maximum a posterior log-likelihood for the objective. If conjugate priors are used for the variables µ j , Σ j , the optimization problem remains structurally unchanged and results in a bounded objective (see Snoussi and Mohammad-Djafari (2002)). Adapted versions of Expectation Maximization have been proposed in the literature and are often applied in practice.
A similar approach has been proposed in Hosseini and Sra (2020) where the reformulated objective (3) is penalized by an additive term that consists of the logarithm of the Wishart prior, i.e. we penalize each of the K components with where Ψ is the block matrix Ψ = γ β Λ + κλλ T κλ κλ T κ for λ ∈ R d , Λ ∈ R d×d and γ, β, ρ, ν, κ ∈ R. If we assume that ρ = γ(d + ν + 1) + β, the results of Theorem 1 are still valid for the penalized version, see Hosseini and Sra (2020). Besides, the authors introduce an additive term to penalize very tiny clusters by introducing Dirichlet priors for the mixing coefficients α j = In total, the penalized problem is given by where The use of such an additive penalizer leads to a bounded objective: Theorem 2 The penalized optimization problem in (9) is bounded from above.
Proof. We follow the proof for the original objective (2) from Snoussi and Mohammad-Djafari (2002). The penalized objective readŝ We get the upper bound where we applied Bernoulli's inequality in the first inequality and used the positive definiteness of S j in the second inequality. a is a positive constant independent of S j and S k . By applying the relationship det(A) 1/n ≤ 1 n tr(A) for A ∈ R n×n by the inequality of arithmetic and geometric means, we get for the right hand side of (10) for a constant b > 0. The crucial part on the right side of (11) is when one of the S k approaches a singular matrix and thus the determinant approaches zero. Then, we reach the boundary of the parameter space. We study this issue in further detail: Without loss of generality , let k = 1 be the component where this occurs. Let S * 1 be a singular semipositive definite matrix of rank r < d + 1. Then, there exists a decomposition of the form and U an orthogonal square matrix of size d + 1. Now consider the sequence S (n) 1 given by where converging to 0 as n → ∞. Then, the matrix S λ l , the right side of (11) reads , which converges to 0 as n → ∞ by the rule of Hôpital.
With Theorem 2, we are able to study the convergence theory of the reformulated problem (3) in Section 3.

Riemannian Characteristics of the reformulated Problem
To solve the reformulated problem (3) or the penalized reformulated problem (9), we specify the Riemannian characteristics of the optimization problem. It is an optimization problem over the product manifold where P d+1 is the set of strictly positive definite matrices of dimension d + 1. The set of symmetric matrices is tangent to the set of positive definite matrices as P d+1 is an open subset of it. Thus the tangent space of the manifold (13) is given by where S d+1 is the set of symmetric matrices of dimension d + 1. The inner product that is commonly associated with the manifold of positive definite matrices is the intrinsic inner product where S ∈ P d+1 and ξ S , χ S ∈ S d+1 . The inner product defined on the tangent space (14) is the sum over all component-wise inner products and reads The retraction used is the exponential map on the manifold given by see Jeuris et al. (2012).
Riemannian Gradient and Hessian. We now specify the Riemannian Gradient and the Riemannian Hessian in order to apply second-order methods on the manifold. The Riemannian Hessian in Theorem 4 is novel for the problem of fitting Gaussian Mixture Models and provides a way of making second-order methods applicable.
Theorem 3 The Riemannian gradient of the reformulated problem reads .
The additive terms for the penalizers in (7), (8) are given by Proof. The Riemannian gradient of a product manifold is the Cartesian product of the individual expressions (Absil et al., 2008). We compute the Riemannian gradients with respect to S 1 , . . . , S K and η.
The gradient with respect to η is the classical Euclidean gradient, hence we get by using the chain rule For the derivative of the penalizer with respect to η r , we get (grad Pen(θ)) ηr = (grad e Pen(θ)) ηr = ζ The Riemannian gradient with respect to the matrices S 1 , . . . , S K is the projected Euclidean gradient onto the subspace T S j P d+1 (with inner product (15)), see Absil et al. (2008), Boumal (2020). The relationship between the Euclidean gradient grad e f and the Riemannian gradient grad f for an arbitrary function f : P n → R with respect to the intrinsic inner product defined on the set of positive definite matrices (15) reads see for example , Jeuris et al. (2012). In a first step, we thus compute the Euclidean gradient with respect to a matrix S l : where we used the Leibniz rule and the partial matrix derivatives which holds by the chain rule and the fact that S −1 l is symmetric. Using the relationship (18) and using (19) yields the Riemannian gradient with respect to S l . It is given by Analogously, we compute the Euclidean gradient of the matrix penalizer ψ(S j , Φ) and use the relationship (18) to get the Riemannian gradient of the matrix penalizer.
To apply Newton-like algorithms, we derived a formula for the Riemannian Hessian of the reformulated problem. It is stated in Theorem 4: Theorem 4 Let θ ∈ M and ξ θ ∈ T θ M. The Riemannian Hessian is given by for l = 1, . . . , K, r = 1, . . . , K − 1 and The Hessian for the additive penalizer reads Hess( Proof. It can be shown that for a product manifold M = M 1 × M 2 with ∇ 1 , ∇ 2 being the Riemannian connections of M 1 , M 2 , respectively, the Riemannian connection ∇ of M for X, Y ∈ M is given by Carmo, 1992). Applying this to our problem, we apply the Riemannian connections of the single parts on the Riemannian gradient derived in Theorem 3. It reads for l = 1, . . . , K and r = 1, . . . , K − 1.
We will now specify the single components of (22). Let ηr denote the Riemannian gradient from Theorem 3 at position S l , η r , respectively. For the latter part in (22), we observe that the Riemannian connection ∇ e ξη l for ξ η l ∈ R is the classical vector field differentiation (Absil et al., 2008, Section 5.3). We obtain where D S j (·)[ξ S j ], D ηr (·)[ξ ηr ] denote the classical Fréchet derivatives with respect to S j , η j along the directions ξ S j and ξ η j , respectively. For the first part on the right hand side of (23), we have and for the second part by applying the chain rule, the Leibniz rule and the relationship α l = exp(η l ) .
For the Hessian with respect to the matrices S l , we first need to specify the Riemannian connection with respect to the inner product (15). It is uniquely determined as the solution to the Koszul formula (Absil et al., 2008, Section 5.3), hence we need to find an affine connection that satisfies the formula. For a positive definite matrix S and symmetric matrices ζ S , ξ S and D(ξ S )[ζ S ], this solution is given by (Jeuris et al., 2012, Sra and) where ξ S , ν S are vector fields on M and D(ξ S )[ν S ] denotes the classical Fréchet derivative of ξ S along the direction ν S . Hence, for the first part in (22), we get After applying the chain rule and Leibniz rule, we obtain and We plug (25), (26) into (24) and use the Riemannian gradient at position S l for the last term in (24). After some rearrangement of terms, we obtain the expression for ζ S l in (21).

Riemannian Newton Trust-Region Algorithm
Equipped with the Riemannian gradient and the Riemannian Hessian, we are now in the position to apply Newton-type algorithms to our optimization problem. As studying positive-definiteness of the Riemannian Hessian from Theorem 4 is hard, we suggest to introduce some safeguarding strategy for the Newton method by applying a Riemannian Newton Trust-Region method.

Riemannian Newton Trust-Region Method
The Riemannian Newton Trust-Region Algorithm is the retraction-based generalization of the standard Trust-Region method (Conn et al., 2000) on manifolds, where the quadratic subproblem uses the Hessian information for an objective function f that we seek to minimize. Theory on the Riemannian Newton Trust-Region method can be found in detail in Absil et al. (2008), we here state the Riemannian Newton Trust-Region method in Algorithm 1. Furthermore, we will study both global and local convergence theory for our penalized problem.
Algorithm 1: Riemannian Trust-Region algorithm Absil et al. (2008) Input: objective f with Hessian H , initial iterate θ 0 ∈ M, initial TR-radius ∆ 0 , maximal TR-radius∆, rejection threshold ρ ∈ [0, 1/4), acceptance parameters 0 ≤ ω 1 ≤ ω 2 ≤ 1, τ 1 ≤ 1/4, τ 2 > 1, termination criteria 1 Set t = 0; 2 while termination criteria not fulfilled do 3 Obtain s t by (approximately) solving the TR-subproblem Global convergence. In the following Theorem, we show that the Riemannian Newton Trust-Region Algorithm applied on the reformulated penalized problem converges to a stationary point under suitable conditions: we assume that the mixing proportion of each cluster is above a certain threshold in each iteration and that the covariance matrices S j do not get arbitrarily large in each iteration.
Proof. According to general global convergence results for Riemannian manifolds, convergence to a stationary point is given if the iterates remain bounded (see (Absil et al., 2008, Proposition 7.4.5) and proofs in (Absil et al., 2008, Section 7.4). Since the rejection threshold in Algorithm 1 is nonnegative, i.e. ρ > 0, we get for all iterations t = 0, 1, 2, . . . . We show that the iterates S t j remain in the interior of P d+1 . For this, assume there exists a subsequence θ t i with λ min (S t i j ) → 0 as t i → ∞. By the proof of Theorem 2, this impliesL pen (θ t i ) Thus, there exists a lower bound C l > 0 such that For the upper bound, we consider the set of successful (unsuccessful) steps S t (F t ) generated by the algorithm until iteration t given by be the tangent vector returned by solving the quadratic subproblem in line 4, Algorithm 1 and be the retraction of ξ t at iteration t with respect to S t j , see (17). Due to the boundedness of the quadratic subproblem in (1), there exists∆ > 0 such that ξ S j t ≤∆ for all j = 1, . . . , K. We get and from the assumption (28) boundedness from above follows directly.
We now show boundedness of By the inequality (30), we havê Due to (31), there existsC > 0 such that We will study ϕ(η t , ζ) for η t at the boundary in the following. We distinguish between the following two cases: 1. Assume there exists j ∈ {1, . . . , l} such that η t j → ∞.
2. Assume there does not exist a j ∈ {1, . . . , K − 1} such that η t j → ∞. We first study the first case, that is 1 For this, we distinguish two cases: 1 a) We study the case where only some of the η t j approach ∞. Without loss of generality, assume that η t j → ∞ for j ∈ {1, . . . , l} for l < K − 1. Further, for all j > l, assume that η t j ≤ c for a constant c > 0. Recall that for all j = 1, . . . , K with η K = 0. By assumption (27), we have α t j > . With the rule of L'Hôpital, we get which is a contradiction to K j=1 α j = 1.
1 b) Now assume that for all j = 1, . . . , K − 1 we have η t j → ∞ with the same speed, otherwise we are in case 1a). Without loss of generality, we set n = η t j and let n → ∞. Using η K = 0 and exp(n) > −1, the penalization term ϕ(η t , ζ) reads which is a contradiction to (32).
Thus, the iterates {θ} t remain in a compact set which yields the superlinear local convergence.
We showed that Algorithm 1 applied on the problem of fitting Gaussian Mixture Models converges to a stationary point for all starting values. From general convergence theory for Riemannian Trust-Region algorithms (Absil et al., 2008, Section 7.4.2), under some assumptions, the convergence speed of Algorithm 1 is superlinear. In the following, we show that these assumptions are fulfilled and specify the convergence rate.

Local convergence. Local convergence result on (Riemannian) Trust Region-
Methods depend on the solver for the quadratic subproblem. If the quadratic subproblem in Algorithm 1 is (approximately) solved with sufficient decrease, the local convergence close to a maximizer ofL pen is superlinear under mild assumptions. The truncated Conjugate Gradient method is a typical choice that returns a sufficient decrease, we suggest to use this matrix-free method for Gaussian Mixture Models. We state the Algorithm in Appendix B. The local superlinear convergence is stated in Theorem 6.
Theorem 6 (Local convergence) Consider Algorithm 1, where the quadratic subproblem is solved by the truncated Conjugate Gradient Method and f = −L pen . Let v ∈ M be a nondegenerate local minimizer of f , H the Hessian of the reformulated penalized problem from Theorem 4 and δ < 1 the parameter of the termination criterion of tCG (Appendix B, line 18 in Algorithm 2). Then there exists c > 1 such that, for all sequences {θ t } generated by Algorithm 1 converging to v, there exists T > 0 such that for all t > T , gradL pen (θ t+1 ) ≤ c gradL pen (θ t ) δ+1 . (33) Proof. Let v ∈ M be a nondegenerate local minimizer of f (maximizer ofL pen ). We choose the termination criterion of tCG such that δ < 1 . According to (Absil et al., 2008, Theorem 7.4.12), it suffices to show that HessL pen (R θ (ξ)) is Lipschitz continuous at 0 θ in a neighborhood of v. From the proof of Theorem 2, we know that close to v, we are bounded away from the boundary of M, i.e. we are bounded away from points on the manifold with singular S j , j = 1, . . . , K. The extreme value theorem yields the local Lipschitz continuity such that all requirements of (Absil et al., 2008, Theorem 7.4.12) are fulfilled and the statement in Theorem 6 holds true.

Practical Design choices
We apply Algorithm 1 on our cost function (9), where we seek to minimize f = −L pen . The quadratic subproblem in line 3 in Algorithm 1 is solved by the truncated Conjugate Gradient method (tCG) with the inner product (16). To speed up convergence of the tCG, we further use a preconditioner: At iteration t, we store the gradients computed in tCG and store an inverse Hessian approximation via the LBFGS formula. This inverse Hessian approximation is then used for the minimization of the next subproblemm θ t+1 . The use of such preconditioners has been suggested by Morales and Nocedal (2000) for solving a sequence of slowly varying systems of linear equations and gave a speed-up in convergence for our method. The initial TR radius is set by using the method suggested by Sartenaer (1997) that is based on the model trust along the steepest-descent direction. The choice of parameters ω 1 , ω 2 , τ 1 , τ 2 , ρ in Algorithm 1 are chosen according to the suggestions in Gould et al. (2005) and Conn et al. (2000).

Numerical Results
We test our method on the penalized objective function (3) on both simulated and real-world data sets. We compare our method against the (penalized) Expectation Maximization Algorithm and the Riemannian LBFGS method proposed by Hosseini and . For all methods, we used the same initialization by running k-means++ (Arthur and Vassilvitskii, 2007) and stopped all methods when the difference of average log-likelihood for two subsequent iterates falls below 1e − 10 or when the number of iterations exceeds 1500 (clustering) or 3000 (density approximation). For the Riemannian LBFGS method suggested by Sra (2015, 2020), we used the MixEst package (Hosseini and Mash'al, 2015) kindly provided by one of the authors. 1 For the Riemannian Trust region method we mainly followed the code provided by the pymanopt Python package provided by Townsend et al. (2016), but adapted it for a faster implementation by computing the matrix inverse S −1 j only once per iteration. We used Python version 3.7. The experiments were conducted on an Intel Xeon CPU X5650 at 2.67 GHhz with 24 cores and 20GB RAM.
In Subsection 4.1, we test our method on clustering problems on simulated data and on real-world datasets from the UCI Machine Learning Repository (Dua and Graff, 2017). In Subsection 4.2 we consider Gaussian Mixture Models as probability density estimators and show the applicability of our method.

Clustering
We test our method for clustering tasks on different artificial (Subsection 4.1.1) and real-world (Subsection 4.1.2) data sets.

Simulated Data
As the convergence speed of EM depends on the level of separation between the data, we test our methods on data sets with different degrees of separation as proposed in Dasgupta (1999), . The distributions are sampled such that their means satisfy for i, j = 1, . . . , K, i = j and c models the degree of separation. Additionally, a low eccentricity (or condition number) of the covariance matrices has an impact on the performance of Expectation Maximization (Dasgupta, 1999), for which reason we also consider different values of eccentricity e = λmax(Σ j ) λ min (Σ j ) . This is a measure of how much the data scatters.
We test our method on 20 and 40-dimensional data and an equal distribution among the clusters, i.e. we set α j = 1 K for all j = 1, . . . , K. Although it is known that unbalanced mixing coefficients α j also result in slower EM convergence, this effect is less strong than the level of overlap (Naim and Gildea, 2012), so we only show simulation results with balanced clusters. Results for unbalanced mixing coefficients and varying number of components K are shown for real-world data in Subsection 4.1.2.
We first take a look at the 20-dimensional data sets, for which we simulated m = 1000 data points for each parameter setting. In Table 4.1, we show the results for very scattered data, that is e = 1. We see that, like predicted by literature, the Expectation Maximization converges slowly in such a case. This effect is even stronger with a lower separation constant c. The effect of the eccentricity becomes even more clear when comparing the results of Table 4.1 with Table 4.2. Also the Riemannian algorithms converge slower for lower values of eccentricity e and separation levels c. However, they seem to suffer less from hidden information than Expectation Maximization. The proposed Riemannian Newton Trust Region algorithm (R-NTR) beats the other methods in terms of runtime and number of iterations (see Figure 4.1a). The Riemannian LBFGS (R-LBFGS) method by  also shows faster convergence than EM, but the gain of second-order information available by the Riemannian Hessian is obvious. However, the R-LBFGS results created by the MixEst toolbox show long runtimes compared to the other methods. We see from Figure 4.1b that the average penalized log-likelihood is slightly higher for R-LBFGS in some experiments. Still, the objective evaluated at the point satisfying the termination criterion is at a competitive level in all methods (see also Table 4.1). When increasing the eccentricity (Table 4.2), we see that the Riemannian methods still converge faster than EM, but our method is not faster than EM. This is because EM benefits from very low per-iteration costs and the gain in number of iterations is less strong in this case. However, we see that the Riemannian Newton Trust-Region method is not substantially slower. Furthermore, the average loglikelihood values (ALL) are more or less equal in all methods, so we might assume that all methods stopped close to a similar optimum. This is also underlined by comparable mean squared errors (MSE) to the true parameters from which the input data has been sampled from. In average, Riemannian Newton Trust-Region gives the best results in terms of runtime and number of iterations.  In Table (4.3), we show results for dimension d = 40 and low eccentricity (e = 1) and the same simulation protocol as above (in particular, m = 1000). We observed that with our method, we only performed very few Newton-like steps and instead exceeded the trust-region within the tCG many times, leading to poorer steps (see also Figure 4.2b). One possible reason is that the number of parameters increases with d quadratically, that is in O(Kd 2 ), while at the same time we did not increase the number of observations m = 1000. If we are too far from a local optimum and the clusters are not well initialized due to few observations, the factor f i l in the Hessian (Theorem 4) becomes small, leading to large potential conjugate gradients steps (see Algorithm 2). Although this affects the E-step in the Expectation Maximization algorithm as well, the effect seems to be much severe in our method. To underline this, we show simulation results for a higher number of observations, that is, m = 10.000, in Table 4.4 with the same true parameters α j , µ j , Σ j as in Table 4.3. As expected, the superiority in runtime of our method becomes visible: The R-NTR method beats Expectation maximization with a factor of 4. Just like for the case of a lower dimension d = 20, the mean average log-likelihood and the errors are comparable between our method and EM, whereas R-LBFGS shows slightly worse results although it now attains comparable runtimes to our method.
We thus see that the ratio between number of observations and number of parameters must be large enough in order to benefit from the Hessian information in our method.

Real-World Data
We tested our method on some real-world data sets from UCI Machine Learning repository (Dua and Graff, 2017) besides the simulated data sets. For this, we normalized the data sets and tested the methods for different values of K.
Combined Cycle Power Plant Data Set (Kaya and Tufekci, 2012).
In Table 4.5, we show the results for the combined cycle power plant data set, which is a data set with 4 features. Although the dimension is quite low, we see that we can beat EM both in terms of runtime and number of iterations for almost all K by applying the Riemannian Newton-Trust Region method. This underlines the results shown for artificial data in Subsection 4.1.1. The gain by our method becomes even stronger when we consider a large number of components K where the overlap between clusters is large and we can reach a local optimum with our method in up to 15 times less iterations and a time saving of factor close to 4.  MAGIC Gamma Telescope Data Set (Bock et al., 2004). We also study the behaviour on a data set with higher dimensions and a larger number of observations with the MAGIC Gamma Telescope Data Set, see Table 4.6. Here, we can also observe a lower number of iterations in the Riemannian Optimization methods. Similar to the combined cycle power plant data set, this effect becomes even stronger for a high number of clusters where the ratio of hidden information is large. Our method shows by far the best runtimes. For this data set, the average log-likelihood values are very close to each other except for K = 15 where the ALL is worse for the Riemannian methods. It seems that in this case, the R-NTR and the R-LBFGS methods end in different local maxima than the EM. However, for all of the methods, convergence to global maxima is theoretically not ensured and for all methods, a globalization strategy like a split-and-merge approach (Li and Li, 2009) might improve the final ALL values. As the Magic Gamma telescope data set is a classification data set with 2 classes, we further report the classification performance in Table 4.7a. We see that the geodesic distance defined on the manifold and the weighted mean squared errors (wMSE) are comparable between all three methods. In Table 4.7b, we also report the Adjusted Rand Index (Hubert and Arabie, 1985) for all methods. Although the clustering performance is very low compared to the true class labels (first row), we see that it is equal among the three methods.
We show results on additional real-world data sets in Appendix C.

Gaussian Mixture Models as Density Approximaters
Besides the task of clustering (multivariate) data, Gaussian Mixture Models are also well-known to serve as probability density approximators of smooth density functions with enough components (Scott, 1992).
In this Subsection, we present the applicability of our method for Gaussian mixture density approximation of a Beta-Gamma distribution and compare against EM and R-LBFGS for the estimation of the Gaussian components.
We consider a bivariate Beta-Gamma distribution with parameters α Beta = 0.5, β Beta = 0.5, a Gamma = 1, β Gamma = 1, where the joint distribution is characterized by a Gaussian copula. The density function surface is visualized in Figure  4.3.
We simulated 1000 realizations of the Beta(0.5,0.5)-Gamma(1, 1) distribution and fitted a Gaussian Mixture Model for 100 simulation runs. We considered different numbers of K and compared the (approximated) root mean integrated squared error (RMISE). The RMISE and its computational approximation formula is given by (Gross et al., 2015): where f denotes the underlying true density , i.e. Beta(0.5,0.5)-Gamma(1, 1),f the density approximator (GMM), N is the number of equidistant grid points with the associated grid width δ g . For our simulation study, we chose 16384 grid points in the box [0, 5] × [0, 10]. We show the results in Table 4.8, where we fit the parameters of the GMM by our method (R-NTR) and compare against GMM approximations where we fit the parameters with EM and R-LBFGS. We observe that the RMISE is of comparable size for all methods and even slightly better for our method for K = 2, K = 5 and K = 10. Just as for the clustering results in Subsection 4.1, we have much lower runtimes for R-NTR and a much lower number of total iterations. This is a remarkable improvement especially for a larger number of components. We also observe that in all methods, the mean average log-likelihood (ALL) of the training data sets with 1000 observations attains higher values with an increasing number of components K. This supports the fact that the approximation power of GMMs for arbitrary density function is expected to become higher if we add additional Gaussian components (Scott, 1992, Goodfellow et al., 2016. On the other hand, the RMISE (which is not based on the training data) increased in our experiments with larger K's. This means that we are in a situation of overfitting.  The drawback of overfitting is well-known for EM (Andrews, 2018) and we also observed this for the R-NTR and the R-LBFGS methods. However, the RMISE are comparable and so none of the methods outperforms another substantially in terms of overfitting. This can also be seen from Figure 4.4 showing the distribution of the pointwise errors for K = 2 and K = 5. Although the R-LBFGS method shows higher error values on the boundary of the support of the distribution for K = 5, the errors show similar distributions among the three methods at a comparable level. We propose methodologies such as cross validation (Murphy, 2013) or applying a split-and-merge approach on the optimized parameters (Li and Li, 2009) to address the problem of overfitting.
The results show that our method is well-suited for both density estimation tasks and especially for clustering of real-world and simulated data.

Conclusion
We proposed a Riemannian Newton Trust-Region method for Gaussian Mixture Models. For this, we derived an explicit formula for the Riemannian Hessian and gave results on convergence theory. Our method is a fast alternative to the well-known Expectation Maximization and existing Riemannian approaches for Gaussian Mixture Models. Especially for highly overlapping components, the numerical results show that our method leads to an enourmous speed-up both in terms of runtime and the total number of iterations and medium-sized problems. This makes it a favorable method for density approximation as well as for difficult clustering tasks for data with higher dimensions. Here, especially the availability of the Riemannian Hessian increases the convergence speed compared to Quasi-Newton algorithms. When considering higher-dimensional data, we experimentally observed that our method still works very well and is faster than EM if the number of observations is large enough. We plan to examine this further and take a look into data sets with higher dimensions. Here, it is common to impose a special structure on the covariance matrices to tackle the curse of dimensionality (McLachlan et al., 2019). Adapted versions for Expectation Maximization exist and a transfer of Riemannian method into constrained settings is subject of current research.

A Unboundedness of the Reformulated Problem
Theorem 7 The reformulated objective (3) without penalization term is unbounded from above.
We consider the simplified case where K = 1 and investigate the log-likelihood of the variable θ (n) = S (n) since α 1 = 1. Similar to the proof of Theorem 2, we introduce a singular positive semidefinite matrix S s with rank r < d + 1 and a sequence of positive definite matrices S (n) 1 converging to S s for n → ∞. The reformulated objective at θ (n) readŝ With the decomposition (S (n) 1 ) −1 = U T D n U as in (12), we see that Now assume that one of the y i is in the kernel of the matrixŨ s , wherẽ

C Additional Numerical Results
We present numerical results additional to Section 4 both for simulated clustering data and for real-world data.
C.1 Simulated data   (Cortez et al., 2009) The wine quality data set also provides classification labels: we can distinguish between white and red wine or distinguish between 7 quality labels. Besides the clustering performance of the methods (Table C.4), we also show the goodness of fit of our method for K = 2 and K = 7.  Table C.5: Model quality for (normalized) wine data set for K = 2 (red/white wine).
(a) Weighted mean squared errors of (normalized) wine data set for K = 2. Weighting by mixing coefficients of the respective method.