1 Introduction

Gaussian Mixture Models are widely recognized in Data Science and Statistics. The fact that any probability density can be approximated by a Gaussian Mixture Model with a sufficient number of components makes it an attractive tool in statistics. However, this comes with some computational limitations, where some of them are described in Ormoneit and Tresp (1995), Lee and Mclachlan (2013) and Coretto (2021). Nevertheless, we here focus on the benefits of Gaussian mixture models. Besides the goal of density approximation, the possibility of modeling latent features by the underlying components make it also a strong tool for soft clustering tasks. Typical applications are to be found in the area of image analysis (Alfò et al. 2008; Dresvyanskiy et al. 2020; Zoran and Weiss 2012), pattern recognition (Wu et al. 2012; Bishop 2006), econometrics (Articus and Burgard 2014; Compiani and Kitamura 2016 and many others).

We state the Gaussian Mixture Model in the following:

Let \(K \in \mathbb {N}\) be given. The Gaussian Mixture Model (with K components) is given by the (multivariate) density function p of the form

$$\begin{aligned} p(x) = \sum \limits _{j=1}^K \alpha _j p_{{\mathcal {N}}}(x; \mu _j, \Sigma _j), \qquad x \in \mathbb {R}^d, \end{aligned}$$
(1)

with positive mixture components \(\alpha _j\) that sum up to 1 and Gaussian density functions \(p_{{\mathcal {N}}}\) with means \(\mu _j \in \mathbb {R}^d\) and covariance matrices \(\Sigma _j \in \mathbb {R}^{d \times d}\). In order to have a well-defined expression, we impose \(\Sigma _j \succ 0\), i.e. the \(\Sigma _j\) are symmetric positive definite.

Given observations \(x_1,\ldots , x_m\), the goal of parameter estimation for Gaussian Mixture Models consists in maximizing the log-likelihood. This yields the optimization problem

$$\begin{aligned} \max _{\begin{array}{c} {\alpha } \in \varDelta _K \\ \mu _j \in \mathbb {R}^d \\ \Sigma _j \succ 0 \end{array}} \sum \limits _{i=1}^m \log \left( \sum \limits _{j=1}^K \alpha _j p_{{\mathcal {N}}}(x_i; \mu _j, \Sigma _j)\right) , \end{aligned}$$
(2)

where

$$\begin{aligned} \varDelta _K = \left\{ (\alpha _1,\dots , \alpha _K), \text { } \alpha _j \in \mathbb {R}^{+} \forall j, \quad \sum \limits _{j=1}^K \alpha _j =1\right\} \end{aligned}$$

is the K-dimensional probability simplex and the covariance matrices \(\Sigma _j\) are restricted to the set of positive definite matrices.

In practice, this problem is commonly solved by the Expectation Maximization (EM) algorithm. It is known that the Expectation Maximization algorithm converges fast if the K clusters are well separated. Ma et al. (2000) showed that in such a case, the convergence rate is superlinear. However, Expectation Maximization has its speed limits for highly overlapping clusters, where the latent variables have a non-neglibible large probability among more than one cluster. In such a case, the convergence is linear (Xu and Jordan 1996) which might results in slow parameter estimation despite very low per-iteration costs.

From a nonlinear optimization perspective, the problem in (2) can be seen as a constrained nonlinear optimization problem. However, the positive definiteness constraint of the covariance matrices \(\Sigma _j\) is a challenge for applying standard nonlinear optimization algorithms. While this constraint is naturally fulfilled in the EM algorithm, we cannot simply drop it as we might leave the parameter space in alternative methods. Approaches to this problem like introducing a Cholesky decomposition (Salakhutdinov et al. 2003; Xu and Jordan 1996), or using interior point methods via smooth convex inequalities (Vanderbei and Benson 1999) can be applied and one might hope for faster convergence with Newton-type algorithms. Methods like using a Conjugate Gradient Algorithm (or a combination of both EM and CG) led to faster convergence for highly overlapping clusters (Salakhutdinov et al. 2003). However, this induces an additional numeric overhead by imposing the positive-definiteness constraint by a Cholesky decomposition.

In recent approaches Hosseini and Sra (2015, 2020) suggest to exploit the geometric structure of the set of positive definite matrices. As an open set in the set of symmetric matrices, it admits a manifold structure (Bhatia 2007) and thus the concepts of Riemannian optimization can be applied. The concept of Riemannian optimization, i.e. optimizing over parameters that live on a smooth manifold is well studied and has gained increasing interest in the domain of Data Science, for example for tensor completion problems (Heidel and Schulz 2018). However, the idea is quite new for Gaussian Mixture Models and Hosseini and Sra (2015, 2020) showed promising results with a Riemannian LBFGS and Riemannian Stochastic Gradient Descent Algorithm. The results in Hosseini and Sra (2015, 2020) are based on a reformulation of the log-likelihood in (2) that turns out to be very efficient in terms of runtime. By design, the algorithms investigated in Hosseini and Sra (2015, 2020) do not use exact second-order information of the objective function. Driven by the quadratic local convergence of the Riemannian Newton method, we thus might hope for faster algorithms with the availability of the Riemannian Hessian. In the present work, we derive a formula for the Riemannian Hessian of the reformulated log-likelihood and suggest a Riemannian Newton Trust-Region method for parameter estimation of Gaussian Mixture Models.

An arXiv preprint of this article is availabe (Sembach et al. 2021).

The paper is organized as follows. In Sect. 2, we introduce the reader to the concepts of Riemannian optimization and the Riemannian setting of the reformulated log-likelihood for Gaussian Mixture Models. In particular, we derive the expression for the Riemannian Hessian in Sect. 2.3 which is a big contribution to richer Riemannian Algorithm methods for Gaussian Mixture Models. In Sect. 3 we present the Riemannian Newton Trust-Region Method and prove the convergence to global optima and superlinear local convergence for our problem. We compare our proposed method against existing algorithms both on artificial and real world data sets in Sect. 4 for the task of clustering and density approximation.

2 Riemannian setting for Gaussian mixture models

We will build the foundations of Riemannian Optimization in the following to specify the characteristics for Gaussian Mixture Models afterwards. In particular, we introduce a formula for the Riemannian Hessian for the reformulated problem which is the basis for second-order optimization algorithms.

2.1 Riemannian optimization

To construct Riemannian Optimization methods, we briefly state the main concepts of Optimization on Manifolds or Riemannian Optimization. A good introduction is Absil et al. (2008) and Boumal (2020), we here follow the notations of Absil et al. (2008). The concepts of Riemannian Optimization are based on concepts from unconstrained euclidean optimization algorithms and are generalized to (possibly nonlinear) manifolds.

A manifold is a space that locally resembles Euclidean space, meaning that we can locally map points on manifolds to \(\mathbb {R}^n\) via bicontinuous mappings. Here, n denotes the dimension of the manifold. In order to define a generalization of differentials, Riemannian optimization methods require smooth manifolds meaning that the transition mappings are smooth functions. As manifolds are in general not vector spaces, standard optimization algorithms like line-search methods cannot be directly applied as the iterates might leave the admissible set. Instead, one moves along tangent vectors in tangent spaces \(T_{\theta }{\mathcal {M}}\), local approximations of a point \(\theta \) on the manifold, i.e. \(\theta \in {\mathcal {M}}\). Tangent spaces are basically first-order approximations of the manifold at specific points and the tangent bundle \(T{\mathcal {M}}\) is the disjoint union of the tangent spaces \(T_{\theta }{\mathcal {M}}\). In Riemannian manifolds, each of the tangent spaces \(T_{\theta }{\mathcal {M}}\) for \(\theta \in {\mathcal {M}}\) is endowed with an inner product \(\langle \cdot , \cdot \rangle _{\theta }\) that varies smoothly with \(\theta \). The inner product is essential for Riemannian optimization methods as it admits some notion of length associated with the manifold. The optimization methods also require some local pull-back from the tangent spaces \(T_{\theta } {\mathcal {M}}\) to the manifold \({\mathcal {M}}\) which can be interpreted as moving along a specific curve on \({\mathcal {M}}\) (dotted curve in Fig. 1). This is realized by the concept of retractions: Retractions are mappings from the tangent bundle \(T{\mathcal {M}}\) to the manifold \({\mathcal {M}}\) with rigidity conditions: we move through the zero element \(0_{\theta }\) with velocity \(\xi _{\theta } \in T_{\theta }{\mathcal {M}}\) , i.e. \(DR_{\theta } ( 0_{\theta } )[ \xi _{\theta } ] = \xi _{\theta }.\) Furthermore, the retraction of \(0_{\theta } \in T_{\theta }{\mathcal {M}}\) at \(\theta \) is \(\theta \) itself (see Fig. 1).

Roughly spoken, a step of a Riemannian optimization algorithm works as follows:

  • At iterate \(\theta ^t\), take a new step \(\xi _{\theta ^t}\) on the tangent space \(T_{\theta ^t}{\mathcal {M}}\)

  • Pull back the new step to the manifold by applying the retraction at point \(\theta ^t\) by setting \(\theta ^{t+1} = R_{\theta ^t} (\xi _{\theta ^t})\)

Here, the crucial part that has an impact on convergence speed is updating the new iterate on the tangent space, just like in the Euclidean case. As Riemannian optimization algorithms are a generalization of Euclidean unconstrained optimization algorithms, we thus introduce a generalization of the gradient and the Hessian.

Fig. 1
figure 1

Retraction-based Riemannian optimization

Riemannian Gradient. In order to characterize Riemannian gradients, we need a notion of differential of functions defined on manifolds.

The differential of \(f : {\mathcal {M}}\rightarrow \mathbb {R}\) at \(\theta \) is the linear operator \({{\,\mathrm{D}\,}}f(\theta ) : T{\mathcal {M}}_{\theta } \rightarrow \mathbb {R}\) defined by:

$$\begin{aligned} {{\,\mathrm{D}\,}}f (\theta )[v] = \frac{d}{dt} f(c(t))\biggr |_{t=0}, \end{aligned}$$

where \(c: I \rightarrow {\mathcal {M}}\), \(0 \in I \subset \mathbb {R}\) is a smooth curve on \({\mathcal {M}}\) with \(c'(0) = v\).

The Riemannian gradient can be uniquely characterized by the differential of the function f and the inner product associated with the manifold:

The Riemannian gradient of a smooth function \(f: M \rightarrow \mathbb {R}\) on a Riemannian manifold is a mapping \({{\,\mathrm{grad}\,}}f: {\mathcal {M}}\rightarrow T{\mathcal {M}}\) such that, for all \(\theta \in {\mathcal {M}}\), \({{\,\mathrm{grad}\,}}f(\theta )\) is the unique tangent vector in \(T_{\theta }{\mathcal {M}}\) satisfying

$$\begin{aligned} \langle {{\,\mathrm{grad}\,}}f(\theta ), \xi _{\theta } \rangle _{\theta } = {{\,\mathrm{D}\,}}f(\theta ) [\xi _{\theta }] \quad \forall \xi _{\theta } \in {\mathcal {M}}. \end{aligned}$$

Riemannian Hessian. Just like we defined Riemannian gradients, we can also generalize the Hessian to its Riemannian version. To do this, we need a tool to differentiate along tangent spaces, namely the Riemannian connection (for details see Absil et al. 2008, Section 5.3).

The Riemannian Hessian of \(f: {\mathcal {M}}\rightarrow \mathbb {R}\) at \(\theta \) is the linear operator \({{\,\mathrm{Hess}\,}}f(\theta ): T_{\theta }{\mathcal {M}}\rightarrow T_{\theta }{\mathcal {M}}\) defined by

$$\begin{aligned} {{\,\mathrm{Hess}\,}}f(\theta )[\xi _{\theta }] = \nabla _{\xi _{\theta }} {{\,\mathrm{grad}\,}}f(\theta ), \end{aligned}$$

where \(\nabla \) is the Riemannian connection with respect to the Riemannian manifold.

2.2 Reformulation of the log-likelihood

In Hosseini and Sra (2015), the authors experimentally showed that applying the concepts of Riemannian optimization to the objective in (2) cannot compete with Expectation Maximization. This can be mainly led back to the fact that the maximization in the M-step of EM, i.e. the maximization of the log-likelihood for a single Gaussian, is a concave problem and thus very easy to solve - it even admits a closed-form solution. However, when considering Riemannian optimization for (2), the maximization of the log-likelihood of a single Gaussian is not geodesically concave (concavity along the shortest curve connecting two points on a manifold). The following reformulation introduced by Hosseini and Sra (2015) removes this geometric mismatch and results in a speed-up for the Riemannian algorithms.

We augment the observations \(x_i\) by introducing the observations \(y_i = (x_i, 1)^T \in \mathbb {R}^{d+1}\) for \(i=1, \dots , m\) and consider the optimization problem

$$\begin{aligned} \max _{\theta = ((S_1,\dots , S_K), (\eta _1,\dots , \eta _{K-1}))} \hat{{\mathcal {L}}} (\theta ) = \sum \limits _{i=1}^m \log \left( \sum \limits _{j=1}^K h^i(\theta _j)\right) , \end{aligned}$$
(3)

where

$$\begin{aligned} h^i(\theta _j)&= \frac{\exp (\eta _j)}{\sum \limits _{k=1}^K \exp (\eta _k)} \frac{ \exp \left( \frac{1}{2}\left( 1-y_i^T S_j^{-1} y_i \right) \right) }{\sqrt{(2\pi )^d \det (S_j)}} \nonumber \\&= \frac{\exp (\eta _j)}{\sum \limits _{k=1}^K \exp (\eta _j)}q_{{\mathcal {N}}}(y_i;S_j), \end{aligned}$$
(4)

with \(q_{{\mathcal {N}}}(y_i;S_j) = \sqrt{2\pi } \exp (\frac{1}{2}) p_{{\mathcal {N}}}(y_i;0,S_j)\) for parameters \({\theta _j = (S_j, \eta _j)}\), \(j=1, \dots , K-1\) and \({\theta _K = (S_K, 0)}\).

This means that instead of considering Gaussians of d-dimensional variables x, we now consider Gaussians of \(d+1\)-dimensional variables y with zero mean and covariance matrices \(S_j\). The reformulation leads to faster Riemannian algorithms, as it has been shown in Hosseini and Sra (2015) that the maximization of a single Gaussian

$$\begin{aligned} \max _{S \succ 0} \sum \limits _{i=1}^m \log q_{{\mathcal {N}}}(y_i; S) \end{aligned}$$

is geodesically concave.

Furthermore, this reformulation is faithful, as the original problem (2) and the reformulated problem (3) are equivalent in the followings sense:

Theorem 1

(Hosseini and Sra 2015, Theorem 2.2) A local maximum of the reformulated GMM log-likelihood \( \hat{{\mathcal {L}}} (\theta )\) with maximizer \(\theta ^{*} = ({\theta _1}^{*},\dots ,{\theta _K}^{*})\), \({\theta _j}^{*} = ({S_j}^{*}, {\eta _j}^{*})\) is a local maximum of the original log-likelihood \({\mathcal {L}}((\alpha _j, \mu _j,\Sigma _j)_{j})\) with maximizer \(({\alpha _j}^{*}, {\mu _j}^{*}, {\Sigma _j}^{*})_{j}\). Here, \({\mathcal {L}}\) denotes the objective in the problem (2).

The relationship of the maximizers is given by

$$\begin{aligned} S_j^{*}&= \left( \begin{array}{cc} \Sigma _j^{*} + \mu _j^{*}{\mu _j^{*}}^T &{} \quad \mu _j^{*} \\ {\mu _j^{*}}^T &{} \quad 1\\ \end{array} \right) , \end{aligned}$$
(5)
$$\begin{aligned} \eta _j^{*}&= \log \left( \frac{\alpha _j^{*}}{\alpha _K^{*}}\right) \quad j=1, \dots , K-1; \quad \eta _K \equiv 0 . \end{aligned}$$
(6)

This means that instead of solving the original optimization problem (2) we can easily solve the reformulated problem (3) on its according parameter space and transform the optima back by the relationships (5) and (6).

Penalizing the objective. When applying Riemannian optimization algorithms on the reformulated problem (3), covariance singularity is a challenge. Although this is not observed in many cases in practice, it might result in unstable algorithms. This is due to the fact that the objective in (3) is unbounded from above, see “Appendix  A” for details. The same problem is extensively studied for the original problem (2) and makes convergence theory hard to investigate. An alternative consists in considering the maximum a posterior log-likelihood for the objective. If conjugate priors are used for the variables \(\mu _j, \Sigma _j\), the optimization problem remains structurally unchanged and results in a bounded objective (see Snoussi and Mohammad-Djafari 2002). Adapted versions of Expectation Maximization have been proposed in the literature and are often applied in practice.

A similar approach has been proposed in Hosseini and Sra (2020) where the reformulated objective (3) is penalized by an additive term that consists of the logarithm of the Wishart prior, i.e. we penalize each of the K components with

$$\begin{aligned} \psi (S_j, \Psi ) = -\frac{\rho }{2} \log \det (S_j) - \frac{\beta }{2} {{\,\mathrm{tr}\,}}(\Psi {S_j}^{-1}), \end{aligned}$$
(7)

where \(\Psi \) is the block matrix

$$\begin{aligned} \Psi = \left( \begin{array}{cc} \frac{\gamma }{\beta }\Lambda + \kappa \lambda \lambda ^T &{} \kappa \lambda \\ \kappa \lambda ^T &{} \kappa \end{array}\right) \end{aligned}$$

for \(\lambda \in \mathbb {R}^d\), \(\Lambda \in \mathbb {R}^{d \times d}\) and \(\gamma , \beta , \rho , \nu , \kappa \in \mathbb {R}\). If we assume that \({\rho = \gamma (d+\nu + 1) + \beta }\), the results of Theorem 1 are still valid for the penalized version, see Hosseini and Sra (2020). Besides, the authors introduce an additive term to penalize very tiny clusters by introducing Dirichlet priors for the mixing coefficients \(\alpha _j = \frac{\exp (\eta _j)}{\sum \limits _{k=1}^K \exp (\eta _k)}\), i.e.

$$\begin{aligned} \varphi (\eta , \zeta ) = \zeta \left( \sum \limits _{j=1}^K \eta _j - K\log \left( \sum \limits _{k=1}^K \exp (\eta _k)\right) \right) . \end{aligned}$$
(8)

In total, the penalized problem is given by

$$\begin{aligned} \max _{\theta } \hat{{\mathcal {L}}}_{\text {pen}} (\theta ; \Psi , \zeta ) = \sum \limits _{i=1}^m \log \left( \sum \limits _{j=1}^K h^i(\theta _j)\right) + Pen(\theta ), \end{aligned}$$
(9)

where

$$\begin{aligned} Pen(\theta ) =\sum \limits _{j=1}^K \psi (S_j, \Psi ) + \varphi (\eta , \zeta ). \end{aligned}$$

The use of such an additive penalizer leads to a bounded objective:

Theorem 2

The penalized optimization problem in (9) is bounded from above.

Proof

We follow the proof for the original objective (2) from Snoussi and Mohammad-Djafari (2002). The penalized objective reads

$$\begin{aligned} \hat{{\mathcal {L}}}_{\text {pen}} (\theta ; \Psi , \zeta ) = \sum \limits _{i=1}^m \log \left( \sum \limits _{j=1}^K q_{{\mathcal {N}}}^{\text {AP}} (\theta _j, S_j; \Psi , \zeta ) \right) , \end{aligned}$$

where

$$\begin{aligned}&q_{{\mathcal {N}}}^{\text {AP}} (\theta _j, S_j; \Psi , \zeta ) = \\&\quad \text { } h^i(\theta _j) \bigg (\prod _{k=1}^K \det (S_k)^{-\frac{\rho }{2}}\exp \left( -\frac{1}{2}{{\,\mathrm{tr}\,}}(S_k^{-1}\Psi )\right) {\alpha _k}^{\zeta } \bigg )^{1/m}. \end{aligned}$$

We get the upper bound

$$\begin{aligned}&q_{{\mathcal {N}}}^{\text {AP}} (\theta _j, S_j; \Psi , \zeta ) \nonumber \\&\quad \le h^i(\theta _j) \prod _{k=1}^K \det (S_k)^{-\frac{\rho }{2}} \exp \left( -\frac{1}{2} {{\,\mathrm{tr}\,}}(S_k^{-1}\Psi )\right) \alpha _k^{\zeta } \nonumber \\&\quad \le a \alpha _j \det (S_j)^{-\frac{d}{2}} \prod _{k=1}^K \det (S_k)^{-\frac{\rho }{2}}\exp \left( -\frac{1}{2} {{\,\mathrm{tr}\,}}(S_k^{-1}\Psi )\right) \nonumber \\&\quad = a \alpha _j (\det (S_j))^{-\frac{d+\rho }{2}} \exp \left( -\frac{1}{2} {{\,\mathrm{tr}\,}}(S_j^{-1}\Psi )\right) \nonumber \\&\qquad \qquad \qquad \quad \times \prod \limits _{\begin{array}{c} k=1 \\ k \ne j \end{array}}^K \det (S_k)^{-\frac{\rho }{2}} \exp \left( -\frac{1}{2} {{\,\mathrm{tr}\,}}(S_k^{-1}\Psi )\right) , \end{aligned}$$
(10)

where we applied Bernoulli’s inequality in the first inequality and used the positive definiteness of \(S_j\) in the second inequality. a is a positive constant independent of \(S_j\) and \(S_k\).

By applying the relationship \(\det (A)^{1/n} \le \frac{1}{n} {{\,\mathrm{tr}\,}}(A)\) for \(A \in \mathbb {R}^{n \times n}\) by the inequality of arithmetic and geometric means, we get for the right hand side of (10)

$$\begin{aligned}&\det (S_k)^{-\frac{b}{2}} \exp (-\frac{1}{2} {{\,\mathrm{tr}\,}}(S_k^{-1}\Psi )) \nonumber \\&\qquad \qquad \le (\det (S_k))^{- \frac{b}{2}} \exp \left( -\frac{d+1}{2} \left( \frac{\det (\Psi )}{\det (S_k)}\right) ^{\frac{1}{d+1}}\right) \end{aligned}$$
(11)

for a constant \(b > 0\). The crucial part on the right side of (11) is when one of the \(S_k\) approaches a singular matrix and thus the determinant approaches zero. Then, we reach the boundary of the parameter space. We study this issue in further detail:

Without loss of generality , let \(k=1\) be the component where this occurs. Let \(S_1^{*}\) be a singular semipositive definite matrix of rank \(r < d+1\). Then, there exists a decomposition of the form

$$\begin{aligned} S_1^{*} = U^T D U, \end{aligned}$$

where \(D = {{\,\mathrm{diag}\,}}(0, \dots , 0, \lambda _{d-r}, \lambda _{d-r+1}, \dots , \lambda _{d+1})\), \(\lambda _{l} > 0\) for \(l=d-r,\dots , d+1\) and U an orthogonal square matrix of size \(d+1\). Now consider the sequence \(S_1^{(n)}\) given by

$$\begin{aligned} S_1^{(n)} = U^{T} D^{(n)} U, \end{aligned}$$
(12)

where

$$\begin{aligned} D^{(n)} = {{\,\mathrm {diag}\,}}(\lambda _1^{(n)}, \dots , \lambda _{d-r-1}^{(n)},\lambda _{d-r}, \lambda _{d-r+1}, \dots , \lambda _{d+1}) \end{aligned}$$

with \(\left( \lambda _l^{(n)}\right) _{l=1, \dots , d-r-1} \) converging to 0 as \(n \rightarrow \infty \). Then, the matrix \(S_1^{(n)}\) converges to \(S_1^{(*)}\). Setting \(\lambda ^{(n)} = \prod \limits _{1}^{d-r-1} \lambda _l^{(n)}\) and \(\lambda ^{+} = \prod \limits _{d-r}^{d+1} \lambda _l\), the right side of (11) reads

$$\begin{aligned} \left( \lambda ^{(n)} \lambda ^{+}\right) ^{-\frac{b}{2}}\exp \left( -\frac{d+1}{2} \left( \frac{\det (\Psi )}{\lambda ^{+}\lambda ^{(n)}}\right) ^{\frac{1}{d+1}}\right) , \end{aligned}$$

which converges to 0 as \(n \rightarrow \infty \) by the rule of Hòpital. \(\square \)

With Theorem 2, we are able to study the convergence theory of the reformulated problem (3) in Sect. 3.

2.3 Riemannian characteristics of the reformulated problem

To solve the reformulated problem (3) or the penalized reformulated problem (9), we specify the Riemannian characteristics of the optimization problem. It is an optimization problem over the product manifold

$$\begin{aligned} {\mathcal {M}}= \left( \mathbb {P}^{d+1}\right) ^K \times \mathbb {R}^{K-1}, \end{aligned}$$
(13)

where \(\mathbb {P}^{d+1}\) is the set of strictly positive definite matrices of dimension \(d+1\). The set of symmetric matrices is tangent to the set of positive definite matrices as \(\mathbb {P}^{d+1}\) is an open subset of it. Thus the tangent space of the manifold (13) is given by

$$\begin{aligned} T_{\theta } {\mathcal {M}}&= \left( \mathbb {S}^{d+1}\right) ^K \times \mathbb {R}^{K-1}, \end{aligned}$$
(14)

where \(\mathbb {S}^{d+1}\) is the set of symmetric matrices of dimension \(d+1\). The inner product that is commonly associated with the manifold of positive definite matrices is the intrinsic inner product

$$\begin{aligned} \langle \xi _S, \chi _S \rangle _S = {{\,\mathrm{tr}\,}}(S^{-1}\xi _S S^{-1} \chi _S), \end{aligned}$$
(15)

where \(S \in \mathbb {P}^{d+1}\) and \(\xi _S, \chi _S \in \mathbb {S}^{d+1}\). The inner product defined on the tangent space (14) is the sum over all component-wise inner products and reads

$$\begin{aligned} \langle \xi _{\theta },\chi _{\theta } \rangle _{\theta }&=\sum \limits _{j=1}^K {{\,\mathrm{tr}\,}}(S_j^{-1}\xi _{S_j} S_j^{-1} \chi _{S_j}) + \xi _{\eta }^T \chi _{\eta }, \end{aligned}$$
(16)

where

$$\begin{aligned} \theta&= ((S_1, \dots , S_K), \eta ) \quad&\in {\mathcal {M}},\\ \xi _{\theta }&= \left( (\xi _{S_1}, \dots , \xi _{S_K}), \xi _{\eta }\right) \quad&\in T_{\theta }{\mathcal {M}}, \\ \chi _{\theta }&=\left( (\chi _{S_1}, \dots , \chi _{S_K}), \chi _{\eta }\right) \quad&\in T_{\theta } {\mathcal {M}}. \end{aligned}$$

The retraction we use is the exponential map on the manifold given by

$$\begin{aligned} R_{\theta }(\xi ) = \left( \begin{array}{c} \left( S_j\exp \left( S_j^{-1} \xi _ {S_j}\right) \right) _{j=1,\dots ,K}\\ (\eta _j + \xi _{\eta _j})_{j=1,\dots , K-1} \end{array} \right) , \end{aligned}$$
(17)

see Jeuris et al. (2012).

Riemannian Gradient and Hessian. We now specify the Riemannian Gradient and the Riemannian Hessian in order to apply second-order methods on the manifold. The Riemannian Hessian in Theorem 4 is novel for the problem of fitting Gaussian Mixture Models and provides a way of making second-order methods applicable.

Theorem 3

The Riemannian gradient of the reformulated problem reads \({{{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta ) = \left( \chi _S, \chi _{\eta } \right) }\), with

$$\begin{aligned} \chi _S&= \left( \frac{1}{2} \sum \limits _{i=1}^m f_l^i (y_i {y_i}^T - S_l)\right) _{l=1,\dots , K}, \\ \chi _{\eta }&= \left( \sum \limits _{i=1}^m f_r^i - \alpha _r \sum \limits _{j=1}^K f_j^i \right) _{r=1,\dots , K-1}, \end{aligned}$$

where

$$\begin{aligned} f_l^i&= \frac{h^i(\theta _l)}{\sum \limits _{j=1}^K h^i(\theta _j)},\\ \alpha _r&= \frac{\exp (\eta _r)}{\sum \limits _{k=1}^K \exp (\eta _k)}. \end{aligned}$$

The additive terms for the penalizers in (7), (8) are given by

$$\begin{aligned} {{\,\mathrm{grad}\,}}{{\,\mathrm{Pen}\,}}(\theta ) = \left( \begin{array}{c} \left( -\frac{1}{2}\left( \rho S_l - \beta \Psi \right) \right) _{l=1,\dots , K} \\ \zeta \left( 1- K \alpha _r \right) _{l=1,\dots , K-1} \end{array} \right) . \end{aligned}$$

Proof

The Riemannian gradient of a product manifold is the Cartesian product of the individual expressions (Absil et al. 2008). We compute the Riemannian gradients with respect to \(S_1, \dots , S_K\) and \(\eta \).

The gradient with respect to \(\eta \) is the classical Euclidean gradient, hence we get by using the chain rule

$$\begin{aligned} \left( {{\,\mathrm {grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{\eta _r}&= \left( {{\,\mathrm {grad}\,}}^{e} \hat{{\mathcal {L}}}(\theta )\right) _{\eta _r} \\&= \sum \limits _{i=1}^m\left( \sum \limits _{k=1}^K h^i(\theta _k)\right) ^{-1} \sum \limits _{j=1}^K \frac{\partial h^i(\theta _j)}{\partial \eta _l} \\&= \sum \limits _{i=1}^m \sum \limits _{j=1}^K f_j^i \left( {1\text {1}}_{\{j=r\}}- \frac{\exp (\eta _r)}{\sum \limits _{k=1}^K \exp (\eta _k)} \right) , \end{aligned}$$

for \(r=1, \dots , K-1\), where \({1\text {1}}_{\{j=r\}} =1\) if \(j=r\) and 0, else.

For the derivative of the penalizer with respect to \(\eta _r\), we get

$$\begin{aligned} \left( {{\,\mathrm{grad}\,}}Pen(\theta )\right) _{\eta _r}&=\left( {{\,\mathrm{grad}\,}}^{e} Pen(\theta )\right) _{\eta _r} \\&= \zeta \left( 1-K\frac{\exp (\eta _r)}{\sum \limits _{j=1}^{K}\exp (\eta _j)}\right) . \end{aligned}$$

The Riemannian gradient with respect to the matrices \(S_1, \ldots , S_K\) is the projected Euclidean gradient onto the subspace \(T_{S_j} \mathbb {P}^{d+1}\) (with inner product 15), see Absil et al. (2008) and Boumal (2020). The relationship between the Euclidean gradient \({{\,\mathrm{grad}\,}}^{e} f\) and the Riemannian gradient \({{\,\mathrm{grad}\,}}f\) for an arbitrary function \(f: \mathbb {P}^n \rightarrow \mathbb {R}\) with respect to the intrinsic inner product defined on the set of positive definite matrices (15) reads

$$\begin{aligned} {{\,\mathrm{grad}\,}}f(S) = \frac{1}{2} S \left( {{\,\mathrm{grad}\,}}^{e} f(S) + \left( {{\,\mathrm{grad}\,}}^{e} f(S)\right) ^T \right) S, \end{aligned}$$
(18)

see for example Hosseini and Sra (2015) and Jeuris et al. (2012). In a first step, we thus compute the Euclidean gradient with respect to a matrix \(S_l\):

$$\begin{aligned} \left( {{\,\mathrm{grad}\,}}^{e} \hat{{\mathcal {L}}}(\theta )\right) _{S_l} = -\frac{1}{2} \sum \limits _{i=1}^m f_l^i (S_l^{-1}y_i {y_i}^T S_l^{-1}- S_l^{-1}), \end{aligned}$$
(19)

where we used the Leibniz rule and the partial matrix derivatives

$$\begin{aligned}&\frac{\partial \left( \det (S_l)^{-1/2}\right) }{\partial S_l} = -\frac{1}{2} (\det (S_l))^{-1/2} S_l^{-1},\\&\frac{\partial \exp \left( -\frac{1}{2}y_i^TS_l^{-1}y_i\right) }{\partial S_l} = \frac{1}{2}\exp \left( -\frac{1}{2}y_i^TS_l^{-1}y_i\right) \\&\qquad \quad \times S_l^{-1}y_i {y_i}^T S_l^{-1}, \end{aligned}$$

which holds by the chain rule and the fact that \(S_l^{-1}\) is symmetric.

Using the relationship (18) and using (19) yields the Riemannian gradient with respect to \(S_l\). It is given by

$$\begin{aligned} \left( {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{S_l} = \frac{1}{2} \sum \limits _{i=1}^m f_l^i ( y_i {y_i}^T - S_l). \end{aligned}$$

Analogously, we compute the Euclidean gradient of the matrix penalizer \(\psi (S_j, \Phi )\) and use the relationship (18) to get the Riemannian gradient of the matrix penalizer. \(\square \)

To apply Newton-like algorithms, we derived a formula for the Riemannian Hessian of the reformulated problem. It is stated in Theorem 4:

Theorem 4

Let \(\theta \in {\mathcal {M}}\) and \(\xi _{\theta } \in T_{\theta }{\mathcal {M}}\). The Riemannian Hessian is given by

$$\begin{aligned} {{\,\mathrm{Hess}\,}}\left( \hat{{\mathcal {L}}}(\theta ) \right) [\xi _{\theta }] = \left( \zeta _{S}, \zeta _{\eta }\right) \in T_{\theta }{\mathcal {M}}, \end{aligned}$$

where

$$\begin{aligned} \zeta _{S_l}&= -\frac{1}{4} \sum \limits _{i=1}^m {f_l }^i \bigg [ {C_l}^i - \bigg ( {a_l}^i- \sum \limits _{j=1}^K {f_j}^i{a_j}^i\bigg )(y_i {y_i}^T - S_l) \bigg ] \end{aligned}$$
(20)
$$\begin{aligned} \zeta _{\eta _r}&= \frac{1}{2} \sum \limits _{i=1}^m \bigg [ {f_r}^i \bigg ({a_r}^i - \sum \limits _{j=1}^K {f_j}^i {a_j}^i\bigg )&\nonumber \\&\qquad \qquad - 2\alpha _r \left( \xi _{\eta _r} - \sum \limits _{j=1}^{K-1} \alpha _j \xi _{\eta _j}\right) \bigg ]&\end{aligned}$$
(21)

for \(l=1,\dots , K\), \(r=1,\dots , K-1 \) and

$$\begin{aligned}&{a_l}^i = {y_i}^T {S_l}^{-1} \xi _{S_l}{S_l}^{-1}y_i - {{\,\mathrm{tr}\,}}(S_l^{-1}\xi _{S_l}) + 2\xi _{\eta _l}, \\&{f_l}^i = \frac{h^i(\theta _l)}{\sum \limits _{j=1}^K h^i(\theta _j)}, \\&{C_l}^i = y_i {y_i}^T{S_l}^{-1}\xi _{S_l} + \xi _{S_l}{S_l}^{-1}y_i {y_i}^T \\&\alpha _r = \frac{\exp (\eta _r)}{\sum \limits _{k=1}^K \exp (\eta _k)}, \\&\xi _{\eta _K} \equiv 0. \end{aligned}$$

The Hessian for the additive penalizer reads \({{\,\mathrm{Hess}\,}}({{\,\mathrm{Pen}\,}}(\theta ))[\xi _{\theta }] = \left( {\zeta _{S}}^{pen}, {\zeta _{\eta }}^{pen} \right) \), where

$$\begin{aligned} \zeta _{S_l}^{pen}&= \frac{\beta }{4}\left( \Psi {S_l}^{-1} \xi _{S_l} + \xi _{S_l}{S_l}^{-1}\Psi \right) , \\ \zeta _{\eta _l}^{pen}&= K\zeta \alpha _r \left( \xi _{\eta _r} - \sum \limits _{j=1}^{K-1} \alpha _j \xi _{\eta _j}\right) . \end{aligned}$$

Proof

It can be shown that for a product manifold \({\mathcal {M}} = {\mathcal {M}}_1 \times {\mathcal {M}}_2\) with \(\nabla ^1, \nabla ^2\) being the Riemannian connections of \({\mathcal {M}}_1, {\mathcal {M}}_2\), respectively, the Riemannian connection \(\nabla \) of \({\mathcal {M}}\) for \(X, Y \in {\mathcal {M}}\) is given by

$$\begin{aligned} \nabla _Y (X) = \nabla _{Y_1}^1 X_1 \times \nabla _{Y_2}^2 X_2, \end{aligned}$$

where \(X_1, Y_1 \in T{\mathcal {M}}_1\) and \(X_2, Y_2 \in T{\mathcal {M}}_2\) (Carmo 1992).

Applying this to our problem, we apply the Riemannian connections of the single parts on the Riemannian gradient derived in Theorem 3. It reads

$$\begin{aligned}&{{\,\mathrm{Hess}\,}}\hat{{\mathcal {L}}}(\theta )[\xi _{\theta }] = \nabla _{\theta } {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta ) \nonumber \\&\qquad \qquad = \left( \left( \nabla _{\xi _{S_l}} ^{(pd)} {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{l}, \left( \nabla _{\xi _{\eta _r}}^{e} {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{r}\right) ^{T} \end{aligned}$$
(22)

for \(l=1, \dots , K\) and \(r=1,\dots , K-1\).

We will now specify the single components of (22). Let

$$\begin{aligned} {{\,\mathrm{grad}\,}}_{S_l}\hat{{\mathcal {L}}}(\theta )&= \left( {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{S_l}, \\ {{\,\mathrm{grad}\,}}_{\eta _r}\hat{{\mathcal {L}}}(\theta )&= \left( {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _{\eta _r} \end{aligned}$$

denote the Riemannian gradient from Theorem 3 at position \(S_l\), \(\eta _r\), respectively.

For the latter part in (22), we observe that the Riemannian connection \(\nabla _{\xi _{\eta _l}}^{e}\) for \(\xi _{\eta _l} \in \mathbb {R}\) is the classical vector field differentiation (Absil et al. 2008, Section 5.3). We obtain

$$\begin{aligned} \nabla _{\xi _{\eta _l}}^{e}{{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )&= \sum \limits _{j=1}^K {{\,\mathrm{D}\,}}_{S_j} ({{\,\mathrm{grad}\,}}_{\eta _l}\hat{{\mathcal {L}}}(\theta ))[\xi _{S_j}] \nonumber \\&\quad \quad + \sum \limits _{j=1}^{K-1} {{\,\mathrm{D}\,}}_{\eta _r} ({{\,\mathrm{grad}\,}}_{\eta _j} \hat{{\mathcal {L}}}(\theta ))[\xi _{\eta _j}], \end{aligned}$$
(23)

where \({{\,\mathrm{D}\,}}_{S_j}(\cdot )[\xi _{S_j}]\), \({{\,\mathrm{D}\,}}_{\eta _r}(\cdot )[\xi _{\eta _r}]\) denote the classical Fréchet derivatives with respect to \(S_j\), \(\eta _j\) along the directions \(\xi _{S_j}\) and \(\xi _{\eta _j}\), respectively.

For the first part on the right hand side of (23), we have

$$\begin{aligned}&\sum \limits _{j=1}^K {{\,\mathrm{D}\,}}_{S_j}({{\,\mathrm{grad}\,}}_{\eta _r} \hat{{\mathcal {L}}}(\theta ))[\xi _{S_j}] \\&= \frac{1}{2}\sum \limits _{i=1}^m \bigg [ \frac{h^i(\theta _r)}{\sum \limits _{k=1}^K h^i(\theta _k)}\bigg ({y_i}^T {S_r}^{-1} \xi _{S_r}{S_r}^{-1}y_i - {{\,\mathrm{tr}\,}}(S_r^{-1}\xi _{S_r}) \\&\qquad - \sum \limits _{j=1}^K \frac{h^i(\theta _j)}{\sum \limits _{k=1}^K h^i(\theta _j)} ({y_i}^T {S_j}^{-1} \xi _{S_j}{S_j}^{-1}y_i - {{\,\mathrm{tr}\,}}(S_j^{-1}\xi _{S_j}))\bigg )\bigg ] \end{aligned}$$

and for the second part

$$\begin{aligned}&\sum \limits _{j=1}^{K-1} {{\,\mathrm{D}\,}}_{\eta _j}({{\,\mathrm{grad}\,}}_{\eta _r} \hat{{\mathcal {L}}}(\theta ))[\xi _{\eta _j}] \\&\qquad =\sum \limits _{i=1}^m \bigg [ \bigg (\frac{h^i(\theta _r)}{\sum \limits _{k=1}^K h^i(\theta _k)}-\alpha _r \bigg ) \xi _{\eta _r} + \alpha _r \sum \limits _{j=1}^{K-1} \alpha _j \xi _{\eta _j}\\&\qquad \qquad - \frac{h^i(\theta _r)}{\sum \limits _{k=1}^K h^i(\theta _j)} \sum \limits _{j=1}^{K-1} \frac{h^i(\theta _j)}{\sum \limits _{k=1}^K h^i(\theta _j)} \xi _{\eta _j} \bigg ] \end{aligned}$$

by applying the chain rule, the Leibniz rule and the relationship \(\alpha _l = \frac{\exp (\eta _l)}{\sum \limits _{k=1}^K \exp (\eta _k)}\). Plugging the terms into (23), this yields the expression for \(\zeta _{\eta _r}\) in (21).

For the Hessian with respect to the matrices \(S_l\), we first need to specify the Riemannian connection with respect to the inner product (15). It is uniquely determined as the solution to the Koszul formula (Absil et al. 2008, Section 5.3), hence we need to find an affine connection that satisfies the formula. For a positive definite matrix S and symmetric matrices \(\zeta _{S}, \xi _{S}\) and \({{\,\mathrm{D}\,}}(\xi _S)[\zeta _{S}]\), this solution is given by (Jeuris et al. 2012; Sra and Hosseini 2015)

$$\begin{aligned} \nabla _{\nu _{S}}^{(pd)} \xi _{S} = {{\,\mathrm{D}\,}}(\xi _S)[\nu _{S}] - \frac{1}{2} (\nu _{S} S^{-1}\xi _{S} + \xi _{S}S^{-1} \nu _{S}), \end{aligned}$$

where \(\xi _S\), \(\nu _S\) are vector fields on \({\mathcal {M}}\) and \({{\,\mathrm{D}\,}}(\xi _S)[\nu _{S}]\) denotes the classical Fréchet derivative of \(\xi _S\) along the direction \(\nu _S\). Hence, for the first part in (22), we get

$$\begin{aligned}&\left( \nabla _{\xi _{S_l}} ^{(pd)} {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}(\theta )\right) _l \nonumber \\&= \bigg (\sum \limits _{j=1}^K {{\,\mathrm{D}\,}}_{S_j}({{\,\mathrm{grad}\,}}_{S_l} \hat{{\mathcal {L}}}(\theta ))[\xi _{S_j}] + \sum \limits _{j=1}^{K-1} {{\,\mathrm{D}\,}}_{\eta _j} ({{\,\mathrm{grad}\,}}_{S_l} \hat{{\mathcal {L}}}(\theta ))[\xi _{\eta _j}] \nonumber \\&\qquad - \frac{1}{2} \left( {{\,\mathrm{grad}\,}}_{S_l} \hat{{\mathcal {L}}}(\theta )S_l^{-1}\xi _{S_l} + \xi _{S_l} S_l^{-1} {{\,\mathrm{grad}\,}}_{S_l}\hat{{\mathcal {L}}}(\theta )\right) \bigg )_l. \end{aligned}$$
(24)

After applying the chain rule and Leibniz rule, we obtain

$$\begin{aligned}&\sum \limits _{j=1}^K {{\,\mathrm{D}\,}}_{S_j}({{\,\mathrm{grad}\,}}_{S_l}\hat{{\mathcal {L}}}(\theta ))[\xi _{S_j}] \nonumber \\&= -\frac{1}{4} \sum \limits _{i=1}^m f_l^i \bigg [2\xi _{S_l} - \bigg (({y_i}^T {S_l}^{-1} \xi _{S_l}{S_l}^{-1}y_i - {{\,\mathrm{tr}\,}}({S_l}^{-1}\xi _{S_l})) \nonumber \\&\qquad \qquad + \sum \limits _{j=1}^K f_j^i ( {y_i}^T {S_j}^{-1} \xi _{S_j}{S_j}^{-1}y_i - {{\,\mathrm{tr}\,}}({S_j}^{-1}\xi _{S_j}))\bigg ) \nonumber \\&\qquad \qquad \times (y_i {y_i}^T - S_l) \bigg ] \end{aligned}$$
(25)

and

$$\begin{aligned}&\sum \limits _{j=1}^{K-1} {{\,\mathrm{D}\,}}_{\eta _j} ({{\,\mathrm{grad}\,}}_{S_l}\hat{{\mathcal {L}}}(\theta ))[\xi _{\eta _j}] \nonumber \\&\qquad =\frac{1}{2} \sum \limits _{i=1}^m \left( \frac{h^i(\theta _l)}{\sum \limits _{k=1}^K h^i(\theta _j)} \bigg (\xi _{\eta _l} - \sum \limits _{j=1}^{K-1} \frac{h^i(\theta _j)}{\sum \limits _{k=1}^K h^i(\theta _k)} \xi _{\eta _k}\bigg )\right) \nonumber \\&\qquad \qquad \qquad \times (y_i {y_i}^T - S_l). \end{aligned}$$
(26)

We plug (25), (26) into (24) and use the Riemannian gradient at position \(S_l\) for the last term in (24). After some rearrangement of terms, we obtain the expression for \(\zeta _{S_l}\) in (21).

The computation of \({{\,\mathrm{Hess}\,}}(\text {Pen}(\theta ))[\xi _{\theta }]\) is analogous by replacing \(\hat{{\mathcal {L}}}\) with \(\varphi (\eta , \zeta )\) in (23) and with \(\psi (S_l, \Phi )\) in (24).

3 Riemannian Newton trust-region algorithm

Equipped with the Riemannian gradient and the Riemannian Hessian, we are now in the position to apply Newton-type algorithms to our optimization problem. As studying positive-definiteness of the Riemannian Hessian from Theorem 4 is hard, we suggest to introduce some safeguarding strategy for the Newton method by applying a Riemannian Newton Trust-Region method.

3.1 Riemannian Newton trust-region method

The Riemannian Newton Trust-Region Algorithm is the retraction-based generalization of the standard Trust-Region method (Conn et al. 2000)

on manifolds, where the quadratic subproblem uses the Hessian information for an objective function f that we seek to minimize. Theory on the Riemannian Newton Trust-Region method can be found in detail in Absil et al. (2008), we here state the Riemannian Newton Trust-Region method in Algorithm 1. Furthermore, we will study both global and local convergence theory for our penalized problem.

figure a

Global convergence. In the following Theorem, we show that the Riemannian Newton Trust-Region Algorithm applied on the reformulated penalized problem converges to a stationary point. To the best of our knowledge, this has not been proved before.

Theorem 5

(Global convergence) Consider the penalized reformulated objective \(\hat{{\mathcal {L}}}_{pen}\) from (9). If we apply the Riemannian Newton Trust-Region Algorithm (Algorithm 1) to minimize \(f= -\hat{{\mathcal {L}}}_{pen}\), it holds

$$\begin{aligned} \lim \limits _{t \rightarrow \infty } {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}_{pen}(\theta ^{t}) = 0. \end{aligned}$$
(27)

Proof

According to general global convergence results for Riemannian manifolds, convergence to a stationary point is given if the level set \(\{\theta : f(\theta ) \le f(\theta ^0)\}\) is a compact Riemannian manifold (Absil et al. 2008, Proposition 7.4.5) and the function f is lower bounded. Since the penalized reformulated objective \(\hat{{\mathcal {L}}}_{pen}\) is upper bounded by Theorem 2, the latter is fulfilled. In the following, we will show that the iterates produced by Algorithm 1 stay in a compact set:

Let \(\theta ^0= ((S_1^0, \dots , S_K^0), \eta ^0) \in {\mathcal {M}}\) be the starting point of Algorithm 1 applied on the objective \(f = - \hat{{\mathcal {L}}}_{pen}\). We further define the set of successful (unsuccessful) steps \({\mathfrak {S}}_t\) (\({\mathfrak {F}}_t\)) generated by the algorithm until iteration t by

$$\begin{aligned} {\mathfrak {S}}_t&= \{l \in \{0,1,\dots , t\}: \rho _l > \rho '\}, \\ {\mathfrak {F}}_t&= \{l \in \{0,1,\dots , t\}: \rho _l \le \rho '\}. \end{aligned}$$

At iterate t, we consider \(\theta ^t = ((S_1^0, \dots , S_K^0), \eta ^0) \in {\mathcal {M}}\). Let \(s^t = \left( \left( s_{S_1}^t, \dots ,s_{S_K}^t \right) , {s_{\eta }}^t \right) \in T_{\theta }{\mathcal {M}}\) be the tangent vector returned by solving the quadratic subproblem in line 3, Algorithm 1 and \(R_{S_j^t} = S_j^t \exp \left( (S_j^t)^{-1} s_{S_j^t}\right) \) be the retraction part of \(S_j\), see (17). Then

$$\begin{aligned}&\left\Vert \theta ^{t}\right\Vert = \left\Vert R_{\theta ^{t-1}}(s^t)\right\Vert {1\text {1}}_{\{t \in {\mathfrak {S}}_t\}} + \left\Vert \theta ^{t-1}\right\Vert {1\text {1}}_{\{t \in {\mathfrak {F}}_t\}} \nonumber \\&\le \sum \limits _{j=1}^K \left( \left\Vert R_{S_j^t} (s_{S_j}^t)\right\Vert {1\text {1}}_{\{t \in {\mathfrak {S}}_t\}} + \left\Vert S_j^{t-1}\right\Vert {1\text {1}}_{\{t \in {\mathfrak {F}}_t\}} \right) + \left\Vert \eta ^t\right\Vert , \end{aligned}$$
(28)

where \({1\text {1}}_{\{t \in {\mathfrak {S}}_t\}}\) is the indicator function, i.e. \({1\text {1}}_{\{t \in {\mathfrak {S}}_t\}} =1\) if \(t \in {\mathfrak {S}}_t\) and 0, else.

Since \(\left\Vert s^t\right\Vert _{\theta ^t} \le \varDelta \) for all t, there exists \({\bar{\varDelta }}^1 > 0\), \({\bar{\varDelta }}^{2} >0\) such that \(\left\Vert s_{S_j}^t\right\Vert \le {\bar{\varDelta }}^{1}\) for all \(j=1, \dots , K\) and \(\left\Vert s_{\eta }^t\right\Vert \le {\bar{\varDelta }}^{2}\). Hence, the second part on the right hand side of (28) yields

$$\begin{aligned} \left\Vert \eta ^t\right\Vert&= \left\Vert \eta ^0 + \sum \limits _{l=1}^{t-1} s_{\eta }^l {1\text {1}}_{\{t \in {\mathfrak {S}}_t\}}\right\Vert \\&\le \left\Vert \eta ^0\right\Vert + \sum \limits _{l=1}^{t-1} \left\Vert s_{\eta }^l\right\Vert \\&\le \left\Vert \eta ^0\right\Vert + t {\bar{\varDelta }}^2. \end{aligned}$$

For the first part on the right hand side of (28), we take a closer look at \(\left\Vert R_{S_j^t} (s_{S_j}^t)\right\Vert \). For better readability, we omit the index j and simply write \(\left\Vert R_{S^t} (s_{S}^t)\right\Vert \). Let \(\lambda (S^l)\) denote the minimal eigenvalue of S at iteration l. By the inequality of Cauchy-Schwarz and using the retraction (17), we get

$$\begin{aligned}&\left\Vert R_{S^t} (s_{S}^t)\right\Vert \end{aligned}$$
(29)
$$\begin{aligned}&\quad \le \left\Vert S^{t-1}\right\Vert \exp \left( \left\Vert (S^{t-1})^{-1}\right\Vert \left\Vert s_S^t\right\Vert \right) \nonumber \\&\quad \le \left\Vert S^{t-1}\right\Vert \exp \left( \frac{{\bar{\varDelta }}^1}{\left\Vert S^{t-1}\right\Vert }\right) \nonumber \\&\quad \le \left\Vert S^{t-1}\right\Vert \exp \left( \frac{{\bar{\varDelta }}^1}{(d+1)\lambda (S^{t-1})}\right) \nonumber \\&\quad \le \left\Vert S^{t-2}\right\Vert \exp \left( \frac{{\bar{\varDelta }}^1}{(d+1)\lambda (S^{t-1})}\right) \end{aligned}$$
(30)
$$\begin{aligned}&\quad \quad \times \left( \exp \left( \frac{{\bar{\varDelta }}^1}{(d+1)\lambda (S^{t-2})} \right) {{1\text {1}}}_{\{t-1 \in {\mathfrak {S}}_t\}} + {1\text {1}}_{\{t \in {\mathfrak {F}}_t\}}\right) . \end{aligned}$$
(31)

By applying (31) iteratively to (28), we get

$$\begin{aligned} \left\| S^t\right\| \le \left\| S^0\right\| \exp \left( \frac{{\bar{\varDelta }}^1}{d+1} \sum \limits _{l=0}^{t-1} \frac{1}{\lambda (S^l)} {1\text {1}}_{\{l \in {\mathfrak {S}}_t\}} \right) . \end{aligned}$$
(32)

We now show that the eigenvalues of \(S_l\) cannot become infinitesimal small. For this, assume that there exists a subsequence of minimal eigenvalues \(\{\lambda (S_{l_i})\}_{i}\) such that \(\lambda (S_{l_i}) \rightarrow 0\) for \(i \rightarrow \infty \). Then, according to the proof of Theorem 2, this yields \(\hat{{\mathcal {L}}}_{pen} (\theta ^{l_i}) \rightarrow -\infty \) for \(i \rightarrow \infty \). This is a contradiction, as the Riemannian Trust-Region method in Algorithm 1 ensures we have a decrease in f (increase in \(\hat{{\mathcal {L}}}_{pen}\) ) in every iteration.

Thus, the right hand side of (32) cannot become infinitely large and there exists a constant \(C > 0\) such that \(\left\Vert \theta ^{t}\right\Vert \le C\). Hence, we see that our algorithm applied to the (penalized) problem of fitting Gaussian Mixture Models converges to stationary points. \(\square \)

We have shown that Algorithm 1 applied on the problem of fitting Gaussian Mixture Models converges to a stationary point for all starting values. From general convergence theory for Riemannian Trust-Region algorithms (Absil et al. 2008, Section 7.4.2), under some assumptions, the convergence speed of Algorithm 1 is superlinear. In the following, we show that these assumptions are fulfilled and specify the convergence rate.

Local convergence. Local convergence result on (Riemannian) Trust Region-Methods depend on the solver for the quadratic subproblem. If the quadratic subproblem in Algorithm 1 is (approximately) solved with sufficient decrease, the local convergence close to a maximizer of \(\hat{{\mathcal {L}}}_{pen}\) is superlinear under mild assumptions. The truncated Conjugate Gradient method is a typical choice that returns a sufficient decrease, we suggest to use this matrix-free method for Gaussian Mixture Models. We state the Algorithm in “Appendix B”. The local superlinear convergence is stated in Theorem 6.

Theorem 6

(Local convergence) Consider Algorithm 1, where the quadratic subproblem is solved by the truncated Conjugate Gradient Method and \({f {=}- \hat{{\mathcal {L}}}_{pen}}\). Let \(v \in {\mathcal {M}}\) be a nondegenerate local minimizer of f, H the Hessian of the reformulated penalized problem from Theorem 4 and \(\delta {<} 1\) the parameter of the termination criterion of tCG (“Appendix B”, line 18 in Algorithm 2). Then there exists \(c >1\) such that, for all sequences \(\{\theta ^t\}\) generated by Algorithm 1 converging to v, there exists \(T {>}0\) such that for all \(t {>} T\),

$$\begin{aligned} \left\Vert {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}_{pen}(\theta ^{t+1})\right\Vert \le c \left\Vert {{\,\mathrm{grad}\,}}\hat{{\mathcal {L}}}_{pen}(\theta ^t)\right\Vert ^{\delta +1}. \end{aligned}$$
(33)

Proof

Let \(v \in {\mathcal {M}}\) be a nondegenerate local minimizer of f (maximizer of \(\hat{{\mathcal {L}}}_{pen}\)). We choose the termination criterion of tCG such that \(\delta <1\) . According to Absil et al. (2008, Theorem 7.4.12), it suffices to show that \({{\,\mathrm{Hess}\,}}\hat{{\mathcal {L}}}_{pen}(R_{\theta }(\xi ))\) is Lipschitz continuous at \(0_{\theta }\) in a neighborhood of v. From the proof of Theorem 2, we know that close to v, we are bounded away from the boundary of \({\mathcal {M}}\), i.e. we are bounded away from points on the manifold with singular \(S_j\), \(j=1,\dots , K\). Thus, there exist neighborhoods \(B_{\epsilon _1}\), \(B_{\epsilon _2}\) such that for \(\theta \in B_{\epsilon _1}(v) \subset {\mathcal {M}}\) and \(\xi \in B_{\epsilon _2}\), the function \({{\,\mathrm{Hess}\,}}\hat{{\mathcal {L}}}_{pen}(R_{\theta }(\xi ))\) is continuously differentiable. From this, the local Lipschitz continuity follows directly with the Heine–Borel Theorem. Thus, all assumptions of Absil et al. (2008, Theorem 7.4.12) are fulfilled and (33) holds. \(\square \)

3.2 Practical design choices

We apply Algorithm 1 on our cost function (9), where we seek to minimize \(f= - \hat{{\mathcal {L}}}_{pen}\). The quadratic subproblem in line 3 in Algorithm 1 is solved by the truncated Conjugate Gradient method (tCG) with the inner product (16). To speed up convergence of the tCG, we further use a preconditioner: At iteration t, we store the gradients computed in tCG and store an inverse Hessian approximation via the LBFGS formula. This inverse Hessian approximation is then used for the minimization of the next subproblem \({\hat{m}}_{\theta ^{t+1}}\). The use of such preconditioners has been suggested by Morales and Nocedal (2000) for solving a sequence of slowly varying systems of linear equations and gave a speed-up in convergence for our method. The initial TR radius is set by using the method suggested by Sartenaer (1997) that is based on the model trust along the steepest-descent direction. The choice of parameters \(\omega _1, \omega _2, \tau _1, \tau _2, \rho '\) in Algorithm 1 are chosen according to the suggestions in Gould et al. (2005) and Conn et al. (2000).

4 Numerical results

We test our method on the penalized objective function (3) on both simulated and real-world data sets. We compare our method against the (penalized) Expectation Maximization Algorithm and the Riemannian LBFGS method proposed by Hosseini and Sra (2015). For all methods, we used the same initialization by running k-means++ (Arthur and Vassilvitskii 2007) and stopped all methods when the difference of average log-likelihood for two subsequent iterates falls below \(1e-10\) or when the number of iterations exceeds 1500 (clustering) or 3000 (density approximation). For the Riemannian LBFGS method suggested by Hosseini and Sra (2015, 2020), we used the MixEst package (Hosseini and Mash’al 2015) kindly provided by one of the authors.Footnote 1 For the Riemannian Trust region method we mainly followed the code provided by the pymanopt Python package provided by Townsend et al. (2016), but adapted it for a faster implementation by computing the matrix inverse \(S_j^{-1}\) only once per iteration. We used Python version 3.7. The experiments were conducted on an Intel Xeon CPU X5650 at 2.67 GHhz with 24 cores and 20GB RAM.

In Sect. 4.1, we test our method on clustering problems on simulated data and on real-world datasets from the UCI Machine Learning Repository (Dua and Graff 2017). In Sect. 4.2 we consider Gaussian Mixture Models as probability density estimators and show the applicability of our method.

4.1 Clustering

We test our method for clustering tasks on different artificial (Sect. 4.1.1) and real-world (Sect. 4.1.2) data sets.

4.1.1 Simulated data

As the convergence speed of EM depends on the level of separation between the data, we test our methods on data sets with different degrees of separation as proposed in Dasgupta (1999); Hosseini and Sra (2015). The distributions are sampled such that their means satisfy

$$\begin{aligned} \left\Vert \mu _{i} - \mu _{j}\right\Vert ^2 \ge c\max _{i,j}\left( {{\,\mathrm{tr}\,}}(\Sigma _{i}), {{\,\mathrm{tr}\,}}(\Sigma _{j})\right) \end{aligned}$$

for \(i,j=1,\dots , K, i \ne j\) and c models the degree of separation. Additionally, a low eccentricity (or condition number) of the covariance matrices has an impact on the performance of Expectation Maximization (Dasgupta 1999), for which reason we also consider different values of eccentricity \(e = \sqrt{\left( \frac{\lambda _{max}(\Sigma _j)}{\lambda _{min}(\Sigma _j)}\right) }\). This is a measure of how much the data scatters.

We test our method on 20 and 40-dimensional data and an equal distribution among the clusters, i.e. we set \(\alpha _j = \frac{1}{K}\) for all \(j=1, \dots , K\). Although it is known that unbalanced mixing coefficients \(\alpha _j\) also result in slower EM convergence, this effect is less strong than the level of overlap (Naim and Gildea 2012), so we only show simulation results with balanced clusters. Results for unbalanced mixing coefficients and varying number of components K are shown for real-world data in Sect. 4.1.2.

Table 1 Simulation results of 20 runs for dimensions \(d=20\), number of components \(K=5\) and eccentricity \(e=1\)
Table 2 Simulation results of 20 runs for dimensions \(d=20\), number of components \(K=5\) and eccentricity \(e=10\)
Fig. 2
figure 2

Average penalized log-likelihood reduction for highly overlapping clusters: \(d=20\), \(K=5\), \(e=1\), \(c=0.2\)

Fig. 3
figure 3

Average penalized log-likelihood for highly overlapping clusters: \(d=20\), \(K=5\), \(e=1\), \(c=0.2\)

We first take a look at the 20-dimensional data sets, for which we simulated \(m=1000\) data points for each parameter setting. In Table 1, we show the results for very scattered data, that is \(e=1\). We see that, like predicted by literature, the Expectation Maximization converges slowly in such a case. This effect is even stronger with a lower separation constant c. The effect of the eccentricity becomes even more clear when comparing the results of Table 1 with Table 2. Also the Riemannian algorithms converge slower for lower values of eccentricity e and separation levels c. However, they seem to suffer less from hidden information than Expectation Maximization. The proposed Riemannian Newton Trust Region algorithm (R-NTR) beats the other methods in terms of runtime and number of iterations (see Fig. 2). The Riemannian LBFGS (R-LBFGS) method by Hosseini and Sra (2015) also shows faster convergence than EM, but the gain of second-order information available by the Riemannian Hessian is obvious. However, the R-LBFGS results created by the MixEst toolbox show long runtimes compared to the other methods. We see from Fig. 3 that the average penalized log-likelihood is slightly higher for R-LBFGS in some experiments. Still, the objective evaluated at the point satisfying the termination criterion is at a competitive level in all methods (see also Table 1).

When increasing the eccentricity (Table 2), we see that the Riemannian methods still converge faster than EM, but our method is not faster than EM. This is because EM benefits from very low per-iteration costs and the gain in number of iterations is less strong in this case. However, we see that the Riemannian Newton Trust-Region method is not substantially slower. Furthermore, the average log-likelihood values (ALL) are more or less equal in all methods, so we might assume that all methods stopped close to a similar optimum. This is also underlined by comparable mean squared errors (MSE) to the true parameters from which the input data has been sampled from. In average, Riemannian Newton Trust-Region gives the best results in terms of runtime and number of iterations.

Table 3 Simulation results of 20 runs for dimensions \(d=40\), number of components \(K=5\) and eccentricity \(e=1\) with \(m=1000\) observations

In Table 3, we show results for dimension \(d=40\) and low eccentricity (\(e=1\)) and the same simulation protocol as above (in particular, \(m=1000\)). We observed that with our method, we only performed very few Newton-like steps and instead exceeded the trust-region within the tCG many times, leading to poorer steps (see also Fig. 5). One possible reason is that the number of parameters increases with d quadratically, that is in \({\mathcal {O}}(K d^2)\), while at the same time we did not increase the number of observations \(m=1000\). If we are too far from a local optimum and the clusters are not well initialized due to few observations, the factor \(f_l^i\) in the Hessian (Theorem 4) becomes small, leading to large potential conjugate gradients steps (see Algorithm 2). Although this affects the E-step in the Expectation Maximization algorithm as well, the effect seems to be much severe in our method.

To underline this, we show simulation results for a higher number of observations, that is, \(m=10000\), in Table 4 with the same true parameters \(\alpha _j, \mu _j, \Sigma _j\) as in Table 3. As expected, the superiority in runtime of our method becomes visible: The R-NTR method beats Expectation maximization with a factor of 4. Just like for the case of a lower dimension \(d=20\), the mean average log-likelihood and the errors are comparable between our method and EM, whereas R-LBFGS shows slightly worse results although it now attains comparable runtimes to our method.

We thus see that the ratio between number of observations and number of parameters must be large enough in order to benefit from the Hessian information in our method.

Table 4 Simulation results of 20 runs for dimensions \(d=40\), number of components \(K=5\) and eccentricity \(e=1\) with \(m=10{,}000\) observations

4.1.2 Real-world data

We tested our method on some real-world data sets from UCI Machine Learning repository (Dua and Graff 2017) besides the simulated data sets. For this, we normalized the data sets and tested the methods for different values of K.

Combined Cycle Power Plant Data Set (Kaya and Tufekci 2012). In Table 5, we show the results for the combined cycle power plant data set, which is a data set with 4 features. Although the dimension is quite low, we see that we can beat EM both in terms of runtime and number of iterations for almost all K by applying the Riemannian Newton-Trust Region method. This underlines the results shown for artificial data in Sect. 4.1.1. The gain by our method becomes even stronger when we consider a large number of components K where the overlap between clusters is large and we can reach a local optimum with our method in up to 15 times less iterations and a time saving of factor close to 4.

Fig. 4
figure 4

Average penalized log-likelihood reduction for overlapping clusters: \(d=40\), \(K=5\), \(e=1\), \(c=1\)

Fig. 5
figure 5

Average penalized log-likelihood for overlapping clusters: \(d=40\), \(K=5\), \(e=1\), \(c=1\)

Table 5 Results of (normalized) combined cycle power plant data set for different number of components

MAGIC Gamma Telescope Data Set (Bock et al. 2004). We also study the behaviour on a data set with higher dimensions and a larger number of observations with the MAGIC Gamma Telescope Data Set, see Table 6. Here, we can also observe a lower number of iterations in the Riemannian Optimization methods. Similar to the combined cycle power plant data set, this effect becomes even stronger for a high number of clusters where the ratio of hidden information is large. Our method shows by far the best runtimes. For this data set, the average log-likelihood values are very close to each other except for \(K=15\) where the ALL is worse for the Riemannian methods. It seems that in this case, the R-NTR and the R-LBFGS methods end in different local maxima than the EM. However, for all of the methods, convergence to global maxima is theoretically not ensured and for all methods, a globalization strategy like a split-and-merge approach (Li and Li 2009) might improve the final ALL values. As the Magic Gamma telescope data set is a classification data set with 2 classes, we further report the classification performance in Table 7. We see that the geodesic distance defined on the manifold and the weighted mean squared errors (wMSE) are comparable between all three methods. In Table 8, we also report the Adjusted Rand Index (Hubert and Arabie 1985) for all methods. Although the clustering performance is very low compared to the true class labels (first row), we see that it is equal among the three methods.

Table 6 Results of (normalized) magic gamma telescope data set for different number of components
Table 7 Weighted mean squared errors of (normalized) magic gamma telescope data set for \(K=2\)
Table 8 Adjusted rand index for (normalized) magic gamma telescope data set (\(K=2\))

We show results on additional real-world data sets in “Appendix C”.

4.2 Gaussian mixture models as density approximaters

Besides the task of clustering (multivariate) data, Gaussian Mixture Models are also well-known to serve as probability density approximators of smooth density functions with enough components (Scott 1992).

In this Subsection, we present the applicability of our method for Gaussian mixture density approximation of a Beta-Gamma distribution and compare against EM and R-LBFGS for the estimation of the Gaussian components.

We consider a bivariate Beta-Gamma distribution with parameters \(\alpha _{Beta} =0.5\), \(\beta _{Beta} =0.5\), \(a_{Gamma}=1, \beta _{Gamma} = 1 \), where the joint distribution is characterized by a Gaussian copula. The density function surface is visualized in Fig. 6.

Fig. 6
figure 6

Probability density of bivariate Beta(0.5,0.5)-Gamma(1, 1) distribution

Table 9 Simulation results averaged over 100 simulation runs for approximation of a Beta(0.5,0.5)-Gamma(1, 1) distribution by a Gaussian Mixture Models with different values of K
Fig. 7
figure 7

Contours of pointwise root mean squared error (RMSE) for density approximation via GMMs of the Beta (0.5,0.5)-Gamma (1, 1) distribution

We simulated 1000 realizations of the Beta(0.5,0.5)-Gamma(1, 1) distribution and fitted a Gaussian Mixture Model for 100 simulation runs. We considered different numbers of K and compared the (approximated) root mean integrated squared error (RMISE).

The RMISE and its computational approximation formula is given by (Gross et al. 2015):

$$\begin{aligned} RMISE({\hat{f}})&= \sqrt{{\mathbb {E}} \left( \int \left( f(x) - {\hat{f}}(x)\right) ^2 dx\right) } \\&\approx \sqrt{\frac{1}{N} \sum \limits _{r=1}^N \left( f(g_r) - {\hat{f}}(g_r)\right) ^2 \delta _g^2 }, \end{aligned}$$

where f denotes the underlying true density , i.e. Beta(0.5,0.5)-Gamma(1, 1), \({\hat{f}}\) the density approximator (GMM), N is the number of equidistant grid points with the associated grid width \(\delta _g\).

For our simulation study, we chose 16384 grid points in the box \([0,5] \times [0,10]\). We show the results in Table 9, where we fit the parameters of the GMM by our method (R-NTR) and compare against GMM approximations where we fit the parameters with EM and R-LBFGS. We observe that the RMISE is of comparable size for all methods and even slightly better for our method for \(K=2\), \(K=5\) and \(K=10\). Just as for the clustering results in Sect. 4.1, we have much lower runtimes for R-NTR and a much lower number of total iterations. This is a remarkable improvement especially for a larger number of components. We also observe that in all methods, the mean average log-likelihood (ALL) of the training data sets with 1000 observations attains higher values with an increasing number of components K. This supports the fact that the approximation power of GMMs for arbitrary density function is expected to become higher if we add additional Gaussian components (Scott 1992; Goodfellow et al. 2016). On the other hand, the RMISE (which is not based on the training data) increased in our experiments with larger K’s. This means that we are in a situation of overfitting. The drawback of overfitting is well-known for EM (Andrews 2018) and we also observed this for the R-NTR and the R-LBFGS methods. However, the RMISE are comparable and so none of the methods outperforms another substantially in terms of overfitting. This can also be seen from Fig. 7 showing the distribution of the pointwise errors for \(K=2\) and \(K=5\). Although the R-LBFGS method shows higher error values on the boundary of the support of the distribution for \(K=5\), the errors show similar distributions among the three methods at a comparable level. We propose methodologies such as cross validation (Murphy 2013) or applying a split-and-merge approach on the optimized parameters (Li and Li 2009) to address the problem of overfitting.

The results show that our method is well-suited for both density estimation tasks and especially for clustering of real-world and simulated data.

5 Conclusion

We proposed a Riemannian Newton Trust-Region me-thod for Gaussian Mixture Models. For this, we derived an explicit formula for the Riemannian Hessian and gave results on convergence theory. Our method is a fast alternative to the well-known Expectation Maximization and existing Riemannian approaches for Gaussian Mixture Models. Especially for highly overlapping components, the numerical results show that our method leads to an enourmous speed-up both in terms of runtime and the total number of iterations and medium-sized problems. This makes it a favorable method for density approximation as well as for difficult clustering tasks for data with higher dimensions. Here, especially the availability of the Riemannian Hessian increases the convergence speed compared to Quasi-Newton algorithms. When considering higher-dimensional data, we experimentally observed that our method still works very well and is faster than EM if the number of observations is large enough. We plan to examine this further and take a look into data sets with higher dimensions. Here, it is common to impose a special structure on the covariance matrices to tackle the curse of dimensionality (McLachlan et al. 2019). Adapted versions for Expectation Maximization exist and a transfer of Riemannian method into constrained settings is subject of current research.