Skip to main content

Penalized maximum likelihood estimator for mixture of von Mises–Fisher distributions


The von Mises–Fisher distribution is one of the most widely used probability distributions to describe directional data. Finite mixtures of von Mises–Fisher distributions have found numerous applications. However, the likelihood function for the finite mixture of von Mises–Fisher distributions is unbounded and consequently the maximum likelihood estimation is not well defined. To address the problem of likelihood degeneracy, we consider a penalized maximum likelihood approach whereby a penalty function is incorporated. We prove strong consistency of the resulting estimator. An Expectation–Maximization algorithm for the penalized likelihood function is developed and experiments are performed to examine its performance.


In many areas of statistical modelling, data are represented as directions or unit length vectors (Mardia 1972; Jupp 1995; Mardia and Jupp 2000). The analysis of directional data has attracted much research interests in various disciplines, from hydrology (Chen et al. 2013) to biology (Boomsma et al. 2006), and from image analysis (Zhe et al. 2019) to text mining (Banerjee et al. 2005). The von Mises–Fisher (vMF) distribution is one of the most commonly used distributions to model data distributed on the surface of the unit hypersphere (Fisher 1953; Mardia and Jupp 2000). The vMF distribution has been applied successfully in many domains (e.g., (Sinkkonen and Kaski 2002; Mcgraw et al. 2006; Bangert et al. 2010)).

A mixture of vMF distributions (Banerjee et al. 2005) assumes that each observation is drawn from one of the p vMF distributions. Applications of the vMF mixture model are diverse, including image analysis (Mcgraw et al. 2006) and text mining (Banerjee et al. 2005). More recently, it has been shown that the vMF mixture model can approximate any continuous density function on the unit hypersphere to arbitrary degrees of accuracy given sufficient numbers of mixture component (Ng and Kwong 2020).

Various estimation strategies have been developed to perform model estimation, including the maximum likelihood approach (Banerjee et al. 2005) and Bayesian methods (Bagchi and Kadane 1991; Taghia et al. 2014). The maximum likelihood approach, which is typically performed using the Expectation–Maximization (EM) algorithm (Dempster et al. 1977; Banerjee et al. 2005), is among the most popular approach to parameter estimation. However, as we show in Sect. 3 the likelihood function is unbounded from above, and consequently a global maximum likelihood estimate (MLE) fails to exist.

The unboundedness of likelihood function occurs in various mixture modelling context, particularly for mixture models with location-scale family distributions including the mixture of normal distributions (Ciuperca et al. 2003; Chen et al. 2008) and the mixture of Gamma distributions (Chen et al. 2016). Various approaches have been developed in order to tackle the likelihood degeneracy problem, including resorting to local estimates (Peters et al. 1978), compactification of the parameter space (Redner 1981), and constrained maximization of the likelihood function (Hathaway 1985).

An alternative solution to the likelihood degeneracy problem is to penalized the likelihood function such that the resulting penalized likelihood function is bounded and the existence of the penalized MLE is guaranteed. The approach of penalized maximum likelihood was applied to normal mixture models (Ciuperca et al. 2003; Chen et al. 2008), two-parameter Gamma mixture models (Chen et al. 2016). A penalized likelihood approach has also a Bayesian interpretation (Goodd and Gaskins 1971; Ciuperca et al. 2003), whereby the penalized likelihood function corresponds to a posterior density and the penalized maximum likelihood solution to the maximum a posterior estimate.

Previously, the penalized maximum likelihood approach is applied to study the mixture of von-Mises distributions (Chen et al. 2008) where consistency results were obtained. The von-Mises distribution is a special case of the von Mises–Fisher distribution defined on the circle. We generalize the results in Chen et al. (2008) to the arbitrary dimensional sphere. The consistency proof in Chen et al. (2008) relies heavily on the univariate properties of the von Mises distribution and generalization of the result to higher dimensions is not straightforward. In this paper we prove a few useful technical lemmas before proving the main results. To handle the non-identifiability of the mixture models, we use the framework of Redner (1981) to obtain consistency in the quotient space.

In this paper, we also consider the penalized likelihood approach to tackle the problem of likelihood unboundedness for the mixture of vMF distributions. We incorporate a penalty term into the likelihood function and maximize the resulting penalized likelihood function. We study conditions on the penalty function to ensure consistency of the penalized maximum likelihood estimator (PMLE). We develop an Expectation–Maximization algorithm to perform model estimation based on the penalized likelihood function. The rest of the paper is structured as follows. Section 2 introduces the background on vMF mixtures and key notations used in the subsequent sections. The problem of likelihood degeneracy is formally presented in Sect. 3. Section 4 develops the penalized maximum likelihood approach and discusses conditions on the penalty function to ensure strong consistency of the resulting estimator. An Expectation–Maximization algorithm is also developed in Sect. 4, and its performance is examined in Sect. 5. Section 6 illustrate the proposed EM algorithm using a data application. We conclude the paper with a discussion section.


The probability density function of a d-dimensional vMF distribution is given by:

$$\begin{aligned} f({\mathbf {x}};\varvec{\mu }, \kappa ) = c_{d}(\kappa ) e^{\kappa \varvec{\mu }^{T} {\mathbf {x}}} , \end{aligned}$$

where \(x \in {\mathbb {S}}^{d-1}\) is a d-dimensional unit vector (i.e. \( {\mathbb {S}}^{d-1} := \{ {\mathbf {x}} \in {\mathbb {R}}^d: ||{\mathbf {x}}|| = 1 \}\) where \(||\cdot ||\) is the \(L_2\) norm), \(\varvec{\mu } \in {\mathbb {S}}^{d-1}\) is the mean direction, and \(\kappa \ge 0\) is the concentration parameter. The normalizing constant \(c_d(\kappa )\) has the form

$$\begin{aligned} c_d(\kappa ) = \frac{\kappa ^{d/2 - 1}}{(2 \pi )^{d/2} I_{d/2-1}(\kappa )}, \end{aligned}$$

where \(I_r(\cdot )\) is the modified Bessel function of the first kind of order r. The vMF distribution becomes increasingly concentrated at the mean direction \(\mu \) as the concentration parameter \(\kappa \) increases. The case \(\kappa =0\) corresponds to the uniform distribution on \({\mathbb {S}}^{d-1}\).

The probability density function of a mixture of vMF distributions with p mixture components can be expressed as

$$\begin{aligned} g({\mathbf {x}}; \{\pi _k, \varvec{\mu }_k, \kappa _k\}_{k=1}^{p}) = \sum _{k=1}^{p} \pi _k f({\mathbf {x}}; \varvec{\mu }_k, \kappa _k) , \end{aligned}$$

where \((\pi _1, \ldots , \pi _p)\) is the mixing proportions, \((\varvec{\mu }_k, \kappa _k)\) are the parameters for the kth component of the mixture, and \(f(\cdot ; \varvec{\mu }_k, \kappa _k)\) is the vMF density function defined in (1).

We let \(\varTheta := \{ \theta \equiv (\varvec{\mu }, \kappa ): \varvec{\mu } \in {\mathbb {S}}^{d-1}, \kappa \ge 0 \}\) be the parameter space of the vMF distribution, with the metric \(\rho ( \cdot , \cdot )\) defined as

$$\begin{aligned} \rho (\theta _1, \theta _2) = \text{ arccos }(\varvec{\mu }_1^{T} \varvec{ \mu }_2) + |\kappa _1 - \kappa _2| , \end{aligned}$$

for \(\theta _1 = (\varvec{\mu }_1, \kappa _1), \theta _2 = (\varvec{\mu }_2, \kappa _2)\). For any \(\theta = (\varvec{\mu }, \kappa ) \in \varTheta \), we write \(f_{\theta }(\cdot ) := f(\cdot ; \varvec{\mu }, \kappa )\) for the density function and \(\gamma _{\theta }\) for the corresponding measure. The space of mixing probabilities is denoted by \(\varPi := \{ (\pi _1, \ldots , \pi _p): \sum _{i=1}^{p} \pi _p = 1, \pi _k \ge 0, k = 1, \ldots , p \}\). A p-component mixture of vMF distributions can be expressed as \(\gamma = \sum _{k=1}^{p} \pi _k \gamma _{\theta _k}\) where \((\pi _1, \ldots , \pi _p) \in \varPi \) and \((\theta _1, \ldots , \theta _p) \in \varTheta ^{p}\), and where \(\varTheta ^{p} = \varTheta \times \cdots \times \varTheta \) is the product of the parameter spaces. We define the product space \(\varGamma := \varPi \times \varTheta ^{p}\), and we slightly abuse notations to let \(\gamma \) denote both the mixing measure and the parameters in \(\varGamma \). While \(\varGamma \) is a natural parameterization of the family of mixture of vMF distributions, elements of \(\varGamma \) are not identifiable. Thus, we let \({\tilde{\varGamma }}\) be the quotient topological space obtained from \(\varGamma \) by identifying all parameters \((\pi _1, \ldots , \pi _p, \theta _1, \ldots , \theta _p)\) such that their corresponding densities are equal (almost) everywhere. For the rest of the paper, we assume that the number of mixture components p is known.

Likelihood degeneracy

We investigate the likelihood degeneracy problem of the vMF mixture model in this section. For any observations generated from a vMF mixture model with two or more mixture components, we show that the resulting likelihood function on the parameter space \(\varGamma \) is unbounded above. As discussed in Sect. 1, likelihood degeneracy is a common problem for mixture models with location-scale distributions, including the normal mixtures. In the case of normal mixture distributions, one can show that by letting the mean parameter of a mixture component equal to one of the observations and letting the variance of the same mixture component converge to zero while holding other parameters fixed, the likelihood function diverges to positive infinity (Chen et al. 2008).

For the vMF mixture distributions, the likelihood unboundedness can be best understood in the special case of \(x \in {\mathbb {S}}^1\), or the mixture of von Mises distributions. The von Mises distribution, also known as the circular normal distribution, approaches a normal distribution with large concentration parameter \(\kappa \):

$$\begin{aligned} f(x|\mu , \kappa ) \approx \frac{1}{\sigma \sqrt{2 \pi }} \exp \bigg [ \frac{-(x - \mu )^2}{2 \sigma ^2} \bigg ] , \end{aligned}$$

with \(\sigma ^2 = 1 / \kappa \), and the approximation converges uniformly as \(\kappa \) goes to infinity. Therefore, the likelihood function of a mixture of von Mises distributions diverges to infinity by letting the mean parameter of a mixture component equal to one of the observations and letting the concentration parameter diverges to infinity.

We now consider the general case of the vMF mixture models. Let \(\mathcal{X} = \{{\mathbf {x}}_1, \ldots , {\mathbf {x}}_n\}\) be the observations generated from a mixture of vMF distributions with density function \(\sum _{k=1}^{p} \pi _k f_{\theta _k}(\cdot )\) where \(\theta _k = (\varvec{\mu }_k, \kappa _k)\). The likelihood function can be expressed as:

$$\begin{aligned} L(\mathcal{X}; \varvec{\theta }, \varvec{\pi }) = \prod _{i=1}^{n} \sum _{k=1}^{p} \pi _k f({\mathbf {x}}_i; \varvec{\mu }_k, \kappa _k) , \end{aligned}$$

where \(\varvec{\theta } = (\theta _1, \ldots , \theta _p) = ((\varvec{\mu }_1, \kappa _1), \ldots , (\varvec{\mu }_p, \kappa _p))\) and \(\varvec{\pi } = (\pi _1, \ldots , \pi _p)\). We can show that by letting the mean direction \(\varvec{\mu }_k\) of one of the mixture components equals to an arbitrary observation and letting the corresponding concentration parameter \(\kappa _k\) goes to infinity, the resulting likelihood function diverges.

Theorem 1

For any observations \(\mathcal{X} = ({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n)\), there exists a sequence \((\varvec{\theta }^{(q)}, \varvec{\pi }^{(q)})\), \(q=1,2,\ldots \) such that \(L(\mathcal{X}; \varvec{\theta }^{(q)}, \varvec{\pi }^{(q)}) \uparrow \infty \) as \(q \rightarrow \infty \).

The proof of Theorem 1 is provided in the “Appendix”. The unboundedness of the likelihood function on the parameter space implies that the maximum likelihood estimator is not well defined.

Penalized maximum likelihood estimation


Let \(\gamma _0 \in \varGamma \) be the true mixing measure for the mixture of vMF distributions with corresponding density function \(f_0\) on \({\mathbb {S}}^{d-1}\). We let M be the maximum of the true density \(f_0\):

$$\begin{aligned} M := \max _{{\mathbf {x}} \in {\mathbb {S}}^{d-1}} f_0({\mathbf {x}}) , \end{aligned}$$

and define the metric \(d({\mathbf {x}}, {\mathbf {y}}) = \arccos ({\mathbf {x}}^{T} {\mathbf {y}})\) on \({\mathbb {S}}^{d-1}\) as the angle between two unit vectors \({\mathbf {x}}, {\mathbf {y}} \in {\mathbb {S}}^{d-1}\). For any fixed \({\mathbf {x}} \in {\mathbb {S}}^{d-1}\) and positive number \(\epsilon \), the \(\epsilon \)-ball in \({\mathbb {S}}^{d-1}\) centered at \({\mathbf {x}}\) is defined as \(B_{\epsilon }({\mathbf {x}}) = \{{\mathbf {y}} \in {\mathbb {S}}^{d-1}: d({\mathbf {x}}, {\mathbf {y}}) < \epsilon \}\). For any measurable set \(B \subset {\mathbb {S}}^{d-1}\), the spherical measure of B is given by \( \omega (B) := \int _{B} d \omega \), where \(d \omega \) is the standard surface measure on \({\mathbb {S}}^{d-1}\).

For any \({\mathbf {x}} \in {\mathbb {S}}^{d-1}\) and small positive number \(\epsilon \), the measure of the ball \(B_{2\epsilon }({\mathbf {x}})\) in \({\mathbb {S}}^{d-1}\) is given by (Li 2011)

$$\begin{aligned} \omega (B_{2 \epsilon }({\mathbf {x}}))= & {} \frac{2 \pi ^{(d-1)/2}}{\varGamma (\frac{d-1}{2})} \int _{0}^{2\epsilon } \sin ^{d-2}(\theta ) d\theta \nonumber \\\le & {} 2^{d-1} \frac{2 \pi ^{(d-1)/2}}{\varGamma (\frac{d-1}{2})} \epsilon ^{d-1} \end{aligned}$$
$$\begin{aligned}= & {} A_2 \epsilon ^{d-1}, \end{aligned}$$


$$\begin{aligned} A_2 = 2^{d-1} \frac{2 \pi ^{(d-1)/2}}{\varGamma (\frac{d-1}{2})} . \end{aligned}$$

We define the function \(\delta (\cdot )\) by

$$\begin{aligned} \delta (\epsilon ) := M A_2 \epsilon ^{d-1} , \end{aligned}$$

where the constants M and \(A_2\) are defined in Eqs. (4) and (7), respectively. The function \(\delta (\cdot )\) plays a crucial role in Lemmas 1 and 2 . Lemmas 1 and 2 are analogous to Lemmas 1 and 2 in Chen et al. (2008). They provide (almost sure) upper bounds on the number of observations in a small \(\epsilon \)-ball in \({\mathbb {S}}^{d-1}\). The upper bound in Lemma 1 is for each fixed \(\epsilon \) in an interval whereas the upper bound in Lemma 2 holds uniformly for all \(\epsilon \) in the same interval. The proof of Lemma 1 is given in the “Appendix”. The proof of Lemma 2 is similar to the proof of Lemma 2 in Chen et al. (2008) and is omitted. Lemmas 1 and 2 are crucial to ensure consistency of the penalized maximum likelihood estimator.

We note that Lemmas 1 and 2 may be generalized by relaxing the assumption that the true density is a mixture of vMF densities. This is possible because the vMF assumption does not play a crucial role. Such a generalization has been obtained for normal mixtures (Chen et al. 2016, Lemma 3.2). However, this is not required for the proof of our main result.

Lemma 1

For any sufficiently small positive number \(\xi _0\), as \(n \rightarrow \infty \), and for each fixed \(\epsilon \) such that

$$\begin{aligned} \frac{\log n}{M n A_2} \le \epsilon ^{d-1} < \xi _0 , \end{aligned}$$

the following inequalities hold except for a zero probability event:

$$\begin{aligned} \sup _{\varvec{\mu } \in S^{d-1}} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{\epsilon }(\varvec{\mu })\big ) \bigg \} \le 2 \delta (\epsilon ). \end{aligned}$$

Uniformly for all \(\epsilon \) such that

$$\begin{aligned} 0< \epsilon ^{d-1} < \frac{\log n}{M n A_2}, \end{aligned}$$

the following inequalities hold except for a zero probability event:

$$\begin{aligned} \sup _{\varvec{\mu } \in S^{d-1}} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{\epsilon }(\varvec{\mu })\big ) \bigg \} \le 2 \frac{(\log n)^2}{n} . \end{aligned}$$

Lemma 2

For any sufficiently small positive number \(\xi _0\), as \(n \rightarrow \infty \), uniformly for all \(\epsilon \) such that

$$\begin{aligned} \frac{\log n}{M n A_2} \le \epsilon ^{d-1} < \xi _0 , \end{aligned}$$

the following inequality holds except for a zero probability event:

$$\begin{aligned} \sup _{\varvec{\mu } \in S^{d-1}} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{\epsilon }(\varvec{\mu })\big ) \bigg \} \le 4 \delta (\epsilon ). \end{aligned}$$

Uniformly for all \(\epsilon \) such that

$$\begin{aligned} 0< \epsilon ^{d-1} < \frac{\log n}{M n A_2}, \end{aligned}$$

the following inequalities hold except for a zero probability event:

$$\begin{aligned} \sup _{\varvec{\mu } \in S^{d-1}} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{\epsilon }(\varvec{\mu })\big ) \bigg \} \le 2 \frac{(\log n)^2}{n} . \end{aligned}$$

Penalized maximum likelihood estimator

For any mixing measure of a p-component mixture \(\gamma = \sum _{l=1}^{p} \pi _l \gamma _{\theta _l} \) in \(\varGamma \), and n i.i.d. observations \(\mathcal{X}\), the penalized log-likelihood function is defined as

$$\begin{aligned} pl_n(\gamma ) = l_n(\gamma ) + p_n(\varvec{\kappa }) \end{aligned}$$

where \(l_n(\gamma )\) is the log-likelihood function:

$$\begin{aligned} l_n(\gamma ) = \sum _{i=1}^{n} \log \bigg \{ \sum _{k=1}^{p} \pi _k f({\mathbf {x}}_i; \varvec{\mu }_k, \kappa _k) \bigg \}, \end{aligned}$$

and \(p_n(\cdot )\) is a penalty function that depends on \(\varvec{\kappa } = (\kappa _1, \ldots , \kappa _p)\). Note that we slightly abuse notations and let \(p_n(\cdot )\) denotes the penalty function and p denotes the number of mixture components. We impose the following conditions on the penalty function \(p_n(\cdot )\).

  1. C1

    \(p_n(\varvec{\kappa }) = \sum _{l=1}^{p} {\tilde{p}}_n(\kappa _l)\),

  2. C2

    For \(l=1,\ldots , p\), \(\sup _{\kappa _l > 0} \max \{0, {\tilde{p}}_n(\kappa _l) \} = o(n)\) and \({\tilde{p}}_n(\kappa _l) = o(n)\) for each fixed \(\kappa _l \ge 0\),

  3. C3

    For \(l=1, \ldots , p\), and for

    $$\begin{aligned} 0 < \frac{1}{\log (\kappa _l)^{2d - 2}} \le \frac{\log n}{M n A_2} , \end{aligned}$$

    \({\tilde{p}}_n(\kappa _l) \le -3 (\log n)^{2} \log \kappa _l\) for large enough n.

Conditions C1 – C3 on the penalty function are analogous to the three conditions proposed in Chen et al. (2008). Condition C1 assumes that the penalty function is of additive form. Condition C2 ensures that the penalty is not overly strong while condition C3 allows the penalty to be severe when the concentration parameter is very large. Recall the true mixing measure \(\gamma _0 \in \varGamma \), and let \({\hat{\gamma }}\) denote the maximizer of the penalized log-likelihood function defined in Eq. (13). We have the following main result of this paper demonstrating that the maximizer of the penalized log-likelihood function is strongly consistency.

Theorem 2

Let \({\hat{\gamma }}_n\) be the maximizer of the penalized log-likelihood \(pl_n(\gamma )\), then \({\hat{\gamma }}_n \rightarrow \gamma _0\) almost surely in the quotient topological space \({\tilde{\varGamma }}\).

EM algorithm

We develop an Expectation–Maximization algorithm to maximize the penalized log-likelihood function defined in Eq. (13). By condition C1, the penalty function is assumed to have the form \(p_n(\varvec{\kappa }) = \sum _{l=1}^{p} {\tilde{p}}_n(\kappa _l)\). We consider \({\tilde{p}}_n(\kappa _l)\) to have the form \(p_n(\kappa _l) = - \psi _n \kappa _l\) for all l where the constant \(\psi _n \propto n^{-1}\) that depends on the sample size n. In particular, we may set \(\psi _n = \zeta / n\) for some constant \(\zeta > 0\) or \(\psi _n = S_x / n\) where \(S_x\) is the sample circular variance.

The resulting penalty function clearly satisfies condition C2. We note that condition C3 is also satisfied since for

$$\begin{aligned} 0 < \frac{1}{\log (\kappa _l)^{2d - 2}} \le \frac{\log n}{M n A_2} , \end{aligned}$$

we have

$$\begin{aligned} \kappa _l \approx \exp \big ( (n / \log n)^{1/(2d-2)} \big ) . \end{aligned}$$

The EM algorithm developed in Banerjee et al. (2005) can be easily modified to incorporate an additional penalty function. The E-Step of the penalized EM involves computing the conditional probabilities:

$$\begin{aligned} p(Z_i = h|{\mathbf {x}}_i, \varvec{\theta }) = \frac{\pi _h f({\mathbf {x}}_i; \theta _h)}{ \sum _{l=1}^{p} \pi _l f({\mathbf {x}}_i;\theta _l) }, \quad h=1,\ldots , p, \end{aligned}$$

where \(Z_i\) is the latent variable denoting the cluster membership of the ith observation. For the M-step, using the method of Lagrange multipliers, we optimize the full conditional penalized log-likelihood function below

$$\begin{aligned}&\sum _{l=1}^{p} \bigg [ \sum _{i=1}^{n} (\log ( \pi _l ) + \log (c_d(\kappa _l)) ) p(Z_i=l| {\mathbf {x}}_i, \varvec{\theta })\\&\quad + \sum _{i=1}^{n} \kappa _l \mu _l^{T} x_i p(Z_i=l| {\mathbf {x}}_i, \varvec{\theta }) - \psi _n \kappa _l + \lambda _l (1 - \varvec{\mu }_l^{T} \varvec{\mu }_l ) \bigg ] \end{aligned}$$

with respect to \(\varvec{\mu }_h, \kappa _h, \pi _h\) for \(h=1,\ldots ,p\), which gives:

$$\begin{aligned} {\hat{\pi }}_h= & {} \frac{1}{n} \sum _{i=1}^{N} p(Z_i = h|{\mathbf {x}}_i, \varvec{\theta }) \end{aligned}$$
$$\begin{aligned} \hat{\varvec{\mu }}_h= & {} \frac{ r_h }{ ||r_h||} \end{aligned}$$
$$\begin{aligned} \frac{I_{d/2}({\hat{\kappa }}_h)}{I_{d/2-1}({\hat{\kappa }}_h)}= & {} \frac{- \psi _n + ||r_h||}{\sum _{i=1}^{N} p(Z_i=h|{\mathbf {x}}_i, \varvec{\theta })} \end{aligned}$$

where \( r_h = \sum _{i=1}^{n} {\mathbf {x}}_i p(Z_i = h|{\mathbf {x}}_i, \varvec{\theta })\). We note that the assumption on \(\psi _n\) implies that \(-\psi _n + ||r_h|| \ge 0\) almost surely as \(n \rightarrow \infty \). However, for a finite sample size, there is a non-zero possibility that \( -\psi _n + ||r_h|| < 0\), and the updating equation for \(\kappa _h\) is not well defined since the left hand side of Eq. (17) is non-negative. However, the left hand side of Eq. (17) is a strictly monotonically increasing function from \([0, \infty )\) to [0, 1) (Schou 1978; Hornik and Grün 2014), and in particular \({\hat{\kappa }}_h = 0\) whenever

$$\begin{aligned} \frac{I_{d/2}({\hat{\kappa }}_h)}{I_{d/2-1}({\hat{\kappa }}_h)} = 0. \end{aligned}$$

Thus, we can simply set \(\kappa _h = 0\) whenever \( -\psi _n + ||r_h|| < 0\). To solve Eq. (17) for \({\hat{\kappa }}_h\), various approximations have been proposed (Banerjee et al. 2005; Tanabe et al. 2007; Song et al. 2012). Section 2.2 of Hornik and Grün (2014) contains a detailed review of available approximations. We consider the approximation used in Banerjee et al. (2005):

$$\begin{aligned} {\hat{\kappa }}_h \approx \frac{\rho _h (d - \rho _h^2)}{ 1 - \rho _h^{2}} , \end{aligned}$$


$$\begin{aligned} \rho _h = \frac{- \psi _n + ||r_h||}{\sum _{i=1}^{N} p(Z_i=h|{\mathbf {x}}_i, \varvec{\theta })} . \end{aligned}$$

We initialize the EM algorithm by randomly assigning the observations into mixture components, and the algorithm is terminated if the change in the penalized log-likelihood falls below a small threshold which is set at \(10^{-5}\) in the experiements.

Simulation studies

We perform simulation studies to investigate the performance of the proposed EM algorithm for maximizing the penalized likelihood function. We generate data from the mixture of vMF distributions with two and three mixture components and with dimensions \(d=2,3,4\). For each model, data are generated with increasing samples sizes to assess the convergence of the estimated parameters toward the true parameters. The concentration parameters \(\varvec{\kappa }\) and the mixing proportions \(\varvec{\pi }\) are pre-specficied whereas the mean directions \(\varvec{\mu }\) are drawn from the uniform distribution on the surface of the unit hypersphere.

Table 1 Simulation results for the vMF mixtures with two mixture components

For the two mixture components model, we specify the mixing proportions as \(\varvec{\pi } = (0.5, 0.5)\) and the concentration parameters \(\varvec{\kappa } = (10, 1)\). For the model with three mixture components, we set \(\varvec{\pi } = (0.4, 0.3, 0.3)\) and \(\varvec{\kappa } = (10, 5, 1)\). For illustrative purpose, we consider the penalty function \({\tilde{p}}_n(\kappa _l) = - (1/n) \kappa _l\). For each combination of dimension d and sample size n, we simulate 500 random samples from the model and the EM algorithm developed in Sect. 4.3 is used to obtain the parameter estimates. We measure the distance between the estimated parameters and the true parameters for each random sample. For the mean direction parameters \(\varvec{\mu }\), the distance is measured using the metric \(d({\mathbf {x}}, {\mathbf {y}}) = \text{ arccos }({\mathbf {x}}^T {\mathbf {y}})\).

Simulation results for the two and three mixture cases are presented in Tables 1 and 2, respectively. The average distance and the standard deviation between the true and the estimated parameters from 500 replications are reported. We observe that the estimated parameters converge to the true parameter as n increases. We notice that the mean direction parameter can be estimated with higher precision when the corresponding concentration parameter is large. This is expected since observations are more closely clustered with a large concentration parameter.

Table 2 Simulation results for the vMF mixtures with three mixture components

Tables 3 and 4 show the number of degeneracies when running the EM algorithm for computing the ordinary MLE for mixture of vMF distributions. Observations are generated from mixture of vMF distributions with one mixture component for Table 3 and with two mixture components for Table 4. We vary the dimension of the data from \(d=3\) to \(d=4\) and the sample size from \(n=100\) to \(n=500\). Mixtures of vMF distributions with \(p=2,3,4,5\) components are fitted to the generated data. We compute the ordinary MLE using the EM algorithm and record the number of times that the EM fails to converge from 1000 simulation runs. The EM algorithm is considered fail to converge if one of the concentration parameters becomes exceedingly large (greater than \(10^{10}\)). From Tables 3 and 4, the EM algorithm tends to fail to converge with smaller sample sizes. We also note that when the fitted model has a larger number of mixture components p, the EM algorithm is more likely to fail to converge.

Table 3 Number of degeneracies of the EM algorithm when computing the ordinary MLE for mixture of vMF distributions with one mixture component
Table 4 Number of degeneracies of the EM algorithm when computing the ordinary MLE for mixture of vMF distributions with two mixture components

Data application

We illustrate the EM algorithm for maximum penalized log-likelihood using the household data set from R package HSAUR3. The data set contains the household expenditures of 20 single men and 20 single women on four commodity group. As in Hornik and Grün (2014), we will also focus on the three commodity groups (housing, food and service). The EM algorithms for ordinary MLE and for the penalized MLE with 2 and 3 mixture components are fitted to the data. The results are shown in Tables 5 and 6, respectively, where the estimated parameters \(\hat{\varvec{\pi }}, \hat{\varvec{\mu }}, \hat{\varvec{\kappa }}\) are shown for all cases. The estimated paramters for the MLE and for the penalized MLE are very similar for both \(p=2\) and \(p=3\). The log-likelihood evaluated at the MLE is slightly larger than the penalized log-likelihood evaluated at the penalized MLE. More interestingly, we observe that for each case the largest concentration parameter obtained under the penalized MLE is smaller than that obtained under the MLE. This behavior suggests that the incorporation of a penalty function pulls back the estimate of largest concentration parameter towards 0 and prevents the divergence of the likelihood function.

Table 5 Maximum likelihood estimates obtained from fitting mixtures of vMF distributions to the household expenses example
Table 6 Penalized maximum likelihood estimates obtained from fitting mixtures of vMF distributions to the household expenses example


In this paper we considered a penalized maximum likelihood approach to the estimation of the mixture of vMF distributions. By incorporating a suitable penalty function, we showed that the resulting penalized MLE is strongly consistent. An EM algorithm was derived to maximize the penalized likelihood function, and its performance and behavior were examined using simulation studies and a data application. The techniques used in this work to prove consistency could be applicable to study other mixture models for spherical observations.


  • Bagchi P, Kadane JB (1991) Laplace approximations to posterior moments and marginal distributions on circles, spheres, and cylinders. Can J Stat 19:67–77

    MathSciNet  Article  Google Scholar 

  • Banerjee A, Dhillon IS, Ghosh J, Sra S (2005) Clustering on the unit hypersphere using von Mises–Fisher distributions. J Mach Learn Res 6:1345–1382

    MathSciNet  MATH  Google Scholar 

  • Bangert M, Hennig P, Oelfke U (2010) Using an infinite Von Mises–Fisher mixture model to cluster treatment beam directions in external radiation therapy. IEEE, Piscataway, pp 746–751

    Google Scholar 

  • Boomsma W, Kent J, Mardia K, Taylor C, Hamelryck T (2006) Graphical models and directional statistics capture protein structure. In: Barber S, Baxter P, Mardia K, Walls R (eds) Interdisciplinary statistics and bioinformatics. Leeds University Press, pp 91–94, null. Conference date: 04-07-2006 Through 06-07-2006

  • Chen J, Tan X, Zhang R (2008) Inference for normal mixtures in mean and variance. Stat Sin 18:443–465

    MathSciNet  MATH  Google Scholar 

  • Chen L, Singh VP, Guo S, Fang B, Liu P (2013) A new method for identification of flood seasons using directional statistics. Hydrol Sci J 58:28–40

    Article  Google Scholar 

  • Chen J, Li S, Tan X (2016) Consistency of the penalized MLE for two-parameter gamma mixture models. Sci China Math 59:2301–2318

    MathSciNet  Article  Google Scholar 

  • Ciuperca G, Ridolfi A, Idier J (2003) Penalized maximum likelihood estimator for normal mixtures. Scand J Stat 30:45–59

    MathSciNet  Article  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B 39:1–38 (with discussion)

    MathSciNet  MATH  Google Scholar 

  • Fisher R (1953) Dispersion on a sphere. Proc R Soc Lond Ser A 217:295–305

    MathSciNet  Article  Google Scholar 

  • Goodd IJ, Gaskins RA (1971) Nonparametric roughness penalties for probability densities. Biometrika 58:255–277

    MathSciNet  Article  Google Scholar 

  • Hathaway RJ (1985) A constrained formulation of maximum-likelihood estimation for normal mixture distributions. Ann Stat 13:795–800

    MathSciNet  Article  Google Scholar 

  • Hornik K, Grün B (2014) movMF: an R package for fitting mixtures of von Mises–Fisher distributions. J Stat Softw 58:1–31

    Article  Google Scholar 

  • Jupp PE (1995) Some applications of directional statistics to astronomy. In: New trends in probability and statistics, vol 3 (Tartu/Pühajärve, 1994), VSP, Utrecht, pp 123–133

  • Li S (2011) Concise formulas for the area and volume of a hyperspherical cap. Asian J Math Stat 4:66–70

    MathSciNet  Article  Google Scholar 

  • Mardia KV (1972) Statistics of directional data, probability and mathematical statistics, no. 13. Academic Press, London

    Google Scholar 

  • Mardia KV, Jupp PE (2000) Directional statistics. Wiley series in probability and statistics. Wiley, Chichester

    MATH  Google Scholar 

  • Mcgraw T, Vemuri B, Yezierski B, Mareci T (2006) Von Mises–Fisher mixture model of the diffusion ODF. In: Proceedings/IEEE international symposium on biomedical imaging: from nano to macro. IEEE international symposium on biomedical imaging 2006, pp 65–68

  • Ng TLJ, Kwong K-K (2020) Universal approximation on the hypersphere. Commun Stat Theory Methods 0:1–11

    Google Scholar 

  • Peters BC Jr, Walker HF (1978) An iterative procedure for obtaining maximum-likelihood estimates of the parameters for a mixture of normal distributions. SIAM J Appl Math 46:362–378

    MathSciNet  Article  Google Scholar 

  • Redner R (1981) Note on the consistency of the maximum likelihood estimate for nonidentifiable distributions. Ann Stat 9:225–228

    MathSciNet  Article  Google Scholar 

  • Schou G (1978) Estimation of the concentration parameter in von Mises–Fisher distributions. Biometrika 65:369–377

    MathSciNet  Article  Google Scholar 

  • Sinkkonen J, Kaski S (2002) Clustering based on conditional distributions in an auxiliary space. Neural Comput 14:217–239

    Article  Google Scholar 

  • Song H, Liu J, Wang G (2012) High-order parameter approximation for von Mises–Fisher distributions. Appl Math Comput 218:11880–11890

    MathSciNet  MATH  Google Scholar 

  • Taghia J, Ma Z, Leijon A (2014) Bayesian estimation of the von–Mises Fisher mixture model with variational inference. IEEE Trans Pattern Anal Mach Intell 36:1701–1715

    Article  Google Scholar 

  • Tanabe A, Fukumizu K, Oba S, Takenouchi T, Ishii S (2007) Parameter estimation for von Mises–Fisher distributions. Comput Stat 22:145–157

    MathSciNet  Article  Google Scholar 

  • Zhe X, Chen S, Yan H (2019) Directional statistics-based deep metric learning for image classification and retrieval. Pattern Recognit 93:113–123

    Article  Google Scholar 

Download references


Open Access funding provided by the IReL Consortium.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tin Lok James Ng.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.



Proof of Theorem 1


We fix \(\varvec{\theta } := ((\varvec{\mu }_1, \kappa _1), \ldots , (\varvec{\mu }_p, \kappa _p)) \in \varTheta ^{p}\) such that \(\varvec{\mu }_l = {\mathbf {x}}_m\) for some pair (lm). We construct a sequence \((\varvec{\theta }^{(q)}, \varvec{\pi }^{(q)}), q=1,2,\ldots \) and show that \(L(\mathcal{X}; \varvec{\theta }^{(q)}, \varvec{\pi }^{(q)}) \uparrow \infty \) as \(q \uparrow \infty \).

For \(k=1,2,\ldots ,p\) and \(q=1,2,\ldots \), we let \(\varvec{\mu }^{(q)}_{k} = \varvec{\mu }_k\), and \(\pi ^{(q)}_{k} = (1 - 1/q) \pi _k + 1/(q p)\). It is easy to verify that \(\sum _{k=1}^{p} \pi ^{(q)}_k = 1\). For each q, we let \(\kappa _l^{(q)} = q\), and for \(k \ne l\), we let \(\kappa _k^{(q)} = \kappa _k\). Since \(\pi _k^{(q)} \ge 1 / (q p)\), the likelihood is lower bounded by:

$$\begin{aligned} L(\mathcal{X}; \varvec{\theta }^{(q)}, \varvec{\pi }^{(q)}) \ge \frac{1}{(q p)^{n}} \prod _{i=1}^{n} \sum _{k=1}^{p} f({\mathbf {x}}_i; \varvec{\mu }_k, \kappa _k^{(q)}) . \end{aligned}$$

For the mth observation, we have

$$\begin{aligned} \sum _{k=1}^{p} f({\mathbf {x}}_m; \varvec{\mu }_k, \kappa _k^{(q)}) \ge f({\mathbf {x}}_m; \varvec{\mu }_l, \kappa _l^{(q)}) = c_d(\kappa _l^{(q)}) e^{\kappa _l^{(q)}} . \end{aligned}$$

For any \(i \ne m\), and \(h \ne l\), we have

$$\begin{aligned} \sum _{k=1}^{p} f({\mathbf {x}}_i; \mu _k, \kappa _k^{(q)}) \ge f({\mathbf {x}}_i; \varvec{\mu }_h, \kappa _h^{(q)}) = c_d(\kappa _h^{(q)}) e^{\kappa _h^{(q)} \varvec{\mu }_h^{T} {\mathbf {x}}_i} . \end{aligned}$$

Therefore, the likelihood function can be lower bounded by

$$\begin{aligned} L(\mathcal{X}; \varvec{\theta }^{(q)}, \varvec{\pi }^{(q)})\ge & {} \frac{1}{(qp)^{n}} c_d(\kappa _l^{(q)}) e^{\kappa _l^{(q)}} \prod _{i \ne n} c_d(\kappa _h^{(q)}) e^{\kappa _h^{(q)} \varvec{\mu }_h^{T} {\mathbf {x}}_i} \\= & {} \frac{1}{(qp)^{n}} c_d(q) e^{q} \prod _{i \ne n} c_d(\kappa _h) e^{\kappa _h \varvec{\mu }_h^{T} {\mathbf {x}}_i} . \end{aligned}$$

Since \(c_d(\kappa ) = \mathcal{O}(\kappa ^{d/2 - 1/2})\), we have \(L(\mathcal{X}; \varvec{\theta }^{(q)}, \varvec{\pi }^{(q)}) \uparrow \infty \) as \(q \uparrow \infty \). \(\square \)

Technical lemmas

The following technical lemmas are useful for the proof of the main results. Lemma 3 gives the asymptotic expansion of the modified Bessel function of the first kind and is straight forward to derive.

Lemma 3

Let \(I_r(\cdot )\) be the modified Bessel function of the first kind with degree r. As \(z \rightarrow \infty \), we have

$$\begin{aligned} I_r(z) = \frac{e^{z}}{\sqrt{2\pi z}} (1 + \mathcal{O}(z^{-1})) . \end{aligned}$$

Lemma 4 concerns the covering of the surface of the unit hypersphere with \(B_{\epsilon }\)-balls and is needed for the proof of Lemma 1.

Lemma 4

For any sufficiently small positive number \(\epsilon \), there exists points \(\varvec{\eta }_1, \ldots , \varvec{\eta }_m \in {\mathbb {S}}^{d-1}\) with \(m \le A_1/\epsilon ^{d-1}\) where \(A_1 > 0\) is some constant which depends on the dimension d such that for any \({\mathbf {x}} \in {\mathbb {S}}^{d-1}\), there exists \(\varvec{\eta }_j\) with \(B_{\epsilon }({\mathbf {x}}) \subset B_{2 \epsilon }(\varvec{\eta }_j) \).


Fix \(\epsilon > 0\) and consider an open cover \(\{ B_{\epsilon }(\varvec{\eta }_i)\}_i\) of \({\mathbb {S}}^{d-1}\). By compactness of \({\mathbb {S}}^{d-1}\), there exists a finite subcover \(\{B_{\epsilon }(\varvec{\eta }_1), \ldots , B_{\epsilon }(\varvec{\eta }_m)\}\) of \({\mathbb {S}}^{d-1}\). Let \({\mathbf {x}} \in {\mathbb {S}}^{d-1}\) be fixed, and let \( {\mathbf {z}} \in B_{\epsilon }({\mathbf {x}})\) be arbitrary. We must show that \(d({\mathbf {z}},\varvec{\eta }_i) < 2 \epsilon \) for some i.

Since \(\{B_{\epsilon }(\varvec{\eta }_1), \ldots , B_{\epsilon }(\varvec{\eta }_m)\}\) is an open cover of \({\mathbb {S}}^{d-1}\), we must have \({\mathbf {x}} \in B_{\epsilon }(\varvec{\eta }_i)\) for some i. Therefore,

$$\begin{aligned} d({\mathbf {z}}, \varvec{\eta }_i) \le d({\mathbf {z}},{\mathbf {x}}) + d({\mathbf {x}}, \varvec{\eta }_i) < \epsilon + \epsilon = 2\epsilon . \end{aligned}$$

Hence, we have \(B_{\epsilon }({\mathbf {x}}) \subset B_{2 \epsilon }(\varvec{\eta }_i)\). Since x is arbitrary, the result follows.

The statement that \(m \le A_1/\epsilon ^{d-1}\) for some constant \(A_1 > 0\) is clearly true for \(d=2\), and the general case holds using proof by induction with a geometric argument. \(\square \)

Proof of Lemma 1

Proof of Lemma 1


Let \(\epsilon \) be a small positive number. By Lemma 4, there exists \(\varvec{\eta }_1, \ldots , \varvec{\eta }_m \in {\mathbb {S}}^{d-1}\) with \(m \le A_1 / \epsilon ^{d-1} \) such that for any \({\mathbf {x}} \in {\mathbb {S}}^{d-1}\), we have \(B_{\epsilon }({\mathbf {x}}) \subset B_{2 \epsilon }(\varvec{\eta }_j)\) for some j. Consequently, we have that

$$\begin{aligned} \sup _{\varvec{\mu } \in S^{d-1}} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{\epsilon }(\varvec{\mu })\big ) \bigg \}\le & {} \max _{j=1,\ldots ,m} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I\big (X_i \in B_{2 \epsilon }(\varvec{\eta }_j)\big ) \bigg \} \\\le & {} \max _{j=1,\ldots ,m} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} I \big ( X_i \in B_{2\epsilon }(\varvec{\eta }_j) \big ) - \gamma _0(B_{2\epsilon }(\varvec{\eta }_{j})) \bigg \} \\&+ \max _{j=1, \ldots , m} \bigg \{ \gamma _0(B_{2 \varvec{\epsilon }}(\varvec{\eta }_j)) \bigg \} , \end{aligned}$$

where \( \gamma _0(B_{2\epsilon }(\varvec{\eta }_j)) = \gamma _0(X \in B_{2\epsilon }(\varvec{\eta }_j))\) is the probability that a random variable X generated from the \(\gamma _0\) takes value in the \(2 \epsilon -\)ball \(B_{2\epsilon }\). For each \(j=1, \ldots , m\), \(\gamma _0(B_{2 \epsilon }(\varvec{\eta }_j))\) can be bounded above by

$$\begin{aligned} \gamma _0(B_{2\epsilon }(\varvec{\eta }_j)) = \int _{B_{2 \epsilon }(\varvec{\eta }_j)} f_0({\mathbf {x}}) d \omega ({\mathbf {x}}) \le M \omega (B_{2\epsilon }(\varvec{\eta }_j)) = M A_2 \epsilon ^{d-1} = \delta (\epsilon ) , \end{aligned}$$

where we recall that the constants M and \(A_2\) are defined in Eq. (4) and (7), respectively, and the function \(\delta (\cdot )\) is defined in Eq. (8). This implies that

$$\begin{aligned} \max _{j=1, \ldots , m} \bigg \{ \gamma _0(B_{2\epsilon }(\varvec{\eta }_j)) \bigg \} \le \delta (\epsilon ) . \end{aligned}$$

Define the quantities

$$\begin{aligned} \varDelta _{nj} := \bigg | \frac{1}{n} \sum _{i=1}^{n} I \big ( X_i \in B_{2\epsilon }(\varvec{\eta }_j) \big ) - \gamma _0(B_{2\epsilon }(\varvec{\eta }_{j})) \bigg | , \quad j=1,\ldots ,m. \end{aligned}$$

For \(t > 0\), by Bernstein’s inequality we have

$$\begin{aligned} {\mathbb {P}}( \varDelta _{nj} \ge t )\le & {} 2 \exp \bigg ( - \frac{ \frac{1}{2} n^2 t^2 }{n \gamma _0(B_{2\epsilon }(\varvec{\eta }_j) (1-\gamma _0(B_{2\epsilon }(\varvec{\eta }_j))) + \frac{1}{3} nt } \bigg ) \nonumber \\\le & {} 2 \exp \bigg ( - \frac{n^2 t^2}{2 n M A_2 \epsilon ^{d-1} + \frac{2}{3} n t } \bigg ) \nonumber \\= & {} 2 \exp \bigg ( - \frac{n t^2}{2\delta (\epsilon ) + \frac{2}{3} t } \bigg ) \end{aligned}$$

We note that \(\delta (\epsilon ) > \log n / n\) whenever \(\epsilon ^{d-1} > \log n / (M n A_2)\). Letting \(t = \delta (\epsilon )\) in the inequality above, we obtain

$$\begin{aligned} {\mathbb {P}}(\varDelta _{nj} \ge \delta (\epsilon )) \le 2 n^{-3} . \end{aligned}$$

Since \(\epsilon ^{d-1} > \log n / (M n A_2)\) implies \(m < n\) for sufficiently large n, we apply the union upper bound to obtain

$$\begin{aligned} {\mathbb {P}} \Big ( \max _j \varDelta _{nj} \ge \delta (\epsilon ) \Big ) \le 2 n^{-2} . \end{aligned}$$

Combining the two inequalities (18) and (20), the first conclusion of the lemma follows by applying the Borel-Cantelli lemma.

For the second statement of the lemma, we observe that \( 0< \epsilon ^{d-1} < \log n / (M n A_2)\) implies that \(\delta (\epsilon ) < \log n / n\). Let \(t := (\log n)^{2} / n\), for large enough n, we have

$$\begin{aligned} 2 \delta (\epsilon ) < t / 3 . \end{aligned}$$

Substituting t into inequality (19) gives

$$\begin{aligned} {\mathbb {P}} \bigg ( \varDelta _{nj} \ge \frac{ (\log n)^{2} }{ n } \bigg ) \le \exp ( - (\log n)^{2} ) \le n^{-3} . \end{aligned}$$

The second conlusion of the lemma follows from an application of the Borel-Cantelli lemma. \(\square \)

Proof of strong consistency of PMLE


We prove the strong consistency of PMLE for the case of two mixture components. The proof of strong consistency for the general case of p mixture components is analogous but significantly more tedious. We briefly outline the key steps for the proof of the p mixture components case, which are also along the lines of Section 3.3 of Chen et al. (2008).

Recall that a two component mixture mixing measure has the form \( \gamma = \pi \gamma _{\theta _1} + (1 - \pi ) \gamma _{\theta _2}\), where \(0 \le \pi \le 1\), \(\theta _1 = (\varvec{\mu }_1, \kappa _1)\), \(\theta _2 = (\varvec{\mu }_2, \kappa _2)\) and the corresponding penalized log-likelihood function is given by

$$\begin{aligned} pl_n(\gamma ) = l_n(\gamma ) + {\tilde{p}}_n(\kappa _1) + {\tilde{p}}_n(\kappa _2) . \end{aligned}$$

Let \(K_0 = E_0 \log f(X; \gamma _0)\) where \(E_0\) denotes the expectation under the true probability measure \(\gamma _0\). We follow the strategy in Chen et al. (2008) to divide the parameter space \( \varGamma = \varPi \times \varTheta ^{2}\) into three regions

$$\begin{aligned} \varGamma _1 := \{\pi , 1 - \pi , (\varvec{\mu }_1, \kappa _1), (\varvec{\mu }_2, \kappa _2): \kappa _1 \ge \kappa _2 \ge \nu _0 \} , \end{aligned}$$
$$\begin{aligned} \varGamma _2 := \{\pi , 1 - \pi , (\varvec{\mu }_1, \kappa _1), (\varvec{\mu }_2, \kappa _2): \kappa _1 \ge \tau _0, \kappa _2 \le \nu _0\} , \end{aligned}$$
$$\begin{aligned} \varGamma _3^ := \varGamma - (\varGamma _1 \cup \varGamma _2) . \end{aligned}$$

We require \(\nu _0\) and \(\tau _0\) to be sufficiently large where the exact magnitude are to be specified later. We will show that the penalized MLE \({\hat{\gamma }}\) almost surely is not in \(\varGamma _1\) or \(\varGamma _2\). Therefore, \({\hat{\gamma }}\) must be in \(\varGamma _3\) and its consistency follows from Theorem 5 of Redner (1981).

We first consider the region \( \varGamma _1 \), and define the following index sets:

$$\begin{aligned} D_1 := \{i: X_i \in B_{\epsilon _1}(\varvec{\mu }_1)\}, \quad D_2 := \{i: X_i \in B_{\epsilon _2}(\varvec{\mu }_2)\} , \end{aligned}$$


$$\begin{aligned} \epsilon _i = \frac{1}{(\log \kappa _i)^{2}}, \quad i=1,2 . \end{aligned}$$

\(D_1\) and \(D_2\) consist of observations that are very close to \(\varvec{\mu }_1\) and \(\varvec{\mu }_2\), respectively. We separately assess the likelihood contributions of the observations in \(D_1\) and \(D_2\) in Lemmas 5 and 6.

By Lemmas 5 and 6 the maximizer \({\hat{\gamma }}_n\) of \(pl_n(\gamma )\) is almost surely in \(\varGamma _3\). Lemmas 5 and 6 also imply that \(\gamma _0 \in \varGamma _3\). We want to apply Theorem 5 of Redner (1981) to conclude the strong consistency of the estimator \({\hat{\gamma }}_n\). First, it is clear following the definition of the metric \(\rho \) on \(\varTheta \) given in Equation 3 that \(\varGamma _3\) is a compact subset of \(\varGamma \) containing \(\gamma _0\). For any \(\theta = (\varvec{\mu }, \kappa ) \in \varTheta \), we can choose r sufficiently small such that for all \(\theta ' =(\varvec{\mu }', \kappa ')\) with \(\rho (\theta , \theta ') < r\), the density \(f(\cdot ; \varvec{\mu }', \kappa ')\) is bounded. Thus, we have

$$\begin{aligned} \int _{ {\mathbb {S}}^{d-1} } \log \bigg ( \max \big ( 1, \sup _{ \theta ': \rho (\theta , \theta ') < r} f({\mathbf {x}}, \varvec{\mu }', \kappa ') \big ) \bigg ) d \gamma _0\le & {} \int _{{\mathbb {S}}^{d-1}} C d \gamma _0 \\\le & {} C, \end{aligned}$$

for some constant C. Therefore, Assumption 2a of Redner (1981) is also satisfied. For any \(\theta _1, \theta _2 \in \varTheta \) where \(\theta _1 = (\varvec{\mu }_1, \kappa _1)\) and \(\theta _2 = (\varvec{\mu }_2, \kappa _2)\), since \(f(\cdot , \varvec{\mu }_1, \kappa _1)\) is bounded, we have

$$\begin{aligned} \int _{{\mathbb {S}}^{d-1}} |\log f({\mathbf {x}}; \varvec{\mu }_1, \kappa _1)| d \gamma _{\theta _2} < \infty . \end{aligned}$$

Therefore, Assumption 4 of Redner (1981) is also satisfied. We can conclude by applying Theorem 5 of Redner (1981) that \({\hat{\gamma }}_n \rightarrow \gamma _0\) almost surely in the quotient space \({\hat{\varGamma }}\).

We outline the key steps of the proof for the general case of p mixture components, which follows the same strategy as the proof for the 2 components case. We divide the parameter space \(\varGamma = \varPi \times \varTheta ^p\) into \(p+1\) regions

$$\begin{aligned} \varGamma _1:= & {} \{ (\pi _1, \ldots , \pi _p), (\varvec{\mu }_1, \kappa _1), \ldots , (\varvec{\mu }_p, \kappa _p): \kappa _1 \ge \kappa _2 \ge \cdots \ge \kappa _p \ge \nu _{10} \} ,\\ \varGamma _k:= & {} \{ (\pi _1, \ldots , \pi _p), (\varvec{\mu }_1, \kappa _1), \ldots , (\varvec{\mu }_p, \kappa _p): \kappa _1 \ge \cdots \ge \kappa _{p-k+1} \ge \nu _{k0}, \\&\quad \nu _{(k-1)0} \ge \kappa _{p-k+2} \ge \cdots \ge \kappa _p \} , \end{aligned}$$

for \(k=2, \ldots , p\), and \(\varGamma _{p+1} = \varGamma - \cup _{k=1}^{p} \varGamma _k \).

Given suitable constraints on the values of \(\nu _{10}, \ldots , \nu _{k0}\), an extension of Lemma 5 shows that

$$\begin{aligned} \sup _{\gamma \in \varGamma _1} pl_n(\gamma ) - pl_n(\gamma _0) \rightarrow - \infty ,\quad \text{ a.s. } , \end{aligned}$$

and an extension of Lemma 6 shows that

$$\begin{aligned} \sup _{\gamma \in \varGamma _k} pl_n(\gamma ) - pl_n(\gamma _0) \rightarrow - \infty , \quad \text{ a.s. } \end{aligned}$$

for \(k=2, \ldots , p\). Therefore, the probability that \({\hat{\gamma }}_n\), the maximizer of \(pl_n(\gamma )\) belongs to \(\varGamma _1, \ldots , \varGamma _p\) goes to zero. With \({\hat{\gamma }}_n\) almost surely in \(\varGamma _p\) which is a compact subset of \(\varGamma \), we can apply Theorem 5 of Redner (1981) to conclude the strong consistency of \({\hat{\gamma }}_n\). \(\square \)

Lemma 5

\( \sup _{\gamma \in \varGamma _1} pl_n(\gamma ) - pl_n(\gamma _0) \rightarrow - \infty , \quad \text{ a.s. } \)


The log-likelihood contributions of observations in any index set D is given by

$$\begin{aligned} l_n(\gamma ;D) = \sum _{i \in D} \log \bigg ( \pi c_d(\kappa _1) \exp (\kappa _1 {\mathbf {x}}_i^{T} \varvec{\mu }_1) + (1-\pi ) c_d(\kappa _2) \exp (\kappa _2 {\mathbf {x}}_i^{T} \varvec{\mu }_2) \bigg ) . \end{aligned}$$

For any observation i in \(D_1\), its likelihood contribution is bounded above by \(\exp (\kappa _1) / c_d(\kappa _1)\), and by the asymptotic expansion of the modified Bessel function of the first kind in Lemma 3, we have

$$\begin{aligned} c_d(\kappa _1) \exp (\kappa _1) \le A_3 \sqrt{\kappa _1} . \end{aligned}$$

for some constant \(A_3 > 0\). Consequently, the log-likelihood of observations in \(D_1\) is bounded above by

$$\begin{aligned} l_n(\gamma ;D_1) \le n(D_1) \log ( A_3 \sqrt{\kappa _1}) , \end{aligned}$$

where \(n(D_1)\) is the number of observations in \(D_1\). By Lemma 2, for

$$\begin{aligned} \frac{ \log n}{M n A_2} \le \epsilon _1^{d-1} < \xi _0 , \end{aligned}$$

\(n(D_1)\) is almost surely bounded above by

$$\begin{aligned} n(D_1) = \sum _{i=1}^{n} I(X_i \in B_{\epsilon _1}(\varvec{\mu }_1)) \le 4 n \delta (\epsilon _1) = 4 n M A_2 \epsilon _1^{d-1} . \end{aligned}$$

Therefore, recalling that \(\epsilon _1 = 1 / (\log \kappa _1)^{2}\), \(l_n(\gamma ;D_1)\) can be bounded above by:

$$\begin{aligned} l_n(\gamma ;D_1)\le & {} 4 n M A_2 \epsilon _1^{d-1} ( \log A_3 \sqrt{\kappa } ) \nonumber \\\le & {} A_4 n \frac{1}{(\log \kappa _1)^{2 d - 3}} \nonumber \\\le & {} A_4 n \frac{1}{(\log \nu _0)^{2 d - 3}} \end{aligned}$$

where \(A_4 > 0\) is some constant, and the last inequality follows from \(\kappa _1 > \nu _0\). For

$$\begin{aligned} 0< \epsilon _1^{d-1} < \frac{ \log n }{ M n A_2} , \end{aligned}$$

we have \( n(D_1) \le 2 (\log n )^{2}\) almost surely by Lemma 2. Therefore, with condition C3 on the penalty function \({\tilde{p}}_n(\kappa _1)\), almost surely

$$\begin{aligned} n(D_1) (\log A_3 \sqrt{\kappa _1}) + {\tilde{p}}_n(\kappa _1) \le 2 (\log n)^{2} (\log A_3 \sqrt{\kappa _1}) + {\tilde{p}}_n(\kappa _1) < 0 . \end{aligned}$$

The two bounds (22) and (23) can be combined to form

$$\begin{aligned} l_n(\gamma ;D_1) + {\tilde{p}}_n(\kappa _1) \le A_4 n \frac{1}{ (\log \nu _0)^{2 d -3} } . \end{aligned}$$

The same approach can be used to derive the same bound for observations in \(D_1^{c} D_2\):

$$\begin{aligned} l_n(\gamma ;D_1^{c} D_2) + {\tilde{p}}_n(\kappa _2) \le A_4 n \frac{1}{ (\log \nu _0)^{2 d -3} } . \end{aligned}$$

For any observation \({\mathbf {x}}\) that falls outside both \(D_1\) and \(D_2\), we have that \(\arccos ({\mathbf {x}}^{T} \varvec{\mu }_1) > \epsilon _1\) and \(\arccos ({\mathbf {x}}^{T} \varvec{\mu }_2) > \epsilon _2\). Using the Taylor expansion of \(\cos (\epsilon )\) for positive \(\epsilon \) around 0, we can show that for \({\mathbf {x}} \in D_1^{c} D_2^{c}\), we have

$$\begin{aligned} {\mathbf {x}}^{T} \varvec{\mu }_i \le 1 - \frac{1}{3 \epsilon _i^{2}} = 1 - \frac{1}{3 (\log \kappa _i)^{4}}, \quad i = 1, 2 . \end{aligned}$$

Consequently, recalling the inequality (21), and for large enough \(\nu _0\), the log-likelihood contribution of such \({\mathbf {x}}\) is bounded above by

$$\begin{aligned} \log A_3 + \log \sqrt{\kappa _i} - \frac{\kappa _i}{3 (\log \kappa _i)^{4}} \le \log \kappa _i - \frac{\kappa _i}{4 (\log \kappa _i)^{4}} \le \log \nu _0 - \frac{\nu _0}{4 (\log \nu _0)^{4}} . \end{aligned}$$

For large enough n, we must have \(n(D_1^{c} D_2^{c}) \ge n / 2 \) almost surely. Therefore, almost surely the log-likelihood of the observations in \(D_1^{c} D_2^{c}\) is bounded above by

$$\begin{aligned} l_n(\gamma ;D_1^{c}D_2^{c}) \le (n/2) \bigg ( \log \nu _0 - \frac{\nu _0}{ (\log \nu _0)^{4} } \bigg ) . \end{aligned}$$

For sufficiently large \(\nu _0\), the following inequalities hold

$$\begin{aligned}&2 A_4 \frac{1}{(\log \nu _0)^{2d-3}} \le 1, \\&\log \nu _0 - \frac{\nu _0}{(\log \nu _0)^{4}} \le 2 K_0 - 4 . \end{aligned}$$

Therefore, combining the bounds (24), (25), (26), the penalized log-likelihood can be bounded above by

$$\begin{aligned} pl_n(\gamma )\le & {} 2 A_4 n \frac{1}{(\log \nu _0)^{2d-3}} + (n/2) \bigg ( \log \nu _0 - \frac{\nu _0}{(\log \nu _0)^{4}} \bigg ) \\\le & {} n + (n/2)(2 K_0 - 4) \\= & {} n(K_0 - 1) . \end{aligned}$$

By strong law of large numbers, we have \(n^{-1} pl_n(\gamma _0) \rightarrow K_0\) almost surely. Therefore,

$$\begin{aligned} \sup _{\gamma \in \varGamma _1} pl_n(\gamma ) - pl_n(\gamma _0) \rightarrow -\infty \end{aligned}$$

almost surely. \(\square \)

Lemma 6

\( \sup _{\gamma \in \varGamma _2} pl_n(\gamma ) - pl_n(\gamma _0) \rightarrow - \infty , \quad \text{ a.s. } \)


To establish a similar result for \(\varGamma _2\), we define

$$\begin{aligned} g({\mathbf {x}};\gamma ) = \pi \exp \bigg (\frac{\kappa _1}{2} ({\mathbf {x}}^{T} \mu _1 - 1)\bigg ) + (1-\pi ) c_d(\kappa _2) \exp \big ( \kappa _2 {\mathbf {x}}^{T} \mu _2 \big ) . \end{aligned}$$

We note that the first part of the RHS above is not a vMF density, and is well defined as \(\kappa _1 \rightarrow \infty \). Straightforward calculation shows that for all \(\gamma \in \varGamma _2\), we have

$$\begin{aligned} \int _{{\mathbb {S}}^{d-1}} g({\mathbf {x}}; \gamma ) dx < 1 . \end{aligned}$$

Therefore, by Jensen’s inequality, for all \(\gamma \in \varGamma _2\),

$$\begin{aligned} E_0 \bigg [ \log \frac{g(X;\gamma )}{f_0(X)} \bigg ] < 0 , \end{aligned}$$

where we recall that \(f_0(\cdot )\) is the true density function. Since \(\sup _{\gamma \in \varGamma _2} g({\mathbf {x}};\gamma )\) is bounded and \(g({\mathbf {x}};\gamma )\) is continuous in \(\gamma \) almost surely w.r.t. \(f_0({\mathbf {x}})\), it follows that

$$\begin{aligned} \sup _{\gamma \in \varGamma _2} \bigg \{ \frac{1}{n} \sum _{i=1}^{n} \log \Big ( \frac{g(X_i;\gamma )}{f_0(X_i)} \Big ) \bigg \} \rightarrow - \eta (\tau _0) < 0 , \quad \text{ a.s. } \end{aligned}$$

where \(\eta (\tau _0) > 0 \) is an increasing function of \(\tau _0\). Hence, we can find \(\tau _0 > \nu _0\) such that

$$\begin{aligned} A_4 \frac{1}{(\log \tau _0)^{2d-3}} \le \eta (\nu _0)/4 < \eta (\tau _0)/4 . \end{aligned}$$

For any observation in \(D_1\), its log-likelihood contribution is no larger than

$$\begin{aligned} \log (A_3 \sqrt{\kappa _1} ) + \log g({\mathbf {x}}; \gamma ) . \end{aligned}$$

For sufficiently large \(\tau _0\), the log-likelihood contribution of any observation not in \(D_1\) is no more than \(\log g({\mathbf {x}};\gamma )\). This follows since for large enough \(\kappa _1 > \tau _0\),

$$\begin{aligned} c_d(\kappa _1) e^{\kappa _1 {\mathbf {x}}^{T} \mu _1}= & {} c_d(\kappa _1) e^{\kappa _1 ({\mathbf {x}}^{T} \varvec{\mu }_1 - 1)} e^{\kappa _1} \\\le & {} A_3 \sqrt{\kappa _1} e^{\kappa _1 ({\mathbf {x}}^{T} \varvec{\mu }_1 - 1)} \\\le & {} e^{ \frac{\kappa _1}{2} ({\mathbf {x}}^{T} \varvec{\mu }_1 - 1) } . \end{aligned}$$

Therefore, the penalized log-likelihood is almost surely bounded above by

$$\begin{aligned}&\sup _{\varGamma _2} pl_n(\gamma ) - pl_n(\gamma _0) \\&\quad \le \sup _{\kappa _1 \ge \tau _0} \bigg \{ \sum _{i \in D_1} \log (A_3 \sqrt{\kappa _1}) + {\tilde{p}}_n(\kappa _1) \bigg \} + \sup _{\varGamma _2} \bigg \{ \sum _{i=1}^{n} \log \frac{g(X_i;\gamma )}{f_0(X_i)} \bigg \} + p_n(\varvec{\kappa }_0) \\&\quad \le A_4 n \frac{1}{(\log \tau _0)^{2d-3}} - \frac{3}{4} \eta (\tau _0) n + p_n(\varvec{\kappa }_0) \\&\quad \le -\frac{1}{2} \eta (\tau _0) n +p_n(\varvec{\kappa }_0) \rightarrow -\infty \end{aligned}$$

where \(\varvec{\kappa }_0\) is the vector of the concentration parameters of the true measure \(\gamma _0\), the second inequality follows from (24) and (27), and the last inequality follows from (28). \(\square \)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ng, T.L.J. Penalized maximum likelihood estimator for mixture of von Mises–Fisher distributions. Metrika (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Mixture of von Mises–Fisher distributions
  • Penalized maximum likelihood estimation
  • Strong consistency