In this section, we propose an alternative of the classical EM algorithm for computing the parameters of the Student t distribution along with convergence results. In particular, we are interested in estimating the degree of freedom parameter ν, where the function F is of particular interest.
Algorithm 1 with weights \(w_{i} = \frac {1}{n}\), i = 1,…,n, is the classical EM algorithm. Note that the function in the third M-Step
$$ \begin{array}{@{}rcl@{}} {\varPhi}_{r} \left( \frac{\nu}{2} \right) &:= \phi \left( \frac{\nu}{2} \right) \underbrace{ - \phi \left( \frac{\nu_{r} + d}{2} \right) + \sum\limits_{i=1}^{n} w_{i} \left( \gamma_{i,r} - \log(\gamma_{i,r} ) - 1 \right)}_{c_{r}} \end{array} $$
has a unique zero since by (8) the function ϕ < 0 is monotone increasing with \(\lim _{x \rightarrow \infty } \phi (x) = 0^{-}\) and cr > 0. Concerning the convergence of the EM algorithm it is known that the values of the objective function L(νr,μr,Σr) are monotone decreasing in r and that a subsequence of the iterates converges to a critical point of L(ν, μ, Σ) if such a point exists, see [5].
Algorithm 2 distinguishes from the EM algorithm in the iteration of Σ, where the factor \(\frac {1}{\sum \limits _{i=1}^{n} w_{i} \gamma _{i,r}}\) is incorporated now. The computation of this factor requires no additional computational effort, but speeds up the performance in particular for smaller ν. Such kind of acceleration was suggested in [12, 24]. For fixed ν ≥ 1, it was shown in [32] that this algorithm is indeed an EM algorithm arising from another choice of the hidden variable than used in the standard approach, see also [15]. Thus, it follows for fixed ν ≥ 1 that the sequence L(ν, μr,Σr) is monotone decreasing. However, we also iterate over ν. In contrast to the EM Algorithm 1 our ν iteration step depends on μr+ 1 and Σr+ 1 instead of μr and Σr. This is important for our convergence results. Note that for both cases, the accelerated algorithm can no longer be interpreted as an EM algorithm, so that the convergence results of the classical EM approach are no longer available.
Let us mention that a Jacobi variant of Algorithm 2 for fixedν, i.e.,
$$ {\varSigma}_{r+1} = \sum\limits_{i=1}^{n} \frac{w_{i}\gamma_{i,r} (x_{i}-\mu_{r})(x_{i}-\mu_{r})^{\mathrm{T}} }{{\sum}_{i=1}^{n} w_{i} \gamma_{i,r}}, $$
with μr instead of μr+ 1 including a convergence proof was suggested in [17]. The main reason for this index choice was that we were able to prove monotone convergence of a simplified version of the algorithm for estimating the location and scale of Cauchy noise (d = 1, ν = 1) which could be not achieved with the variant incorporating μr+ 1 (see [16]). This simplified version is known as myriad filter in image processing. In this paper, we keep the original variant from the EM algorithm (14) since we are mainly interested in the computation of ν.
Instead of the above algorithms we suggest to take the critical point (4) more directly into account in the next two algorithms.
Finally, Algorithm 4 computes the update of ν by directly finding a zero of the whole function F in (4) given μr and Σr. The existence of such a zero was discussed in the previous section. The zero computation is done by an inner loop which iterates the update step of ν from Algorithm 3. We will see that the iteration converge indeed to a zero of F.
In the rest of this section, we prove that the sequence (L(νr,μ, r, Σr))r generated by Algorithms 2 and 3 decreases in each iteration step and that there exists a subsequence of the iterates which converges to a critical point.
We will need the following auxiliary lemma.
Lemma 1
Let \(F_{a},F_{b}\colon \mathbb {R}_{>0}\to \mathbb {R}\) be continuous functions, where Fa is strictly increasing and Fb is strictly decreasing. Define F : = Fa + Fb. For any initial value x0 > 0 assume that the sequence generated by
$$ x_{l+1} = \text{ zero of } F_{a}(x)+F_{b}(x_{l}) $$
is uniquely determined, i.e., the functions on the right-hand side have a unique zero. Then it holds
-
i)
If F(x0) < 0, then (xl)l is strictly increasing and F(x) < 0 for all x ∈ [xl,xl+ 1], \(l \in \mathbb N_{0}\).
-
ii)
If F(x0) > 0, then (xl)l is strictly decreasing and F(x) > 0 for all x ∈ [xl+ 1,xl], \(l \in \mathbb N_{0}\).
Furthermore, assume that there exists x− > 0 with F(x) < 0 for all x < x− and x+ > 0 with F(x) > 0 for all x > x+. Then, the sequence (xl)l converges to a zero x∗ of F.
Proof
We consider the case i) that F(x0) < 0. Case ii) follows in a similar way.
We show by induction that F(xl) < 0 and that xl+ 1 > xl for all \(l \in \mathbb N\). Then it holds for all \(l\in \mathbb N\) and x ∈ (xl,xl+ 1) that Fa(x) + Fb(x) < Fa(x) + Fb(xl) < Fa(xl+ 1) + Fb(xl) = 0. Thus F(x) < 0 for all x ∈ [xl,xl+ 1], \(l \in \mathbb N_{0}\).
Induction step. Let Fa(xl) + Fb(xl) < 0. Since Fa(xl+ 1) + Fb(xl) = 0 > Fa(xl) + Fb(xl) and Fa is strictly increasing, we have xl+ 1 > xl. Using that Fb is strictly decreasing, we get Fb(xl+ 1) < Fb(xl) and consequently
$$ F(x_{l+1}) = F_{a}(x_{l+1}) + F_{b}(x_{l+1}) < F_{a}(x_{l+1}) + F_{b}(x_{l})=0. $$
Assume now that F(x) > 0 for all x > x+. Since the sequence (xl)l is strictly increasing and F(xl) < 0 it must be bounded from above by x+. Therefore it converges to some \(x^{*}\in \mathbb {R}_{>0}\). Now, it holds by the continuity of Fa and Fb that
$$ 0 =\lim\limits_{l\to\infty} F_{a}(x_{l+1}) + F_{b}(x_{l}) = F_{a}(x^{*}) + F_{b}(x^{*}) = F(x^{*}). $$
Hence x∗ is a zero of F. □
For the setting in Algorithm 4, Lemma 1 implies the following corollary.
Corollary 3
Let \(F_{a} (\nu ) := \phi \left (\frac {\nu }{2} \right ) - \phi \left (\frac {\nu +d}{2} \right )\) and
$$ F_{b} (\nu) := \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right),\quad r \in \mathbb{N}_{0}. $$
Assume that there exists ν+ > 0 such that F : = Fa + Fb > 0 for all ν ≥ ν+. Then the sequence (νr, l)l generated by the r th inner loop of Algorithm 4 converges to a zero of F.
Note that by Corollary 2 the above condition on F is fulfilled in each iteration step, e.g., if \(\delta _{i,r} \not \in [d - \sqrt {2d} , d + \sqrt {2d}]\) for i = 1,…,n and \(r \in \mathbb {N}_{0}\).
Proof
From the previous section we know that Fa is strictly increasing and Fb is strictly decreasing. Both functions are continuous. If F(νr) < 0, then we know from Lemma 1 that (νr, l)l is increasing and converges to a zero \(\nu _{r}^{*}\) of F.
If F(νr) > 0, then we know from Lemma 1 that (νr, l)l is decreasing. The condition that there exists \(x_{-}\in \mathbb {R}_{>0}\) with F(x) < 0 for all x < x− is fulfilled since \(\lim _{x \rightarrow 0} F(x) = -\infty \). Hence, by Lemma 1, the sequence converges to a zero \(\nu _{r}^{*}\) of F. □
To prove that the objective function decreases in each step of the Algorithms 2–4 we need the following lemma.
Lemma 2
Let \(F_{a},F_{b}\colon \mathbb {R}_{>0}\to \mathbb {R}\) be continuous functions, where Fa is strictly increasing and Fb is strictly decreasing. Define F : = Fa + Fb and let \(G\colon \mathbb {R}_{>0}\to \mathbb {R}\) be an antiderivative of F, i.e., \(F= \frac {\mathrm {d}}{\mathrm {d}x} G\). For an arbitrary x0 > 0, let (xl)l be the sequence generated by
$$ x_{l+1} = \text{ zero of } F_{a}(x) + F_{b}(x_{l}). $$
Then the following holds true:
-
i)
The sequence (G(xl))l is monotone decreasing with G(xl) = G(xl+ 1) if and only if x0 is a critical point of G. If (xl)l converges, then the limit x∗ fulfills
$$ G(x_{0}) \geq G(x_{1}) \geq G(x^{*}), $$
with equality if and only if x0 is a critical point of G.
-
ii)
Let \(F = \tilde F_{a} + \tilde F_{b}\) be another splitting of F with continuous functions \(\tilde F_{a}, \tilde F_{b}\), where the first one is strictly increasing and the second one strictly decreasing. Assume that \(\tilde F_{a}^{\prime }(x) > F_{a}^{\prime }(x)\) for all x > 0. Then holds for \(y_{1} := \text { zero of } \tilde F_{a}(x) + \tilde F_{b}(x_{0})\) that G(x0) ≥ G(y1) ≥ G(x1) with equality if and only if x0 is a critical point of G.
Proof
i) If F(x0) = 0, then x0 is a critical point of G.
Let F(x0) < 0. By Lemma 1 we know that (xl)l is strictly increasing and that F(x) < 0 for x ∈ [xr,xr+ 1], \(r \in \mathbb {N}_{0}\). By the Fundamental Theorem of calculus it holds
$$ G(x_{l+1})=G(x_{l})+{\int}_{x_{l}}^{x_{l+1}} F(\nu) d\nu. $$
Thus, G(xl+ 1) < G(xl).
Let F(x0) > 0. By Lemma 1 we know that (xl)l is strictly decreasing and that F(x) > 0 for x ∈ [xr+ 1,xr], \(r \in \mathbb {N}_{0}\). Then
$$ G(x_{l}) = G(x_{l+1})+{\int}_{x_{l+1}}^{x_{l}} F(\nu) d\nu. $$
implies G(xl+ 1) < G(xl). Now, the rest of assertion i) follows immediately. ii) It remains to show that G(x1) ≤ G(y1). Let F(x0) < 0. Then we have y1 ≥ x0 and x1 ≥ x0. By the Fundamental Theorem of calculus we obtain
$$ F(x_{0}) + {\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x)dx = F_{a}(x_{0})+{\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x) dx + F_{b} (x_{0}) = F_{a} (x_{1}) + F_{b} (x_{0})=0,$$
and
$$ F(x_{0}) + {\int}_{x_{0}}^{y_{1}} \tilde F_{a}^{\prime}(x)dx=\tilde F_{a}(x_{0})+{\int}_{x_{0}}^{y_{1}}\tilde F_{a}^{\prime}(x) dx+\tilde F_{b}(x_{0}) =\tilde F_{a}(y_{1})+\tilde F_{b}(x_{0})=0. $$
This yields
$$ {\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x) dx={\int}_{x_{0}}^{y_{1}}\tilde F_{a}^{\prime}(x)dx, $$
and since \(\tilde F^{\prime }_{a}(x) > F^{\prime }_{a}(x)\) further y1 ≤ x1 with equality if and only if x0 = x1, i.e., if x0 is a critical point of G. Since F(x) < 0 on (x0,x1) it holds
$$ G(x_{1})=G(y_{1})+{\int}_{y_{1}}^{x_{1}}F(x) dx \leq G(y_{1}), $$
with equality if and only if x0 = x1. The case F(x0) > 0 can be handled similarly. □
Lemma 2 implies the following relation between the values of the objective function L for Algorithms 2–4.
Corollary 4
For the same fixed \(\nu _{r}>0, \mu _{r}\in \mathbb {R}^{d}, {\varSigma }_{r}\in \text {SPD}(d)\) define μr+ 1, Σr+ 1, \(\nu _{r+1}^{\text {aEM}}\), \(\nu _{r+1}^{\text {MMF}}\) and \(\nu _{r+1}^{\text {GMMF}}\) by Algorithm 2, 3 and 4, respectively. For the GMMF algorithm assume that the inner loop converges. Then it holds
$$ \begin{array}{@{}rcl@{}} L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}) &\geq& L(\nu_{r+1}^{\text{aEM}},\mu_{r+1},{\varSigma}_{r+1}) \geq L(\nu_{r+1}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1})\\ &\geq& L(\nu_{r+1}^{\text{GMMF}},\mu_{r+1},{\varSigma}_{r+1}). \end{array} $$
Equality holds true if and only if \(\frac {\mathrm {d}}{\mathrm {d}\nu }L(\nu _{r},\mu _{r+1},{\varSigma }_{r+1})=0\) and in this case \(\nu _{r} = \nu _{r+1}^{\text {aEM}} = \nu _{r+1}^{\text {MMF}} = \nu _{r+1}^{\text {GMMF}}\).
Proof
For G(ν) : = L(ν, μr+ 1,Σr+ 1), we have \(\frac {\mathrm {d}}{\mathrm {d}\nu } L(\nu ,\mu _{r+1},{\varSigma }_{r+1}) = F(\nu )\), where
$$ F(\nu) := \phi\left( \frac{\nu}{2} \right) -\phi\left( \frac{\nu +d}{2} \right) + \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right). $$
We use the splitting
$$F = F_{a} + F_{b} = \tilde F_{a} + \tilde F_{b}$$
with
$$ F_{a} (\nu):= \phi\left( \frac\nu2 \right)- \phi\left( \frac{\nu + d}{2} \right), \quad \tilde F_{a} := \phi\left( \frac\nu2 \right), $$
$$ F_{b}(\nu) := \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right), $$
and
$$ \quad \tilde F_{b} (\nu):= - \phi \left( \frac{\nu+d}{2} \right) + F_{b}(\nu). $$
By the considerations in the previous section we know that Fa, \(\tilde F_{a}\) are strictly increasing and Fb, \(\tilde F_{b}\) are strictly decreasing. Moreover, since \(\phi ^{\prime } > 0\) we have \(\tilde F^{\prime }_{a} > F^{\prime }_{a}\). Hence it follows from Lemma 2(ii) that
$$L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}) \ge L\left( \nu_{r}^{\text{aEM}},\mu_{r+1},{\varSigma}_{r+1}\right) \ge L\left( \nu_{r}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1}\right).$$
Finally, we conclude by Lemma 2(i) that
$$L\left( \nu_{r}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1}\right) \ge L\left( \nu_{r}^{\text{GMMF}},\mu_{r+1},{\varSigma}_{r+1}\right).$$
□
Concerning the convergence of the three algorithms we have the following result.
Theorem 3
Let (νr,μr,Σr)r be sequence generated by Algorithm 2, 3 or 4, respectively starting with arbitrary initial values \(\nu _{0} >0,\mu _{0}\in \mathbb {R}^{d},{\varSigma }_{0}\in \text {SPD}(d)\). For the GMMF algorithm we assume that in each step the inner loop converges. Then it holds for all \(r\in \mathbb N_{0}\) that
$$ L(\nu_{r},\mu_{r},{\varSigma}_{r}) \geq L(\nu_{r+1},\mu_{r+1},{\varSigma}_{r+1}), $$
with equality if and only if (νr,μr,Σr) = (νr+ 1,μr+ 1,Σr+ 1).
Proof
By the general convergence results of the accelerated EM algorithm for fixed ν, see also [17], it holds
$$ L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1})\leq L(\nu_{r},\mu_{r},{\varSigma}_{r}), $$
with equality if and only if (μr,Σr) = (μr+ 1,Σr+ 1). By Corollary 4 it holds
$$ L(\nu_{r+1},\mu_{r+1},{\varSigma}_{r+1})\leq L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}), $$
with equality if and only if νr = νr+ 1. The combination of both results proves the claim. □
Lemma 3
Let \(T = (T_{1}, T_{2}, T_{3}): \mathbb {R}_{>0} \times \mathbb {R}^{d} \times SPD(d) \rightarrow \mathbb {R}_{>0} \times \mathbb {R}^{d} \times SPD(d)\) be the operator of one iteration step of Algorithm 2 (or 3). Then T is continuous.
Proof
We show the statement for Algorithm 3. For Algorithm 2 it can be shown analogously. Clearly the mapping (T2,T3)(ν, μ, Σ) is continuous. Since
$$T_{1}(\nu,\mu,{\varSigma}) = \text{zero of } {\varPsi}(x, \nu,T_{2}(\nu,\mu,{\varSigma}),T_{3}(\nu,\mu,{\varSigma})),$$
where
$$ \begin{array}{@{}rcl@{}} &{{\varPsi}}&(x, \nu,\mu,{\varSigma}) =\phi\left( \frac{x}{2}\right)-\phi\left( \frac{x+d}{2}\right)\\ &&+ \sum\limits_{i=1}^{n} w_{i}\left( \tfrac{\nu+d}{\nu+(x_{i}-\mu)^{T}{\varSigma}^{-1}(x_{i}-\mu)}-\log\left( \tfrac{\nu+d}{\nu+(x_{i}-\mu)^{T}{\varSigma}^{-1}(x_{i}-\mu)}\right)-1\right). \end{array} $$
It is sufficient to show that the zero of Ψ depends continuously on ν, T2 and T3. Now the continuously differentiable function Ψ is strictly increasing in x, so that \(\frac {\partial }{\partial x} {\varPsi }(x,\nu ,T_{2},T_{3})>0\). By Ψ(T1,ν, T2,T3) = 0, the Implicit Function Theorem yields the following statement: There exists an open neighborhood U × V of (T1,ν, T2,T3) with \(U\subset \mathbb {R}_{>0}\) and \(V\subset \mathbb {R}_{>0}\times \mathbb {R}^{d}\times SPD(d)\) and a continuously differentiable function G: V → U such that for all (x, ν, μ, Σ) ∈ U × V it holds
$$ {\varPsi}(x,\nu,\mu,{\varSigma})=0 \quad \text{if and only if}\quad G(\nu,\mu,{\varSigma})=x. $$
Thus the zero of Ψ depends continuously on ν, T2 and T3. □
This implies the following theorem.
Theorem 4
Let (νr,μr,Σr)r be the sequence generated by Algorithm 2 or 3 with arbitrary initial values \(\nu _{0} >0,\mu _{0}\in \mathbb {R}^{d},{\varSigma }_{0}\in \text {SPD}(d)\). Then every cluster point of (νr,μr,Σr)r is a critical point of L.
Proof
The mapping T defined in Lemma 3 is continuous. Further we know from its definition that (ν, μ, Σ) is a critical point of L if and only if it is a fixed point of T. Let \((\hat \nu ,\hat \mu ,\hat {\varSigma })\) be a cluster point of (νr,μr,Σr)r. Then there exists a subsequence \((\nu _{r_{s}},\mu _{r_{s}},{\varSigma }_{r_{s}})_{s}\) which converges to \((\hat \nu ,\hat \mu ,\hat {\varSigma })\). Further we know by Theorem 3 that Lr = L(νr,μr,Σr) is decreasing. Since (Lr)r is bounded from below, it converges. Now it holds
$$ \begin{array}{@{}rcl@{}} L\left( \hat \nu,\hat \mu,\hat {\varSigma}\right)&=&\underset{s\to\infty}{\lim}L\left( \nu_{r_{s}},\mu_{r_{s}},{\varSigma}_{r_{s}}\right)\\ &=&\underset{s\to\infty}{\lim}L_{r_{s}}=\underset{s\to\infty}{\lim}L_{r_{s}+1}\\ &=&\underset{s\to\infty}{\lim}L\left( \nu_{r_{s}+1},\mu_{r_{s}+1},{\varSigma}_{r_{s}+1}\right)\\ &=&\underset{s\to\infty}{\lim}L\left( T\left( \nu_{r_{s}},\mu_{r_{s}},{\varSigma}_{r_{s}}\right)\right)=L\left( T\left( \hat\nu,\hat\mu,\hat{\varSigma}\right)\right). \end{array} $$
By Theorem 3 and the definition of T we have that L(ν, μ, Σ) = L(T(ν, μ, Σ)) if and only if (ν, μ, Σ) = T(ν, μ, Σ). By the definition of the algorithm this is the case if and only if (ν, μ, Σ) is a critical point of L. Thus \((\hat \nu ,\hat \mu ,\hat {\varSigma })\) is a critical point of L. □