Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution

In this paper, we consider maximum likelihood estimations of the degree of freedom parameter ν, the location parameter μ and the scatter matrix Σ of the multivariate Student t distribution. In particular, we are interested in estimating the degree of freedom parameter ν that determines the tails of the corresponding probability density function and was rarely considered in detail in the literature so far. We prove that under certain assumptions a minimizer of the negative log-likelihood function exists, where we have to take special care of the case ν→∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu \rightarrow \infty $\end{document}, for which the Student t distribution approaches the Gaussian distribution. As alternatives to the classical EM algorithm we propose three other algorithms which cannot be interpreted as EM algorithm. For fixed ν, the first algorithm is an accelerated EM algorithm known from the literature. However, since we do not fix ν, we cannot apply standard convergence results for the EM algorithm. The other two algorithms differ from this algorithm in the iteration step for ν. We show how the objective function behaves for the different updates of ν and prove for all three algorithms that it decreases in each iteration step. We compare the algorithms as well as some accelerated versions by numerical simulation and apply one of them for estimating the degree of freedom parameter in images corrupted by Student t noise.


Introduction
The motivation for this work arises from certain tasks in image processing, where the robustness of methods plays an important role.In this context, the Student-t distribution and the closely related Student-t mixture models became popular in various image processing tasks.In [31] it has been shown that Student-t mixture models are superior to Gaussian mixture models for modeling image patches and the authors proposed an application in image compression.Image denoising based on Student-t models was addressed in [17] and image deblurring in [6,34].Further applications include robust image segmentation [4,25,29] as well as robust registration [8,35].
In one dimension and for ν = 1, the Student-t distribution coincides with the one-dimensional Cauchy distribution.This distribution has been proposed to model a very impulsive noise behavior and one of the first papers which suggested a variational approach in connection with wavelet shrinkage for denoising of images corrupted by Cauchy noise was [3].A variational method consisting of a data term that resembles the noise statistics and a total variation regularization term has been introduced in [23,28].Based on an ML approach the authors of [16] introduced a so-called generalized myriad filter that estimates both the location and the scale parameter of the Cauchy distribution.They used the filter in a nonlocal denoising approach, where for each pixel of the image they chose as samples of the distribution those pixels having a similar neighborhood and replaced the initial pixel by its filtered version.
We also want to mention that a unified framework for images corrupted by white noise that can handle (range constrained) Cauchy noise as well was suggested in [14].
In contrast to the above pixelwise replacement, the state-of-the-art algorithm of Lebrun et al. [18] for denoising images corrupted by white Gaussian noise restores the image patchwise based on a maximum a posteriori approach.In the Gaussian setting, their approach is equivalent to minimum mean square error estimation, and more general, the resulting estimator can be seen as a particular instance of a best linear unbiased estimator (BLUE).
For denoising images corrupted by additive Cauchy noise, a similar approach was addressed in [17] based on ML estimation for the family of Student-t distributions, of which the Cauchy distribution forms a special case.The authors call this approach generalized multivariate myriad filter.However, all these approaches assume that the degree of freedom parameter ν of the Student-t distribution is known, which might not be the case in practice.In this paper we consider the estimation of the degree of freedom parameter based on an ML approach.In contrast to maximum likelihood estimators of the location and/or scatter parameter(s) µ and Σ, to the best of our knowledge the question of existence of a joint maximum likelihood estimator has not been analyzed before and in this paper we provide first results in this direction.Usually the likelihood function of the Student-t distributions and mixture models are minimized using the EM algorithm derived e.g. in [13,21,22,26].For fixed ν, there exists an accelerated EM algorithm [12,24,32] which appears to be more efficient than the classical one for smaller parameters ν.We examine the convergence of the accelerated version if also the degree of freedom parameter ν has to be estimated.Also for unknown degrees of freedom, there exist an accelerated version of the EM algorithm, the so-called ECME algorithm [20] which differs from our algorithm.Further, we propose two modifications of the ν iteration step which lead to efficient algorithms for a wide range of parameters ν.Finally, we address further accelerations of our algorithms by the squared iterative methods (SQUAREM) [33] and the damped Anderson acceleration with restarts and -monotonicity (DAAREM) [9].
The paper is organized as follows: In Section 2 we introduce the Student-t distribution, the negative log-likelihood function L and their derivatives.The question of the existence of a minimizer of L is addressed in Section 3. Section 4 deals with the solution of the equation arising when setting the gradient of L with respect to ν to zero.The results of this section will be important for the convergence consideration of our algorithms in the Section 5. We propose three alternatives of the classical EM algorithm and prove that the objective function L decreases for the iterates produced by these algorithms.Finally, we provide two kinds of numerical results in Section 5. First, we compare the different algorithms by numerical examples which indicate that the new ν iterations are very efficient for estimating ν of different magnitudes.Second, we come back to the original motivation of this paper and estimate the degree of freedom parameter ν from images corrupted by one-dimensional Student-t noise.

Likelihood of the Multivariate Student-t Distribution
The density function of the d-dimensional Student-t distribution T ν (µ, Σ) with ν > 0 degrees of freedom, location paramter µ ∈ R d and symmetric, positive definite scatter matrix , with the Gamma function Γ(s) := ∞ 0 t s−1 e −t dt.The expectation of the Student-t distribution is E(X) = µ for ν > 1 and the covariance matrix is given by Cov(X) = ν ν−2 Σ for ν > 2, otherwise the quantities are undefined.The smaller the value of ν, the heavier are the tails of the T ν (µ, Σ) distribution.For ν → ∞, the Student-t distribution T ν (µ, Σ) converges to the normal distribution N (µ, Σ) and for ν = 0 it is related to the projected normal distribution on the sphere S d−1 ⊂ R d .Figure 1  elliptically symmetric distributions.These distributions are stable under linear transforms in the following sense: Let X ∼ T ν (µ, Σ) and A ∈ R d×d be an invertible matrix and let admits the following stochastic representation, which can be used to generate samples from T ν (µ, Σ) based on samples from the multivariate standard normal distribution N (0, I) and the Gamma distribution Γ ν 2 , ν 2 : Let Z ∼ N (0, I) and Y ∼ Γ ν 2 , ν 2 be independent, then For i.i.d.samples x i ∈ R d , i = 1, . . ., n, the likelihood function of the Student-t distribution , and the log-likelihood function by In the following, we are interested in the negative log-likelihood function, which up to the factor 2 n and weights w i = 1 n reads as In this paper, we allow for arbitrary weights from the open probability simplex ∆n := w = (w 1 , . . ., w n ) ∈ R n >0 : n i=1 w i = 1 .In this way, we might express different levels of confidence in single samples or handle the occurrence of multiple samples.Using ∂ log(|X|) , see [27], the derivatives of L with respect to µ, Σ and ν are given by x > 0 and the digamma function Setting the derivatives to zero results in the equations Computing the trace of both sides of (3) and using the linearity and permutation invariance of the trace operator we obtain .
We are interested in critical points of the negative log-likelihood function L, i.e. in solutions (µ, Σ, ν) of ( 2) -( 4), and in particular in minimizers of L.

Existence of Critical Points
In this section, we examine whether the negative log-likelihood function L has a minimizer, where we restrict our attention to the case µ = 0.For an approach how to extend the results to arbitrary µ for fixed ν we refer to [17].To the best of our knowledge, this is the first work that provides results in this direction.The question of existence is, however, crucial in the context of ML estimation, since it lays the foundation for any convergence result for the EM algorithm or its variants.In fact, the authors of [13] observed the divergence of the EM algorithm in some of their numerical experiments, which is in accordance with our observations.
For fixed ν > 0, it is known that there exists a unique solution of (3) and for ν = 0 that there exists solutions of (3) which differ only by a multiplicative positive constant, see, e.g.[17].In contrast, if we do not fix ν, we have roughly to distinguish between the two cases that the samples tend to come from a Gaussian distribution, i.e. ν → ∞, or not.The results are presented in Theorem 3.2.
For µ = 0, the negative log-likelihood function becomes Further, for a fixed ν > 0, set To prove the next existence theorem we will need two lemmas, whose proofs are given in the appendix.(ii) For every minimizing sequence (ν r , Σ r ) r of L(ν, Σ) we have lim r→∞ ν r = ∞.Then (Σ r ) r converges to the maximum likelihood estimator Σ = n i=1 w i x i x T i of the normal distribution N (0, Σ).
Proof.Case 1: Assume that there exists a minimizing sequence (ν r , Σ r ) r of L, such that (ν r ) r has a bounded subsequence.In particular, using Lemma A.1, we have that (ν r ) r has a cluster point ν * > 0 and a subsequence (ν r k ) k converging to ν * .Clearly, the sequence (ν r k , Σ r k ) k is again a minimizing sequence so that we skip the second index in the following.By Lemma A.2, the set {Σ r : r ∈ N} is a compact subset of SPD(d).Therefore there exists a subsequence (Σ r k ) k which converges to some Σ * ∈ SPD(d).Now we have by continuity of Case 2: Assume that for every minimizing sequence (ν r , Σ r ) r it holds that ν r → ∞ as r → ∞.We rewrite the likelihood function as Next we show by contradiction that {Σ r : r ∈ N} is in SPD(d) and bounded: Denote the eigenvalues of Σ r by λ r1 ≥ • • • ≥ λ rd .Assume that either {λ r1 : r ∈ N} is unbounded or that {λ rd : r ∈ N} has zero as a cluster point.Then, we know by [17,Theorem 4.3] that there exists a subsequence of (Σ r ) r , which we again denote by (Σ r ) r , such that for any fixed ν > 0 it holds By (5) this yields This contradicts the assumption that (ν r , Σ r ) r is a minimizing sequence of L. Hence {Σ r : r ∈ N} is a bounded subset of SPD(d).
Finally, we show that any subsequence of (Σ r ) r has a subsequence which converges to Σ = n i=1 w i x i x T i .Then the whole sequence (Σ r ) r converges to Σ. Let (Σ r k ) k be a subsequence of (Σ r ) r .Since it is bounded, it has a convergent subsequence (Σ r k l ) l which converges to some Σ ∈ {Σ r : r ∈ N} ⊂ SPD(d).For simplicity, we denote (Σ r k l ) l again by (Σ r ) r .Since (Σ r ) r is converges, we know that also (x T i Σ −1 r x i ) r converges and is bounded.By lim r→∞ ν r = ∞ we know that the functions x → 1 + x νr νr converge locally uniformly to x → exp(x) as r → ∞.Thus we obtain lim r→∞ Hence we have inf By taking the derivative with respect to Σ we see that the right-hand side is minimal if and On the other hand, by similar computations as above we get inf so that Σ = Σ.This finishes the proof.

Zeros of F
In this section, we are interested in the existence of solutions of (4), i.e., in zeros of F for arbitrary fixed µ and Σ. Setting x := ν 2 > 0, t := d 2 and we rewrite the function F in (4) as where and The digamma function ψ and φ = ψ − log(•) are well examined in the literature, see [1].The function φ(x) is the expectation value of a random variable which is Γ(x, x) distributed.It and it is well-known that −φ is completely monotone.This implies that the negative of A is also completely monotone, i.e. for all x > 0 and m ∈ N 0 we have On the other hand, we have that B(x) ≡ 0 if s = t in which case F s = A < 0 and has therefore no zero.If s = t, then B s is completely monotone, i.e., for all x > 0 and m ∈ N 0 , in particular B s > 0, B s < 0 and B s > 0, and with E(Y ) = d and Var(Y ) = 2d.Thus we would expect that for samples x i from such a random variable X the corresponding values ( . These considerations are reflected in the following theorem and corollary. Theorem 4.1.For F s : R >0 → R given by (7) the following relations hold true: , then there exists x + such that F s (x) > 0 for all x ≥ x + .In particular, F s has a zero.
Proof.We have We want to sandwich F s between two rational functions P s and P s + Q which zeros can easily be described.
Since the trigamma function ψ has the series representation see [1], we obtain For x > 0, we have Let R(x) and T (x) denote the rectangular and trapezoidal rule, respectively, for computing the integral with step size 1.Then we verify .
By considering the first and second derivative of g we see the integrand in I(x) is strictly decreasing and strictly convex.Thus, P s (x) < F s (x), where We have and and Euler's summation formula, we obtain Therefore, we conclude The main coefficient of x 5 of the polynomial in the numerator is 2(t − (s − t) 2 ) which fulfills (12) Proof.By Theorem 4.1ii) and since lim x→0 F s (x) = −∞ and lim x→∞ = 0 + , it remains to prove that F s has at most one zero.Let x 0 > 0 be the smallest number such that F s (x 0 ) = 0.
We prove that F s (x) < 0 for all x > x 0 .To this end, we show that h s (x is strictly decreasing.By (11) we have and for s > t further where I(x) is the integral and R(x) the corresponding rectangular rule with step size 1 of the function g := g 1 + g 2 defined as We show that R(x) − I(x) < 0 for all x > 0. Let T (x), T i (x) be the trapezoidal rules with step size 1 corresponding to I(x) and Since g 2 is a decreasing, concave function, we conclude T 2 (x) − I 2 (x) < 0. Using Euler's summation formula in (13) for g 1 , we get 1 is a positive function, we can write All coefficients of x are smaller or equal than zero for t ≥ 1 which implies that h s is strictly decreasing.
Theorem 4.1 implies the following corollary.
we have by Theorem 4.1 that F s i (x) < 0 for all x > 0. Clearly, the same holds true for the whole function F such that it cannot have a zero.
Since lim x→0 F (x) = −∞ this implies that F has a zero.

Algorithms
In this section, we propose an alternative of the classical EM algorithm for computing the parameters of the Student-t distribution along with convergence results.In particular, we are interested in estimating the degree of freedom parameter ν, where the function F is of particular interest.

Algorithm 1 with weights w
cr has a unique zero since by ( 8) the function φ < 0 is monotone increasing with lim x→∞ φ(x) = 0 − and c r > 0. Concerning the convergence of the EM algorithm it is known that the values of the objective function L(ν r , µ r , Σ r ) are monotone decreasing in r and that a subsequence of the iterates converges to a critical point of L(ν, µ, Σ) if such a point exists, see [5].
Algorithm 2 distinguishes from the EM algorithm in the iteration of Σ, where the factor is incorporated now.The computation of this factor requires no additional computational effort, but speeds up the performance in particular for smaller ν.Such kind of acceleration was suggested in [12,24].For fixed ν ≥ 1, it was shown in [32] that this algorithm is indeed an EM algorithm arising from another choice of the hidden variable than used in the standard approach, see also [15].Thus, it follows for fixed ν ≥ 1 that the sequence L(ν, µ r , Σ r ) is monotone decreasing.However, we also iterate over ν.In contrast to the EM Algorithm 1 our ν iteration step depends on µ r+1 and Σ r+1 instead of µ r and Σ r .This is important for our convergence results.Note that for both cases, the accelerated algorithm can no longer be interpreted as an EM algorithm, so that the convergence results of the classical EM approach are no longer available.
Let us mention that a Jacobi variant of Algorithm 2 for fixed ν i.e.
, Algorithm 1 EM Algorithm (EM) M-Step: Update the parameters with µ r instead of µ r+1 including a convergence proof was suggested in [17].The main reason for this index choice was that we were able to prove monotone convergence of a simplified version of the algorithm for estimating the location and scale of Cauchy noise (d = 1, ν = 1) which could be not achieved with the variant incorporating µ r+1 , see [16].This simplified version is known as myriad filter in image processing.In this paper, we keep the original variant from the EM algorithm ( 14) since we are mainly interested in the computation of ν.
Instead of the above algorithms we suggest to take the critical point equation ( 4) more directly into account in the next two algorithms.
Algorithm 3 computes a zero of and F b is strictly decreasing.Define F := F a + F b .For any initial value x 0 > 0 assume that the sequence generated by is uniquely determined, i.e., the functions on the right-hand side have a unique zero.Then it holds i) If F (x 0 ) < 0, then (x l ) l is strictly increasing and F (x) < 0 for all x ∈ [x l , x l+1 ], l ∈ N 0 .
Furthormore, assume that there exists x − > 0 with F (x) < 0 for all x < x − and x + > 0 with F (x) > 0 for all x > x + .Then, the sequence (x l ) l converges to a zero x * of F .
Proof.We consider the case i) that F (x 0 ) < 0. Case ii) follows in a similar way.We show by induction that F (x l ) < 0 and that x l+1 > x l for all l ∈ N. Then it holds for all and F a is strictly increasing, we have x l+1 > x l .Using that F b is strictly decreasing, we get Assume now that F (x) > 0 for all x > x + .Since the sequence (x l ) l is strictly increasing and F (x l ) < 0 it must be bounded from above by x + .Therefore it converges to some x * ∈ R >0 .Now, it holds by the continuity of F a and F b that Hence x * is a zero of F .
For the setting in Algorithm 4, Lemma 5.1 implies the following corollary.Proof.From the previous section we know that F a is strictly increasing and F b is strictly decreasing.Both functions are continuous.If F (ν r ) < 0, then we know from Lemma 5.1 that (ν r,l ) l is increasing and converges to a zero ν * r of F .If F (ν r ) > 0, then we know from Lemma 5.1 that (ν r,l ) l is decreasing.The condition that there exists x − ∈ R >0 with F (x) < 0 for all x < x − is fulfilled since lim x→0 F (x) = −∞.
Hence, by Lemma 5.1, the sequence converges to a zero ν * r of F .
To prove that the objective function decreases in each step of the Algorithms 2 -4 we need the following lemma.For an arbitrary x 0 > 0, let (x l ) l be the sequence generated by Then the following holds true: with equality if and only if x 0 is a critical point of G.
ii) Let F = Fa + Fb be another splitting of F with continuous functions Fa , Fb , where the first one is strictly increasing and the second one strictly decreasing.Assume that F a (x) > F a (x) for all x > 0. Then holds for y 1 := zero of Fa (x) + Fb (x 0 ) that G(x 0 ) ≥ G(y 1 ) ≥ G(x 1 ) with equality if and only if x 0 is a critical point of G.
Let F (x 0 ) < 0. By Lemma 5.1 we know that (x l ) l is strictly increasing and that F (x) < 0 for x ∈ [x r , x r+1 ], r ∈ N 0 .By the Fundamental Theorem of calculus it holds Let F (x 0 ) > 0. By Lemma 5.1 we know that (x l ) l is strictly decreasing and that F (x) > 0 implies G(x l+1 ) < G(x l ).Now, the rest of assertion i) follows immediately.
ii) It remains to show that G(x 1 ) ≤ G(y 1 ).Let F (x 0 ) < 0. Then we have y 1 ≥ x 0 and x 1 ≥ x 0 .By the Fundamental Theorem of calculus we obtain This yields and since F a (x) > F a (x) further y 1 ≤ x 1 with equality if and only if with equality if and only if x 0 = x 1 .The case F (x 0 ) > 0 can be handled similarly.
Concerning the convergence of the three algorithms we have the following result.
be the operator of one iteration step of Algorithm 2 (or 3).Then T is continuous.
Proof.We show the statement for Algorithm 3.For Algorithm 2 it can be shown analogously.Clearly the mapping (T 2 , T 3 )(ν, µ, Σ) is continuous.Since It is sufficient to show that the zero of Ψ depends continuously on ν, T 2 and T 3 .Now the continuously differentiable function Ψ is strictly increasing in x, so that ∂ ∂x Ψ(x, ν, T 2 , T 3 ) > 0. By Ψ(T 1 , ν, T 2 , T 3 ) = 0, the Implicit Function Theorem yields the following statement: There exists an open neighborhood U ×V of (T 1 , ν, T 2 , T 3 ) with U ⊂ R >0 and V ⊂ R >0 ×R d ×SPD(d) and a continuously differentiable function G : V → U such that for all (x, ν, µ, Σ) Thus the zero of Ψ depends continuously on ν, T 2 and T 3 .This implies the following theorem.
Proof.The mapping T defined in Lemma 5.6 is continuous.Further we know from its definition that (ν, µ, Σ) is a critical point of L if and only if it is a fixed point of T .Let (ν, μ, Σ) be a cluster point of (ν r , µ r , Σ r ) r .Then there exists a subsequence (ν rs , µ rs , Σ rs ) s which converges to (ν, μ, Σ).Further we know by Theorem 5.5 that L r = L(ν r , µ r , Σ r ) is decreasing.Since (L r ) r is bounded from below, it converges.Now it holds By Theorem 5.5 and the definition of T we have that L(ν, µ, Σ) = L(T (ν, µ, Σ)) if and only if (ν, µ, Σ) = T (ν, µ, Σ).By the definition of the algorithm this is the case if and only if (ν, µ, Σ) is a critical point of L. Thus (ν, μ, Σ) is a critical point of L.

Numerical Results
In this section we give two numerical examples of the developed theory.First, we compare the four different algorithms in Subsection 6.1.Then, in Subsection 6.2, we address further accelerations of our algorithms by SQUAREM [33] and DAAREM [9] and show also a comparison with the ECME algorithm [20].Finally, in Subsection 6.3, we provide an application in image analysis by determining the degree of freedom parameter in images corrupted by Student-t noise.

Comparison of Algorithms
In this section, we compare the numerical performance of the classical EM algorithm 1 and the proposed Algorithms 2, 3 and 4. To this aim, we did the following Monte Carlo simulation: Based on the stochastic representation of the Student-t distribution, see equation ( 1), we draw n = 1000 i.i.d.realizations of the T ν (µ, Σ) distribution with location parameter µ = 0 and different scatter matrices Σ and degrees of freedom parameters ν.Then, we used Algorithms 2, 3 and 4 to compute the ML-estimator (ν, μ, Σ).
We initialize all algorithms with the sample mean for µ and the sample covariance matrix for Σ.Furthermore, we set ν = 3 and in all algorithms the zero of the respective function is computed by Newtons Method.As a stopping criterion we use the following relative distance: We take the logarithm of ν in the stopping criterion, because T ν (µ, Σ) converges to the normal distribution as ν → ∞ and therefore the difference between T ν (µ, Σ) and T ν+1 (µ, Σ) becomes small for large ν.
To quantify the performance of the algorithms, we count the number of iterations until the stopping criterion is reached.Since the inner loop of the GMMF is potentially time consuming we additionally measure the execution time until the stopping criterion is reached.This experiment is repeated N = 10.000 times for different values of ν ∈ {1, 2, 5, 10}.Afterward we calculate the average number of iterations and the average execution times.The results are given in Table 1.We observe that the performance of the algorithms depends on Σ.
Further we see, that the performance of the aEM algorithm is always better than those of the classical EM algorithm.Further all algorithms need a longer time to estimate large ν.
This seems to be natural since the likelihood function becomes very flat for large ν.Further, the GMMF needs the lowest number of iterations.But for small ν the execution time of the GMMF is larger than those of the MMF and the aEM algorithm.This can be explained by the fact, that the ν step has a smaller relevance for small ν but is still time consuming in the GMMF.The MMF needs slightly more iterations than the GMMF but if ν is not extremely large the execution time is smaller than for the GMMF and for the aEM algorithm.In summary, the MMF algorithm is proposed as algorithm of choice.
In Figure 2 we exemplarily show the functional values L(ν r , µ r , Σ r ) of the four algorithms and samples generated for different values of ν and Σ = I.Note that the x-axis of the plots is in log-scale.We see that the convergence speed (in terms of number of iterations) of the EM algorithm is much slower than those of the MMF/GMMF.For small ν the convergence speed of the aEM algorithm is close to the GMMF/MMF, but for large ν it is close to the EM algorithm.
In Figure 3 we show the histograms of the ν-output of 1000 runs for different values of ν and Σ = I.Since the ν-outputs of all algorithms are very close together we only plot the output of the GMMF.We see that the accuracy of the estimation of ν decreases for increasing ν.
This can be explained by the fact, that the likelihood function becomes very flat for large ν such that the estimation of ν becomes much harder.

Comparison with other Accelerations of the EM Algortihm
In this section, we compare our algorithms with the Expectation/Conditional Maximization Either (ECME) algorithm [19,20] and apply the SQUAREM acceleration [33] as well as the damped Anderson Acceleration (DAAREM) [9] to our algorithms.

ECME algorithm:
The ECME algorithm was first proposed in [19].Some numerical examples of the behavior of the ECME algorithm for estimating the parameters (ν, µ, Σ) of a Student-t distribution T ν (µ, Σ) are given in [20].The idea of ECME is first to replace the M-Step of the EM algorithm by the following update of the parameters (ν r , µ r , Σ r ): first, we fix ν = ν r and compute the update (µ r+1 , Σ r+1 ) of the parameters (µ r , Σ r ) by performing one step of the EM algorithm for fixed degree of freedom (CM1-Step).Second, we fix (µ, Σ) = (µ r , Σ r ) and compute the update ν r+1 of ν r by maximizing the likelihood function with respect to ν (CM2-Step).The resulting algorithm is given in Algorithm 5.It is similar to the GMMF (Algorithm 4), but uses the Σ-update of the EM algorithm (Algorithm 1) instead of the Σ-update of the aEM algorithm (Algorithm 2).The authors of [19] showed a similar convergence result as for the EM algorithm.Alternatively, we could prove Theorem 5.5 for the ECME algorithm analogously as for the GMMF algorithm.
Next, we consider two acceleration schemes of arbitrary fixed point algorithms ϑ r+1 = G(ϑ r ).
, where the columns of F r ∈ R p×mr are given by f (ϑ r−mr+j+1 )− f (ϑ r−mr+j ) for j = 0, ..., m r − 1.An equivalent formulation of update step ( 15) is given by where the columns of X r ∈ R p×mr are given by ϑ r−mr+j+1 − ϑ r−mr+j for j = 0, ..., m r − 1.
The Anderson acceleration can be viewed as a special case of a multisecant quasi-Newton procedure to solve f (ϑ) = 0.For more details we refer to [7,9].
The DAAREM acceleration modifies the Anderson acceleration in three points.The first modification is to restart the algorithm after m steps.That is, to set m r = min(m, c r ) instead of m r = min(m, r), where c r ∈ {1, ..., m} is defined by c r = r mod m.The second modification is to add damping term in the computation coefficients γ (r) .This means, that γ (r) is given by γ for some damping parameters δ r .We initialize the δ r by δ 1 = 1 1+α κ and decrease the exponent of α in each step by 1 up to a minimum of κ − D for some parameter D ∈ N >0 .
The third modification is to enforce that for the negative log-likelihood function L does not increase more than in one iteration step.To do this, we compute the update ϑ r+1 using the Anderson acceleration.If L(ϑ r+1 ) > L(ϑ r ) + , we use our original fixed point algorithm in this step, i.e. we set ϑ r+1 = G(ϑ r ).
We summarize the DAAREM acceleration in Algorithm 6.In our numerical experiments we use for the parameters the values suggested by [9], that is = 0.01, c = 0, α = 1.Constant area detection: In order to detect constant regions in an image, we adopt an idea presented in [30].It is based on Kendall's τ -coefficient, which is a measure of rank correlation, and the associated z-score, see [10,11].In the following, we briefly summarize the main ideas behind this approach.For finding constant regions we proceed as follows: First, the image grid G is partitioned into K small, non-overlapping regions G = K k=1 R k , and for each region we consider the hypothesis testing problem H 0 : R k is constant vs. H 1 : R k is not constant.
To decide whether to reject H 0 or not, we observe the following: Consider a fixed region R k and let I, J ⊆ R k be two disjoint subsets of R k with the same cardinality.Denote with u I and u J the vectors containing the values of u at the positions indexed by I and J.Then, under H 0 , the vectors u I and u J are uncorrelated (in fact even independent) for all choices of I, J ⊆ R k with I ∩ J = ∅ and |I| = |J|.As a consequence, the rejection of H 0 can be reformulated as the question whether we can find I, J such that u I and u J are significantly correlated, since in this case there has to be some structure in the image region R k and it cannot be constant.Now, in order to quantify the correlation, we adopt an idea presented in [30] and make use of Kendall's τ -coefficient, which is a measure of rank correlation, and the associated z-score, see [10,11].The key idea is to focus on the rank (i.e., on the relative order) of the values rather than on the values themselves.In this vein, a block is considered homogeneous if the ranking of the pixel values is uniformly distributed, regardless of the spatial arrangement of the pixels.In the following, we assume that we have extracted two disjoint subsequences x = u I and y = u J from a region R k with I and J as above.Let (x i , y i ) and (x j , y j ) be two pairs of observations.Then, the pairs are said to be concordant if x i < x j and y i < y j or x i > x j and y i > y j , discordant if x i < x j and y i > y j or x i > x j and y i < y j , tied if x i = x j or y i = y j .
Next, let x, y ∈ R n be two sequences without tied pairs and let n c and n d be the number of concordant and discordant pairs, respectively.Then, Kendall's τ coefficient [10] is defined .
From this definition we see that if the agreement between the two rankings is perfect, i.e.
the two rankings are the same, then the coefficient attains its maximal value 1.On the       Proof.We write

A. Auxiliary Lemmas
where Then it holds lim ν→0 g(ν) = ∞.Hence it is sufficient to show that (ν r , Σ r ) r has a subsequence Note that Assumption 3.1 ensures x i = 0 and x T i x i > 0 for i = 1, . . ., n.Then we get Hence (L νr (Σ r )) r is bounded from below and (ν r , Σ r ) cannot be a minimizing sequence.
Denote by p r,1 ≥ . . .≥ p r,d > 0 the eigenvalues of P r .Since {P r : r ∈ N} is bounded there exists some C > 0 with C ≥ p r,1 for all r ∈ N. Thus one of the following cases is fulfilled: i) There exists a constant c > 0 such that p r,d > c for all r ∈ N.
ii) There exists a subsequence (P r k ) k of (P r ) r which converges to some P ∈ ∂ SPD(d).Hence (L νr (Σ r )) r is bounded from below and (ν r , Σ r ) cannot be a minimizing sequence.

Case 2ii)
We use similar arguments as in the proof of [17,Theorem 4.3].Let (P r k ) k be a subsequence of (P r ) r which converges to some P ∈ ∂ SPD(d).Hence it holds for j ≥ q + 1 that L j (P r ) = d We prove for k ≥ q + 1 by induction that for sufficiently large r ∈ N it holds If we multiply both sides with log(p rd ) this yields (19) for k = d.

Lemma 5 . 3 .
Let F a , F b : R >0 → R be continuous functions, where F a is strictly increasing and F b is strictly decreasing.Define F := F a + F b and let G : R >0 → R be an antiderivative of F , i.e.F = d dx G.

Figure 2 :
Figure 2: Plots of L(ν r , µ r , Σ r ) on the y-axis and r on the x-axis for all algorithms.

Figure 3 :
Figure 3: Histograms of the output ν from the algorithms.
2, κ = 25, D = 2κ and m = min( p 2 , 10), where p is the number of parameters in ϑ.Simulation Study: To compare the performance of all of these algorithms we perform again a Monte Carlo simulation.As in the previous section we draw n = 100 i.i.d.realizations Algorithm use the code of[30] 1 .The true parameters used to generate the noisy images where ν = 1 and σ = 10 for the top row and ν = 5 and σ = 10 for the bottom row, while the obtained estimates are (geometric mean in brackets) ν = 1.0437 (1.0291) and σ = 10.3845(10.3111)for the top row and ν = 5.4140 (5.0423) and σ = 10.5500(10.1897) for the bottom row.Histogram of estimates for ν.
Histogram of estimates for σ 2 .(d) Noisy image with detected homogeneous areas.
Histogram of estimates for ν.

Lemma A. 1 .
Let x i ∈ R d , i = 1, . . ., n and w ∈ ∆n fulfill Assumption 3.1.Let (ν r , Σ r ) r be a sequence in R >0 × SPD(d) with ν r → 0 as r → ∞ (or if {ν r } r has a subsequence which converges to zero).Then (ν r , Σ r ) r cannot be a minimizing sequence of L(ν, Σ).Noisy image with detected homogeneous areas.Histogram of estimates for ν.Histogram of estimates for σ 2 .Noisy image with detected homogeneous areas.Histogram of estimates for ν.Histogram of estimates for σ 2 .

Case 2i )w
Let c > 0 with p r,d ≥ c for all r ∈ N. Then lim inf r→∞ log(|P r |) ≥ log(c d ) = d log(c) i log(x T i P −1 r x i ) + log(|P r |) x i + d log(c) + const.

ww
i∈I j w i (log(x T i P −1 r x i ) + log(p r,j )) + 1 − d i∈I j w i log(p r,j ) = d i∈I j w i log(p r,j x T i P −1 r x i ) + 1 − d i∈I j w i log(p r,j ).Thus we conclude lim inf r→∞ L 0 (P r ) = lim inf i log(p r,j ).It remains to show that there exist c > 0 i log(p r,j ) ≥ c.

wi∈ Ĩk− 1 w
i log(p r,j ) + 1 − d i∈I k w i log(p r,k ) ≥ d i∈ Ĩk w i − k d log(p r,k+1 ) + 1 − d i∈I k w i log(p r,k ).and since i∈ Ĩk w i < Ĩk 1 d ≤ k d by Assumption 3.1 and p r,k+1 ≤ p r,k < 1 finally ≥ d i∈ Ĩkw i − k d log(p r,k ) + 1 − d i∈I k w i log(p r,k ) = d i − (k − 1) log(p r,k ).
. Therefore, if s ∈ [t − √ t, t + √ t], then there exists x + large enough such that the numerator becomes smaller than zero for all x ≥ x + .Consequently, F (10)) ≤ P s (x)+Q(x) < 0 for all x ≥ x + .Thus, F s is decreasing on [x + , ∞).By(10), we conclude that F s has a zero.The following corollary states that F s has exactly one zero if s > t + √ t.Unfortunately we do not have such a results for s < t − √ t.Corollary 4.2.Let F s : R >0 → R be given by (7).If s > t + √ t, t ≥ 1, then F s has exactly one zero.

Table 2 :
Average number of iterations (top) and execution times (bottom) and the corresponding standard deviations of the different algorithms.