Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution

Hasannasab, Marzieh; Hertrich, Johannes; Laus, Friederike; Steidl, Gabriele

doi:10.1007/s11075-020-00959-w

Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution

Original Paper
Open access
Published: 23 September 2020

Volume 87, pages 77–118, (2021)
Cite this article

Download PDF

You have full access to this open access article

Numerical Algorithms Aims and scope Submit manuscript

Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution

Download PDF

Marzieh Hasannasab ORCID: orcid.org/0000-0002-3975-5545¹,
Johannes Hertrich¹,
Friederike Laus² &
…
Gabriele Steidl¹

2193 Accesses
10 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 15 July 2021

This article has been updated

Abstract

In this paper, we consider maximum likelihood estimations of the degree of freedom parameter ν, the location parameter μ and the scatter matrix Σ of the multivariate Student t distribution. In particular, we are interested in estimating the degree of freedom parameter ν that determines the tails of the corresponding probability density function and was rarely considered in detail in the literature so far. We prove that under certain assumptions a minimizer of the negative log-likelihood function exists, where we have to take special care of the case $\nu \rightarrow \infty $, for which the Student t distribution approaches the Gaussian distribution. As alternatives to the classical EM algorithm we propose three other algorithms which cannot be interpreted as EM algorithm. For fixed ν, the first algorithm is an accelerated EM algorithm known from the literature. However, since we do not fix ν, we cannot apply standard convergence results for the EM algorithm. The other two algorithms differ from this algorithm in the iteration step for ν. We show how the objective function behaves for the different updates of ν and prove for all three algorithms that it decreases in each iteration step. We compare the algorithms as well as some accelerated versions by numerical simulation and apply one of them for estimating the degree of freedom parameter in images corrupted by Student t noise.

Sparse Estimation: An MMSE Approach

Article 14 February 2023

A comparison of the $$L_2$$ minimum distance estimator and the EM-algorithm when fitting $${\varvec{{k}}}$$ -component univariate normal mixtures

Article 24 February 2016

Consistency factor for the MCD estimator at the Student-t distribution

Article Open access 12 October 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The motivation for this work arises from certain tasks in image processing, where the robustness of methods plays an important role. In this context, the Student t distribution and the closely related Student t mixture models became popular in various image processing tasks. In [31] it has been shown that Student t mixture models are superior to Gaussian mixture models for modeling image patches and the authors proposed an application in image compression. Image denoising based on Student t models was addressed in [17] and image deblurring in [6, 34]. Further applications include robust image segmentation [4, 25, 29] as well as robust registration [8, 35].

In one dimension and for ν = 1, the Student t distribution coincides with the one-dimensional Cauchy distribution. This distribution has been proposed to model a very impulsive noise behavior and one of the first papers which suggested a variational approach in connection with wavelet shrinkage for denoising of images corrupted by Cauchy noise was [3]. A variational method consisting of a data term that resembles the noise statistics and a total variation regularization term has been introduced in [23, 28]. Based on an ML approach the authors of [16] introduced a so-called generalized myriad filter that estimates both the location and the scale parameter of the Cauchy distribution. They used the filter in a nonlocal denoising approach, where for each pixel of the image they chose as samples of the distribution those pixels having a similar neighborhood and replaced the initial pixel by its filtered version. We also want to mention that a unified framework for images corrupted by white noise that can handle (range constrained) Cauchy noise as well was suggested in [14].

In contrast to the above pixelwise replacement, the state-of-the-art algorithm of Lebrun et al. [18] for denoising images corrupted by white Gaussian noise restores the image patchwise based on a maximum a posteriori approach. In the Gaussian setting, their approach is equivalent to minimum mean square error estimation, and more general, the resulting estimator can be seen as a particular instance of a best linear unbiased estimator (BLUE). For denoising images corrupted by additive Cauchy noise, a similar approach was addressed in [17] based on ML estimation for the family of Student t distributions, of which the Cauchy distribution forms a special case. The authors call this approach generalized multivariate myriad filter.

However, all these approaches assume that the degree of freedom parameter ν of the Student t distribution is known, which might not be the case in practice. In this paper we consider the estimation of the degree of freedom parameter based on an ML approach. In contrast to maximum likelihood estimators of the location and/or scatter parameter(s) μ and Σ, to the best of our knowledge the question of existence of a joint maximum likelihood estimator has not been analyzed before and in this paper we provide first results in this direction. Usually the likelihood function of the Student t distributions and mixture models are minimized using the EM algorithm derived e.g. in [13, 21, 22, 26]. For fixed ν, there exists an accelerated EM algorithm [12, 24, 32] which appears to be more efficient than the classical one for smaller parameters ν. We examine the convergence of the accelerated version if also the degree of freedom parameter ν has to be estimated. Also for unknown degrees of freedom, there exist an accelerated version of the EM algorithm, the so-called ECME algorithm [20] which differs from our algorithm. Further, we propose two modifications of the ν iteration step which lead to efficient algorithms for a wide range of parameters ν. Finally, we address further accelerations of our algorithms by the squared iterative methods (SQUAREM) [33] and the damped Anderson acceleration with restarts and 𝜖-monotonicity (DAAREM) [9].

The paper is organized as follows: In Section 2 we introduce the Student t distribution, the negative $\log $-likelihood function L and their derivatives. The question of the existence of a minimizer of L is addressed in Section 3. Section 4 deals with the solution of the equation arising when setting the gradient of L with respect to ν to zero. The results of this section will be important for the convergence consideration of our algorithms in the Section 5. We propose three alternatives of the classical EM algorithm and prove that the objective function L decreases for the iterates produced by these algorithms. Finally, we provide two kinds of numerical results in Section 5. First, we compare the different algorithms by numerical examples which indicate that the new ν iterations are very efficient for estimating ν of different magnitudes. Second, we come back to the original motivation of this paper and estimate the degree of freedom parameter ν from images corrupted by one-dimensional Student t noise. The code is provided online^{Footnote 1}.

2 Likelihood of the multivariate student t distribution

The density function of the d-dimensional Student t distribution T_ν(μ, Σ) with ν > 0 degrees of freedom, location paramter $\mu \in \mathbb {R}^{d}$ and symmetric, positive definite scatter matrixΣ ∈ SPD(d) is given by

$$ p(x|\nu,\mu,{\varSigma}) = \frac{\Gamma\left( \frac{d+\nu}{2}\right)}{\Gamma\left( \frac{\nu}{2}\right) \nu^{\frac{d}{2}} \pi^{\frac{d}{2}} {\left| {\varSigma} \right|}^{\frac{1}{2}}} \frac{1}{\left( 1 +\frac1\nu(x-\mu)^{\mathrm{T}} {\varSigma}^{-1}(x-\mu) \right)^{\frac{d+\nu}{2}}}, $$

with the Gamma function $ {\Gamma }(s) := {\int \limits }_{0}^{\infty } t^{s-1}\mathrm {e}^{-t} \mathrm {d}t $. The expectation of the Student t distribution is $\mathbb {E}(X) = \mu $ for ν > 1 and the covariance matrix is given by $Cov(X) =\frac {\nu }{\nu -2} {\varSigma }$ for ν > 2; otherwise, the quantities are undefined. The smaller the value of ν, the heavier the tails of the T_ν(μ, Σ) distribution. For $\nu \to \infty $, the Student t distribution T_ν(μ, Σ) converges to the normal distribution $\mathcal {N}(\mu ,{\varSigma })$ and for ν = 0 it is related to the projected normal distribution on the sphere $\mathbb {S}^{d-1}\subset \mathbb {R}^{d}$. Figure 1 illustrates this behavior for the one-dimensional standard Student t distribution.

As the normal distribution, the d-dimensional Student t distribution belongs to the class of elliptically symmetric distributions. These distributions are stable under linear transforms in the following sense: Let $X\sim T_{\nu }(\mu ,{\varSigma })$ and $A\in \mathbb {R}^{d\times d}$ be an invertible matrix and let $b\in \mathbb {R}^{d}$. Then $AX + b\sim T_{\nu }\left (A\mu + b, A{\varSigma } A^{\mathrm {T}}\right )$. Furthermore, the Student t distribution T_ν(μ, Σ) admits the following stochastic representation, which can be used to generate samples from T_ν(μ, Σ) based on samples from the multivariate standard normal distribution $\mathcal {N}(0,I)$ and the Gamma distribution ${\Gamma }\left (\tfrac {\nu }{2},\tfrac {\nu }{2}\right )$: Let $Z\sim \mathcal {N}(0,I)$ and $Y\sim {\Gamma }\left (\tfrac {\nu }{2},\tfrac {\nu }{2}\right )$ be independent, then

$$ X = \mu + \frac{{\varSigma}^{\frac{1}{2}}Z}{\sqrt{Y}}\sim T_{\nu}(\mu,{\varSigma}). $$

(1)

For i.i.d. samples $x_{i} \in \mathbb R^{d}$, i = 1,…,n, the likelihood function of the Student t distribution T_ν(μ, Σ) is given by

$$ \mathcal{L}(\nu,\!\mu,\!{\varSigma}|x_{1},\!\ldots,\!x_{n}) = \frac{\Gamma\left( \frac{d+\nu}{2}\right)^{n}}{\Gamma\left( \frac{\nu}{2}\right)^{n}(\pi \nu)^{\frac{nd}{2}}\left| {\varSigma} \right|^{\frac{n}{2}} } \prod\limits_{i=1}^{n} \frac{1}{\left( 1{}+{}\frac{1}{\nu}(x_{i}{}-{}\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)\right)^{\frac{d+\nu}{2}}}, $$

and the log-likelihood function by

$$ \begin{array}{@{}rcl@{}} \ell(\nu,\mu,{\varSigma}|x_{1},\ldots,x_{n}) &=& n \log\left( {\Gamma}\left( \tfrac{d+\nu}{2}\right)\right) - n \log \left( {\Gamma}\left( \tfrac{\nu}{2}\right)\right)-\tfrac{nd}{2}\log(\pi\nu) \\ &&- \frac{n}{2}\log \left| {\varSigma} \right| - \tfrac{d+\nu}{2} \sum\limits_{i=1}^{n} \log\left( 1+\frac{1}{\nu}(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu) \right). \end{array} $$

In the following, we are interested in the negative log-likelihood function, which up to the factor $\frac {2}{n}$ and weights $w_{i} = \frac {1}{n}$ reads as

$$ \begin{array}{@{}rcl@{}} L(\nu,\mu,{\varSigma}) &=& -2\log\left( {\Gamma}\left( \tfrac{d+\nu}{2}\right)\right)+ 2 \log\left( {\Gamma}\left( \tfrac{\nu}{2}\right)\right) - \nu \log(\nu) \\ &&\quad + (d + \nu)\sum\limits_{i=1}^{n} w_{i} \log\left( \nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu) \right)+ \log \left| {\varSigma} \right|. \end{array} $$

In this paper, we allow for arbitrary weights from the open probability simplex $ \mathring {\Delta }_{n} := \left \{w = (w_{1},\ldots ,w_{n}) \in \mathbb R_{>0}^{n}: {\sum }_{i=1}^{n} w_{i} = 1 \right \} $. In this way, we might express different levels of confidence in single samples or handle the occurrence of multiple samples. Using $ \frac {\partial \log (\left | X \right |)}{\partial X} = X^{-1} $ and $ \frac {\partial a^{\mathrm {T}} X^{-1}b }{\partial X} =- {\left (X^{-\mathrm {T}}\right )}a b^{\mathrm {T}} {\left (X^{-\mathrm {T}}\right )} $ (see [27]), the derivatives of L with respect to μ, Σ and ν are given by

$$ \begin{array}{@{}rcl@{}} \frac{\partial L}{\partial \mu}(\nu,\mu,{\varSigma}) & =& -2(d+\nu )\sum\limits_{i=1}^{n} w_{i} \frac{ {\varSigma}^{-1}(x_{i}-\mu)}{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)},\\ \frac{\partial L}{\partial {\varSigma}}(\nu,\mu,{\varSigma}) & =& - (d+\nu ) \sum\limits_{i=1}^{n} w_{i} \frac{ {\varSigma}^{-1}(x_{i}-\mu)(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} }{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}+{\varSigma}^{-1},\\ \frac{\partial L}{\partial \nu}(\nu,\mu,{\varSigma} ) & =& \phi\left( \frac{\nu}{2}\right) - \phi \left( \frac{\nu + d}{2}\right) + \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}\right.\\ && \quad \left. - \log\left( \frac{\nu + d}{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)} \right) - 1\right), \end{array} $$

with

$$ \phi(x) := \psi(x) - \log (x), \qquad x >0$$

and the digamma function

$$ \psi(x) = \frac{\mathrm{d}}{\mathrm{d}x}\log\left( {\Gamma}(x)\right) = \frac{{\Gamma}^{\prime}(x)}{\Gamma(x)}. $$

Setting the derivatives to zero results in the equations

$$ \begin{array}{@{}rcl@{}} 0 &=& \sum\limits_{i=1}^{n} w_{i} \frac{x_{i}-\mu}{\nu+(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}, \end{array} $$

(2)

$$ \begin{array}{@{}rcl@{}} I &=& (d+\nu)\sum\limits_{i=1}^{n} w_{i} \frac{{\varSigma}^{-\frac{1}{2}}(x_{i}-\mu)(x_{i}-\mu)^{\mathrm{T}} {{\varSigma}^{-\frac{1}{2}}} }{\nu+(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)} , \end{array} $$

(3)

$$ \begin{array}{@{}rcl@{}} 0 &=& F\left( \frac{\nu }{2} \right) := \phi\left( \frac{\nu }{2}\right) - \phi\left( \frac{\nu +d}{2}\right) \\ &&\quad + \sum\limits_{i=1}^{n} w_{i} \left( \tfrac{\nu + d}{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}- \log\left( \tfrac{\nu + d}{\nu + (x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)} \right) - 1\right). \end{array} $$

(4)

Computing the trace of both sides of (3) and using the linearity and permutation invariance of the trace operator we obtain

$$ \begin{array}{@{}rcl@{}} d&=&\text{tr}(I) =(d+\nu)\sum\limits_{i=1}^{n} w_{i} \frac{\text{tr}\left( {\varSigma}^{-\frac{1}{2}}(x_{i}-\mu)(x_{i}-\mu)^{\mathrm{T}} {{\varSigma}^{-\frac{1}{2}}}\right)}{\nu+(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)} \\ &=& (d+\nu)\sum\limits_{i=1}^{n} w_{i} \frac{(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}{\nu+(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}, \end{array} $$

which yields

$$ 1= (d+\nu) \sum\limits_{i=1}^{n} w_{i} \frac{1}{\nu+(x_{i}-\mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i}-\mu)}. $$

We are interested in critical points of the negative log-likelihood function L, i.e., in solutions (μ, Σ, ν) of (2)–(4), and in particular in minimizers of L.

3 Existence of critical points

In this section, we examine whether the negative log-likelihood function L has a minimizer, where we restrict our attention to the case μ = 0. For an approach how to extend the results to arbitrary μ for fixed ν we refer to [17]. To the best of our knowledge, this is the first work that provides results in this direction. The question of existence is, however, crucial in the context of ML estimation, since it lays the foundation for any convergence result for the EM algorithm or its variants. In fact, the authors of [13] observed the divergence of the EM algorithm in some of their numerical experiments, which is in accordance with our observations.

For fixed ν > 0, it is known that there exists a unique solution of (3) and for ν = 0 that there exist solutions of (3) which differ only by a multiplicative positive constant (see, e.g., [17]). In contrast, if we do not fix ν, we have roughly to distinguish between the two cases that the samples tend to come from a Gaussian distribution, i.e., HCode $\nu \to \infty $, or not. The results are presented in Theorem 1.

We make the following general assumption:

Assumption 1

Any subset of less or equal d samples x_i, i ∈{1,…,n} is linearly independent and $\max \limits \{w_{i}:i=1,\ldots ,n\}<\frac {1}{d }$.

For μ = 0, the negative log-likelihood function becomes

$$ \begin{array}{@{}rcl@{}} L(\nu,{\varSigma}) &:=& -2\log\left( {\Gamma}\left( \frac{d+\nu}{2}\right)\right)+2\log\left( {\Gamma}\left( \frac\nu2\right)\right)-\nu\log(\nu)\\ &\quad& +(d+\nu)\sum\limits_{i=1}^{n} w_{i} \log\left( \nu+ x_{i}^{\mathrm{T}}{\varSigma}^{-1}x_{i}\right)+\log(\left| {\varSigma} \right|)\\ &=&-2\log\left( {\Gamma}\left( \frac{d+\nu}{2}\right)\right)+2\log\left( {\Gamma}\left( \frac\nu2\right)\right)-\nu\log(\nu)\\ &\quad& +(d+\nu)\log(\nu)+(d+\nu)\sum\limits_{i=1}^{n}w_{i}\log\left( 1+ \frac1\nu x_{i}^{\mathrm{T}}{\varSigma}^{-1}x_{i}\right)+\log(\left| {\varSigma} \right|). \end{array} $$

Further, for a fixed ν > 0, set

$$ L_{\nu}({\varSigma}) := (d+\nu)\sum\limits_{i=1}^{n} w_{i} \log\left( \nu+ x_{i}^{\mathrm{T}}{\varSigma}^{-1}x_{i}\right)+\log(\left| {\varSigma} \right|). $$

To prove the next existence theorem we will need two lemmas, whose proofs are given in the Appendix.

Theorem 1

Let $x_{i} \in \mathbb R^{d}$, i = 1,…,n and w ∈Δ̈_n fulfill Assumption 1. Then exactly one of the following statements holds:

(i)
There exists a minimizing sequence (ν_r,Σ_r)_r of L, such that $\{\nu _{r}:r\in \mathbb N\}$ has a finite cluster point. Then we have $argmin_{(\nu ,{\varSigma })\in \mathbb {R}_{>0}\times \text {SPD}(d)} L(\nu ,{\varSigma })\neq \emptyset $ and every $(\hat \nu ,\hat {\varSigma })\in argmin_{(\nu ,{\varSigma })\in \mathbb {R}_{>0}\times \text {SPD}(d)}L(\nu ,{\varSigma })$ is a critical point of L.
(ii)
For every minimizing sequence (ν_r,Σ_r)_r of L(ν, Σ) we have $\underset {r\to \infty }{\lim } \nu _{r}=\infty $. Then (Σ_r)_r converges to the maximum likelihood estimator $\hat {\varSigma }={\sum }_{i=1}^{n} w_{i}x_{i}x_{i}^{\mathrm {T}}$ of the normal distribution $\mathcal {N}(0,{\varSigma })$.

Proof

Case 1: Assume that there exists a minimizing sequence (ν_r,Σ_r)_r of L, such that (ν_r)_r has a bounded subsequence. In particular, using Lemma 4, we have that (ν_r)_r has a cluster point ν^∗ > 0 and a subsequence $(\nu _{r_{k}})_{k}$ converging to ν^∗. Clearly, the sequence $(\nu _{r_{k}},{\varSigma }_{r_{k}})_{k}$ is again a minimizing sequence so that we skip the second index in the following. By Lemma 5, the set $\overline {\{{\varSigma }_{r}:r\in \mathbb N\}}$ is a compact subset of SPD(d). Therefore there exists a subsequence $({\varSigma }_{r_{k}})_{k}$ which converges to some Σ^∗∈SPD(d). Now we have by continuity of L(ν, Σ) that

$$ L(\nu^{*},{\varSigma}^{*})=\lim\limits_{k\to\infty}L(\nu_{r_{k}},{\varSigma}_{r_{k}})=\min_{(\nu,{\varSigma})\in\mathbb{R}_{>0}\times\text{SPD}(d)} L(\nu,{\varSigma}). $$

Case 2: Assume that for every minimizing sequence (ν_r,Σ_r)_r it holds that $\nu _{r}\to \infty $ as $r\to \infty $. We rewrite the likelihood function as

$$ \begin{array}{@{}rcl@{}} L(\nu,{\varSigma}) &=& 2\log \left( \frac{\Gamma\left( \frac\nu2\right)\frac\nu2^{\frac{d}{2}} }{ {\Gamma}\left( \frac{d+\nu}{2} \right)} \right) +d \log(2)\\ &&\quad+(d+\nu) {\sum}_{i=1}^{n} w_{i} \log \left( 1+\frac1\nu x_{i}^{\mathrm{T}} {\varSigma}^{-1}x_{i}\right)+\log(\left| {\varSigma} \right|). \end{array} $$

Since

$$\underset{\nu \rightarrow \infty}{\lim} \frac{\Gamma\left( \frac\nu2\right)\frac\nu2^{\frac{d}{2}} }{ {\Gamma}\left( \frac{d+\nu}{2} \right)}=1,$$

we obtain

$$ \underset{r\to\infty}{\lim}L(\nu_{r},{\varSigma}_{r})= d\log(2)+ \underset{\nu_{r} \rightarrow \infty}{\lim} (d+\nu_{r})\sum\limits_{i=1}^{n}w_{i}\log\left( 1+\frac1{\nu_{r}} x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)+\log(\left| {\varSigma}_{r} \right|). $$

(5)

Next we show by contradiction that $\overline {\{{\varSigma }_{r}:r\in \mathbb N\}}$ is in SPD(d) and bounded: Denote the eigenvalues of Σ_r by λ_r1 ≥⋯ ≥ λ_rd. Assume that either $\{\lambda _{r1}:r\in \mathbb N\}$ is unbounded or that $\{\lambda _{rd}:r\in \mathbb N\}$ has zero as a cluster point. Then, we know by [17, Theorem 4.3] that there exists a subsequence of (Σ_r)_r, which we again denote by (Σ_r)_r, such that for any fixed ν > 0 it holds

$$ \underset{r\to\infty}{\lim} L_{\nu} ({\varSigma}_{r})=\infty. $$

Since $k\mapsto \left (1+\frac {k}{x}\right )^{k}$ is monotone increasing, for ν_r ≥ d + 1 we have

$$ \begin{array}{@{}rcl@{}} (d+\nu_{r})\sum\limits_{i=1}^{n} w_{i} \log\left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right) &=&\sum\limits_{i=1}^{n} w_{i} \log\left( \left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{\nu_{r}+d}\right)\\ &\geq& \sum\limits_{i=1}^{n} w_{i} \log\left( \left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{\nu_{r}}\right)\\ &\geq& \sum\limits_{i=1}^{n} w_{i} \log\left( \left( 1+\frac1{d+1}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{d+1}\right)\\ &=& (d + 1)\!\sum\limits_{i=1}^{n}\! w_{i} \log\!\left( \!1+\frac1{d+1}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)\\ &\geq& (d+1)\sum\limits_{i=1}^{n} w_{i} \log\left( 1+x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)\\\\ &&- \log(d+1)^{d+1}. \end{array} $$

By (5) this yields

$$ \begin{array}{@{}rcl@{}} \underset{r\to\infty}{\lim}L(\nu_{r},{\varSigma}_{r}) &\geq& d\log(2) - \log(d+1)^{d+1} \\ &&\quad+ \underset{r\to\infty}{\lim} (d+1)\sum\limits_{i=1}^{n} w_{i} \log\left( 1+x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)+\log(\left| {\varSigma}_{r} \right|) \\ &=&d\log(2)- \log(d+1)^{d+1} +\underset{r\to\infty}{\lim}L_{1}({\varSigma}_{r})=\infty. \end{array} $$

This contradicts the assumption that (ν_r,Σ_r)_r is a minimizing sequence of L. Hence, $\overline {\{{\varSigma }_{r}:r\in \mathbb N\}}$ is a bounded subset of SPD(d). Finally, we show that any subsequence of (Σ_r)_r has a subsequence which converges to $\hat {\varSigma }={\sum }_{i=1}^{n} w_{i} x_{i}x_{i}^{\mathrm {T}}$. Then the whole sequence (Σ_r)_r converges to $\hat {\varSigma }$. Let $\left ({\varSigma }_{r_{k}}\right )_{k}$ be a subsequence of (Σ_r)_r. Since it is bounded, it has a convergent subsequence $\left ({\varSigma }_{r_{k_{l}}}\right )_{l}$ which converges to some $\tilde {\varSigma }\in \overline {\{{\varSigma }_{r}:r\in \mathbb N\}}\subset \text {SPD}(d)$. For simplicity, we denote $\left ({\varSigma }_{r_{k_{l}}}\right )_{l}$ again by (Σ_r)_r. Since (Σ_r)_r is converges, we know that also $\left (x_{i}^{\mathrm {T}} {\varSigma }_{r}^{-1}x_{i}\right )_{r}$ converges and is bounded. By $\underset {r\to \infty }{\lim }\nu _{r}=\infty $ we know that the functions $x\mapsto \left (1+\frac {x}{\nu _{r}}\right )^{\nu _{r}}$ converge locally uniformly to $x\mapsto \exp (x)$ as $r\to \infty $. Thus we obtain

$$ \begin{array}{@{}rcl@{}} &&\underset{r\to\infty}{\lim}(d+\nu_{r})\sum\limits_{i=1}^{n} w_{i} \log\left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)\\ &&\qquad=\underset{r\to\infty}{\lim}\sum\limits_{i=1}^{n} w_{i}\log\left( \left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{d+\nu_{r}}\right) \\ &&\qquad=\underset{r\to\infty}{\lim} \sum\limits_{i=1}^{n} w_{i} \log\left( \underset{r\to\infty}{\lim}\left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{\nu_{r}}\left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{d}\right)\\ &&\qquad=\underset{r\to\infty}{\lim}\sum\limits_{i=1}^{n} w_{i} \log\left( \underset{r\to\infty}{\lim}\left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i}\right)^{\nu_{r}}\right)\\ &&\qquad=\sum\limits_{i=1}^{n} w_{i} \log\left( \exp\left( x_{i}^{\mathrm{T}}\tilde{{\varSigma}}^{-1}x_{i}\right)\right)=\sum\limits_{i=1}^{n} w_{i}x_{i}^{\mathrm{T}} \tilde{\varSigma}^{-1}x_{i}. \end{array} $$

Hence, we have

$$ \begin{array}{@{}rcl@{}} \underset{(\nu,{\varSigma})\in\mathbb{R}_{>0}\times\text{SPD}(d)}{\inf}L(\nu,{\varSigma})=\underset{r\to\infty}{\lim} L(\nu_{r},{\varSigma}_{r}) =d\log(2)+\sum\limits_{i=1}^{n} w_{i}x_{i}^{\mathrm{T}}\tilde{\varSigma}^{-1}x_{i}+\log\left( \left|\tilde{\varSigma}\right|\right). \end{array} $$

By taking the derivative with respect to Σ we see that the right-hand side is minimal if and only if ${\varSigma }=\hat {\varSigma }={\sum }_{i=1}^{n}w_{i}x_{i}x_{i}^{\mathrm {T}}$. On the other hand, by similar computations as above we get

$$ \begin{array}{@{}rcl@{}} \underset{(\nu,{\varSigma})\in\mathbb{R}_{>0}\times\text{SPD}(d)}{\inf}L(\nu,{\varSigma}) &\leq& \underset{r\to\infty}{\lim} L\left( \nu_{r},\hat{\varSigma}\right)\\ &=& d\log(2) + \log\left( \left|\hat{\varSigma}\right|\right) \\ &&\quad+ \underset{v_{r} \rightarrow \infty}{\lim} (d+\nu_{r}) \sum\limits_{i=1}^{n} w_{i} \log \left( 1+\frac1{\nu_{r}}x_{i}^{\mathrm{T}} \hat {\varSigma}^{-1}x_{i}\right)\\ &=&d\log(2) + \log\left( \left|\hat{\varSigma}\right|\right) + \sum\limits_{i=1}^{n} w_{i}x_{i}^{\mathrm{T}} \hat{\varSigma}^{-1}x_{i}+\log\left( \left|\hat{\varSigma}\right|\right), \end{array} $$

so that $\tilde {\varSigma }=\hat {\varSigma }$. This finishes the proof. □

4 Zeros of F

In this section, we are interested in the existence of solutions of (4), i.e., in zeros of F for arbitrary fixed μ and Σ. Setting $x := \frac {\nu }{2} > 0$, $t := \frac {d}{2}$ and

$$ s_{i} := \frac12 (x_{i} - \mu)^{\mathrm{T}} {\varSigma}^{-1} (x_{i} - \mu), \quad i=1,\ldots,n. $$

we rewrite the function F in (4) as

$$ \begin{array}{@{}rcl@{}} F(x) &=& \phi (x) - \phi(x+t) + \sum\limits_{i=1}^{n} w_{i} \left( \frac{x+t}{x + s_{i}}- \log\left( \frac{x + t}{x + s_{i}} \right) - 1\right) \\ &=& \sum\limits_{i=1}^{n} w_{i} F_{s_{i}} (x) = \sum\limits_{i=1}^{n} w_{i} \left( A(x) + B_{s_{i}}(x) \right), \end{array} $$

(6)

where

$$ F_{s}(x) := A(x) + B_{s}(x) $$

(7)

and

$$ A(x) := \phi (x) - \phi(x+t),\qquad B_{s} (x) := \frac{x+t}{x + s}- \log\left( \frac{x + t}{x + s} \right) - 1. $$

The digamma function ψ and $\phi = \psi - \log (\cdot )$ are well examined in the literature (see [1]). The function ϕ(x) is the expectation value of a random variable which is Γ(x, x) distributed. It holds $-\frac {1}{x} < \phi (x) < - \frac {1}{2x}$ and it is well-known that − ϕ is completely monotone. This implies that the negative of A is also completely monotone, i.e., for all x > 0 and $m \in \mathbb N_{0}$ we have

$$ (-1)^{m+1} \phi^{(m)} (x) > 0, \qquad (-1)^{m+1} A^{(m)} (x) > 0, $$

in particular A < 0, $A^{\prime } > 0$ and $A^{\prime \prime } < 0$. Further, it is easy to check that

$$ \begin{array}{@{}rcl@{}} \underset{x\rightarrow 0}{\lim} \phi(x) &=& -\infty, \qquad \underset{x\rightarrow \infty}{\lim} \phi(x) = 0^{-}, \end{array} $$

(8)

$$ \begin{array}{@{}rcl@{}} \underset{x\rightarrow 0}{\lim} A(x) &=& -\infty, \qquad \underset{x\rightarrow \infty}{\lim} A(x) = 0^{-}. \end{array} $$

(9)

On the other hand, we have that B(x) ≡ 0 if s = t in which case F_s = A < 0 and has therefore no zero. If s≠t, then B_s is completely monotone, i.e., for all x > 0 and $m \in \mathbb N_{0}$,

$$ (-1)^{m} B_{s}^{(m)} (x) > 0, $$

in particular B_s > 0, $B_{s}^{\prime } < 0$ and $B_{s}^{\prime \prime } >0$, and

$$ B_{s}(0) = \frac{t}{s} - \log \left( \frac{t}{s} \right) - 1 > 0, \qquad \underset{x\rightarrow \infty}{\lim} B_{s} (x) = 0^{+}. $$

Hence, we have

$$ \underset{x \rightarrow 0}{\lim} F_{s}(x) = -\infty, \qquad \underset{x \rightarrow \infty}{\lim} F_{s}(x) = 0. $$

(10)

If $X \sim {\mathcal N}(\mu ,{\varSigma })$ is a d-dimensional random vector, then $Y := (X-\mu )^{\mathrm {T}} {\varSigma }^{-1} (X-\mu ) \sim {\chi _{d}^{2}}$ with $\mathbb E (Y) = d$ and V ar(Y ) = 2d. Thus, we would expect that for samples x_i from such a random variable X the corresponding values $(x_{i} - \mu )^{\mathrm {T}} {\varSigma }^{-1} (x_{i} - \mu )$ lie with high probability in the interval $[d - \sqrt {2d},d+ \sqrt {2d}]$, respective $s_{i} \in [t -\sqrt {t}, t + \sqrt {t}]$. These considerations are reflected in the following theorem and corollary.

Theorem 2

For $F_{s}: \mathbb {R}_{>0} \rightarrow \mathbb {R}$ given by (7) the following relations hold true:

i) If $s \in [t - \sqrt {t},t+ \sqrt {t}] \cap \mathbb R_{>0}$, then F_s(x) < 0 for all x > 0 so that F_s has no zero.
ii) If s > 0 and $s \not \in [t - \sqrt {t},t+ \sqrt {t}]$, then there exists x₊ such that F_s(x) > 0 for all x ≥ x₊. In particular, F_s has a zero.

Proof

We have

$$ \begin{array}{@{}rcl@{}} F_{s}^{\prime}(x) &=& \phi^{\prime}\left( x\right) - \phi^{\prime}(x+t) - \frac{(s-t)^{2}}{(x +s)^{2}(x+t)}\\ &=& \psi^{\prime}(x) - \psi^{\prime}(x+t) - \frac{t}{x(x+t)} - \frac{(s-t)^{2}}{(x +s)^{2}(x+t)}. \end{array} $$

We want to sandwich $F^{\prime }_{s}$ between two rational functions P_s and P_s + Q which zeros can easily be described.

Since the trigamma function $\psi ^{\prime }$ has the series representation

$$ \psi^{\prime}(x) = \sum\limits_{k=0}^{\infty} \frac{1}{(x+k)^{2}}, $$

see [1], we obtain

$$ F_{s}^{\prime}(x) = \sum\limits_{k=0}^{\infty}\frac{1}{(x+k)^{2}} - \frac{1}{(x+k+t)^{2}} - \frac{t}{x(x+t)} - \frac{(s-t)^{2}}{(x+s)^{2}(x+t)}. $$

(11)

For x > 0, we have

$$I(x) = {\int}_{0}^{\infty} \underbrace{\frac{1}{(x+u)^{2}}-\frac{1}{(x+u+t)^{2}}}_{g(u)} du =\frac1x-\frac1{x+t} = \frac{t}{(x+t)x}.$$

Let R(x) and T(x) denote the rectangular and trapezoidal rule, respectively, for computing the integral with step size 1. Then, we verify

$$R(x)=\sum\limits_{k=0}^{\infty} g(k)=\sum\limits_{k=0}^{\infty} \frac1{(x+k)^{2}}-\frac1{(x+k+t)^{2}}$$

so that

$$ \begin{array}{@{}rcl@{}} F_{s}^{\prime}(x) &=& \left( R(x) - T(x) \right) + \left( T(x) - I(x) \right) - \frac{(s-t)^{2}}{(x+s)^{2}(x+t)}\\ & =& \frac12 \left( \frac{1}{x^{2}} -\frac{1}{(x+t)^{2}} \right)+ \left( T(x) - I(x) \right) - \frac{(s-t)^{2}}{(x+s)^{2}(x+t)}. \end{array} $$

By considering the first and second derivatives of g we see the integrand in I(x) is strictly decreasing and strictly convex. Thus, $ P_{s}(x) < F_{s}^{\prime }(x) $, where

$$ \begin{array}{@{}rcl@{}} P_{s}(x) &:=& \frac12 \left( \frac{1}{x^{2}} -\frac{1}{(x+t)^{2}} \right) - \frac{(s-t)^{2}}{(x+s)^{2}(x+t)}\\ &=& \frac{(2tx + t^{2})(x+s)^{2} - (s-t)^{2} x^{2}(x+t)}{2x^{2}(x+s)^{2}(x+t)^{2}}\\ &=& \frac{p_{s}(x)}{2x^{2}(x+s)^{2}(x+t)^{2}}. \end{array} $$

with p_s(x) : = a₃x³ + a₂x² + a₁x + a₀ and

$$ \begin{array}{@{}rcl@{}} a_{0} &=& t^{2}s^{2} > 0,\quad\quad\quad\quad\quad\quad\quad a_{1} = 2st(s+t) > 0, \\ a_{2} &=& t\left( 4s+t - (s-t)^{2}\right),\quad\quad a_{3} = 2\left( t- (s-t)^{2} \right). \end{array} $$

For t ≥ 1, we have

$$ a_{3} \ge 0 \quad \Longleftrightarrow \quad s \in [t - \sqrt{t}, t + \sqrt{t}] $$

(12)

and

$$a_{2} \geq 0 \quad \Longleftrightarrow \quad s \in [t+2-\sqrt{4+ 5t}, t+2 + \sqrt{4+ 5t}] \supset [t - \sqrt{t}, t + \sqrt{t}].$$

For $t=\frac 12$, it holds $[t+2-\sqrt {4+ 5t}, t+2 + \sqrt {4+ 5t}]\supset [0,t+\sqrt {t}]$.

Thus, for $s \in [t - \sqrt {t}, t + \sqrt {t}]$, by the sign rule of Descartes, p_s(x) has no positive zero which implies

$$ 0 \le P_{s}(x) < F_{s}^{\prime}(x) \quad \text{for} \quad s \in [t - \sqrt{t}, t + \sqrt{t}] \cap \mathbb R_{>0}. $$

Hence, the continuous function F_s is monotone increasing and by (10) we obtain F_s(x) < 0 for all x > 0 if $s \in [t - \sqrt {t}, t + \sqrt {t}] \cap \mathbb R_{>0}$. Let s > 0 and $s \not \in [t - \sqrt {t}, t + \sqrt {t}]$. By

$$ T(x)-I(x)=\sum\limits_{k=0}^{\infty} \left( \frac12(g(k+1)+g(k)) - {{\int}_{0}^{1}} g(k+u) du \right) $$

and Euler’s summation formula, we obtain

$$ T(x) - I(x) = \sum\limits_{k=0}^{\infty} \frac{1}{12} \left( g^{\prime}(k+1) - g^{\prime}(k) \right) - \frac{1}{720} g^{(4)}(\xi_{k}), \quad \xi_{k} \in (k,k+1) $$

with $g^{\prime }(u) = -\frac {2}{(x+u)^{3}}+\frac {2}{(x+u+t)^{3}}$ and $g^{(4)}(u) = \frac {5!}{(x+u)^{6}}-\frac {5!}{(x+u+t)^{6}}$, so that

$$ \begin{array}{@{}rcl@{}} T(x) - I(x) &=& -\frac{1}{12} g^{\prime}(0) + \sum\limits_{k=0}^{\infty} \frac16\frac1{(x+\xi_{k}+t)^{6}}-\frac16\frac1{(x+\xi_{k})^{6}}\\ &<&- \frac{1}{12}g^{\prime}(0) =\frac16\frac{3t x^{2} + 3t^{2}x + t^{3}}{x^{3}(x+t)^{3}}. \end{array} $$

(13)

Therefore, we conclude

$$ F_{s}^{\prime}(x) < P_{s}(x) + \underbrace{\frac16\frac{3t x^{2} + 3t^{2}x + t^{3}}{x^{3}(x+t)^{3}}}_{Q(x)} = \frac{p_{s}(x) x (x+t) + (t x^{2} + t^{2}x + \frac13 t^{3})(x+s)^{2}}{2 x^{3}(x+s)^{2}(x+t)^{3}}. $$

The main coefficient of x⁵ of the polynomial in the numerator is $2\left (t-(s-t)^{2}\right )$ which fulfills (12). Therefore, if $s \not \in [t - \sqrt {t}, t + \sqrt {t}]$, then there exists x₊ large enough such that the numerator becomes smaller than zero for all x ≥ x₊. Consequently, $F^{\prime }_{s}(x) \leq P_{s}(x) + Q(x)<0$ for all x ≥ x₊. Thus, F_s is decreasing on $[x_{+},\infty )$. By (10), we conclude that F_s has a zero. □

The following corollary states that F_s has exactly one zero if $s > t+ \sqrt {t}$. Unfortunately we do not have such a results for $s < t - \sqrt {t}$.

Corollary 1

Let $F_{s}: \mathbb {R}_{>0} \rightarrow \mathbb {R}$ be given by (7). If $s >t + \sqrt {t}$, t ≥ 1, then F_s has exactly one zero.

Proof

By Theorem 2ii) and since $\lim _{x\rightarrow 0} F_{s}(x) = -\infty $ and $\lim _{x\rightarrow \infty } = 0^{+}$, it remains to prove that $F_{s}^{\prime }$ has at most one zero. Let x₀ > 0 be the smallest number such that $F_{s}^{\prime }(x_{0})=0$. We prove that $F_{s}^{\prime }(x)<0$ for all x > x₀. To this end, we show that $h_{s}(x):= F_{s}^{\prime }(x)(x+s)^{2}(x+t)$ is strictly decreasing. By (11) we have

$$ h_{s}(x) = (x+s)^{2}(x+t)\left( \sum\limits_{k=0}^{\infty}\frac{1}{(x+k)^{2}} - \frac{1}{(x+k+t)^{2}} - \frac{t}{x(x+t)} \right)- (s-t)^{2}, $$

and for s > t further

$$ \begin{array}{@{}rcl@{}} h_{s}^{\prime}(x) &=& \left( 2(x+s)(x+t)+ (x+s)^{2}\right)\left( \sum\limits_{k=0}^{\infty}\frac{1}{(x+k)^{2}} - \frac{1}{(x+k+t)^{2}} - \frac{t}{x(x+t)} \right) \\ &&\quad + (x+s)^{2}(x+t)\left( \sum\limits_{k=0}^{\infty}\frac{-2}{(x+k)^{3}} + \frac{2}{(x+k+t)^{3}} + \frac{t(2x+t)}{x^{2}(x+t)^{2}} \right)\\ &\leq& 3(x+s)^{2} \left( \sum\limits_{k=0}^{\infty}\frac{1}{(x+k)^{2}} - \frac{1}{(x+k+t)^{2}} - \frac{t}{x(x+t)} \right)\\ &&\quad + (x+s)^{2}(x+t)\left( \sum\limits_{k=0}^{\infty}\frac{-2}{(x+k)^{3}} + \frac{2}{(x+k+t)^{3}} + \frac{t(2x+t)}{x^{2}(x+t)^{2}} \right) \\ &=& (x+s)^{2} (R(x)-I(x)), \end{array} $$

where I(x) is the integral and R(x) the corresponding rectangular rule with step size 1 of the function g : = g₁ + g₂ defined as

$$ \begin{array}{@{}rcl@{}} g_{1}(u)&:=& 3\left( \frac{1}{(x+u)^{2}} - \frac{1}{(x+ t + u)^{2}}\right), \\ g_{2}(u)&:=& (x+t)\left( \frac{-2}{(x+u)^{3}} + \frac{2}{(x+t+ u)^{3}}\right). \end{array} $$

We show that R(x) − I(x) < 0 for all x > 0. Let T(x), T_i(x) be the trapezoidal rules with step size 1 corresponding to I(x) and $I_{i}(x)={\int \limits }_{0}^{\infty } g_{i}(u)du$, i = 1,2. Then it follows

$$ R(x)- I(x) = R(x) - T(x) + T(x) - I(x) =R(x) - T(x) + T_{1}(x) - I_{1}(x) + T_{2}(x) - I_{2}(x). $$

Since g₂ is a decreasing, concave function, we conclude T₂(x) − I₂(x) < 0. Using Euler’s summation formula in (13) for g₁, we get

$$ T_{1}(x) - I_{1}(x) = -\frac{1}{12}g_{1}^{\prime}(0) - \frac{1}{720}\sum\limits_{k=0}^{\infty} g_{1}^{(4)}(\xi_{k}), \quad \xi_{k}\in(k,k+1). $$

Since $g_{1}^{(4)}$ is a positive function, we can write

$$ \begin{array}{@{}rcl@{}} R(x) - I(x) &<& R(x) - T(x) + T_{1}(x) - I_{1}(x) \leq \frac{1}{2} g(0) -\frac{1}{12}g_{1}^{\prime}(0)\\ &=& \frac{3}{2}\left( \frac{1}{x^{2}}-\frac{1}{(x+t)^{2}}\right) + \frac{1}{2}(x+t) \left( \frac{-2}{x^{3}} + \frac{2}{(x+t)^{3}}\right) \\ &&\quad- \frac{1}{2}\left( \frac{-1}{x^{3}} + \frac{1}{(x+t)^{3}}\right)\\ &=&\frac{t}{2} \frac{(- 3 t + 3 )x^{2} +\left( - 5 t^{2} + 3t\right)x -2 t^{3} +t^{2}}{x^{3}(x+t)^{3}}. \end{array} $$

All coefficients of x are smaller or equal than zero for t ≥ 1 which implies that h_s is strictly decreasing. □

Theorem 2 implies the following corollary.

Corollary 2

For $F: \mathbb {R}_{>0} \rightarrow \mathbb {R}$ given by (6) and $\delta _{i} := (x_{i} - \mu )^{\mathrm {T}} {\varSigma }^{-1} (x_{i} - \mu )$, i = 1,…,n, the following relations hold true:

i) If $\delta _{i} \in [d - \sqrt {2d},d+ \sqrt {2d}] \cap \mathbb R_{>0}$ for all i ∈{1,…,n}, then F(x) < 0 for all x > 0 so that F has no zero.
ii) If δ_i > 0 and $\delta _{i} \not \in [d - \sqrt {2d},d+ \sqrt {2d}]$ for all i ∈{1,…,n}, there exists x₊ such that F(x) > 0 for all x ≥ x₊. In particular, F has a zero.

Proof

Consider $F = {\sum }_{i=1}^{n} F_{s_{i}}$. If $\delta _{i} \in [d - \sqrt {2d},d+ \sqrt {2d}] \cap \mathbb R_{>0}$ for all i ∈{1,…,n}, then we have by Theorem 2 that $F_{s_{i}} (x) < 0$ for all x > 0. Clearly, the same holds true for the whole function F such that it cannot have a zero.

If $\delta _{i} \not \in [d - \sqrt {2d},d+ \sqrt {2d}]$ for all i ∈{1,…,n}, then we know by Theorem 2 that there exist x_i+ > 0 such that $F_{s_{i}} (x) > 0$ for x ≥ x_i+. Thus, F(x) > 0 for $x \ge x_{+} := \max \limits _{i}(x_{i+})$. Since $\lim _{x \rightarrow 0} F(x) = -\infty $ this implies that F has a zero. □

5 Algorithms

In this section, we propose an alternative of the classical EM algorithm for computing the parameters of the Student t distribution along with convergence results. In particular, we are interested in estimating the degree of freedom parameter ν, where the function F is of particular interest.

Algorithm 1 with weights $w_{i} = \frac {1}{n}$, i = 1,…,n, is the classical EM algorithm. Note that the function in the third M-Step

$$ \begin{array}{@{}rcl@{}} {\varPhi}_{r} \left( \frac{\nu}{2} \right) &:= \phi \left( \frac{\nu}{2} \right) \underbrace{ - \phi \left( \frac{\nu_{r} + d}{2} \right) + \sum\limits_{i=1}^{n} w_{i} \left( \gamma_{i,r} - \log(\gamma_{i,r} ) - 1 \right)}_{c_{r}} \end{array} $$

has a unique zero since by (8) the function ϕ < 0 is monotone increasing with $\lim _{x \rightarrow \infty } \phi (x) = 0^{-}$ and c_r > 0. Concerning the convergence of the EM algorithm it is known that the values of the objective function L(ν_r,μ_r,Σ_r) are monotone decreasing in r and that a subsequence of the iterates converges to a critical point of L(ν, μ, Σ) if such a point exists, see [5].

Algorithm 2 distinguishes from the EM algorithm in the iteration of Σ, where the factor $\frac {1}{\sum \limits _{i=1}^{n} w_{i} \gamma _{i,r}}$ is incorporated now. The computation of this factor requires no additional computational effort, but speeds up the performance in particular for smaller ν. Such kind of acceleration was suggested in [12, 24]. For fixed ν ≥ 1, it was shown in [32] that this algorithm is indeed an EM algorithm arising from another choice of the hidden variable than used in the standard approach, see also [15]. Thus, it follows for fixed ν ≥ 1 that the sequence L(ν, μ_r,Σ_r) is monotone decreasing. However, we also iterate over ν. In contrast to the EM Algorithm 1 our ν iteration step depends on μ_r+ 1 and Σ_r+ 1 instead of μ_r and Σ_r. This is important for our convergence results. Note that for both cases, the accelerated algorithm can no longer be interpreted as an EM algorithm, so that the convergence results of the classical EM approach are no longer available.

Let us mention that a Jacobi variant of Algorithm 2 for fixedν, i.e.,

$$ {\varSigma}_{r+1} = \sum\limits_{i=1}^{n} \frac{w_{i}\gamma_{i,r} (x_{i}-\mu_{r})(x_{i}-\mu_{r})^{\mathrm{T}} }{{\sum}_{i=1}^{n} w_{i} \gamma_{i,r}}, $$

with μ_r instead of μ_r+ 1 including a convergence proof was suggested in [17]. The main reason for this index choice was that we were able to prove monotone convergence of a simplified version of the algorithm for estimating the location and scale of Cauchy noise (d = 1, ν = 1) which could be not achieved with the variant incorporating μ_r+ 1 (see [16]). This simplified version is known as myriad filter in image processing. In this paper, we keep the original variant from the EM algorithm (14) since we are mainly interested in the computation of ν.

Instead of the above algorithms we suggest to take the critical point (4) more directly into account in the next two algorithms.

Finally, Algorithm 4 computes the update of ν by directly finding a zero of the whole function F in (4) given μ_r and Σ_r. The existence of such a zero was discussed in the previous section. The zero computation is done by an inner loop which iterates the update step of ν from Algorithm 3. We will see that the iteration converge indeed to a zero of F.

In the rest of this section, we prove that the sequence (L(ν_r,μ, r, Σ_r))_r generated by Algorithms 2 and 3 decreases in each iteration step and that there exists a subsequence of the iterates which converges to a critical point.

We will need the following auxiliary lemma.

Lemma 1

Let $F_{a},F_{b}\colon \mathbb {R}_{>0}\to \mathbb {R}$ be continuous functions, where F_a is strictly increasing and F_b is strictly decreasing. Define F : = F_a + F_b. For any initial value x₀ > 0 assume that the sequence generated by

$$ x_{l+1} = \text{ zero of } F_{a}(x)+F_{b}(x_{l}) $$

is uniquely determined, i.e., the functions on the right-hand side have a unique zero. Then it holds

i)
If F(x₀) < 0, then (x_l)_l is strictly increasing and F(x) < 0 for all x ∈ [x_l,x_l+ 1], $l \in \mathbb N_{0}$.
ii)
If F(x₀) > 0, then (x_l)_l is strictly decreasing and F(x) > 0 for all x ∈ [x_l+ 1,x_l], $l \in \mathbb N_{0}$.

Furthermore, assume that there exists x₋ > 0 with F(x) < 0 for all x < x₋ and x₊ > 0 with F(x) > 0 for all x > x₊. Then, the sequence (x_l)_l converges to a zero x^∗ of F.

Proof

We consider the case i) that F(x₀) < 0. Case ii) follows in a similar way.

We show by induction that F(x_l) < 0 and that x_l+ 1 > x_l for all $l \in \mathbb N$. Then it holds for all $l\in \mathbb N$ and x ∈ (x_l,x_l+ 1) that F_a(x) + F_b(x) < F_a(x) + F_b(x_l) < F_a(x_l+ 1) + F_b(x_l) = 0. Thus F(x) < 0 for all x ∈ [x_l,x_l+ 1], $l \in \mathbb N_{0}$.

Induction step. Let F_a(x_l) + F_b(x_l) < 0. Since F_a(x_l+ 1) + F_b(x_l) = 0 > F_a(x_l) + F_b(x_l) and F_a is strictly increasing, we have x_l+ 1 > x_l. Using that F_b is strictly decreasing, we get F_b(x_l+ 1) < F_b(x_l) and consequently

$$ F(x_{l+1}) = F_{a}(x_{l+1}) + F_{b}(x_{l+1}) < F_{a}(x_{l+1}) + F_{b}(x_{l})=0. $$

Assume now that F(x) > 0 for all x > x₊. Since the sequence (x_l)_l is strictly increasing and F(x_l) < 0 it must be bounded from above by x₊. Therefore it converges to some $x^{*}\in \mathbb {R}_{>0}$. Now, it holds by the continuity of F_a and F_b that

$$ 0 =\lim\limits_{l\to\infty} F_{a}(x_{l+1}) + F_{b}(x_{l}) = F_{a}(x^{*}) + F_{b}(x^{*}) = F(x^{*}). $$

Hence x^∗ is a zero of F. □

For the setting in Algorithm 4, Lemma 1 implies the following corollary.

Corollary 3

Let $F_{a} (\nu ) := \phi \left (\frac {\nu }{2} \right ) - \phi \left (\frac {\nu +d}{2} \right )$ and

$$ F_{b} (\nu) := \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right),\quad r \in \mathbb{N}_{0}. $$

Assume that there exists ν₊ > 0 such that F : = F_a + F_b > 0 for all ν ≥ ν₊. Then the sequence (ν_{r, l})_l generated by the r th inner loop of Algorithm 4 converges to a zero of F.

Note that by Corollary 2 the above condition on F is fulfilled in each iteration step, e.g., if $\delta _{i,r} \not \in [d - \sqrt {2d} , d + \sqrt {2d}]$ for i = 1,…,n and $r \in \mathbb {N}_{0}$.

Proof

From the previous section we know that F_a is strictly increasing and F_b is strictly decreasing. Both functions are continuous. If F(ν_r) < 0, then we know from Lemma 1 that (ν_{r, l})_l is increasing and converges to a zero $\nu _{r}^{*}$ of F.

If F(ν_r) > 0, then we know from Lemma 1 that (ν_{r, l})_l is decreasing. The condition that there exists $x_{-}\in \mathbb {R}_{>0}$ with F(x) < 0 for all x < x₋ is fulfilled since $\lim _{x \rightarrow 0} F(x) = -\infty $. Hence, by Lemma 1, the sequence converges to a zero $\nu _{r}^{*}$ of F. □

To prove that the objective function decreases in each step of the Algorithms 2–4 we need the following lemma.

Lemma 2

Let $F_{a},F_{b}\colon \mathbb {R}_{>0}\to \mathbb {R}$ be continuous functions, where F_a is strictly increasing and F_b is strictly decreasing. Define F : = F_a + F_b and let $G\colon \mathbb {R}_{>0}\to \mathbb {R}$ be an antiderivative of F, i.e., $F= \frac {\mathrm {d}}{\mathrm {d}x} G$. For an arbitrary x₀ > 0, let (x_l)_l be the sequence generated by

$$ x_{l+1} = \text{ zero of } F_{a}(x) + F_{b}(x_{l}). $$

Then the following holds true:

i)
The sequence (G(x_l))_l is monotone decreasing with G(x_l) = G(x_l+ 1) if and only if x₀ is a critical point of G. If (x_l)_l converges, then the limit x^∗ fulfills
$$ G(x_{0}) \geq G(x_{1}) \geq G(x^{*}), $$
with equality if and only if x₀ is a critical point of G.
ii)
Let $F = \tilde F_{a} + \tilde F_{b}$ be another splitting of F with continuous functions $\tilde F_{a}, \tilde F_{b}$, where the first one is strictly increasing and the second one strictly decreasing. Assume that $\tilde F_{a}^{\prime }(x) > F_{a}^{\prime }(x)$ for all x > 0. Then holds for $y_{1} := \text { zero of } \tilde F_{a}(x) + \tilde F_{b}(x_{0})$ that G(x₀) ≥ G(y₁) ≥ G(x₁) with equality if and only if x₀ is a critical point of G.

Proof

i) If F(x₀) = 0, then x₀ is a critical point of G.

Let F(x₀) < 0. By Lemma 1 we know that (x_l)_l is strictly increasing and that F(x) < 0 for x ∈ [x_r,x_r+ 1], $r \in \mathbb {N}_{0}$. By the Fundamental Theorem of calculus it holds

$$ G(x_{l+1})=G(x_{l})+{\int}_{x_{l}}^{x_{l+1}} F(\nu) d\nu. $$

Thus, G(x_l+ 1) < G(x_l).

Let F(x₀) > 0. By Lemma 1 we know that (x_l)_l is strictly decreasing and that F(x) > 0 for x ∈ [x_r+ 1,x_r], $r \in \mathbb {N}_{0}$. Then

$$ G(x_{l}) = G(x_{l+1})+{\int}_{x_{l+1}}^{x_{l}} F(\nu) d\nu. $$

implies G(x_l+ 1) < G(x_l). Now, the rest of assertion i) follows immediately. ii) It remains to show that G(x₁) ≤ G(y₁). Let F(x₀) < 0. Then we have y₁ ≥ x₀ and x₁ ≥ x₀. By the Fundamental Theorem of calculus we obtain

$$ F(x_{0}) + {\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x)dx = F_{a}(x_{0})+{\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x) dx + F_{b} (x_{0}) = F_{a} (x_{1}) + F_{b} (x_{0})=0,$$

and

$$ F(x_{0}) + {\int}_{x_{0}}^{y_{1}} \tilde F_{a}^{\prime}(x)dx=\tilde F_{a}(x_{0})+{\int}_{x_{0}}^{y_{1}}\tilde F_{a}^{\prime}(x) dx+\tilde F_{b}(x_{0}) =\tilde F_{a}(y_{1})+\tilde F_{b}(x_{0})=0. $$

This yields

$$ {\int}_{x_{0}}^{x_{1}} F_{a}^{\prime}(x) dx={\int}_{x_{0}}^{y_{1}}\tilde F_{a}^{\prime}(x)dx, $$

and since $\tilde F^{\prime }_{a}(x) > F^{\prime }_{a}(x)$ further y₁ ≤ x₁ with equality if and only if x₀ = x₁, i.e., if x₀ is a critical point of G. Since F(x) < 0 on (x₀,x₁) it holds

$$ G(x_{1})=G(y_{1})+{\int}_{y_{1}}^{x_{1}}F(x) dx \leq G(y_{1}), $$

with equality if and only if x₀ = x₁. The case F(x₀) > 0 can be handled similarly. □

Lemma 2 implies the following relation between the values of the objective function L for Algorithms 2–4.

Corollary 4

For the same fixed $\nu _{r}>0, \mu _{r}\in \mathbb {R}^{d}, {\varSigma }_{r}\in \text {SPD}(d)$ define μ_r+ 1, Σ_r+ 1, $\nu _{r+1}^{\text {aEM}}$, $\nu _{r+1}^{\text {MMF}}$ and $\nu _{r+1}^{\text {GMMF}}$ by Algorithm 2, 3 and 4, respectively. For the GMMF algorithm assume that the inner loop converges. Then it holds

$$ \begin{array}{@{}rcl@{}} L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}) &\geq& L(\nu_{r+1}^{\text{aEM}},\mu_{r+1},{\varSigma}_{r+1}) \geq L(\nu_{r+1}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1})\\ &\geq& L(\nu_{r+1}^{\text{GMMF}},\mu_{r+1},{\varSigma}_{r+1}). \end{array} $$

Equality holds true if and only if $\frac {\mathrm {d}}{\mathrm {d}\nu }L(\nu _{r},\mu _{r+1},{\varSigma }_{r+1})=0$ and in this case $\nu _{r} = \nu _{r+1}^{\text {aEM}} = \nu _{r+1}^{\text {MMF}} = \nu _{r+1}^{\text {GMMF}}$.

Proof

For G(ν) : = L(ν, μ_r+ 1,Σ_r+ 1), we have $\frac {\mathrm {d}}{\mathrm {d}\nu } L(\nu ,\mu _{r+1},{\varSigma }_{r+1}) = F(\nu )$, where

$$ F(\nu) := \phi\left( \frac{\nu}{2} \right) -\phi\left( \frac{\nu +d}{2} \right) + \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right). $$

We use the splitting

$$F = F_{a} + F_{b} = \tilde F_{a} + \tilde F_{b}$$

with

$$ F_{a} (\nu):= \phi\left( \frac\nu2 \right)- \phi\left( \frac{\nu + d}{2} \right), \quad \tilde F_{a} := \phi\left( \frac\nu2 \right), $$

$$ F_{b}(\nu) := \sum\limits_{i=1}^{n} w_{i} \left( \frac{\nu + d}{\nu + \delta_{i,r+1}} - \log\left( \frac{\nu + d}{\nu + \delta_{i,r+1}} \right) - 1 \right), $$

and

$$ \quad \tilde F_{b} (\nu):= - \phi \left( \frac{\nu+d}{2} \right) + F_{b}(\nu). $$

By the considerations in the previous section we know that F_a, $\tilde F_{a}$ are strictly increasing and F_b, $\tilde F_{b}$ are strictly decreasing. Moreover, since $\phi ^{\prime } > 0$ we have $\tilde F^{\prime }_{a} > F^{\prime }_{a}$. Hence it follows from Lemma 2(ii) that

$$L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}) \ge L\left( \nu_{r}^{\text{aEM}},\mu_{r+1},{\varSigma}_{r+1}\right) \ge L\left( \nu_{r}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1}\right).$$

Finally, we conclude by Lemma 2(i) that

$$L\left( \nu_{r}^{\text{MMF}},\mu_{r+1},{\varSigma}_{r+1}\right) \ge L\left( \nu_{r}^{\text{GMMF}},\mu_{r+1},{\varSigma}_{r+1}\right).$$

□

Concerning the convergence of the three algorithms we have the following result.

Theorem 3

Let (ν_r,μ_r,Σ_r)_r be sequence generated by Algorithm 2, 3 or 4, respectively starting with arbitrary initial values $\nu _{0} >0,\mu _{0}\in \mathbb {R}^{d},{\varSigma }_{0}\in \text {SPD}(d)$. For the GMMF algorithm we assume that in each step the inner loop converges. Then it holds for all $r\in \mathbb N_{0}$ that

$$ L(\nu_{r},\mu_{r},{\varSigma}_{r}) \geq L(\nu_{r+1},\mu_{r+1},{\varSigma}_{r+1}), $$

with equality if and only if (ν_r,μ_r,Σ_r) = (ν_r+ 1,μ_r+ 1,Σ_r+ 1).

Proof

By the general convergence results of the accelerated EM algorithm for fixed ν, see also [17], it holds

$$ L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1})\leq L(\nu_{r},\mu_{r},{\varSigma}_{r}), $$

with equality if and only if (μ_r,Σ_r) = (μ_r+ 1,Σ_r+ 1). By Corollary 4 it holds

$$ L(\nu_{r+1},\mu_{r+1},{\varSigma}_{r+1})\leq L(\nu_{r},\mu_{r+1},{\varSigma}_{r+1}), $$

with equality if and only if ν_r = ν_r+ 1. The combination of both results proves the claim. □

Lemma 3

Let $T = (T_{1}, T_{2}, T_{3}): \mathbb {R}_{>0} \times \mathbb {R}^{d} \times SPD(d) \rightarrow \mathbb {R}_{>0} \times \mathbb {R}^{d} \times SPD(d)$ be the operator of one iteration step of Algorithm 2 (or 3). Then T is continuous.

Proof

We show the statement for Algorithm 3. For Algorithm 2 it can be shown analogously. Clearly the mapping (T₂,T₃)(ν, μ, Σ) is continuous. Since

$$T_{1}(\nu,\mu,{\varSigma}) = \text{zero of } {\varPsi}(x, \nu,T_{2}(\nu,\mu,{\varSigma}),T_{3}(\nu,\mu,{\varSigma})),$$

where

$$ \begin{array}{@{}rcl@{}} &{{\varPsi}}&(x, \nu,\mu,{\varSigma}) =\phi\left( \frac{x}{2}\right)-\phi\left( \frac{x+d}{2}\right)\\ &&+ \sum\limits_{i=1}^{n} w_{i}\left( \tfrac{\nu+d}{\nu+(x_{i}-\mu)^{T}{\varSigma}^{-1}(x_{i}-\mu)}-\log\left( \tfrac{\nu+d}{\nu+(x_{i}-\mu)^{T}{\varSigma}^{-1}(x_{i}-\mu)}\right)-1\right). \end{array} $$

It is sufficient to show that the zero of Ψ depends continuously on ν, T₂ and T₃. Now the continuously differentiable function Ψ is strictly increasing in x, so that $\frac {\partial }{\partial x} {\varPsi }(x,\nu ,T_{2},T_{3})>0$. By Ψ(T₁,ν, T₂,T₃) = 0, the Implicit Function Theorem yields the following statement: There exists an open neighborhood U × V of (T₁,ν, T₂,T₃) with $U\subset \mathbb {R}_{>0}$ and $V\subset \mathbb {R}_{>0}\times \mathbb {R}^{d}\times SPD(d)$ and a continuously differentiable function G: V → U such that for all (x, ν, μ, Σ) ∈ U × V it holds

$$ {\varPsi}(x,\nu,\mu,{\varSigma})=0 \quad \text{if and only if}\quad G(\nu,\mu,{\varSigma})=x. $$

Thus the zero of Ψ depends continuously on ν, T₂ and T₃. □

This implies the following theorem.

Theorem 4

Let (ν_r,μ_r,Σ_r)_r be the sequence generated by Algorithm 2 or 3 with arbitrary initial values $\nu _{0} >0,\mu _{0}\in \mathbb {R}^{d},{\varSigma }_{0}\in \text {SPD}(d)$. Then every cluster point of (ν_r,μ_r,Σ_r)_r is a critical point of L.

Proof

The mapping T defined in Lemma 3 is continuous. Further we know from its definition that (ν, μ, Σ) is a critical point of L if and only if it is a fixed point of T. Let $(\hat \nu ,\hat \mu ,\hat {\varSigma })$ be a cluster point of (ν_r,μ_r,Σ_r)_r. Then there exists a subsequence $(\nu _{r_{s}},\mu _{r_{s}},{\varSigma }_{r_{s}})_{s}$ which converges to $(\hat \nu ,\hat \mu ,\hat {\varSigma })$. Further we know by Theorem 3 that L_r = L(ν_r,μ_r,Σ_r) is decreasing. Since (L_r)_r is bounded from below, it converges. Now it holds

$$ \begin{array}{@{}rcl@{}} L\left( \hat \nu,\hat \mu,\hat {\varSigma}\right)&=&\underset{s\to\infty}{\lim}L\left( \nu_{r_{s}},\mu_{r_{s}},{\varSigma}_{r_{s}}\right)\\ &=&\underset{s\to\infty}{\lim}L_{r_{s}}=\underset{s\to\infty}{\lim}L_{r_{s}+1}\\ &=&\underset{s\to\infty}{\lim}L\left( \nu_{r_{s}+1},\mu_{r_{s}+1},{\varSigma}_{r_{s}+1}\right)\\ &=&\underset{s\to\infty}{\lim}L\left( T\left( \nu_{r_{s}},\mu_{r_{s}},{\varSigma}_{r_{s}}\right)\right)=L\left( T\left( \hat\nu,\hat\mu,\hat{\varSigma}\right)\right). \end{array} $$

By Theorem 3 and the definition of T we have that L(ν, μ, Σ) = L(T(ν, μ, Σ)) if and only if (ν, μ, Σ) = T(ν, μ, Σ). By the definition of the algorithm this is the case if and only if (ν, μ, Σ) is a critical point of L. Thus $(\hat \nu ,\hat \mu ,\hat {\varSigma })$ is a critical point of L. □

6 Numerical results

In this section we give two numerical examples of the developed theory. First, we compare the four different algorithms in Section 6.1. Then, in Section 6.2, we address further accelerations of our algorithms by SQUAREM [33] and DAAREM [9] and show also a comparison with the ECME algorithm [20]. Finally, in Section 6.3, we provide an application in image analysis by determining the degree of freedom parameter in images corrupted by Student t noise. We run all experiments on a HP Probook with Intel i7-8550U Quad Core processor. The code is provided online^{Footnote 2}.

6.1 Comparison of algorithms

In this section, we compare the numerical performance of the classical EM algorithm 1 and the proposed Algorithms 2, 3, and 4. To this aim, we did the following Monte Carlo simulation: Based on the stochastic representation of the Student t distribution, see (1), we draw n = 1000 i.i.d. realizations of the T_ν(μ, Σ) distribution with location parameter μ = 0 and different scatter matrices Σ and degrees of freedom parameters ν. Then, we used Algorithms 2, 3, and 4 to compute the ML estimator $(\hat \nu ,\hat \mu ,\hat {\varSigma })$.

We initialize all algorithms with the sample mean for μ and the sample covariance matrix for Σ. Furthermore, we set ν = 3 and in all algorithms the zero of the respective function is computed by Newton’s method. As a stopping criterion we use the following relative distance:

$$ \frac{ \sqrt{ \left\| \mu_{r+1} - \mu_{r} \right\|^{2} + \left\| {\varSigma}_{r+1} -{\varSigma}_{r} \right\|_{F}^{2}} }{ \sqrt{\|\mu_{r}\|^{2}+\|{\varSigma}_{r}\|_{F}^{2}} } + \frac{ \sqrt{(\log(\nu_{r+1})-\log(\nu_{r}))^{2}}}{|{\log(\nu_{r})}|}<10^{-5}. $$

We take the logarithm of ν in the stopping criterion, because T_ν(μ, Σ) converges to the normal distribution as $\nu \to \infty $ and therefore the difference between T_ν(μ, Σ) and T_ν+ 1(μ, Σ) becomes small for large ν.

To quantify the performance of the algorithms, we count the number of iterations until the stopping criterion is reached. Since the inner loop of the GMMF is potentially time consuming we additionally measure the execution time until the stopping criterion is reached. This experiment is repeated N = 10.000 times for different values of ν ∈{1,2,5,10}. Afterward we calculate the average number of iterations and the average execution times. The results are given in Tables 1 and 2. We observe that the performance of the algorithms depends on Σ. Further we see, that the performance of the aEM algorithm is always better than those of the classical EM algorithm. Further all algorithms need a longer time to estimate large ν. This seems to be natural since the likelihood function becomes very flat for large ν. Further, the GMMF needs the lowest number of iterations. But for small ν the execution time of the GMMF is larger than those of the MMF and the aEM algorithm. This can be explained by the fact, that the ν step has a smaller relevance for small ν but is still time consuming in the GMMF. The MMF needs slightly more iterations than the GMMF but if ν is not extremely large the execution time is smaller than for the GMMF and for the aEM algorithm. In summary, the MMF algorithm is proposed as algorithm of choice.

Table 1 Average number of iterations (lowest in bold) and the corresponding standard deviations of the different algorithms

Full size table

Table 2 The execution times (lowest in bold) and the corresponding standard deviations of the different algorithms

Full size table

In Fig. 2 we exemplarily show the functional values L(ν_r,μ_r,Σ_r) of the four algorithms and samples generated for different values of ν and Σ = I. Note that the x-axis of the plots is in log-scale. We see that the convergence speed (in terms of number of iterations) of the EM algorithm is much slower than those of the MMF/GMMF. For small ν the convergence speed of the aEM algorithm is close to the GMMF/MMF, but for large ν it is close to the EM algorithm.

In Fig. 3 we show the histograms of the ν-output of 1000 runs for different values of ν and Σ = I. Since the ν-outputs of all algorithms are very close together we only plot the output of the GMMF. We see that the accuracy of the estimation of ν decreases for increasing ν. This can be explained by the fact, that the likelihood function becomes very flat for large ν such that the estimation of ν becomes much harder.

6.2 Comparison with other accelerations of the EM algortihm

In this section, we compare our algorithms with the Expectation/Conditional Maximization Either (ECME) algorithm [19, 20] and apply the SQUAREM acceleration [33] as well as the damped Anderson Acceleration (DAAREM) [9] to our algorithms.

ECME algorithm:

The ECME algorithm was first proposed in [19]. Some numerical examples of the behavior of the ECME algorithm for estimating the parameters (ν, μ, Σ) of a Student t distribution T_ν(μ, Σ) are given in [20]. The idea of ECME is first to replace the M-Step of the EM algorithm by the following update of the parameters (ν_r,μ_r,Σ_r): first, we fix ν = ν_r and compute the update (μ_r+ 1,Σ_r+ 1) of the parameters (μ_r,Σ_r) by performing one step of the EM algorithm for fixed degree of freedom (CM1-Step). Second, we fix (μ, Σ) = (μ_r,Σ_r) and compute the update ν_r+ 1 of ν_r by maximizing the likelihood function with respect to ν (CM2-Step). The resulting algorithm is given in Algorithm 5. It is similar to the GMMF (Algorithm 4), but uses the Σ-update of the EM algorithm (Algorithm 5) instead of the Σ-update of the aEM algorithm (Algorithm 2). The authors of [19] showed a similar convergence result as for the EM algorithm. Alternatively, we could prove Theorem 3 for the ECME algorithm analogously as for the GMMF algorithm.

Next, we consider two acceleration schemes of arbitrary fixed point algorithms 𝜗_r+ 1 = G(𝜗_r). In our case $\vartheta \in \mathbb {R}^{p}$ is given by (ν, μ, Σ) and G is given by one step of Algorithm 1, 2, 3, 4, or 5.

SQUAREM Acceleration:

The first acceleration scheme, called squared iterative methods (SQUAREM) was proposed in [33]. The idea of SQUAREM is to update the parameters 𝜗_r = (ν_r,μ_r,Σ_r) in the following way: we compute 𝜗_r,1 = G(𝜗_r) and 𝜗_r,2 = G(𝜗_r,1). Then, we calculate s = 𝜗_r,1 − 𝜗_r and v = (𝜗_r,2 − 𝜗_r,1) − s. Now we set $\vartheta ^{\prime }=\vartheta _{r}-2\alpha r+\alpha ^{2} v$ and define the update $\vartheta _{r+1}=G(\vartheta ^{\prime })$, where α is chosen as follows. First, we set $\alpha =\min \limits (-\tfrac {\|r\|_{2}}{\|v\|_{2}},-1)$. Then we compute $\vartheta ^{\prime }$ as described before. If $L(\vartheta ^{\prime })<L(\vartheta _{r})$, we keep our choice of α. Otherwise we update α by $\alpha =\tfrac {\alpha -1}{2}$. Note that this scheme terminates as long a 𝜗_r is not a critical point of L by the following argument: it holds that 𝜗_r + 2r + v = 𝜗_r,2, which implies that it holds that $\lim _{\alpha \to -1}L(\vartheta _{r}-2\alpha +\alpha ^{2}v)=L(\vartheta _{r,2})\leq L(\vartheta _{r})$ with equality if and only if 𝜗_r is a critical point of L, since all our algorithms have the property that L(𝜗) ≥ L(G(𝜗)) with equality if and only if 𝜗 is a critical point of L. By construction this scheme ensures that the negative log-likelihood values of the iterates are decreasing.

Damped Anderson Acceleration with Restarts and 𝜖-Monotonicity (DAAREM):

The DAAREM acceleration was proposed in [9]. It is based on the Anderson acceleration, which was introduced in [2]. As for the SQUAREM acceleration want to solve the fixed point equation 𝜗 = G(𝜗) with 𝜗 = (ν, μ, Σ) using the iteration 𝜗_r+ 1 = G(𝜗_r). We also use the equivalent formulation to solve f(𝜗) = 0, where f(𝜗) = G(𝜗) − 𝜗. For a fixed parameter $m\in \mathbb {N}_{>0}$, we define $m_{r}=\min \limits (m,r)$. Then, one update of 𝜗_r using the Anderson Acceleration is given by

$$ \begin{array}{@{}rcl@{}} \vartheta_{r+1}&=&G(\vartheta_{r})-\sum\limits_{j=1}^{m_{r}} (G(\vartheta_{r-m_{r}+j})-G(\vartheta_{r-m_{r}+j-1}))\gamma_{j}^{(r)}\\ &=&\vartheta_{r}+f(\vartheta_{r})-\sum\limits_{j=1}^{m_{r}} ((\vartheta_{r-m_{r}+j}-\vartheta_{r-m_{r}+j-1})-(f(\vartheta_{r-m_{r}+j})\\ &&\quad-f(\vartheta_{r-m_{r}+j-1})))\gamma_{j}^{(r)}, \end{array} $$

(14)

with $\gamma ^{(r)}=\left (\mathcal {F}_{r}^{\mathrm {T}}\mathcal {F}_{r}\right )^{-1}\mathcal {F}_{r}^{\mathrm {T}} f(\vartheta _{r})$, where the columns of $\mathcal {F}_{r}\in \mathbb {R}^{p\times m_{r}}$ are given by $f(\vartheta _{r-m_{r}+j+1})-f(\vartheta _{r-m_{r}+j})$ for j = 0,…,m_r − 1. An equivalent formulation of update step (14) is given by

$$ \begin{array}{@{}rcl@{}} \vartheta_{r+1}=\vartheta_{r}+f(\vartheta_{r})-(\mathcal{X}_{r}+\mathcal{F}_{r})\gamma^{(r)}, \end{array} $$

where the columns of $\mathcal {X}_{r}\in \mathbb {R}^{p\times m_{r}}$ are given by $\vartheta _{r-m_{r}+j+1}-\vartheta _{r-m_{r}+j}$ for j = 0,…,m_r − 1. The Anderson acceleration can be viewed as a special case of a multisecant quasi-Newton procedure to solve f(𝜗) = 0. For more details we refer to [7, 9].

The DAAREM acceleration modifies the Anderson acceleration in three points. The first modification is to restart the algorithm after m steps. That is, to set $m_{r}=\min \limits (m,c_{r})$ instead of $m_{r}=\min \limits (m,r)$, where c_r ∈{1,…,m} is defined by c_r = r modm. The second modification is to add damping term in the computation coefficients γ^(r). This means, that γ^(r) is given by $\gamma ^{(r)}=(\mathcal {F}_{r}^{\mathrm {T}}\mathcal {F}_{r}+\lambda _{r} I)^{-1}\mathcal {F}_{r}^{\mathrm {T}} f(\vartheta _{r})$ instead of $\gamma ^{(r)}=(\mathcal {F}_{r}^{\mathrm {T}}\mathcal {F})^{-1}\mathcal {F}_{r}^{\mathrm {T}} f(\vartheta _{r})$. The parameter λ_r is chosen such that

$$ \begin{array}{@{}rcl@{}} \|(\mathcal{F}_{r}^{\mathrm{T}}\mathcal{F}_{r}+\lambda_{r} I)^{-1}\mathcal{F}_{r}^{\mathrm{T}} f(\vartheta_{r})\|_{2}^{2}=\delta_{r}\|(\mathcal{F}_{r}^{\mathrm{T}}\mathcal{F}_{r})^{-1}\mathcal{F}_{r}^{\mathrm{T}} f(\vartheta_{r})\|_{2}^{2} \end{array} $$

(15)

for some damping parameters δ_r. We initialize the δ_r by $\delta _{1}=\tfrac 1{1+\alpha ^{\kappa }}$ and decrease the exponent of α in each step by 1 up to a minimum of κ − D for some parameter $D\in \mathbb {N}_{>0}$. The third modification is to enforce that for the negative log-likelihood function L does not increase more than 𝜖 in one iteration step. To do this, we compute the update 𝜗_r+ 1 using the Anderson acceleration. If L(𝜗_r+ 1) > L(𝜗_r) + 𝜖, we use our original fixed point algorithm in this step, i.e., we set 𝜗_r+ 1 = G(𝜗_r).

We summarize the DAAREM acceleration in Algorithm 6. In our numerical experiments we use for the parameters the values suggested by [9], that is 𝜖 = 0.01, 𝜖_c = 0, α = 1.2, κ = 25, D = 2κ and $m=\min \limits (\lceil \tfrac {p}2\rceil ,10)$, where p is the number of parameters in 𝜗.

Simulation Study:

To compare the performance of all of these algorithms we perform again a Monte Carlo simulation. As in the previous section we draw n = 100 i.i.d. realizations of T_ν(μ, Σ) with μ = 0, Σ = 0.1Id and ν ∈{1,2,5,10,100}. Then, we use each of the Algorithms 1, 2, 3, 4 and 5 to compute the ML estimator $(\hat \nu ,\hat \mu ,\hat {\varSigma })$. We use each of these algorithms with no acceleration, with SQUAREM acceleration and with DAAREM acceleration.

We use the same initialization and stopping criteria as in the previous section and repeat this experiment N = 1.000 times. To quantify the performance of the algorithms, we count the number of iterations and measure the execution time. The results are given in Tables 3 and 4. Since the DAAREM and SQUAREM accelerations were proposed originally for an absolute stopping criteria, we redo the experiments with the stopping criteria

$$ \sqrt{ \| \mu_{r+1} - \mu_{r} \|^{2} + \| {\varSigma}_{r+1} -{\varSigma}_{r} \|_{F}^{2} +(\log(\nu_{r+1})-\log(\nu_{r}))^{2}}<10^{-8}. $$

The results are given in Tables 5 and 6.

Table 3 Average number of iterations (lowest in bold) and the corresponding standard deviations of the different algorithms using a relative stopping criterion

Full size table

Table 4 The execution times (lowest in bold) and the corresponding standard deviations of the different algorithms using a relative stopping criterion

Full size table

Table 5 Average number of iterations (lowest in bold) and the corresponding standard deviations of the different algorithms using an absolute stopping criterion

Full size table

Table 6 The execution times (lowest in bold) and the corresponding standard deviations of the different algorithms using an absolute stopping criterion

Full size table

We observe that for nearly any choice of the parameters the performance of the GMMF is better than the performance of the ECME. For small ν, the performance of the SQUAREM-aEM is also very good. On the other hand, for large ν the SQUAREM-GMMF behaves very well. Further, for any choice of ν the performance of the SQUAREM-MMF is close to the best algorithm.

6.3 Unsupervised estimation of noise parameters

Next, we provide an application in image analysis. To this aim, we consider images corrupted by one-dimensional Student t noise with μ = 0 and unknown Σ ≡ σ² and ν. We provide a method that allows to estimate ν and σ in an unsupervised way. The basic idea is to consider constant areas of an image, where the signal to noise ratio is weak and differences between pixel values are solely caused by the noise.

Constant area detection:

In order to detect constant regions in an image, we adopt an idea presented in [30]. It is based on Kendall’s τ-coefficient, which is a measure of rank correlation, and the associated z-score, see [10, 11]. In the following, we briefly summarize the main ideas behind this approach. For finding constant regions we proceed as follows: First, the image grid $\mathcal {G}$ is partitioned into K small, non-overlapping regions $\mathcal {G}= \bigcup _{k=1}^{K} R_{k}$, and for each region we consider the hypothesis testing problem

$$ \mathcal{H}_{0}\colon R_{k}\text{ is constant}\qquad \text{vs.}\qquad \mathcal{H}_{1}\colon R_{k}\text{ is not constant} . $$

To decide whether to reject ${\mathscr{H}}_{0}$ or not, we observe the following: Consider a fixed region R_k and let $I, J\subseteq R_{k}$ be two disjoint subsets of R_k with the same cardinality. Denote with u_I and u_J the vectors containing the values of u at the positions indexed by I and J. Then, under ${\mathscr{H}}_{0}$, the vectors u_I and u_J are uncorrelated (in fact even independent) for all choices of $I, J\subseteq R_{k}$ with I ∩ J = ∅ and |I| = |J|. As a consequence, the rejection of ${\mathscr{H}}_{0}$ can be reformulated as the question whether we can find I, J such that u_I and u_J are significantly correlated, since in this case there has to be some structure in the image region R_k and it cannot be constant. Now, in order to quantify the correlation, we adopt an idea presented in [30] and make use of Kendall’s τ-coefficient, which is a measure of rank correlation, and the associated z-score, see [10, 11]. The key idea is to focus on the rank (i.e., on the relative order) of the values rather than on the values themselves. In this vein, a block is considered homogeneous if the ranking of the pixel values is uniformly distributed, regardless of the spatial arrangement of the pixels. In the following, we assume that we have extracted two disjoint subsequences x = u_I and y = u_J from a region R_k with I and J as above. Let (x_i,y_i) and (x_j,y_j) be two pairs of observations. Then, the pairs are said to be

$$ \left\{\begin{array}{ll} \text{concordant} & \text{if } x_{i}<x_{j} \text{ and } y_{i}<y_{j}\\& \text{or } x_{i}>x_{j} \text{ and } y_{i}>y_{j},\\ \text{discordant} & \text{if } x_{i}<x_{j} \text{ and } y_{i}>y_{j}\\& \text{or } x_{i}>x_{j} \text{ and } y_{i}<y_{j},\\ \text{tied} & \text{if } x_{i}=x_{j} \text{ or } y_{i}=y_{j}. \end{array} \right. $$

Next, let $x,y\in \mathbb {R}^{n}$ be two sequences without tied pairs and let n_c and n_d be the number of concordant and discordant pairs, respectively. Then, Kendall’s τ coefficient [10] is defined as $\tau \colon \mathbb {R}^{n}\times \mathbb {R}^{n}\to [-1,1]$,

$$ \tau(x,y) = \frac{n_{c} - n_{d}}{\frac{n(n-1)}{2}}. $$

From this definition we see that if the agreement between the two rankings is perfect, i.e., the two rankings are the same, then the coefficient attains its maximal value 1. On the other extreme, if the disagreement between the two rankings is perfect, that is, one ranking is the reverse of the other, then the coefficient has value − 1. If the sequences x and y are uncorrelated, we expect the coefficient to be approximately zero. Denoting with X and Y the underlying random variables that generated the sequences x and y, we have the following result, whose proof can be found in [10].

Theorem 5

Let X and Y be two arbitrary sequences under ${\mathscr{H}}_{0}$ without tied pairs. Then, the random variable τ(X, Y ) has an expected value of 0 and a variance of $\frac {2(2n+5)}{9n(n-1)}$. Moreover, for $n\to \infty $, the associated z-score $z\colon \mathbb {R}^{n}\times \mathbb {R}^{n}\to \mathbb {R}$,

$$ z(x,y) = \frac{3\sqrt{n(n-1)}}{\sqrt{2(2n+5)}}\tau(x,y)=\frac{3\sqrt{2}(n_{c} - n_{d})}{\sqrt{n(n-1)(2n+5)}} $$

is asymptotically standard normal distributed,

$$ z(X,Y)\overset{n\to \infty}{\sim}\mathcal{N}(0,1). $$

With slight adaption, Kendall’s τ coefficient can be generalized to sequences with tied pairs (see [11]). As a consequence of Theorem 5, for a given significance level α ∈ (0,1), we can use the quantiles of the standard normal distribution to decide whether to reject ${\mathscr{H}}_{0}$ or not. In practice, we cannot test any kind of region and any kind of disjoint sequences. As in [30], we restrict our attention to quadratic regions and pairwise comparisons of neighboring pixels. We use four kinds of neighboring relations (horizontal, vertical and two diagonal neighbors) thus perform in total four tests. We reject the hypothesis ${\mathscr{H}}_{0}$ that the region is constant as soon as one of the four tests rejects it. Note that by doing so, the final significance level is smaller than the initially chosen one. We start with blocks of size 64 × 64 whose side-length is incrementally decreased until enough constant areas are found.

Parameter estimation.

In each constant region we consider the pixel values in the region as i.i.d. samples of a univariate Student t distribution T_ν(μ, σ²), where we estimate the parameters using Algorithm 3.

After estimating the parameters in each found constant region, the estimated location parameters μ are discarded, while the estimated scale and degrees of freedom parameters σ respective ν are averaged to obtain the final estimate of the global noise parameters. At this point, as both ν and σ influence the resulting distribution in a multiplicative way, instead of an arithmetic mean, one might use a geometric which is slightly less affected by outliers.

In Fig. 4 we illustrate this procedure for two different noise scenarios. The left column in each figure depicts the detected constant areas. The middle and right column show histograms of the estimated values for ν respective σ. For the constant area detection we use the code of [30]^{Footnote 3}. The true parameters used to generate the noisy images where ν = 1 and σ = 10 for the top row and ν = 5 and σ = 10 for the bottom row, while the obtained estimates are (geometric mean in brackets) $\hat {\nu } = 1.0437$ (1.0291) and $\hat {\sigma }= 10.3845$ (10.3111) for the top row and $\hat {\nu }= 5.4140$ (5.0423) and $\hat {\sigma }=10.5500$ (10.1897) for the bottom row.

A further example is given in Fig. 5. Here, the obtained estimates are (geometric mean in brackets) $\hat {\nu } = 1.0075$ (0.99799) and $\hat {\sigma }= 10.2969$ (10.1508) for the top row and $\hat {\nu }= 5.4184$ (5.1255) and $\hat {\sigma }=10.2295$ (10.1669) for the bottom row.

Change history

15 July 2021
A Correction to this paper has been published: https://doi.org/10.1007/s11075-021-01156-z

Notes

References

Abramowitz, M., Stegun, I.A.: Handbook of mathematical functions: with formulas, graphs, and mathematical tables, volume 55 Courier Corporation (1965)
Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. Assoc. Comput. Mach. 12, 547–560 (1965)
Article MathSciNet Google Scholar
Antoniadis, A., Leporini, D., Pesquet, J.-C.: Wavelet thresholding for some classes of non-Gaussian noise. Statis. Neerlandica 56(4), 434–453 (2002)
Article MathSciNet Google Scholar
Banerjee, A., Maji, P.: Spatially constrained Student’s t-distribution based mixture model for robust image segmentation. J. Mathe. Imag. Vision 60(3), 355–381 (2018)
Article MathSciNet Google Scholar
Byrne, C.L.: The EM algorithm: theory, applications and related methods. Lecture notes university of massachusetts (2017)
Ding, M., Huang, T., Wang, S., Mei, J., Zhao, X.: Total variation with overlapping group sparsity for deblurring images under Cauchy noise. Appl. Math. Comput. 341, 128–147 (2019)
MathSciNet MATH Google Scholar
Fang, H.-R., Saad, Y.: Two classes of multisecant methods for nonlinear acceleration. Numer. Linear Algebra Appli. 16(3), 197–221 (2009)
Article MathSciNet Google Scholar
Gerogiannis, D., Nikou, C., Likas, A.: The mixtures of Student’s t-distributions as a robust framework for rigid registration. Image Vis. Comput. 27(9), 1285–1294 (2009)
Article Google Scholar
Henderson, N.C., Varadhan, R.: Damped Anderson acceleration with restarts and monotonicity control for accelerating EM and EM-like algorithms. J. Comput. Graph. Stat. 28(4), 834–846 (2019)
Article MathSciNet Google Scholar
Kendall, M.G.: A new measure of rank correlation. Biometrika 30(1/2), 81–93 (1938)
Article Google Scholar
Kendall, M.G.: The treatment of ties in ranking problems. Biometrika 239–251 (1945)
Kent, J.T., Tyler, D.E., Vard, Y.: A curious likelihood identity for the multivariate t-distribution. Communications in Statistics-Simulation and Computation 23(2), 441–453 (1994)
Article MathSciNet Google Scholar
Lange, K.L., Little, R.J., Taylor, J.M.: Robust statistical modeling using the t distribution. J. Am. Stat. Assoc. 84(408), 881–896 (1989)
MathSciNet Google Scholar
Lanza, A., Morigi, S., Sciacchitano, F., Sgallari, F.: Whiteness constraints in a unified variational framework for image restoration. J. Mathe. Imag. Vision 60(9), 1503–1526 (2018)
Article MathSciNet Google Scholar
Laus, F.: Statistical Analysis and Optimal Transport for Euclidean and Manifold-Valued Data. PhD Thesis, TU Kaiserslautern (2020)
MATH Google Scholar
Laus, F., Pierre, F., Steidl, G.: Nonlocal myriad filters for Cauchy noise removal. J. Math. Imag. Vision 60(8), 1324–1354 (2018)
Article MathSciNet Google Scholar
Laus, F., Steidl, G.: Multivariate myriad filters based on parameter estimation of student-t distributions. SIAM J Imaging Sci 12(4), 1864–1904 (2019)
Article MathSciNet Google Scholar
Lebrun, M., Buades, A., Morel, J.-M.: A nonlocal Bayesian image denoising algorithm. SIAM J. Imag. Sci. 6(3), 1665–1688 (2013)
Article MathSciNet Google Scholar
Liu, C., Rubin, D.B.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)
Article MathSciNet Google Scholar
Liu, C., Rubin, D.B.: ML estimation of the t distribution using EM and its extensions, ECM and ECME. Stat. Sin. 5(1), 19–39 (1995)
MathSciNet MATH Google Scholar
McLachlan, G., Krishnan, T.: The EM algorithm and extensions. John wiley and sons inc (1997)
McLachlan, G., Peel, D.: Robust cluster analysis via mixtures of multivariate t-distributions. volume 1451 of Lecture Notes in Computer Science. Springer, New York (1998)
Mei, J.-J., Dong, Y., Huang, T.-Z., Yin, W.: Cauchy noise removal by nonconvex ADMM with convergence guarantees. J. Sci. Comput. 74(2), 743–766 (2018)
Article MathSciNet Google Scholar
Meng, X.-L., Van Dyk, D.: The EM algorithm - an old folk-song sung to a fast new tune. J. Royal Statis. Soc. :, Series B (Statis. Methodol.) 59 (3), 511–567 (1997)
Article MathSciNet Google Scholar
Nguyen, T.M., Wu, Q.J.: Robust Student’s-t mixture model with spatial constraints and its application in medical image segmentation. IEEE Trans. Med. Imaging 31(1), 103–116 (2012)
Article Google Scholar
Peel, D., McLachlan, G.J.: Robust mixture modelling using the t distribution. Stat. Comput. 10(4), 339–348 (2000)
Article Google Scholar
Petersen, K.B., Pedersen, M.S.: The Matrix Cookbook. Technical University of Denmark, Lecture Notes (2008)
Google Scholar
Sciacchitano, F., Dong, Y., Zeng, T.: Variational approach for restoring blurred images with Cauchy noise. SIAM J. Imag. Sci. 8(3), 1894–1922 (2015)
Article MathSciNet Google Scholar
Sfikas, G., Nikou, C., Galatsanos, N.: Robust image segmentation with mixtures of Student’s t-distributions. In: 2007 IEEE International Conference on Image Processing, volume 1, pages I – 273–I –276 (2007)
Sutour, C., Deledalle, C.-A., Aujol, J.-F.: Estimation of the noise level function based on a nonparametric detection of homogeneous image regions. SIAM J. Imag. Sci. 8(4), 2622–2661 (2015)
Article MathSciNet Google Scholar
Van Den Oord, A., Schrauwen, B.: The Student-t mixture as a natural image patch prior with application to image compression. J. Mach. Learn. Res. 15(1), 2061–2086 (2014)
MathSciNet MATH Google Scholar
Van Dyk, D.A.: Construction, Implementation, and Theory of Algorithms Based on Data Augmentation and Model Reduction. The University of Chicago, PhD Thesis (1995)
Google Scholar
Varadhan, R., Roland, C.: Simple and globally convergent methods for accelerating the convergence of any EM algorithm. Scandinavian. J. Statis. Theory Appli 35(2), 335–353 (2008)
MATH Google Scholar
Yang, Z., Yang, Z., Gui, G.: A convex constraint variational method for restoring blurred images in the presence of alpha-stable noises. Sensors 18(4), 1175 (2018)
Article Google Scholar
Zhou, Z., Zheng, J., Dai, Y., Zhou, Z., Chen, S.: Robust non-rigid point set registration using Student’s-t mixture model. PloS one 9(3), e91381 (2014)
Article Google Scholar

Download references

Acknowledgments

The authors want to thank the anonymous referees for bringing certain accelerations of the EM algorithm to our attention.

Funding

Open Access funding enabled and organized by Projekt DEAL. This work received funding from the German Research Foundation (DFG) within the project STE 571/16-1.

Author information

Authors and Affiliations

Institute of Mathematics, Technische Universität Berlin, Straße des 17. Juni 136, 10623, Berlin, Germany
Marzieh Hasannasab, Johannes Hertrich & Gabriele Steidl
Technische Universität Kaiserslautern, Paul-Ehrlich-Str. 31, 67663, Kaiserslautern, Germany
Friederike Laus

Authors

Marzieh Hasannasab
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Hertrich
View author publications
You can also search for this author in PubMed Google Scholar
Friederike Laus
View author publications
You can also search for this author in PubMed Google Scholar
Gabriele Steidl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marzieh Hasannasab.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised due to a retrospective Open Access order.

Appendix. Auxiliary lemmas

Lemma 4

Let $x_{i} \in \mathbb R^{d}$, i = 1,…,n and w ∈Δ̈_n fulfill Assumption 1. Let (ν_r,Σ_r)_r be a sequence in $\mathbb {R}_{>0} \times \text {SPD} (d)$ with $\nu _{r} \rightarrow 0$ as $r\rightarrow \infty $ (or if {ν_r}_r has a subsequence which converges to zero). Then (ν_r,Σ_r)_r cannot be a minimizing sequence of L(ν, Σ).

Proof

We write

$$ L(\nu,{\varSigma})=g(\nu)+L_{\nu}({\varSigma}), $$

where

$$ g(\nu)=2\log\left( {\Gamma}\left( \frac{\nu}{2}\right)\right)-2\log\left( {\Gamma}\left( \frac{d+\nu}{2}\right)\right)-\nu\log(\nu). $$

Then it holds $\lim _{\nu \to 0}g(\nu )=\infty $. Hence it is sufficient to show that (ν_r,Σ_r)_r has a subsequence $(\nu _{r_{k}},{\varSigma }_{r_{k}})$ such that $\left (L_{\nu _{r_{k}}}({\varSigma }_{r_{k}}) \right )_{r}$ is bounded from below. Denote by λ_r1 ≥… ≥ λ_rd the eigenvalues of Σ_r.

Case 1: Let $\{\lambda _{r,i}:r\in \mathbb N,i=1,\ldots ,d\}\subseteq [a,b]$ for some $0<a\leq b<\infty $. Then it holds $\liminf _{r\to \infty }\log \left | {\varSigma }_{r} \right |\geq \log (a^{d})=d\log (a)$ and

$$ \begin{array}{@{}rcl@{}} \underset{r\to\infty}{\liminf}(d+\nu_{r})\sum\limits_{i=1}^{n}w_{i}\log(\nu_{r}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i}) &\geq&\underset{r\to\infty}{\lim}(d+\nu_{r})\sum\limits_{i=1}^{n}w_{i}\log\left( \frac1b x_{i}^{\mathrm{T}} x_{i}\right)\\ &=&d\sum\limits_{i=1}^{n}w_{i}\log\left( \frac1b x_{i}^{\mathrm{T}} x_{i}\right). \end{array} $$

Note that Assumption 1 ensures x_i≠ 0 and $x_{i}^{\mathrm {T}} x_{i}>0$ for i = 1,…,n. Then we get

$$ \begin{array}{@{}rcl@{}} \underset{r\to\infty}{\liminf}L_{\nu_{r}}({\varSigma}_{r}) &=&\underset{r\to\infty}{\liminf}(d+\nu_{r})\sum\limits_{i=1}^{n}w_{i}\log(\nu_{r}+x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i})+\log\left| {\varSigma}_{r} \right|\\ &\geq& d\sum\limits_{i=1}^{n}w_{i}\log\left( \frac1b x_{i}^{\mathrm{T}} x_{i}\right)+d\log(a). \end{array} $$

Hence $(L_{\nu _{r}}({\varSigma }_{r}))_{r}$ is bounded from below and (ν_r,Σ_r) cannot be a minimizing sequence.

Case 2: Let $\{\lambda _{r,i}:r\in \mathbb N,i=1,\ldots ,d\}\not \subseteq [a,b]$ for all $0<a\leq b<\infty $. Define ρ_r = ∥Σ_r∥_F and $P_{r}=\frac {{\varSigma }_{r}}{\rho _{r}}$. Then, by concavity of the logarithm, it holds

$$ \begin{array}{@{}rcl@{}} L_{\nu_{r}}({\varSigma}_{r}) &=&(d+\nu_{r})\sum\limits_{i=1}^{n}w_{i}\log(\nu_{r}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i})+\log(\left| {\varSigma}_{r} \right|)\\ &\geq& d\sum\limits_{i=1}^{n}w_{i}\log(x_{i}^{\mathrm{T}} {\varSigma}_{r}^{-1}x_{i})+ \nu_{r}\log(\nu_{r})+\log(\left| {\varSigma}_{r} \right|)\\ &\geq& d\sum\limits_{i=1}^{n}w_{i}\log\left( \frac1{\rho_{r}}x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i}\right)+\log({\rho_{r}^{d}} \left| P_{r} \right|)+\text{const}\\ &=& \underbrace{d\sum\limits_{i=1}^{n}w_{i}\log(x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i})+\log(\left| P_{r} \right|)}_{=: L_{0}(P_{r})}+\text{const} . \end{array} $$

(16)

Denote by p_r,1 ≥… ≥ p_{r, d} > 0 the eigenvalues of P_r. Since $\{P_{r}:r\in \mathbb N\}$ is bounded there exists some C > 0 with C ≥ p_r,1 for all $r\in \mathbb N$. Thus one of the following cases is fulfilled:

i)
There exists a constant c > 0 such that p_{r, d} > c for all $r\in \mathbb N$.
ii)
There exists a subsequence $(P_{r_{k}})_{k}$ of (P_r)_r which converges to some P ∈ ∂SPD(d).

Case 2i) Let c > 0 with p_{r, d} ≥ c for all $r\in \mathbb N$. Then $\underset {r\to \infty }{\liminf } \log (\left | P_{r} \right |) \geq \log (c^{d})=d\log (c)$ and

$$ \underset{r\to\infty}{\liminf}d\sum\limits_{i=1}^{n}w_{i}\log(x_{i}^{\mathrm{T}} P_{r}^{-1} x_{i})\geq d\sum\limits_{i=1}^{n}w_{i}\log\left( \frac1C x_{i}^{\mathrm{T}} x_{i}\right). $$

By (16) this yields

$$ \begin{array}{@{}rcl@{}} \underset{r\to\infty}{\liminf}L_{\nu_{r}}({\varSigma}_{r}) &\geq& \underset{r\to\infty}{\liminf}d\sum\limits_{i=1}^{n} w_{i}\log(x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i})+\log(\left| P_{r} \right|)+\text{const}\\ &\geq& d\sum\limits_{i=1}^{n} w_{i}\log\left( \frac1C x_{i}^{\mathrm{T}} x_{i}\right)+d\log(c)+\text{const}. \end{array} $$

Hence, $(L_{\nu _{r}}({\varSigma }_{r}))_{r}$ is bounded from below and (ν_r,Σ_r) cannot be a minimizing sequence.

Case 2ii) We use similar arguments as in the proof of [17, Theorem 4.3]. Let $(P_{r_{k}})_{k}$ be a subsequence of (P_r)_r which converges to some P ∈ ∂SPD(d). For simplicity we denote $(P_{r_{k}})_{k}$ again by (P_r)_r. Let p₁ ≥… ≥ p_d ≥ 0 be the eigenvalues of P. Since $\|P\|_{F}=\lim \limits _{r\to \infty } \| P_{r}\|_{F}=1$ it holds p₁ > 0. Let q ∈ 1,…,d − 1 such that

$$p_{1}\geq\ldots\geq p_{q}>p_{q+1}=\ldots=p_{d}=0.$$

By e_r,1,…,e_,rd, we denote the orthonormal eigenvectors corresponding to p_r,1,…,p_{r, d}. Since $(\mathbb S^{d})^{d}$ is compact we can assume (by going over to a subsequence) that (e_r,1,…,e_{r, d})_r converges to orthonormal vectors (e₁,…,e_d). Define S₀ : = {0} and for k = 1,…,d set S_k : = span{e₁,…,e_k}. Now, for k = 1,…,d define

$$ W_{k}:= S_{k}\backslash S_{k-1}=\{y\in\mathbb{R}^{d}:\left\langle y, e_{k} \right\rangle\neq 0, \left\langle y, e_{l} \right\rangle=0 \text{ for }l=k+1,\ldots,d\}. $$

Further, let

$$ \tilde I_{k}:=\{i\in\{1,\ldots,n\}:x_{i}\in S_{k}\}\quad\text{and}\quad I_{k}:=\{i\in\{1,\ldots,n\}:x_{i}\in W_{k}\}. $$

Because of $S_{k}=W_{k}\dot \cup S_{k-1}$ we have $\tilde I_{k}=I_{k}\dot \cup \tilde I_{k-1}$ for k = 1,…,d. Due to Assumption 1 we have $\left | I_{k} \right |\leq \left | \tilde I_{k} \right |\leq \dim (S_{k})=k$ for k = 1,…,d − 1. Defining for j = 1,…,d,

$$ L_{j}(P_{r}):= d\underset{i\in I_{j}}{\sum} w_{i}\log(x_{i}^{\mathrm{T}} P_{r}^{-1} x_{i})+\log(p_{rj}), $$

it holds $L_{0}(P_{r})={\sum }_{j=1}^{d} L_{j}$. For j ≤ q we get

$$ \underset{r\to\infty}{\liminf} L_{j}(P_{r}) \geq \underset{r\to\infty}{\liminf}d\underset{i\in I_{j}}{\sum} w_{i}\log\left( \frac1C x_{i}^{\mathrm{T}} x_{i}\right)+\log(p_{r,j}) =d\underset{i\in I_{j}}{\sum}w_{i}\log\left( \frac1C x_{i}^{\mathrm{T}} x_{i}\right)+\log(p_{j}). $$

Since for k ∈{1,…,d} and i ∈ I_k,

$$ x_{i}^{\mathrm{T}} P_{r}^{-1} x_{i}=\sum\limits_{j=1}^{d}\frac1{p_{r,j}}\left\langle x_{i}, e_{r,j} \right\rangle^{2}\geq\frac1{p_{r,k}}\left\langle x_{i}, e_{rk} \right\rangle^{2}, $$

and $\lim _{r\to \infty }\left \langle x_{i}, e_{rk} \right \rangle = \left \langle x_{i}, e_{k} \right \rangle \neq 0$, we obtain

$$ \underset{r\to\infty}{\liminf}p_{r,k}x_{i}^{\mathrm{T}} P_{r} x_{i} \geq\underset{r\to\infty}{\liminf}\left\langle y, e_{r,k} \right\rangle \geq \left\langle y, e_{k} \right\rangle^{2}>0. $$

Hence, it holds for j ≥ q + 1 that

$$ \begin{array}{@{}rcl@{}} L_{j}(P_{r}) &=& d\underset{i\in I_{j}}{\sum}w_{i}\left[\log(x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i})+\log(p_{r,j})\right]+\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j})\\ &=& d\underset{i\in I_{j}}{\sum}w_{i}\log(p_{r,j}x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i})+\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j}). \end{array} $$

Thus, we conclude

$$ \begin{array}{@{}rcl@{}} \underset{r\to\infty}{\liminf}L_{0}(P_{r}) &=& \underset{r\to\infty}{\liminf}\sum\limits_{j=1}^{d} L_{j}(P_{r})\geq\sum\limits_{j=1}^{q}\underset{r\to\infty}{\liminf}L_{j}(P_{r}) + \underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d} L_{j}(P_{r})\\ &\geq& \sum\limits_{j=1}^{q} d\sum\limits_{i\in I_{j}}w_{i}\log\left( \frac1Cx_{i}^{\mathrm{T}} x_{i}\right)+\log(p_{j})+\underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d} d\underset{i\in I_{j}}{\sum}w_{i}\log(p_{rj}x_{i}^{\mathrm{T}} P_{r}^{-1}x_{i})\\ && +\underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{rj})\\ &\geq& \sum\limits_{j=1}^{q}d\underset{i\in I_{j}}{\sum}w_{i}\log(\frac1Cx_{i}^{\mathrm{T}} x_{i})+\log(p_{j}) +\sum\limits_{j=q+1}^{d} d\underset{i\in I_{j} }{\sum}w_{i}\log(\left\langle x_{i}, e_{j} \right\rangle)\\ & & +\underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d} \left( 1-d\underset{i\in I_{j} }{\sum}w_{i}\right)\log(p_{r,j})\\ &=& \text{const}+\underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j}). \end{array} $$

It remains to show that there exist $\tilde c > 0$ such that

$$ \underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j}) \ge \tilde c. $$

(17)

We prove for k ≥ q + 1 by induction that for sufficiently large $r\in \mathbb N$ it holds

$$ \sum\limits_{j=k}^{d} \left( 1-d\underset{i\in I_{j}}{\sum}w_{i} \right) \log(p_{rj})\geq \left( d \underset{i\in\tilde I_{k-1}}{\sum} w_{i} -(k-1)\right)\log(p_{r,k}). $$

(18)

Induction basis k = d: Since $\tilde I_{k}=I_{k}\cup \tilde I_{k-1}$ we have

$$ \underset{i\in\tilde I_{k}}{\sum} w_{i}-\underset{i\in\tilde I_{k-1}}{\sum}w_{i}=\underset{i\in I_{k}}{\sum}w_{i}, $$

and further

$$ 1-d\underset{i\in I_{d}}{\sum} w_{i}{\kern-1.5pt} ={\kern-1.5pt}1 - d{\kern-1.5pt}\left( \underset{i\in\tilde I_{d}}{\sum} w_{i} {\kern-1.5pt}-{\kern-1.5pt}\underset{i\in\tilde I_{d-1}}{\sum} w_{i}\right) {\kern-1.5pt}={\kern-1.5pt}1-d \left( 1{\kern-1.5pt}-{\kern-1.5pt}\underset{i\in\tilde I_{d-1}}{\sum} w_{i}\right) {\kern-1.5pt}={\kern-1.5pt}d\underset{i\in\tilde I_{d-1}}{\sum} w_{i}-(d{\kern-1.5pt}-{\kern-1.5pt}1). $$

If we multiply both sides with $\log (p_{rd})$ this yields (18) for k = d. Induction step: Assume that (18) holds for some k + 1 with d ≥ k + 1 > q + 1, i.e.,

$$ \sum\limits_{j=k+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j})\geq d\left( \underset{i\in\tilde I_{k}}{\sum}w_{i}-\frac{k}{d}\right)\log(p_{r,k+1}). $$

Then we obtain

$$ \begin{array}{@{}rcl@{}} &&\sum\limits_{j=k}^{d} \left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j})\\ &=&\sum\limits_{j=k+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{r,j})+\left( 1-d\underset{i\in I_{k}}{\sum} w_{i}\right)\log(p_{r,k})\\ &\geq& d\left( \underset{i\in\tilde I_{k}}{\sum}w_{i}-\frac{k}{d}\right)\log(p_{r,k+1})+\left( 1-d\underset{i\in I_{k}}{\sum} w_{i}\right)\log(p_{r,k}). \end{array} $$

and since ${\sum }_{i\in \tilde I_{k}}w_{i}<\left | \tilde I_{k} \right |\frac 1d\leq \frac {k}{d}$ by Assumption 1 and p_{r, k+ 1} ≤ p_{r, k} < 1 finally

$$ \begin{array}{@{}rcl@{}} &\geq& d \left( \underset{i\in\tilde I_{k}}{\sum}w_{i} - \frac{k}{d}\right) \log(p_{r,k})+\left( 1-d\underset{i\in I_{k}}{\sum} w_{i}\right) \log(p_{r,k})\\ &=&\left( d\underset{i\in\tilde I_{k-1}}{\sum}w_{i}-(k-1)\right)\log(p_{r,k}). \end{array} $$

This shows (18) for k ≥ q + 1. Using k = q + 1 in (17) we get

$$ \underset{r\to\infty}{\liminf}\sum\limits_{j=q+1}^{d}\left( 1-d\underset{i\in I_{j}}{\sum}w_{i}\right)\log(p_{rj}) \geq \underset{r\to\infty}{\liminf}\underbrace{\left( d\underset{i\in\tilde I_{q}}{\sum}w_{i}-q\right)}_{<0}\underbrace{\log(p_{r,q+1})}_{\text{bounded from above}}>-\infty. $$

This finishes the proof. □

Lemma 5

Let (ν_r,Σ_r)_r be a sequence in $\mathbb {R}_{>0}\times \text {SPD}(d)$ such that there exists $\nu _{-}\in \mathbb {R}_{>0}$ with ν₋≤ ν_r for all $r\in \mathbb N$. Denote by λ_r,1 ≥⋯ ≥ λ_{r, d} the eigenvalues of Σ_r. If $\{\lambda _{1,r}:r\in \mathbb N\}$ is unbounded or $\{\lambda _{d,r}:r\in \mathbb N\}$ has zero as a cluster point, then there exists a subsequence $(\nu _{r_{k}},{\varSigma }_{r_{k}})_{k}$ of (ν_r,Σ_r)_r, such that $\lim \limits _{k\to \infty }L(\nu _{r_{k}},{\varSigma }_{r_{k}})=\infty $.

Proof

Without loss of generality we assume (by considering a subsequence) that either $\lambda _{r1}\to \infty $ as $r\to \infty $ and λ_rd ≥ c > 0 for all $r\in \mathbb N$ or that λ_rd → 0 as $r\to \infty $. By [17, Theorem 4.3] for fixed ν = ν₋, we have $L_{\nu _{-}}({\varSigma }_{r}) \to \infty $ as $r\to \infty $.

The function $h\colon \mathbb {R}_{>0}\to \mathbb {R}$ defined by $\nu \mapsto (d+\nu )\log (\nu +k)$ is monotone increasing for all $k\in \mathbb {R}_{\geq 0}$. This can be seen as follows: The derivative of h fulfills

$$ h^{\prime}(\nu)=\frac{d+\nu}{k+\nu}+\log(\nu+k)\geq \frac{1+\nu}{k+\nu}+\log(\nu+k), $$

and since

$$ \frac{\partial}{\partial k} \left( \frac{1+\nu}{k+\nu}+\log(\nu+k)\right)=\frac{k-1}{(k+\nu)^{2}}, $$

the later function is minimal for k = 1, so that

$$ h^{\prime}(\nu)\geq \frac{1+\nu}{k+\nu}+\log(\nu+k)\geq\frac{1+\nu}{1+\nu}+\log(\nu+1)=1+\log(1+\nu)>0. $$

Using this relation, we obtain

$$ (d+\nu_{r})\sum\limits_{i=1}^{n} w_{i}\log\left( \nu_{r}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i}\right) \geq (d+\nu_{-})\sum\limits_{i=1}^{n} w_{i}\log\left( \nu_{-}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i}\right) $$

and further

$$ \begin{array}{@{}rcl@{}} L(\nu_{r},{\varSigma}_{r}) &=& (d+\nu_{r})\sum\limits_{i=1}^{n} w_{i}\log\left( \nu_{r}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i}\right)+\log(\left| {\varSigma}_{r} \right|)\\ &\geq& (d+\nu_{-})\sum\limits_{i=1}^{n} w_{i}\log\left( \nu_{-}+x_{i}^{\mathrm{T}}{\varSigma}_{r}^{-1}x_{i}\right)+\log(\left| {\varSigma}_{r} \right|)\\ &=&L_{\nu_{-}}({\varSigma}_{r})\to\infty \qquad \text{as} \quad r\to\infty. \end{array} $$

□

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hasannasab, M., Hertrich, J., Laus, F. et al. Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution. Numer Algor 87, 77–118 (2021). https://doi.org/10.1007/s11075-020-00959-w

Download citation

Received: 16 October 2019
Accepted: 02 June 2020
Published: 23 September 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s11075-020-00959-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Alternatives to the EM algorithm for ML estimation of location, scatter matrix, and degree of freedom of the Student t distribution

Abstract

Similar content being viewed by others

Sparse Estimation: An MMSE Approach

A comparison of the $$L_2$$ minimum distance estimator and the EM-algorithm when fitting $${\varvec{{k}}}$$ -component univariate normal mixtures

Consistency factor for the MCD estimator at the Student-t distribution

1 Introduction

2 Likelihood of the multivariate student t distribution

3 Existence of critical points

Assumption 1

Theorem 1

Proof

4 Zeros of F

Theorem 2

Proof

Corollary 1

Proof

Corollary 2

Proof

5 Algorithms

Lemma 1

Proof

Corollary 3

Proof

Lemma 2

Proof

Corollary 4

Proof

Theorem 3

Proof

Lemma 3

Proof

Theorem 4

Proof

6 Numerical results

6.1 Comparison of algorithms

6.2 Comparison with other accelerations of the EM algortihm

ECME algorithm:

SQUAREM Acceleration:

Damped Anderson Acceleration with Restarts and 𝜖-Monotonicity (DAAREM):

Simulation Study:

6.3 Unsupervised estimation of noise parameters

Constant area detection:

Theorem 5

Parameter estimation.

Change history

15 July 2021

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendix. Auxiliary lemmas

Appendix. Auxiliary lemmas

Lemma 4

Proof

Lemma 5

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation