1 Introduction

A Gaussian mixture model consists of several component Gaussian distributions. For given samples of a Gaussian mixture model, people often need to estimate parameters for each component Gaussian distribution [24, 32]. Consider a Gaussian mixture model with r components. For each i ∈ [r] := {1,…,r}, let ωi be the positive probability for the i th component Gaussian to appear in the mixture model. We have each ωi > 0 and \({\sum }_{i=1}^{r}\omega _{i}=1\). Suppose the i th Gaussian distribution is \(\mathcal {N}(\mu _{i},{{\varSigma }}_{i})\), where \(\mu _{i}\in \mathbb {R}^{d}\) is the expectation (or mean) and \({{\varSigma }}_{i}\in \mathbb {R}^{d\times d}\) is the covariance matrix. Let \(y\in \mathbb {R}^{d}\) be the random vector for the Gaussian mixture model and let y1,…,yN be identically independent distributed (i.i.d) samples from the mixture model. Each yj is sampled from one of the r component Gaussian distributions, associated with a label Zj ∈ [r] indicating the component that it is sampled from. The probability that a sample comes from the i th component is ωi. When people observe only samples without labels, the Zj’s are called latent variables. The density function for the random variable y is

$$ f(y) := \sum\limits_{i=1}^{r}\omega_{i} \frac{1}{\sqrt{(2\pi)^{d} \det {{\varSigma}}_{i}}} \exp\left\{-\frac{1}{2}(y-\mu_{i})^{T} {{\varSigma}}_{i}^{-1} (y-\mu_{i}) \right\}, $$

where μi is the mean and Σi is the covariance matrix for the i th component.

Learning a Gaussian mixture model is to estimate the parameters ωi,μi,Σi for each i ∈ [r], from given samples of y. The number of parameters in a covariance matrix grows quadratically with respect to the dimension. Due to the curse of dimensionality, the computation becomes very expensive for large d [35]. Hence, diagonal covariance matrices are preferable in applications. In this paper, we focus on learning Gaussian mixture models with diagonal covariance matrices, i.e.,

$$ {{\varSigma}}_{i} = \text{diag}\left( \sigma_{i1}^{2},\ldots,\sigma_{id}^{2}\right), \quad i=1,\ldots, r. $$

A natural approach for recovering the unknown parameters ωi,μi,Σi is the method of moments. It estimates parameters by solving a system of multivariate polynomial equations, from moments of the random vector y. Directly solving polynomial systems may encounter non-existence or non-uniqueness of statistically meaningful solutions [57]. However, for diagonal Gaussians, the third order moment tensor can help us avoid these troubles.

Let \(M_{3} := \mathbb {E} (y \otimes y \otimes y)\) be the third order moment tensor for y. One can write that y = η(z) + ζ(z), where z is a discrete random variable such that Prob(z = i) = ωi, \(\eta (i)=\mu _{i} \in \mathbb {R}^{d}\) and ζ(i) is the random variable ζi obeying the Gaussian distribution \(\mathcal {N}(0,{{\varSigma }}_{i})\). Assume all Σi are diagonal, then

$$ \begin{array}{@{}rcl@{}} M_{3} &=&\sum\limits_{i=1}^{r} \omega_{i} \mathbb{E}\left[(\eta(i)+\zeta_{i})^{\otimes 3}\right]\\ &=& \sum\limits_{i=1}^{r} \omega_{i} \left( \mu_{i}\otimes\mu_{i}\otimes\mu_{i}+ \mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \zeta_{i}] + \mathbb{E}[\zeta_{i}\otimes \mu_{i}\otimes \zeta_{i}] +\mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes \mu_{i}] \right). \end{array} $$

The second equality holds because ζi has zero mean and

$$ \mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes \zeta_{i}]= \mathbb{E}[\mu_{i}\otimes \mu_{i}\otimes \zeta_{i}]= \mathbb{E}[\zeta_{i}\otimes \mu_{i}\otimes \mu_{i}]= \mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \mu_{i}]=0. $$

The random variable ζi has a diagonal covariance matrix, so \(\mathbb {E}[(\zeta _{i})_{j}(\zeta _{i})_{l}] = 0\) for jl. Therefore,

$$ \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \zeta_{i}] = \sum\limits_{i=1}^{r}\sum\limits_{j=1}^{d} \omega_{i}\sigma_{ij}^{2}\mu_{i}\otimes e_{j}\otimes e_{j} = \sum\limits_{j=1}^{d} a_{j}\otimes e_{j}\otimes e_{j}, $$

where the vectors aj are given by

$$ a_{j} := \sum\limits_{i=1}^{r}\omega_{i}\sigma^{2}_{ij}\mu_{i}, \quad j=1,\ldots,d. $$
(1.1)

Similarly, we have

$$ \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\zeta_{i}\otimes\mu_{i}\otimes \zeta_{i}] = \sum\limits_{j=1}^{d} e_{j}\otimes a_{j}\otimes e_{j}, \quad \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes\mu_{i}] = \sum\limits_{j=1}^{d} e_{j}\otimes e_{j}\otimes a_{j}. $$

Therefore, we can express M3 in terms of ωi, μi, Σi as

$$ M_{3}=\sum\limits_{i=1}^{r} \omega_{i}\mu_{i}\otimes\mu_{i}\otimes\mu_{i} + \sum\limits_{j=1}^{d} \left( a_{j}\otimes e_{j}\otimes e_{j}+e_{j}\otimes a_{j}\otimes e_{j} + e_{j}\otimes e_{j}\otimes a_{j} \right). $$
(1.2)

We are particularly interested in the following third order symmetric tensor

$$ \mathcal{F} := \sum\limits_{i=1}^{r} \omega_{i}\mu_{i}\otimes\mu_{i}\otimes\mu_{i}. $$
(1.3)

When the labels i1, i2, i3 are distinct from each other, we have

$$ (M_{3})_{i_{1}i_{2}i_{3}} = (\mathcal{F})_{i_{1}i_{2}i_{3}} \quad \text{for} \quad i_{1} \ne i_{2} \ne i_{3} \ne i_{1}. $$

Denote the label set

$$ {{\varOmega}} = \{(i_{1}, i_{2}, i_{3}): i_{1} \ne i_{2} \ne i_{3} \ne i_{1}, i_{1},i_{2},i_{3} \text{ are labels for }M_{3} \}. $$
(1.4)

The tensor M3 can be estimated from the samplings for y, so the entries \(\mathcal {F}_{i_{1}i_{2}i_{3}}\) with (i1,i2,i3) ∈Ω can also be obtained from the estimation of M3. To recover the parameters ωi, μi, we first find the tensor decomposition for \(\mathcal {F}\), from the partially given entries \(\mathcal {F}_{i_{1}i_{2}i_{3}}\) with (i1,i2,i3) ∈Ω. Once the parameters ωi, μi are known, we can determine Σi from the expressions of aj as in (1.1).

The above observation leads to the incomplete tensor decomposition problem. For a third order symmetric tensor \(\mathcal {F}\) whose partial entries \(\mathcal {F}_{i_{1}i_{2}i_{3}}\) with (i1,i2,i3) ∈Ω are known, we are looking for vectors p1,…,pr such that

$$ \mathcal{F}_{i_{1}i_{2}i_{3}} = \left( p_{1}^{\otimes 3} + {\cdots} + p_{r}^{\otimes 3}\right)_{i_{1}i_{2}i_{3}} \quad \text{for all }~(i_{1},i_{2},i_{3}) \in {{\varOmega}}. $$

The above is called an incomplete tensor decomposition for \(\mathcal {F}\). To find such a tensor decomposition for \(\mathcal {F}\), a straightforward approach is to do tensor completion: first find unknown tensor entries \(\mathcal {F}_{i_{1}i_{2}i_{3}}\) with (i1,i2,i3)∉Ω such that the completed \(\mathcal {F}\) has low rank, and then compute the tensor decomposition for \(\mathcal {F}\). However, there are serious disadvantages for this approach. The theory for tensor completion or recovery, especially for symmetric tensors, is premature. Low rank tensor completion or recovery is typically not guaranteed by the currently existing methodology. Most methods for tensor completion are based on convex relaxations, e.g., the nuclear norm or trace minimization [22, 36, 41, 54, 58]. These convex relaxations may not produce low rank completions [51].

In this paper, we propose a new method for determining incomplete tensor decompositions. It is based on the generating polynomial method in [40]. The label set Ω consists of (i1,i2,i3) of distinct i1, i2, i3. We can still determine some generating polynomials, from the partially given tensor entries \(\mathcal {F}_{i_{1}i_{2}i_{3}}\) with (i1,i2,i3) ∈Ω. They can be used to get the incomplete tensor decomposition. We show that this approach works very well when the rank r is roughly not more than half of the dimension d. Consequently, the parameters for the Gaussian mixture model can be recovered from the incomplete tensor decomposition of \(\mathcal {F}\).

Related Work

Gaussian mixture models have broad applications in machine learning problems, e.g., automatic speech recognition [30, 48, 50], hyperspectral unmixing problem [4, 34], background subtraction [32, 60] and anomaly detection [56]. They also have applications in social and biological sciences [25, 53, 59].

There exist methods for estimating unknown parameters for Gaussian mixture models. A popular method is the expectation-maximization (EM) algorithm that iteratively approximates the maximum likelihood parameter estimation [16]. This approach is widely used in applications, while its convergence property is not very reliable [49]. Dasgupta [13] introduced a method that first projects data to a randomly chosen low-dimensional subspace and then use the empirical means and covariances of low-dimensional clusters to estimate the parameters. Later, Arora and Kannan [52] extended this idea to arbitrary Gaussians. Vempala and Wong [55] introduced the spectral technique to enhance the separation condition by projecting data to principal components of the sample matrix instead of selecting a random subspace. For other subsequent work, we refer to Dasgupta and Schulman [14], Kannan et al. [29], Achlioptas et al. [1], Chaudhuri and Rao [10], Brubaker and Vempala [7] and Chaudhuri et al. [9].

Another frequently used approach is based on moments, introduced by Pearson [46]. Belkin and Sinha [3] proposed a learning algorithm for identical spherical Gaussians (Σi = σ2I) with arbitrarily small separation between mean vectors. It was also shown in [28] that a mixture of two Gaussians can be learned with provably minimal assumptions. Hsu and Kakade [27] provided a learning algorithm for a mixture of spherical Gaussians, i.e., each covariance matrix is a multiple of the identity matrix. This method is based on moments up to order three and only assumes non-degeneracy instead of separations. For general covariance matrices, Ge et al. [23] proposed a learning method when the dimension d is sufficiently high. More moment-based methods for general latent variable models can be found in [2].

Contributions

This paper proposes a new method for learning diagonal Gaussian mixture models, based on samplings for the first and third order moments. Let \(y_{1},\dots ,y_{N}\) be samples and let {(ωi,μi,Σi) : i ∈ [r]} be parameters of the diagonal Gaussian mixture model, where each covariance matrix Σi is diagonal. We use the samples \(y_{1},\dots ,y_{N}\) to estimate the third order moment tensor M3, as well as the mean vector M1. We have seen that the tensor M3 can be expressed as in (1.2).

For the tensor \(\mathcal {F}\) in (1.3), we have \(\mathcal {F}_{i_{1}i_{2}i_{3}} = (M_{3})_{i_{1}i_{2}i_{3}}\) when the labels i1, i2, i3 are distinct from each other. Other entries of \(\mathcal {F}\) are not known, since the vectors aj are not available. The \(\mathcal {F}\) is an incompletely given tensor. We give a new method for computing the incomplete tensor decomposition of \(\mathcal {F}\) when the rank r is low (roughly no more than half of the dimension d). The tensor decomposition of \(\mathcal {F}\) is unique under some genericity conditions [11], so it can be used to recover parameters ωi, μi. To compute the incomplete tensor decomposition of \(\mathcal {F}\), we use the generating polynomial method in [40, 42]. We look for a special set of generating polynomials for \(\mathcal {F}\), which can be obtained by solving linear least squares. It only requires to use the known entries of \(\mathcal {F}\). The common zeros of these generating polynomials can be determined from eigenvalue decompositions. Under some genericity assumptions, these common zeros can be used to get the incomplete tensor decomposition. After this is done, the parameters ωi, μi can be recovered by solving linear systems. The diagonal covariance matrices Σi can also be estimated by solving linear least squares. The tensor M3 is estimated from the samples y1,…,yN. Typically, the tensor entries \((M_{3})_{i_{1}i_{2}i_{3}}\) and \(\mathcal {F}_{i_{1}i_{2}i_{3}}\), are not precisely given. We also provide a stability

analysis for this case, showing that the estimated parameters are also accurate when the entries \((M_{3})_{i_{1}i_{2}i_{3}}\) have small errors.

The paper is organized as follows. In Section 2, we review some basic results for symmetric tensor decompositions and generating polynomials. In Section 3, we give a new algorithm for computing an incomplete tensor decomposition for \(\mathcal {F}\), when only its subtensor \(\mathcal {F}_{{{\varOmega }}}\) is known. Section 4 gives the stability analysis when there are errors for the subtensor \(\mathcal {F}_{{{\varOmega }}}\). Section 5 gives the algorithm for learning Gaussian mixture models. Numerical experiments and applications are given in Section 6. We make some conclusions and discussions in Section 7.

2 Preliminary

Notation

Denote \(\mathbb {N}\), \(\mathbb {C}\) and \(\mathbb {R}\) the set of nonnegative integers, complex and real numbers respectively. Denote the cardinality of a set L as |L|. Denote by ei the i th standard unit basis vector, i.e., the i th entry of ei is one and all others are zeros. For a complex number c, \(\sqrt [n]{c}\) or c1/n denotes the principal n th root of c. For a complex vector v, Re(v),Im(v) denotes the real part and imaginary part of v respectively. A property is said to be generic if it is true in the whole space except a subset of zero Lebesgue measure. The ∥⋅∥ denotes the Euclidean norm of a vector or the Frobenius norm of a matrix. For a vector or matrix, the superscript T denotes the transpose and H denotes the conjugate transpose. For \(i,j \in \mathbb {N}\), [i] denotes the set {1,2,…,i} and [i,j] denotes the set {i,i + 1,…,j} if ij. For a vector v, \(v_{i_{1}:i_{2}}\) denotes the vector \((v_{i_{1}},v_{i_{1}+1},\ldots ,v_{i_{2}})\). For a matrix A, denote by \(A_{[i_{1}:i_{2}, j_{1}:j_{2}]}\) the submatrix of A whose row labels are i1,i1 + 1,…,i2 and whose column labels are j1,j1 + 1…,j2. For a tensor \(\mathcal {F}\), its subtensor \(\mathcal {F}_{[i_{1}:i_{2},j_{1}:j_{2},k_{1}:k_{2}]}\) is similarly defined.

Let \(\mathrm {S}^{m}(\mathbb {C}^{d})\) (resp., \(\mathrm {S}^{m}(\mathbb {R}^{d})\)) denote the space of m th order symmetric tensors over the vector space \(\mathbb {C}^{d}\) (resp., \(\mathbb {R}^{d}\)). For convenience of notation, the labels for tensors start with 0. A symmetric tensor \(\mathcal {A} \in \mathrm {S}^{m}(\mathbb {C}^{n+1})\) is labelled as

$$ \mathcal{A} = (\mathcal{A}_{i_{1}{\dots} i_{m}} )_{0\le i_{1}, \ldots, i_{m} \le n}, $$

where the entry \(\mathcal {A}_{i_{1}{\ldots } i_{m}}\) is invariant for all permutations of (i1,…,im). The Hilbert–Schmidt norm \(\|\mathcal {A}\|\) is defined as

$$ \|\mathcal{A}\| := \left( \sum\limits_{0\leq i_{1},\dots,i_{m}\leq n}|\mathcal{A}_{i_{1}{\ldots} i_{m}}|^{2}\right)^{1/2}. $$

The norm of a subtensor \(\|\mathcal {A}_{{{\varOmega }}}\|\) is similarly defined. For a vector \(u:=(u_{0},u_{1},\ldots ,u_{n})\in \mathbb {C}^{d}\), the tensor power um := u ⊗⋯ ⊗ u, where u is repeated m times, is defined such that

$$ (u^{\otimes m})_{i_{1}{\ldots} i_{m}} = u_{i_{1}} \times {\cdots} \times u_{i_{m}} . $$

For a symmetric tensor \(\mathcal {F}\), its symmetric rank is

$$ \text{rank}_{\mathrm{S}}(\mathcal{F}) := \min\left\{r ~|~ \mathcal{F}=\sum\limits_{i=1}^{r} u_{i}^{\otimes m}\right\}. $$

There are other types of tensor ranks [31, 33]. In this paper, we only deal with symmetric tensors and symmetric ranks. We refer to [12, 17, 21, 26, 31, 33] for general work about tensors and their ranks. For convenience, if \(r=\text {rank}_{\mathrm {S}}(\mathcal {F})\), we call \(\mathcal {F}\) a rank-r tensor and \(\mathcal {F}={\sum }_{i=1}^{r} u_{i}^{\otimes m}\) is called a rank decomposition.

For a power \(\alpha :=(\alpha _{1},\alpha _{2},\dots ,\alpha _{n})\in \mathbb {N}^{n}\) and \(x:=(x_{1},x_{2},\dots ,x_{n})\), denote

$$ |\alpha| := \alpha_{1}+\alpha_{2}+\cdots+\alpha_{n},\quad x^{\alpha} := x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}{\cdots} x_{n}^{\alpha_{n}}, \quad x_{0} :=1. $$

The monomial power set of degree m is denoted as

$$ \mathbb{N}^{n}_{m} := \{\alpha=(\alpha_{1},\alpha_{2},\dots,\alpha_{n}) \in \mathbb{N}^{n}:|\alpha|\leq m\}. $$

For \(\alpha \in \mathbb {N}_{3}^{n}\), we can write that \(x^{\alpha } = x_{i_{1}}x_{i_{2}}x_{i_{3}}\) for some 0 ≤ i1,i2,i3n.

Let \(\mathbb {C}[x]_{m}\) be the space of all polynomials in x with complex coefficients and whose degrees are no more than m. For a cubic polynomial \(p\in \mathbb {C}[x]_{3}\) and \(\mathcal {F}\in \mathrm {S}^{3}(\mathbb {C}^{n+1})\), we define the bilinear product (note that x0 = 1)

$$ \langle p,\mathcal{F}\rangle=\sum\limits_{0\leq i_{1},i_{2},i_{3}\leq n}p_{i_{1}i_{2}i_{3}}\mathcal{F}_{i_{1}i_{2}i_{3}} \quad\text{ for }~p = \sum\limits_{0\leq i_{1},i_{2},i_{3}\leq n}p_{i_{1}i_{2}i_{3}}x_{i_{1}}x_{i_{2}}x_{i_{3}}, $$

where \(p_{i_{1}i_{2}i_{3}}\) are coefficients of p. A polynomial \(g\in \mathbb {C}[x]_{3}\) is called a generating polynomial for a symmetric tensor \(\mathcal {F} \in \mathrm {S}^{3}(\mathbb {C}^{n+1})\) if

$$ \langle g\cdot x^{\beta}, \mathcal{F}\rangle=0\quad\forall\beta\in\mathbb{N}_{3-\text{deg}(g)}^{n} , $$

where deg(g) denotes the degree of g in x. When the order is bigger than 3, we refer to [40] for the definition of generating polynomials. They can be used to compute symmetric tensor decompositions and low rank approximations [40, 42], which are closely related to truncated moment problems and polynomial optimization [20, 37,38,39, 43]. There are special versions of symmetric tensors and their decompositions [19, 44, 45].

3 Incomplete Tensor Decomposition

This section discusses how to compute an incomplete tensor decomposition for a symmetric tensor \(\mathcal {F} \in \mathrm {S}^{3}(\mathbb {C}^{d})\) when only its subtensor \(\mathcal {F}_{{{\varOmega }}}\) is given, for the label set Ω in (1.4). For convenience of notation, the labels for \(\mathcal {F}\) begin with zeros while a vector \(u\in \mathbb {C}^{d}\) is still labelled as u := (u1,…,ud). We set

$$ n:=d-1, \quad x = (x_{1}, \ldots, x_{n}), \quad x_{0} := 1. $$

For a given rank r, denote the monomial sets

$$ \mathscr{B}_{0} := \{x_{1},\dots,x_{r}\}, \quad \mathscr{B}_{1}=\{x_{i} x_{j}: i \in [r], j \in [r+1, n] \}. $$

For a monomial power \(\alpha \in \mathbb {N}^{n}\), by writing \(\alpha \in {\mathscr{B}}_{1}\), we mean that \(x^{\alpha } \in {\mathscr{B}}_{1}\). For each \(\alpha \in {\mathscr{B}}_{1}\), one can write α = ei + ej with i ∈ [r], j ∈ [r + 1,n]. Let \(\mathbb {C}^{[r] \times {\mathscr{B}}_{1}}\) denote the space of matrices labelled by the pair \((k,\alpha )\in [r] \times {\mathscr{B}}_{1}\). For each \(\alpha = e_{i} + e_{j}\in {\mathscr{B}}_{1}\) and \(G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}\), denote the quadratic polynomial in x

$$ \varphi_{ij}[G](x) := \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})x_{k}- x_{i} x_{j}. $$
(3.1)

Suppose r is the symmetric rank of \(\mathcal {F}\). A matrix \(G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}\) is called a generating matrix of \(\mathcal {F}\) if each φij[G](x), with \(\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}\), is a generating polynomial of \(\mathcal {F}\). Equivalently, G is a generating matrix of \(\mathcal {F}\) if and only if

$$ \langle x_{t} \varphi_{ij}[G](x),\mathcal{F} \rangle = {\sum}_{k=1}^{r} G(k,e_{i}+e_{j})\mathcal{F}_{0kt}-\mathcal{F}_{ijt} = 0, \quad t = 0, 1, \ldots, n, $$
(3.2)

for all i ∈ [r], j ∈ [r + 1,n]. The notion generating matrix is motivated from the fact that the entire tensor \(\mathcal {F}\) can be recursively determined by G and its first r entries (see [40]). The existence and uniqueness of the generating matrix G is shown as follows.

Theorem 3.1

Suppose \(\mathcal {F}\) has the decomposition

$$ \mathcal{F} = \lambda_{1}\left[\begin{array}{c} 1 \\ u_{1} \end{array}\right]^{\otimes 3}+\cdots+\lambda_{r}\left[\begin{array}{c} 1 \\ u_{r} \end{array}\right]^{\otimes 3} , $$
(3.3)

for vectors \(u_{i} \in \mathbb {C}^{n}\) and scalars \( 0\neq \lambda _{i} \in \mathbb {C}\). If the subvectors (u1)1:r,…,(ur)1:r are linearly independent, then there exists a unique generating matrix \(G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}\) satisfying (3.2) for the tensor \(\mathcal {F}\).

Proof

We first prove the existence. For each i = 1,…,r, denote the vectors vi = (ui)1:r. Under the given assumption, V := [v1vr] is an invertible matrix. For each l = r + 1,…,n, let

$$ N_{l} := V \cdot \text{diag}\left( (u_{1})_{l},\ldots,(u_{r})_{l} \right) \cdot V^{-1}. $$
(3.4)

Then Nlvi = (ui)lvi for i = 1,…,r, i.e., Nl has eigenvalues (u1)l,…,(ur)l with corresponding eigenvectors (u1)1:r,…,(ur)1:r. We select \(G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}\) to be the matrix such that

$$ N_{l} = \left[\begin{array}{ccc} G(1,e_{1}+e_{l}) & {\cdots} & G(r,e_{1}+e_{l}) \\ {\vdots} & {\ddots} & {\vdots} \\ G(1,e_{r}+e_{l}) & {\cdots} & G(r,e_{r}+e_{l}) \end{array}\right],\quad l=r+1,\ldots,n. $$

For each s = 1,…,r and \(\alpha = e_{i} + e_{j} \in \mathbb {B}_{1}\) with i ∈ [r], j ∈ [r + 1,n],

$$ \varphi_{ij}[G](u_{s}) =\sum\limits_{k=1}^{r} G(k,e_{i}+e_{j})(u_{s})_{k} - (u_{s})_{i} (u_{s})_{j} = 0. $$

For each t = 1,…,n, it holds that

$$ \begin{array}{@{}rcl@{}} \langle x_{t} \varphi_{ij}[G](x),\mathcal{F} \rangle &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{t}x_{k} - x_{t} x_{i} x_{j}, \mathcal{F} \right\rangle \\ &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{t}x_{k} - x_{t} x_{i} x_{j}, \sum\limits_{s=1}^{r} \lambda_{s} \left[\begin{array}{c} 1 \\ u_{s} \end{array}\right]^{\otimes 3} \right\rangle \\ &=& \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})\sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} (u_{s})_{k} - \sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} (u_{s})_{i} (u_{s})_{j} \\ &=& \sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})(u_{s})_{k} -(u_{s})_{i} (u_{s})_{j} \right) \\ &=& 0. \end{array} $$

When t = 0, we can similarly get

$$ \begin{array}{@{}rcl@{}} \langle \varphi_{ij}[G](x) ,\mathcal{F} \rangle &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{k} - x_{i} x_{j}, \mathcal{F} \right\rangle\\ &=& \sum\limits_{s=1}^{r} \lambda_{s} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})(u_{s})_{k} -(u_{s})_{i} (u_{s})_{j} \right) \\ &=& 0. \end{array} $$

Therefore, the matrix G satisfies (3.2) and it is a generating matrix for \(\mathcal {F}\).

Second, we prove the uniqueness of such G. For each \(\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}\), let

$$ F := \left[\begin{array}{ccc} \mathcal{F}_{011} & {\cdots} & \mathcal{F}_{0r1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{01n} & {\cdots} & \mathcal{F}_{0rn} \end{array}\right],\quad g_{ij} := \left[\begin{array}{c} \mathcal{F}_{1ij} \\ {\vdots} \\ \mathcal{F}_{nij} \end{array}\right]. $$

Since G satisfies (3.2), we have FG(:,ei + ej) = gij. The decomposition (3.3) implies that

$$ F = \left[\begin{array}{ccc} u_{1} & {\cdots} & u_{r} \end{array}\right] \cdot \text{diag}(\lambda_{1},\ldots,\lambda_{r}) \cdot \left[\begin{array}{ccc} v_{1} & {\cdots} & v_{r} \end{array}\right]^{T}. $$

The sets {v1,…,vr} and {u1,…,ur} are both linearly independent. Since each λi≠ 0, the matrix F has full column rank. Hence, the generating matrix G satisfying FG(:,ei + ej) = gij for all i ∈ [r],j ∈ [r + 1,n] is unique. □

The following is an example of generating matrices.

Example 3.2

Consider the tensor \(\mathcal {F}\in \mathtt {S}^{3}(\mathbb {C}^{6})\) that is given as

$$ \mathcal{F} = 0.4\cdot(1,1,1,1,1,1)^{\otimes 3} + 0.6\cdot(1,-1,2,-1,2,3)^{\otimes 3}. $$

The rank r = 2, \({\mathscr{B}}_{0}=\{x_{1},x_{2}\}\) and \({\mathscr{B}}_{1} = \{x_{1}x_{3},x_{1}x_{4},x_{1}x_{5},x_{2}x_{3},x_{2}x_{4},x_{2}x_{5}\}\). We have the vectors

$$ u_{1} =(1,1,1,1,1), \quad u_{2} = (-1,2,-1,2,3), \quad v_{1} =(1,1), \quad v_{2} = (-1,2). $$

The matrices N3, N4, N5 as in (3.4) are

$$ \begin{array}{@{}rcl@{}} N_{3} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & -1 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}1/3 & 2/3 \\ 4/3 & -1/3 \end{array}\right], \\ N_{4} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & 2 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}4/3 & -1/3 \\ -2/3 & 5/3 \end{array}\right], \\ N_{5} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & 3 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}5/3 & -2/3 \\ -4/3 & 7/3 \end{array}\right]. \end{array} $$

The entries of the generating matrix G are listed as below:

(3.5)

The generating polynomials in (3.1) are

$$ \begin{array}{@{}rcl@{}} \varphi_{13}[G](x) &=& \frac{1}{3}x_{1}+\frac{2}{3}x_{2}-x_{1}x_{3},\quad \varphi_{23}[G](x) = \frac{4}{3}x_{1}-\frac{1}{3}x_{2}-x_{2}x_{3},\\ \varphi_{14}[G](x) &=& \frac{4}{3}x_{1}-\frac{1}{3}x_{2}-x_{1}x_{4},\quad \varphi_{24}[G](x) = -\frac{2}{3}x_{1}+\frac{5}{3}x_{2}-x_{2}x_{4},\\ \varphi_{15}[G](x) &=& \frac{5}{3}x_{1}-\frac{2}{3}x_{2}-x_{1}x_{5},\quad \varphi_{25}[G](x) = -\frac{4}{3}x_{1}+\frac{7}{3}x_{2}-x_{2}x_{5}. \end{array} $$

Above generating polynomials can be written in the following form

$$ \left[\begin{array}{c} \varphi_{1j}[G](x)\\ \varphi_{2j}[G](x) \end{array}\right] = N_{j}\left[\begin{array}{c} x_{1} \\ x_{2} \end{array}\right] - x_{j} \left[\begin{array}{c} x_{1}\\ x_{2} \end{array}\right]\quad \text{for } j=3,4,5. $$

For x to be a common zero of φ1j[G](x) and φ2j[G](x), it requires that (x1,x2) is an eigenvector of Nj with the corresponding eigenvalue xj.

3.1 Computing the Tensor Decomposition

We show how to find an incomplete tensor decomposition (3.3) for \(\mathcal {F}\) when only its subtensor \(\mathcal {F}_{{{\varOmega }}}\) is given, where the label set Ω is as in (1.4). Suppose that there exists the decomposition (3.3) for \(\mathcal {F}\), for vectors \(u_{i} \in \mathbb {C}^{n}\) and nonzero scalars \(\lambda _{i} \in \mathbb {C}\). Assume the subvectors (u1)1:r,…,(ur)1:r are linearly independent, so there is a unique generating matrix G for \(\mathcal {F}\), by Theorem 3.1.

For each \(\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}\) with i ∈ [r],j ∈ [r + 1,n] and for each

$$ l=r+1,\ldots,j-1,j+1,\ldots,n, $$

the generating matrix G satisfies the equations

$$ \left\langle x_{l} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})x_{k} - x_{i} x_{j} \right),\mathcal{F} \right\rangle = \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) \mathcal{F}_{0kl} - \mathcal{F}_{ijl} = 0. $$
(3.6)

Let the matrix \(A_{ij}[\mathcal {F}]\in \mathbb {C}^{(n-r-1)\times r}\) and the vector \(b_{ij}[\mathcal {F}]\in \mathbb {C}^{n-r-1}\) be such that

$$ A_{ij}[\mathcal{F}] := \left[\begin{array}{ccc} \mathcal{F}_{0,1,r+1} & {\cdots} & \mathcal{F}_{0,r,r+1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{0,1,j-1} & {\cdots} & \mathcal{F}_{0,r,j-1} \\ \mathcal{F}_{0,1,j+1} & {\cdots} & \mathcal{F}_{0,r,j+1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{0,1,n} & {\cdots} & \mathcal{F}_{0,r,n} \end{array}\right], \quad b_{ij}[\mathcal{F}] := \left[\begin{array}{c} \mathcal{F}_{i,j,r+1}\\ {\vdots} \\ \mathcal{F}_{i,j,j-1}\\ \mathcal{F}_{i,j,j+1}\\ {\vdots} \\ \mathcal{F}_{i,j,n} \end{array}\right]. $$
(3.7)

To distinguish changes in the labels of tensor entries of \(\mathcal {F}\), the commas are inserted to separate labeling numbers.

The equations in (3.6) can be equivalently written as

$$ A_{ij}[\mathcal{F}] \cdot G(:, e_{i}+e_{j}) = b_{ij}[\mathcal{F}]. $$
(3.8)

If the rank \(r\le \frac {d}{2}-1\), then nr − 1 = dr − 2 ≥ r. Thus, the number of rows is not less than the number of columns for matrices \(A_{ij}[\mathcal {F}]\). If \(A_{ij}[\mathcal {F}]\) has linearly independent columns, then (3.8) uniquely determines G(:,α). For such a case, the matrix G can be fully determined by the linear system (3.8). Let \(N_{r+1}(G),\ldots ,N_{m}(G) \in \mathbb {C}^{r\times r}\) be the matrices given as

$$ N_{l}(G) = \left[\begin{array}{ccc} G(1,e_{1}+e_{l}) & {\cdots} &G(r,e_{1}+e_{l}) \\ {\vdots} & {\ddots} & {\vdots} \\ G(1,e_{r}+e_{l}) & {\cdots} &G(r,e_{r}+e_{l}) \end{array}\right],\quad l=r+1,\ldots, n. $$
(3.9)

As in the proof of Theorem 3.1, one can see that

$$ N_{l}(G) \left[\begin{array}{c} (u_{i})_{1} \\ {\vdots} \\(u_{i})_{r} \end{array}\right] = (u_{i})_{l} \cdot \left[\begin{array}{c} (u_{i})_{1} \\ {\vdots} \\(u_{i})_{r} \end{array}\right], \quad l=r+1,\ldots, n. $$

The above is equivalent to the equations

$$ N_{l}(G)v_{i} = (w_{i})_{l-r} \cdot v_{i}, \quad l=r+1,\ldots, n, $$

for the vectors (i = 1,…,r)

$$ v_{i} := (u_{i})_{1:r}, \quad w_{i} := (u_{i})_{r+1:n}. $$
(3.10)

Each vi is a common eigenvector of the matrices Nr+ 1(G),…,Nn(G) and (wi)lr is the associated eigenvalue of Nl(G). These matrices may or may not have repeated eigenvalues. Therefore, we select a generic vector \(\xi := (\xi _{r+1},\dots ,\xi _{n})\) and let

$$ N(\xi) := \xi_{r+1}N_{r+1}+\cdots+\xi_{n}N_{n}. $$
(3.11)

The eigenvalues of N(ξ) are ξTw1,…,ξTwr. When w1,…,wr are distinct from each other and ξ is generic, the matrix N(ξ) does not have a repeated eigenvalue and hence it has unique eigenvectors v1,…,vr, up to scaling. Let \(\tilde {v}_{1},\ldots ,\tilde {v}_{r}\) be unit length eigenvectors of N(ξ). They are also common eigenvectors of Nr+ 1(G),…,Nn(G). For each i = 1,…,r, let \(\tilde {w}_{i}\) be the vector such that its j th entry \((\tilde {w}_{i})_{j}\) is the eigenvalue of Nj+r(G), associated to the eigenvector \(\tilde {v}_{i}\), or equivalently,

$$ \tilde{w}_{i}=\left( \tilde{v}_{i}^{H} N_{r+1}(G)\tilde{v}_{i},\dots, \tilde{v}_{i}^{H} N_{n}(G)\tilde{v}_{i}\right),\quad i=1,\ldots,r. $$
(3.12)

Up to a permutation of \((\tilde {v}_{1},\ldots , \tilde {v}_{r})\), there exist scalars γi such that

$$ v_{i} = \gamma_{i} \tilde{v}_{i}, \quad w_{i} = \tilde{w}_{i}. $$
(3.13)

The tensor decomposition of \(\mathcal {F}\) can also be written as

$$ \mathcal{F} = \lambda_{1} \left[\begin{array}{c} 1 \\ \gamma_{1} \tilde{v}_{1} \\ \tilde{w}_{1} \end{array}\right]^{\otimes 3} + {\cdots} +\lambda_{r} \left[\begin{array}{c} 1 \\ \gamma_{r} \tilde{v}_{r} \\ \tilde{w}_{r} \end{array}\right]^{\otimes 3}. $$

The scalars \(\lambda _{1},\dots ,\lambda _{r}\) and \( \gamma _{1},\dots ,\gamma _{r}\) satisfy the linear equations

$$ \begin{array}{@{}rcl@{}} \lambda_{1}\gamma_{1} \tilde{v}_{1} \otimes\tilde{w}_{1} +\cdots+ {\lambda_{r}}{\gamma_{r}} \tilde{v}_{r} \otimes \tilde{w}_{r} &=& \mathcal{F}_{[0,1:r,r+1:n]}, \\ \lambda_{1}{\gamma_{1}^{2}}\tilde{v}_{1}\otimes \tilde{v}_{1} \otimes \tilde{w}_{1}+\cdots+\lambda_{r}{\gamma_{r}^{2}} \tilde{v}_{r}\otimes \tilde{v}_{r}\otimes \tilde{w}_{r} &=&\mathcal{F}_{[1:r,1:r,r+1:n]} . \end{array} $$

Denote the label sets

(3.14)

To determine the scalars λi, γi, we can solve the linear least squares

$$ \begin{array}{@{}rcl@{}} &&\underset{(\beta_{1},\ldots,\beta_{r})}{\min} \left\|\mathcal{F}_{J_{1}} - \sum\limits_{i=1}^{r} \beta_{i} \cdot \tilde{v}_{i} \otimes \tilde{w}_{i} \right\|^{2}, \end{array} $$
(3.15)
$$ \begin{array}{@{}rcl@{}} && \underset{(\theta_{1},\ldots,\theta_{r})}{\min} \left\|\mathcal{F}_{J_{2}} - \sum\limits_{k=1}^{r}\theta_{k} \cdot (\tilde{v}_{k} \otimes \tilde{v}_{k} \otimes \tilde{w}_{k})_{J_{2}} \right\|^{2}. \end{array} $$
(3.16)

Let \((\beta _{1}^{\ast },\ldots ,\beta _{r}^{\ast })\), \((\theta _{1}^{\ast },\ldots ,\theta _{r}^{\ast })\) be minimizers of (3.15) and (3.16) respectively. Then, for each i = 1,…,r, let

$$ \lambda_{i} := (\beta_{i}^{\ast})^{2}/\theta_{i}^{\ast}, \quad \gamma_{i} := \theta_{i}^{\ast}/\beta_{i}^{\ast}. $$
(3.17)

For the vectors (i = 1,…,r)

$$ p_{i} := \sqrt[3]\lambda_{i}(1,\gamma_{i} \tilde{v}_{i},\tilde{w}_{i}), $$

the sum \(p_{1}^{\otimes 3}+ {\cdots } +p_{r}^{\otimes 3}\) is a tensor decomposition for \(\mathcal {F}\). This is justified in the following theorem.

Theorem 3.3

Suppose the tensor \(\mathcal {F}\) has the decomposition as in (3.3). Assume that the vectors v1,…,vr are linearly independent and the vectors w1,…,wr are distinct from each other, where v1,…,vr,w1,…,wr are defined as in (3.10). Let ξ be a generically chosen coefficient vector and let p1,…,pr be the vectors produced as above. Then, the tensor decomposition \(\mathcal {F} = p_{1}^{\otimes 3}+ {\cdots } +p_{r}^{\otimes 3}\) is unique.

Proof

Since v1,…,vr are linearly independent, the tensor decomposition (3.3) is unique, up to scalings and permutations. By Theorem 3.1, there is a unique generating matrix G for \(\mathcal {F}\) satisfying (3.2). Under the given assumptions, (3.8) uniquely determines G. Note that ξTw1,…,ξTwr are the eigenvalues of N(ξ) and v1,…,vr are the corresponding eigenvectors. When ξ is generically chosen, the values of ξTw1,…,ξTwr are distinct eigenvalues of N(ξ). So N(ξ) has unique eigenvalue decompositions, and hence (3.13) must hold, up to a permutation of (v1,…,vr). Since the coefficient matrices have full column ranks, the linear least squares problems have unique optimal solutions. Up to a permutation of p1,…,pr, it holds that \(p_{i} = \sqrt [3]{\lambda _{i}} \left [\begin {array}{c} 1 \\ u_{i} \end {array}\right ]\). Then, the conclusion follows readily. □

The following is the algorithm for computing an incomplete tensor decomposition for \(\mathcal {F}\) when only its subtensor \(\mathcal {F}_{{{\varOmega }}}\) is given.

Algorithm 3.4

(Incomplete symmetric tensor decompositions)

  • A third order symmetric subtensor \({\mathcal {F}}_{{{\varOmega }}}\) and a rank \(r =\text {rank}_{S}(\mathcal {F})\le \frac {d}{2}-1\).

  • Determine the matrix G by solving (3.8) for each \(\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}\).

  • Let N(ξ) be the matrix as in (3.11), for a randomly selected vector ξ. Compute the unit length eigenvectors \(\tilde {v}_{1},\ldots ,\tilde {v}_{r}\) of N(ξ) and choose \(\tilde {w}_{i}\) as in (3.12).

  • Solve the linear least squares (3.15) and (3.16) to get the coefficients λi, γi as in (3.17).

  • For each i = 1,…,r, let \(p_{i} := \sqrt [3]{ \lambda _{i}}(1, \gamma _{i} \tilde {v}_{i}, \tilde {w}_{i})\).

  • The tensor decomposition \(\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}\).

The following is an example of applying Algorithm 3.4.

Example 3.5

Consider the same tensor \(\mathcal {F}\) as in Example 3.2. The monomial sets \({\mathscr{B}}_{0}\), \({\mathscr{B}}_{1}\) are the same. The matrices \(A_{ij}[\mathcal {F}]\) and vectors \(b_{ij}[\mathcal {F}]\) are

$$ \begin{array}{@{}rcl@{}} A_{13}[\mathcal{F}] &=& A_{23}[\mathcal{F}]= \left[\begin{array}{cc} -0.8 & 2.8\\ -1.4 & 4 \end{array}\right], \qquad b_{13}[\mathcal{F}]=\left[\begin{array}{c}1.6\\2.2 \end{array}\right],\quad~~~ b_{23}[\mathcal{F}]=\left[\begin{array}{c}-2\\-3.2 \end{array}\right],\\ A_{14}[\mathcal{F}] &=& A_{24}[\mathcal{F}]= \left[\begin{array}{cc} 1 & -0.8\\ -1.4 & 4 \end{array}\right], \quad~ b_{14}[\mathcal{F}]=\left[\begin{array}{c}1.6\\-3.2 \end{array}\right],\quad b_{24}[\mathcal{F}]=\left[\begin{array}{c}-2\\7.6 \end{array}\right],\\ A_{15}[\mathcal{F}] &=& A_{25}[\mathcal{F}]= \left[\begin{array}{cc} 1 & -0.8\\ -0.8 & 2.8 \end{array}\right], \quad~ b_{15}[\mathcal{F}] = \left[\begin{array}{c}2.2\\-3.2 \end{array}\right],\quad~ b_{25}[\mathcal{F}] = \left[\begin{array}{c}-3.2\\7.6 \end{array}\right]. \end{array} $$

Solve (3.8) to obtain G, which is same as in (3.5). The matrices N3(G), N4(G), N5(G) are

$$ N_{3}(G)=\left[\begin{array}{cc} 1/3 & 2/3\\4/3 & -1/3 \end{array}\right],\quad N_{4}(G)=\left[\begin{array}{cc} 4/3 & -1/3\\-2/3 & 5/3 \end{array}\right],\quad N_{5}(G)=\left[\begin{array}{cc} 5/3 & -2/3\\-4/3 & 7/3 \end{array}\right]. $$

Choose a generic ξ, say, ξ = (3,4,5), then

$$ N(\xi) = \left[\begin{array}{cc}1/\sqrt{2} & -1/\sqrt{5} \\ 1/\sqrt{2} & 2/\sqrt{5} \end{array}\right] \left[\begin{array}{cc} 12 & 0 \\ 0 & 20 \end{array}\right] \left[\begin{array}{cc}1/\sqrt{2} & -1/\sqrt{5} \\ 1/\sqrt{2} & 2/\sqrt{5} \end{array}\right]^{-1}. $$

The unit length eigenvectors are

$$ \tilde{v}_{1} = (1/\sqrt{2},1/\sqrt{2}), \quad \tilde{v}_{2}=(-1/\sqrt{5},2/\sqrt{5}) . $$

As in (3.12), we get the vectors

$$ w_{1} = (1,1,1),\quad w_{2} = (-1,2,3). $$

Solving (3.15) and (3.16), we get the scalars

$$ \gamma_{1}=\sqrt{2}, \quad \gamma_{2}=\sqrt{5}, \quad \lambda_{1}=0.4, \quad \lambda_{2} = 0.6. $$

This produces the decomposition \(\mathcal {F}=\lambda _{1}u_{1}^{\otimes 3}+\lambda _{2}u_{2}^{\otimes 3}\) for the vectors

$$ u_{1}=(1,\gamma_{1}v_{1},w_{1})=(1,1,1,1,1,1), \quad u_{2}=(1,\gamma_{2}v_{2},w_{2})=(1,-1,2,-1,2,3). $$

Remark 3.6

Algorithm 3.4 requires the value of r. This is generally a hard question. In computational practice, one can estimate the value of r as follows. Let \(\text {Flat}(\mathcal {F})\in \mathbb {C}^{(n+1) \times (n+1)^{2}}\) be the flattening matrix, labelled by (i,(j,k)) such that

$$ \text{Flat}(\mathcal{F})_{i,(j,k)} = \mathcal{F}_{ijk} $$

for all i,j,k = 0,1,…,n. The rank of \(\text {Flat}(\mathcal {F})\) equals the rank of \(\mathcal {F}\) when the vectors p1,…,pr are linearly independent. The rank of \(\text {Flat}(\mathcal {F})\) is not available since only the subtensor \((\mathcal {F})_{{{\varOmega }}}\) is known. However, we can calculate the ranks of submatrices of \((\mathcal {F})_{{{\varOmega }}}\) whose entries are known. If the tensor \(\mathcal {F}\) as in (3.3) is such that both the sets {v1,…,vr} and {w1,…,wr} are linearly independent, one can see that \({\sum }_{i=1}^{r} \lambda _{i} v_{i}{w_{i}^{T}}\) is a known submatrix of \(\text {Flat}(\mathcal {F})\) whose rank is r. This is generally the case if \(r\le \frac {d}{2}-1\), since vi has the length r and wi has length d − 1 − rr. Therefore, the known submatrices of \(\text {Flat}(\mathcal {F})\) are generally sufficient to estimate \(\text {rank}_{S}(\mathcal {F})\). For instance, we consider the case \(\mathcal {F}\in \text {S}^{3}(\mathbb {C}^{7})\). The flattening matrix \(\text {Flat}(\mathcal {F})\) is

$$ \left[\begin{array}{ccccccc} \ast & \ast & \ast & \ast & \ast & \ast & \ast\\ \ast & \ast & \mathcal{F}_{120} & \mathcal{F}_{130} & \mathcal{F}_{140} & \mathcal{F}_{150} & \mathcal{F}_{160}\\ \ast & \mathcal{F}_{210} & \ast & \mathcal{F}_{230} & \mathcal{F}_{240} & \mathcal{F}_{250} & \mathcal{F}_{260}\\ \ast & \mathcal{F}_{310} & \mathcal{F}_{320} & \ast & \mathcal{F}_{340} & \mathcal{F}_{350} & \mathcal{F}_{360}\\ \ast & \mathcal{F}_{410} & \mathcal{F}_{420} & \mathcal{F}_{430} & \ast & \mathcal{F}_{450} & \mathcal{F}_{460}\\ \ast & \mathcal{F}_{510} & \mathcal{F}_{520} & \mathcal{F}_{530} & \mathcal{F}_{540} & \ast & \mathcal{F}_{560}\\ \ast & \mathcal{F}_{610} & \mathcal{F}_{620} & \mathcal{F}_{630} & \mathcal{F}_{640} & \mathcal{F}_{650} & \ast \end{array}\right], $$

where each ∗ means that entry is not given. The largest submatrices with known entries are

$$ \left[\begin{array}{ccc} \mathcal{F}_{410} & \mathcal{F}_{420} & \mathcal{F}_{430}\\ \mathcal{F}_{510} & \mathcal{F}_{520} & \mathcal{F}_{530}\\ \mathcal{F}_{610} & \mathcal{F}_{620} & \mathcal{F}_{630} \end{array}\right], \quad \left[\begin{array}{ccc} \mathcal{F}_{140} & \mathcal{F}_{150} & \mathcal{F}_{160}\\ \mathcal{F}_{240} & \mathcal{F}_{250} & \mathcal{F}_{260}\\ \mathcal{F}_{340} & \mathcal{F}_{350} & \mathcal{F}_{360} \end{array}\right]. $$

The rank of above matrices generally equals \(\text {rank}_{S}(\mathcal {F})\) if \(r\le \frac {d}{2}-1 = 2.5\).

4 Tensor Approximations and Stability Analysis

In some applications, we do not have the subtensor \(\mathcal {F}_{{{\varOmega }}}\) exactly but only have an approximation \(\widehat {\mathcal {F}}_{{{\varOmega }}}\) for it. Algorithm 3.4 can still provide a good rank-r approximation for \(\mathcal {F}\) when it is applied to \(\widehat {\mathcal {F}}_{{{\varOmega }}}\). We define the matrix \(A_{ij}[\widehat {\mathcal {F}}]\) and the vector \(b_{ij}[\widehat {\mathcal {F}}]\) in the same way as in (3.7), for each \(\alpha = e_{i}+e_{j} \in {\mathscr{B}}_{1}\). The generating matrix G for \(\mathcal {F}\) can be approximated by solving the linear least squares

$$ \underset{g \in \mathbb{C}^{r} }{\min} \|A_{ij}[\widehat{\mathcal{F}}] \cdot g - b_{ij}[\widehat{\mathcal{F}}]\|^{2}, $$
(4.1)

for each \(\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}\). Let \(\widehat {G}(:,e_{i}+e_{j})\) be the optimizer of the above and \(\widehat {G}\) be the matrix consisting of all such \(\widehat {G}(:,e_{i}+e_{j})\). Then \(\widehat {G}\) is an approximation for G. For each l = r + 1,…,n, define the matrix \(N_{l}(\widehat {G})\) similarly as in (3.9). Choose a generic vector ξ = (ξr+ 1,…,ξn) and let

$$ \widehat{N}(\xi) := \xi_{r+1} N_{r+1}(\widehat{G}) +{\cdots} +\xi_{n}N_{n}(\widehat{G}). $$
(4.2)

The matrix \(\widehat {N}(\xi )\) is an approximation for N(ξ). Let \(\hat {v}_{1},\ldots ,\hat {v}_{r}\) be unit length eigenvectors of \(\widehat {N}(\xi )\). For k = 1,…,r, let

$$ \hat{w}_{k} := \left( (\hat{v}_{k})^{H} N_{r+1}(\widehat{G})\hat{v}_{k},\ldots, (\hat{v}_{k})^{H} N_{n}(\widehat{G}) \hat{v}_{k} \right). $$
(4.3)

For the label sets J1, J2 as in (3.14), the subtensors \(\widehat {\mathcal {F}}_{J_{1}}, \widehat {\mathcal {F}}_{J_{2}}\) are similarly defined like \(\mathcal {F}_{J_{1}}\), \(\mathcal {F}_{J_{2}}\). Consider the following linear least square problems

$$ \begin{array}{@{}rcl@{}} &&\underset{(\beta_{1},\ldots,\beta_{r})}{\min} \left\|\widehat{\mathcal{F}}_{J_{1}} - \sum\limits_{k=1}^{r} \beta_{k} \cdot \hat{v}_{k}\otimes \hat{w}_{k} \right\|^{2}, \end{array} $$
(4.4)
$$ \begin{array}{@{}rcl@{}} &&\underset{(\theta_{1},\ldots,\theta_{r})}{\min} \left\|\widehat{\mathcal{F}}_{J_{2}} - \sum\limits_{k=1}^{r}\theta_{k} \cdot (\hat{v}_{k} \otimes \hat{v}_{k} \otimes \hat{w}_{k})_{J_{2}} \right\|^{2}. \end{array} $$
(4.5)

Let \((\hat {\beta }_{1}, \ldots , \hat {\beta }_{r})\) and \((\hat {\theta }_{1}, \ldots , \hat {\theta }_{r})\) be their optimizers respectively. For each k = 1,…,r, let

$$ \hat{\lambda}_{k} := (\hat{\beta}_{k})^{2}/\hat{\theta}_{k}, \quad \hat{\gamma}_{k} := \hat{\theta_{k}}/\hat{\beta}_{k}. $$

This results in the tensor approximation

$$ \mathcal{F} \approx (\hat{p}_{1})^{\otimes 3}+\cdots+(\hat{p}_{r} )^{\otimes 3}, $$

for the vectors \(\hat {p}_{k} := \sqrt [3]{ \hat {\lambda }_{k}}(1, \hat {\gamma }_{k}\hat {v}_{k}, \hat {w}_{k})\). The above may not give an optimal tensor approximation. To get an improved one, we can use \(\hat {p}_{1},\ldots ,\hat {p}_{r}\) as starting points to solve the following nonlinear optimization

$$ \underset{(q_{1},\ldots,q_{r})}{\min} \left\| \left( \sum\limits_{k=1}^{r} (q_{k})^{\otimes 3} - \widehat{\mathcal{F}}\right)_{{{\varOmega}}} \right\|^{2}. $$
(4.6)

The minimizer of the optimization (4.6) is denoted as \((p_{1}^{\ast },\ldots ,p_{r}^{\ast })\).

Summarizing the above, we have the following algorithm for computing a tensor approximation.

Algorithm 4.1

(Incomplete symmetric tensor approximations)

  • A third order symmetric subtensor \(\widehat {\mathcal {F}}_{{{\varOmega }}}\) and a rank \(r\le \frac {d}{2}-1\).

  • Find the matrix \(\widehat {G}\) by solving (4.1) for each \(\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}\).

  • Choose a generic vector and let \(\widehat {N}(\xi )\) be the matrix as in (4.2). Compute unit length eigenvectors \(\hat {v}_{1},\ldots ,\hat {v}_{r}\) for \(\widehat {N}(\xi )\) and define \(\hat {w}_{i}\) in (4.3).

  • Solve the linear least squares (4.4), (4.5) to get the coefficients \(\hat {\lambda }_{i}, \hat {\gamma }_{i}\).

  • For each i = 1,…,r, let \(\hat {p}_{i} := \sqrt [3]{ \hat {\lambda }_{i}}(1,\hat {\gamma }_{i} \hat {v}_{i},\hat {w}_{i})\). Then \((\hat {p}_{1})^{\otimes 3}+\cdots +(\hat {p}_{r})^{\otimes 3}\) is a tensor approximation for \(\widehat {\mathcal {F}}\).

  • Use \(\hat {p}_{1},\ldots , \hat {p}_{r}\) as starting points to solve the nonlinear optimization (4.6) for an optimizer \((p_{1}^{\ast },\ldots ,p_{r}^{\ast })\).

  • The tensor approximation \((p_{1}^{\ast })^{\otimes 3}+\cdots +(p_{r}^{\ast })^{\otimes 3}\) for \(\widehat {\mathcal {F}}\).

When \(\widehat {\mathcal {F}}\) is close to \(\mathcal {F}\), Algorithm 4.1 also produces a good rank-r tensor approximation for \(\mathcal {F}\). This is shown in the following.

Theorem 4.2

Suppose the tensor \(\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}\), with \(r \le \frac {d}{2}-1\), satisfies the following conditions:

  1. (i)

    The leading entry of each pi is nonzero;

  2. (ii)

    the subvectors (p1)2:r+ 1,…,(pr)2:r+ 1 are linearly independent;

  3. (iii)

    the subvectors (p1)[r+ 2:j,j+ 2:d],…,(pr)[r+ 2:j,j+ 2:d] are linearly independent for each j ∈ [r + 1,n];

  4. (iv)

    the eigenvalues of the matrix N(ξ) in (3.11) are distinct from each other.

Let \(\hat {p}_{i}\), \(p_{i}^{\ast }\) be the vectors produced by Algorithm 4.1. If the distance \(\varepsilon := \|(\widehat {\mathcal {F}}-\mathcal {F})_{{{\varOmega }}}\|\) is small enough, then there exist scalars \(\hat {\tau }_{i}\), \(\tau _{i}^{\ast }\) such that

$$ (\hat{\tau}_{i})^{3} = (\tau_{i}^{\ast})^{3}=1, \quad \|\hat{\tau}_{i} \hat{p}_{i}- p_{i}\| = O(\varepsilon), \quad \|\tau_{i}^{\ast} p^{\ast}_{i}- p_{i}\| = O(\varepsilon), $$

up to a permutation of (p1,…,pr), where the constants inside O(⋅) only depend on \(\mathcal {F}\) and the choice of ξ in Algorithm 4.1.

Proof

The conditions (i)–(ii), by Theorem 3.1, imply that there is a unique generating matrix G for \(\mathcal {F}\). The matrix G can be approximated by solving the linear least square problems (4.1). Note that

$$ \|A_{ij}[\widehat{\mathcal{F}}] - A_{ij}[\mathcal{F}]\| \leq \varepsilon, \quad \|b_{ij}[\widehat{\mathcal{F}}]-b_{ij}[\mathcal{F}]\|\le \varepsilon, $$

for all \(\alpha =e_{i}+e_{j}\in {\mathscr{B}}_{1}\). The matrix \(A_{ij}[\mathcal {F}]\) can be written as

$$ A_{ij}[\mathcal{F}] = [(p_{1})_{[r+2:j,j+2:d]},\ldots,(p_{r})_{[r+2:j,j+2:d]}]\cdot [(p_{1})_{2:r+1},\ldots,(p_{r})_{2:r+1}]^{T}. $$

By the conditions (ii)–(iii), the matrix \(A_{ij}[\mathcal {F}]\) has full column rank for each j ∈ [r + 1,n] and hence the matrix \(A_{ij}[\widehat {\mathcal {F}}]\) has full column rank when ε is small enough. Therefore, the linear least problems (4.1) have unique solutions and the solution \(\widehat {G}\) satisfies that

$$ \|\widehat{G}-{G}\| = O(\varepsilon), $$

where O(ε) depends on \(\mathcal {F}\) (see [15, Theorem 3.4]). For each j = r + 1,…,n, \(N_{j}(\widehat {G})\) is part of the generating matrix \(\widehat {G}\), so

$$ \|N_{j}(\widehat{G})-{N}_{j}(G)\|\le \|\widehat{G}-G\| = O(\varepsilon), \quad j=r+1,\ldots,n. $$

This implies that \(\|\widehat {N}(\xi )-N(\xi )\|=O(\varepsilon )\). When ε is small enough, the matrix \(\widehat {N}(\xi )\) does not have repeated eigenvalues, due to the condition (iv). Thus, the matrix N(ξ) has a set of unit length eigenvectors \(\tilde {v}_{1},\ldots ,\tilde {v}_{r}\) with eigenvalues \(\tilde {w}_{1},\ldots ,\tilde {w}_{r}\) respectively, such that

$$ \|\hat{v}_{i}-\tilde{v}_{i}\| = O(\varepsilon), \quad \|\hat{w}_{i}-\tilde{w}_{i}\| = O(\varepsilon). $$

This follows from Proposition 4.2.1 in [8]. The constants inside the above O(⋅) depend only on \(\mathcal {F}\) and ξ. The \(\tilde {w}_{1},\ldots ,\tilde {w}_{r} \) are scalar multiples of linearly independent vectors (p1)r+ 2:d,…,(pr)r+ 2:d respectively, so \(\tilde {w}_{1},\ldots ,\tilde {w}_{r}\) are linearly independent. When ε is small, \({\hat {w}}_{1},\ldots ,{\hat {w}}_{r}\) are linearly independent as well. The scalars \(\hat {\lambda }_{i} \hat {\gamma }_{i} \) and \(\hat {\lambda }_{i}(\hat {\gamma }_{i})^{2}\) are optimizers for the linear least square problems (4.4) and (4.5). By Theorem 3.4 in [15], we have

$$ \|\hat{\lambda}_{i}\hat{\gamma}_{i} - \lambda_{i} \gamma_{i}\| = O(\varepsilon),\quad \|\hat{\lambda}_{i}(\hat{\gamma}_{i})^{2} - \lambda_{i} {\gamma_{i}^{2}}\| = O(\varepsilon). $$

The vector pi can be written as \(p_{i} = \sqrt [3]{\lambda _{i}}(1,\gamma _{i} \tilde {v}_{i},\tilde {w}_{i})\), so we must have λi,γi≠ 0 due to the condition (ii). Thus, it holds that

$$ \|\hat{\lambda}_{i} - {\lambda}_{i}\| = O(\varepsilon),\quad \|\hat{\gamma}_{i} - {\gamma}_{i}\| = O(\varepsilon), $$

where constants inside O(⋅) depend only on \(\mathcal {F}\) and ξ. For the vectors \(\tilde {p}_{i}:=\sqrt [3]{\lambda _{i}}(1,\gamma _{i}\tilde {v}_{i},\tilde {w}_{i})\), we have \(\mathcal {F} = {\sum }_{i=1}^{r} \tilde {p}_{i}^{\otimes 3}\), by Theorem 3.3. Since p1,…,pr are linearly independent by the assumption, the rank decomposition of \(\mathcal {F}\) is unique up to scaling and permutation. There exist scalars \(\hat {\tau }_{i}\) such that \((\hat {\tau }_{i})^{3}=1\) and \(\hat {\tau }_{i} \tilde {p}_{i} = p_{i}\), up to a permutation of p1,…,pr. For \(\hat {p}_{i} = \sqrt [3]{\hat {\lambda }_{i}}(1, \hat {\gamma }_{i} \hat {v}_{i} ,\hat {w}_{i})\), we have \(\|\hat {\tau }_{i} \hat {p}_{i}-p_{i}\|=O(\varepsilon )\), where the constants in O(⋅) only depend on \(\mathcal {F}\) and ξ.

Since \(\| \hat {\tau }_{i} \hat {p}_{i} - p_{i}\|=O(\varepsilon )\), we have \(\|({\sum }_{i=1}^{r} (\hat {p}_{i})^{\otimes 3}-\mathcal {F})_{{{\varOmega }}}\| = O(\varepsilon )\). The \((p_{1}^{\ast },\ldots ,p_{r}^{\ast })\) is a minimizer of (4.6), so

$$ \left\|\left( \sum\limits_{i=1}^{r} (p_{i}^{\ast})^{\otimes 3}-\widehat{\mathcal{F}}\right)_{{{\varOmega}}}\right\| \le \left\|\left( \sum\limits_{i=1}^{r} (\hat{p}_{i})^{\otimes 3}-\widehat{\mathcal{F}}\right)_{{{\varOmega}}}\right\| = O(\varepsilon). $$

For the tensor \(\mathcal {F}^{\ast }:={\sum }_{i=1}^{r} (p_{i}^{\ast })^{\otimes 3}\), we get

$$ \|(\mathcal{F}^{\ast}-\mathcal{F})_{{{\varOmega}}}\|\le\|(\mathcal{F}^{\ast}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|+ \|(\widehat{\mathcal{F}}-\mathcal{F})_{{{\varOmega}}}\|=O(\varepsilon) . $$

When Algorithm 4.1 is applied to \((\mathcal {F}^{\ast })_{{{\varOmega }}}\), Step 4 will give the exact decomposition \(\mathcal {F}^{\ast }={\sum }_{i=1}^{r} (p_{i}^{\ast })^{\otimes 3}\). By repeating the previous argument, we can similarly show that \(\|p_{i}- \tau _{i}^{\ast } p_{i}^{\ast }\|= O(\varepsilon )\) for some \(\tau _{i}^{\ast }\) such that \((\tau _{i}^{\ast })^{3}=1\), where the constants in O(⋅) only depend on \(\mathcal {F}\) and ξ. □

Remark 4.3

For the special case that ε = 0, Algorithm 4.1 is the same as Algorithm 3.4, which produces the exact rank decomposition for \(\mathcal {F}\). The conditions in Theorem 4.2 are satisfied for generic vectors p1,…,pr, since \(r\le \frac {d}{2}-1\). The constant in O(⋅) is not explicitly given in the proof. It is related to the condition number \(\kappa (\mathcal {F})\) for tensor decomposition [5]. It was shown by Breiding and Vannieuwenhoven [5] that

$$ \sqrt{\sum\limits_{i=1}^{r}\|p_{i}^{\otimes 3}-\hat{p}_{i}^{\otimes 3}\|^{2}}\leq\kappa(\mathcal{F})\|\mathcal{F}-\hat{\mathcal{F}}\|+c\varepsilon^{2} $$

for some constant c. The continuity of \(\hat {G}\) in \(\hat {\mathcal {F}}\) is implicitly implied by the proof. Eigenvalues and unit eigenvectors of \(\widehat {N}(\xi )\) are continuous in \(\hat {G}\). Furthermore, \(\hat {\lambda }_{i},\hat {\gamma }_{i}\) are continuous in the eigenvalues and unit eigenvectors. All these functions are locally Lipschitz continuous. The \(\hat {p}_{i}\) is Lipschitz continuous with respect to \(\hat {\mathcal {F}}\), in a neighborhood of \(\mathcal {F}\), which also implies an error bound for \(\hat {p}_{i}\). The tensors \((p_{i}^{\ast })^{\otimes 3}\) are also locally Lipschitz continuous in \(\widehat {\mathcal {F}}\), as illustrated in [6]. This also gives error bounds for decomposing vectors \(p_{i}^{\ast }\). We refer to [5, 6] for more details about condition numbers of tensor decompositions.

Example 4.4

We consider the same tensor \(\mathcal {F}\) as in Example 3.2. The subtensor \((\mathcal {F})_{{{\varOmega }}}\) is perturbed to \((\widehat {\mathcal {F}})_{{{\varOmega }}}\). The perturbation is randomly generated from the Gaussian distribution \(\mathcal {N}(0, 0.01)\). For neatness of the paper, we do not display \((\widehat {\mathcal {F}})_{{{\varOmega }}}\) here. We use Algorithm 4.1 to compute the incomplete tensor approximation. The matrices \(A_{ij}[\widehat {\mathcal {F}}]\) and vectors \(b_{ij} [\widehat {\mathcal {F}}]\) are given as follows:

$$ \begin{array}{@{}rcl@{}} A_{13}[\widehat{\mathcal{F}}]&=&A_{23} [\widehat{\mathcal{F}}]=\left[\begin{array}{cc} -0.8135 & 2.7988\\ -1.3697 & 4.0149 \end{array}\right],\qquad b_{13}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}1.5980\\2.1879 \end{array}\right],\quad~~ b_{23}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-2.0047\\-3.2027 \end{array}\right],\\ A_{14}[\widehat{\mathcal{F}}]&=&A_{24}[\widehat{\mathcal{F}}]=\left[\begin{array}{cc} 1.0277 & -0.8020\\-1.3697 & 4.0149 \end{array}\right],\quad b_{14}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}1.5920\\-3.2013 \end{array}\right],\quad b_{24}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-2.0059\\7.5915 \end{array}\right],\\ A_{15}[\widehat{\mathcal{F}}]&=&A_{25}[\widehat{\mathcal{F}}]=\left[\begin{array}{cc} 1.0277 & -0.8020\\-0.8135 & 2.7988 \end{array}\right],\quad b_{15}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}2.1993\\-3.2020 \end{array}\right],\quad b_{25}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-3.1917\\7.6153 \end{array}\right]. \end{array} $$

The linear least square problems (4.1) are solved to obtain \(\widehat {G}\) and \(N_{3}(\widehat {G})\), \(N_{4}(\widehat {G})\), \(N_{5}(\widehat {G})\), which are

$$ \begin{array}{@{}rcl@{}} N_{3}(\widehat{G}) &=& \left[\begin{array}{cc} 0.5156 & 0.7208\\1.6132 & -0.2474 \end{array}\right],\quad N_{4}(\widehat{G})=\left[\begin{array}{cc} 1.2631 & -0.3665\\-0.6489 & 1.6695 \end{array}\right],\\ N_{5}(\widehat{G})&=&\left[\begin{array}{cc} 1.6131 & -0.6752\\-1.2704 & 2.3517 \end{array}\right]. \end{array} $$

For ξ = (3,4,5), the eigendecomposition of the matrix \(\widehat {N}(\xi )\) in (4.2) is

$$ \widehat{N}(\xi) = \left[\begin{array}{cc} -0.7078 & 0.4470 \\ -0.7064 & -0.8945 \end{array}\right]\left[\begin{array}{cc}12.0343 & 0 \\ 0 & 20.0786 \end{array}\right] \left[\begin{array}{cc}-0.7524 & 0.4499 \\ -0.6588 & -0.8931 \end{array}\right]^{-1}. $$

It has eigenvectors \(\hat {v}_{1}=(-0.7078,-0.7064)\), \(\hat {v}_{2}=(0.4470,-0.8945)\). The vectors \(\hat {w}_{1}\), \(\hat {w}_{2}\) obtained as in (4.3) are

$$ \hat{w}_{1} = (1.2021,0.9918,0.9899),\quad \hat{w}_{2} = (-1.0389,2.0145,3.0016). $$

By solving (4.4) and (4.5), we got the scalars

$$ \hat{\gamma}_{1}=-1.1990,\quad \hat{\gamma}_{2}=-2.1458, \qquad \hat{\lambda}_{1}=0.4521,\quad \hat{\lambda}_{2}=0.6232. $$

Finally, we got the decomposition \(\hat {\lambda }_{1} \hat {u}_{1}^{\otimes 3}+\hat {\lambda }_{2} \hat {u}_{2}^{\otimes 3}\) with

$$ \begin{array}{@{}rcl@{}} \hat{u}_{1}&=&(1,\hat{\gamma}_{1}\hat{v}_{1},\hat{w}_{1})=(1,0.8477,0.8479,1.2021,0.9918,0.9899),\\ \hat{u}_{2}&=&(1,\hat{\gamma}_{2}\hat{v}_{2},\hat{w}_{2})=(1,-0.9776,1.9102,-1.0389,2.0145,3.0016). \end{array} $$

They are pretty close to the decomposition of \(\mathcal {F}\).

5 Learning Diagonal Gaussian Mixture

We use the incomplete tensor decomposition or approximation method to learn parameters for Gaussian mixture models. Algorithms 3.4 and 4.1 can be applied to do that.

Let y be the random variable of dimension d for a Gaussian mixture model, with r components of Gaussian distribution parameters (ωi,μi,Σi), i = 1,…,r. We consider the case that \( r\le \frac {d}{2}-1\). Let y1,…,yN be samples drawn from the Gaussian mixture model. The sample average

$$ \widehat{M}_{1} :=\frac{1}{N} (y_{1} + {\cdots} + y_{N}) $$

is an estimation for the mean \(M_{1}:=\mathbb {E}[y]=\omega _{1} \mu _{1} + {\cdots } + \omega _{r} \mu _{r}\). The symmetric tensor

$$ \widehat{M}_{3} := \frac{1}{N} (y_{1}^{\otimes 3} + {\cdots} + y_{N}^{\otimes 3}) $$

is an estimation for the third order moment tensor \(M_{3}:=\mathbb {E}[y^{\otimes 3}]\). Recall that \(\mathcal {F} = {\sum }_{i=1}^{r} \omega _{i} \mu _{i}^{\otimes 3}\). When all the covariance matrices Σi are diagonal, we have shown in (1.2) that

$$ M_{3} = \mathcal{F} + \sum\limits_{j=1}^{d}(a_{j}\otimes e_{j}\otimes e_{j} +e_{j}\otimes a_{j}\otimes e_{j}+e_{j}\otimes e_{j}\otimes a_{j}). $$

If the labels i1, i2, i3 are distinct from each other, \((M_{3})_{i_{1}i_{2}i_{3}} = (\mathcal {F})_{i_{1}i_{2}i_{3}}\). Recall the label set Ω in (1.4). It holds that

$$ (M_{3})_{{{\varOmega}}} = (\mathcal{F})_{{{\varOmega}}}. $$

Note that \((\widehat {M}_{3})_{{{\varOmega }}}\) is only an approximation for (M3)Ω and \((\mathcal {F})_{{{\varOmega }}}\), due to sampling errors. If the rank \(r\le \frac {d}{2}-1\), we can apply Algorithm 4.1 with the input \((\widehat {M}_{3})_{{{\varOmega }}}\), to compute a rank-r tensor approximation for \(\mathcal {F}\). Suppose the tensor approximation produced by Algorithm 4.1 is

$$ \mathcal{F} \approx (p_{1}^{\ast})^{\otimes 3} + {\cdots} + (p_{r}^{\ast})^{\otimes 3}. $$

The computed \(p_{1}^{\ast },\ldots ,p_{r}^{\ast }\) may not be real vectors, even if \(\mathcal {F}\) is real. When the error \(\varepsilon :=\|(\mathcal {F}-\widehat {M}_{3})_{{{\varOmega }}}\|\) is small, by Theorem 4.2, we know

$$ \|\tau_{i}^{\ast} p_{i}^{\ast}-\sqrt[3]{\omega_{i}}\mu_{i}\| = O(\varepsilon) $$

where \((\tau _{i}^{\ast })^{3}=1\). In computation, we can choose \(\tau _{i}^{\ast }\) such that \((\tau _{i}^{\ast })^{3}=1\) and the imaginary part vector \(\text {Im}(\tau _{i}^{\ast } p_{i}^{\ast })\) has the smallest norm. It can be done by checking the imaginary part of \(\tau _{i}^{\ast } p_{i}^{\ast }\) one by one for

$$ \tau_{i}^{\ast} = 1, \quad -\frac{1}{2}+\frac{\sqrt{-3}}{2}, \quad -\frac{1}{2}-\frac{\sqrt{-3}}{2}. $$

Then we get the real vector

$$ \hat{q}_{i} := \text{Re}(\tau_{i}^{\ast} p_{i}^{\ast}). $$

It is expected that \(\hat {q}_{i} \approx \sqrt [3]{\omega _{i}} \mu _{i}\). Since

$$ M_{1} = \omega_{1} \mu_{1} + {\cdots} + \omega_{r} \mu_{r} \approx \omega_{1}^{2/3} \hat{q}_{1} + {\cdots} + \omega_{r}^{2/3} \hat{q}_{r}, $$

the scalars \(\omega _{1}^{2/3},\ldots ,\omega _{r}^{2/3}\) can be obtained by solving the nonnegative linear least squares

$$ \underset{(\beta_{1},\ldots,\beta_{r})\in\mathbb{R}^{r}_{+}}{\min} \left\|\widehat{M}_{1}- \sum\limits_{i=1}^{r} \beta_{i} \hat{q}_{i}\right\|^{2}. $$
(5.1)

Let \((\beta _{1}^{\ast },\ldots ,\beta _{r}^{\ast })\) be an optimizer for the above, then \(\hat {\omega }_{i} := (\beta _{i}^{\ast })^{3/2}\) is a good approximation for ωi and the vector

$$ \hat{\mu}_{i} := \hat{q}_{i} / \sqrt[3]{ \hat{\omega}_{i}} $$

is a good approximation for μi. We may use

$$ \hat{\mu}_{i}, \quad \left( \sum\limits_{j=1}^{r} \hat{\omega}_{j} \right)^{-1} \hat{\omega}_{i}, \quad i=1,\ldots, r $$

as starting points to solve the nonlinear optimization

$$ \left\{ \begin{array}{cl} \underset{(\omega_{1},\ldots,\omega_{r}, \mu_{1},\ldots, \mu_{r})}{\min} & \|{\sum}_{i=1}^{r} \omega_{i} \mu_{i}-\widehat{M}_{1}\|^{2} + \|{\sum}_{i=1}^{r} \omega_{i} (\mu_{i}^{\otimes 3})_{{{\varOmega}}}-(\widehat{M}_{3})_{{{\varOmega}}}\|^{2}\\ \text{subject to} & \omega_{1} + {\cdots} +\omega_{r} = 1,~\omega_{1},\ldots,\omega_{r} \ge 0, \end{array} \right. $$
(5.2)

for getting improved approximations. Suppose an optimizer of the above is

$$ (\omega_{1}^{\ast}, \ldots, \omega_{r}^{\ast}, \mu_{1}^{\ast},\ldots, \mu_{r}^{\ast}). $$

Now, we discuss how to estimate the diagonal covariance matrices Σi. Let

$$ \mathcal{A}:=M_{3}-\mathcal{F}, \quad \widehat{\mathcal{A}} := \widehat{M}_{3}-(\hat{q}_{1})^{\otimes 3}-{\cdots} - (\hat{q}_{r} )^{\otimes 3}. $$
(5.3)

By (1.2), we know that

$$ \mathcal{A} = \sum\limits_{j=1}^{d}(a_{j}\otimes e_{j}\otimes e_{j}+e_{j}\otimes a_{j}\otimes e_{j}+e_{j}\otimes e_{j}\otimes a_{j}), $$
(5.4)

where \(a_{j}={\sum }_{i=1}^{r}\omega _{i}\sigma _{ij}^{2}\mu _{i}\) for \(j=1,\dots ,d\). The equation (5.4) implies that

$$ (a_{j})_{j}=\frac{1}{3}\mathcal{A}_{jjj},\quad (a_{j})_{i}=\mathcal{A}_{jij}, $$

for \(i,j=1,\dots ,d\) and ij. So we choose vectors \(\hat {a}_{j} \in \mathbb {R}^{d}\) such that

$$ (\hat{a}_{j})_{j}=\frac{1}{3}\widehat{\mathcal{A}}_{jjj},\quad (\hat{a}_{j})_{i}=\widehat{\mathcal{A}}_{jij} \quad \text{ for }~i \ne j. $$
(5.5)

Since \(\hat {a}_{j}\approx {\sum }_{i=1}^{r}\omega _{i}\sigma _{ij}^{2}\mu _{i}\), the covariance matrices \({{\varSigma }}_{i} = \text {diag}(\sigma _{i1}^{2}, \ldots , \sigma _{id}^{2})\) can be estimated by solving the nonnegative linear least squares (j = 1,…,d)

$$ \left\{\begin{array}{cl} \underset{(\beta_{1j}, \ldots, \beta_{rj}) }{\min} & \left\|\hat{a}_{j} - {\sum}_{i=1}^{r} \omega^{\ast}_{i}\mu^{\ast}_{i} \beta_{ij}\right\|^{2} \\ \text{subject to} & \beta_{1j} \ge 0,\ldots, \beta_{rj} \ge 0. \end{array} \right. $$
(5.6)

For each j, let \((\beta ^{\ast }_{1j}, \ldots , \beta ^{\ast }_{rj})\) be the optimizer for the above. When \((\widehat {M}_{3})_{{{\varOmega }}}\) is close to (M3)Ω, it is expected that \(\beta ^{\ast }_{ij}\) is close to (σij)2. Therefore, we can estimate the covariance matrices Σi as follows

$$ {{\varSigma}}_{i}^{\ast} := \text{diag}(\beta^{\ast}_{i1}, \ldots, \beta^{\ast}_{id}), \quad (\sigma_{ij}^{\ast})^{2}:=\beta^{\ast}_{ij}. $$
(5.7)

The following is the algorithm for learning Gaussian mixture models.

Algorithm 5.1

(Learning diagonal Gaussian mixture models)Input: Samples \(\{y_{1},\ldots ,y_{N}\} \subseteq \mathbb {R}^{d}\) drawn from a Gaussian mixture model and the number r of component Gaussian distributions.

  1. Step 1.

    Compute the sample averages \(\widehat {M}_{1} := \frac {1}{N} {\sum }_{i=1}^{N} y_{i}\) and \(\widehat {M}_{3} :=\frac {1}{N}{\sum }_{i=1}^{N} y_{i}^{\otimes 3}\).

  2. Step 2.

    Apply Algorithm 4.1 to the subtensor \((\widehat {\mathcal {F}})_{{{\varOmega }}} := (\widehat {M}_{3})_{{{\varOmega }}}\). Let \((p_{1}^{\ast })^{\otimes 3}+{\cdots } +(p_{r}^{\ast })^{\otimes 3}\) be the obtained rank-r tensor approximation for \(\widehat {\mathcal {F}}\). For each i = 1,…,r, let \(\hat {q}_{i} :=\text {Re}(\tau _{i}p_{i}^{\ast })\) where τi is the cube root of 1 that minimizes the imaginary part vector norm \(\|\text {Im}(\tau _{i} p_{i}^{\ast })\|\).

  3. Step 3.

    Solve (5.1) to get \(\hat {\omega }_{1},\ldots ,\hat {\omega }_{r}\) and \(\hat {\mu }_{i} = q_{i}/\sqrt [3]{\hat {\omega }_{i}},i=1,\ldots ,r\).

  4. Step 4.

    Use the above \(\hat {\omega }_{i}\), \(\hat {q}_{i}\) as initial points to solve the nonlinear optimization (5.2) for the optimal \(\omega _{i}^{\ast },\mu _{i}^{\ast }\), i = 1,…,r.

  5. Step 5.

    Get vectors \(\hat {a}_{1}, \ldots , \hat {a}_{d}\) as in (5.5). Solve the optimization (5.6) to get optimizers \(\beta _{ij}^{\ast }\) and then choose \({{\varSigma }}_{i}^{\ast }\) as in (5.7).

Output: Component Gaussian distribution parameters \((\mu ^{\ast }_{i},{{\varSigma }}^{\ast }_{i},\omega ^{\ast }_{i})\), i = 1,…,r.

The sample averages \(\widehat {M}_{1}\), \(\widehat {M}_{3}\) can typically be used as good estimates for the true moments M1, M3. When the value of r is not known, it can be determined as in Remark 3.6. The performance of Algorithm 5.1 is analyzed as follows.

Theorem 5.2

Consider the d-dimensional diagonal Gaussian mixture model with parameters {(ωi,μi,Σi) : i ∈ [r]} and \(r\le \frac {d}{2}-1\). Let \(\{(\omega ^{\ast }_{i},\mu ^{\ast }_{i},{{\varSigma }}^{\ast }_{i}):i\in [r]\}\) be produced by Algorithm 5.1. If the distance \(\varepsilon :=\max \limits (\|M_{3}-\widehat {M}_{3}\|,\|M_{1}-\widehat {M}_{1}\|) \) is small enough and the tensor \(\mathcal {F}={\sum }_{i=1}^{r}\omega _{i} \mu _{i}^{\otimes 3}\) satisfies conditions of Theorem 4.2, then

$$ \|\mu_{i}-\mu^{\ast}_{i}\| = O(\varepsilon),\quad \|\omega_{i}-\omega_{i}^{\ast}\| = O(\varepsilon),\quad \|{{\varSigma}}_{i} - {{\varSigma}}^{\ast}_{i}\| = O(\varepsilon), $$

where the above constants inside O(⋅) only depend on parameters {(ωi,μi,Σi) : i ∈ [r]} and the choice of ξ in Algorithm 5.1.

Proof

For the vectors \(p_{i}:=\sqrt [3]{\omega _{i}}\mu _{i}\), we have \(\mathcal {F} = {\sum }_{i=1}^{r} p_{i}^{\otimes 3}\). Since

$$ \|(\mathcal{F}-\widehat{\mathcal{F}})_{{{\varOmega}}}\| = \|(M_{3}-\widehat{M}_{3})_{{{\varOmega}}}\| \le \varepsilon $$

and \(\mathcal {F}\) satisfies conditions of Theorem 4.2, we know \(\|\tau _{i}^{\ast } p^{\ast }_{i}-p_{i}\|=O(\varepsilon )\) for some \((\tau _{i}^{\ast })^{3}=1\), by Theorem 4.2. The constants inside O(ε) depend on parameters of the Gaussian model and ξ. Then, we have \(\|\text {Im}(\tau _{i}^{\ast } p_{i}^{\ast })\|=O(\varepsilon )\) since the vectors pi are real. When ε is small enough, such \(\tau _{i}^{\ast }\) is the τ in Step 2 of Algorithm 5.1 that minimizes \(\|\text {Im}(\tau _{i}p_{i}^{\ast })\|\), so we have

$$ \| \hat{q}_{i}-p_{i}\|\le \|\tau_{i}p^{\ast}_{i}-p_{i}\|=O(\varepsilon), $$

where \(\hat {q}_{i}=\text {Re}(\tau _{i}p_{i}^{\ast })\) is from Step 2. The vectors \(\hat {q}_{1},\ldots , \hat {q}_{r}\) are linearly independent when ε is small. Thus, the problem (5.1) has a unique solution and the weights \(\hat {\omega }_{i} \) can be found by solving (5.1). Since \(\|M_{1}-\widehat {M}_{1}\|\le \varepsilon \) and \(\|\hat {q}_{i}-p_{i}\|=O(\varepsilon )\), we have \(\|\omega _{i} - \hat {\omega }_{i}\|=O(\varepsilon )\) (see [15, Theorem 3.4]). The mean vectors \(\hat {\mu }_{i}\) are obtained by \(\hat {\mu }_{i} = \hat {q}_{i}/\sqrt [3]{\hat {\omega }_{i}}\), so the approximation error is

$$ \|\mu_{i} - \hat{\mu}_{i}\|=\|{p}_{i}/\sqrt[3]{{\omega}_{i} }- \hat{q}_{i}/\sqrt[3]{\hat{\omega}_{i}}\| = O(\varepsilon). $$

The constants inside the above O(ε) depend on parameters of the Gaussian mixture model and ξ.

The problem (5.2) is solved to obtain \(\omega ^{\ast }_{i}\) and \(\mu _{i}^{\ast }\), so

$$ \left\|\widehat{M_{1}} - \sum\limits_{i=3}^{r} \omega_{i}^{\ast} \mu_{i}^{\ast}\right\| + \left\|\widehat{\mathcal{F}} - \sum\limits_{i=1}^{r} \omega_{i}^{\ast} (\mu_{i}^{\ast})^{\otimes 3} \right\| = O(\varepsilon). $$

Let \(\mathcal {F}^{\ast }:={\sum }_{i=1}^{r} \omega _{i}^{\ast } (\mu _{i}^{\ast })^{\otimes 3}={\sum }_{i=1}^{r}(\sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast })^{\otimes 3}\), then

$$ \|\mathcal{F}-\mathcal{F}^{\ast}\| \le \|\mathcal{F}-\hat{\mathcal{F}}\|+\|\hat{\mathcal{F}}-\mathcal{F}^{\ast}\| = O(\varepsilon). $$

Theorem 4.2 implies \(\|p_{i}- \sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast }\|=O(\varepsilon )\). In addition, we have

$$ \left\|\widehat{M_{1}} - \sum\limits_{i=1}^{r} \omega_{i}^{\ast} \mu_{i}^{\ast}\right\|=\left\|\widehat{M_{1}} - \sum\limits_{i=1}^{r} (\omega_{i}^{\ast})^{2/3} \sqrt[3]{\omega_{i}^{\ast}}\mu_{i}^{\ast}\right\| = O(\varepsilon). $$

The first order moment is \(M_{1}= {\sum }_{i=1}^{r} (\omega _{i})^{2/3} p_{i}\). Since \(\|M_{1}-\hat {M}_{1}\|=O(\varepsilon )\) and \(\|p_{i}- \sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast }\|=O(\varepsilon )\), it holds that \(\|\omega _{i}^{2/3}-(\omega _{i}^{\ast })^{2/3}\|=O(\varepsilon )\) by [15, Theorem 3.4]. This implies that \(\|\omega _{i}-\omega _{i}^{\ast }\|=O(\varepsilon )\), so

$$ \|\mu_{i}-\mu_{i}^{\ast}\|=\|p_{i}/\sqrt[3]{\omega_{i}}-\left( \sqrt[3]{\omega_{i}^{\ast}}\mu_{i}^{\ast}\right)/\sqrt[3]{\omega_{i}^{\ast}}\|=O(\varepsilon). $$

The constants inside the above O(⋅) only depend on parameters {(ωi,μi,Σi) : i ∈ [r]} and ξ.

The covariance matrices Σi are recovered by solving the linear least squares (5.6). In the least square problems, it holds that \(\|\omega _{i}\mu _{i}-\omega _{i}^{\ast }\mu _{i}^{\ast }\|=O(\varepsilon )\) and

$$ \|\mathcal{A}-\widehat{\mathcal{A}}\|\le \|M_{3}-\widehat{M}_{3}\|+ \|\mathcal{F}-\sum\limits_{i=1}^{r} \hat{q}_{i}^{\otimes 3}\|=O(\varepsilon), $$

where tensors \(\mathcal {A}\), \(\widehat {\mathcal {A}}\) are defined in (5.3). When the error ε is small, the vectors \(\omega _{i}^{\ast }\mu _{1}^{\ast },\ldots ,\omega _{i}^{\ast }\mu _{r}^{\ast }\) are linearly independent and hence (5.6) has a unique solution for each j. By [15, Theorem 3.4], we have

$$ \|(\sigma_{ij})^{2} - (\sigma_{ij}^{\ast})^{2}\| = O(\varepsilon). $$

It implies that \(\|{{\varSigma }}_{i}-{{\varSigma }}^{\ast }_{i} \| = O(\varepsilon )\), where the constants inside O(⋅) only depend on parameters {(ωi,μi,Σi) : i ∈ [r]} and ξ. □

6 Numerical Simulations

This section gives numerical experiments for our proposed methods. The computation is implemented in MATLAB R2019b, on an Alienware personal computer with Intel(R)Core(TM)i7-9700K CPU@3.60GHz and RAM 16.0G. The MATLAB function lsqnonlin is used to solve (4.6) in Algorithm 4.1 and the MATLAB function fmincon is used to solve (5.2) in Algorithm 5.1. We compare our method with the classical EM algorithm, which is implemented by the MATLAB function fitgmdist (MaxIter is set to be 100 and RegularizationValue is set to be 0.001).

First, we show the performance of Algorithm 4.1 for computing incomplete symmetric tensor approximations. For a range of dimension d and rank r, we get the tensor \(\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}\), where each pi is randomly generated according to the Gaussian distribution in MATLAB. Then, we apply the perturbation \((\widehat {\mathcal {F}})_{{{\varOmega }}} = (\mathcal {F})_{{{\varOmega }}} + \mathcal {E}_{{{\varOmega }}}\), where \(\mathcal {E}\) is a randomly generated tensor, also according to the Gaussian distribution in MATLAB, with the norm \(\|\mathcal {E}_{\omega }\|_{{{\varOmega }}} = \varepsilon \). After that, Algorithm 4.1 is applied to the subtensor \((\widehat {\mathcal {F}})_{{{\varOmega }}}\) to find the rank-r tensor approximation. The approximation quality is measured by the absolute error and the relative error

$$ \text{abs-error} := \|(\mathcal{F}^{\ast}-\mathcal{F})_{{{\varOmega}}}\|, \quad \text{rel-error} := \frac{\|(\mathcal{F}^{\ast}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|} {\|(\mathcal{F}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|}, $$

where \(\mathcal {F}^{\ast }\) is the output of Algorithm 4.1. For each case of (d,r,ε), we generate 100 random instances. The min, average, and max relative errors for each dimension d and rank r are reported in the Table 1. The results show that Algorithm 4.1 performs very well for computing tensor approximations.

Table 1 The performance of Algorithm 4.1

Second, we explore the performance of Algorithm 5.1 for learning diagonal Gaussian mixture models. We compare it with the classical EM algorithm, for which the MATLAB function fitgmdist is used (MaxIter is set to be 100 and RegularizationValue is set to be 0.0001). The dimensions d = 20,30,40,50,60 are tested. Three values of r are tested for each case of d. We generate 100 random instances of \(\{(\omega _{i},\mu _{i}, {{\varSigma }}_{i}):i=1,\dots ,r\}\) for d ∈{20,30,40}, and 20 random instances for d ∈{50,60}, because of the relatively more computational time for the latter case. For each instance, 10000 samples are generated. To generate the weights ω1,…,ωr, we first use the MATLAB function randi to generate a random 10000 −dimensional integer vector of entries from [r], then the occurring frequency of i in [r] is used as the weight ωi. For each diagonal covariance matrix Σi, its diagonal vector is set to be the square of a random vector generated by the MATLAB function randn. Each sample is generated from one of r component Gaussian distributions, so they are naturally separated into r groups. Algorithm 5.1 and the EM algorithm are applied to fit the Gaussian mixture model to the 10000 samples for each instance. For each sample, we calculate the likelihood of the sample to each component Gaussian distribution in the estimated Gaussian mixture model. A sample is classified to the i th group if its likelihood for the i th component is maximum. The classification accuracy is the rate that samples are classified to the correct group. In Table 2, for each pair (d,r), we report the accuracy of Algorithm 5.1 in the first row and the accuracy of the EM algorithm in the second row. As one can see, Algorithm 5.1 performs better than EM algorithm, and its accuracy isn’t affected when the dimensions and ranks increase. Indeed, as the difference between the dimension d and the rank r increases, Algorithm 5.1 becomes more and more accurate. This is opposite to the EM algorithm. The reason is that the difference between the number of rows and the number of columns of \(A_{ij}[\mathcal {F}]\) in (3.7) increases as dr becomes bigger, which makes Algorithm 5.1 more robust.

Table 2 Comparison between Algorithm 5.1 and EM for simulations

Last, we apply Algorithm 5.1 to do texture classifications. We select 8 textured images of 512 × 512 pixels from the VisTex database (Fig. 1). We use the MATLAB function rgb2gray to convert them into grayscale version since we only need their structure and texture information. Each image is divided into subimages of 32 × 32 pixels. We perform the discrete cosine transformation (DCT) on each block of size 16 × 16 pixels with overlap of 8 pixels. Each component of ‘Wavelet-like’ DCT feature is the sum of the absolute value of the DCT coefficients in the corresponding sub-block. So the dimension d of the feature vector extracted from each subimage is 13. We use blocks extracted from the first 160 subimages for training and those from the rest 96 subimages for testing. We refer to [47] for more details. For each image, we apply Algorithm 5.1 and the EM algorithm to fit a Gaussian mixture model to the image. We choose the number of components r according to Remark 3.6. To classify the test data, we follow the Bayesian decision rule that assigns each block to the texture which maximizes the posteriori probability, where we assume a uniform prior over all classes [18]. The classification accuracy is the rate that a subimage is correctly classified, which is shown in Table 3. Algorithm 5.1 outperforms the classical EM algorithm for the accuracy rates for six of the images.

Fig. 1
figure 1

Textures from VisTex

Table 3 Classification results on 8 textures

7 Conclusions and Future Work

This paper gives a new algorithm for learning Gaussian mixture models with diagonal covariance matrices. We first give a method for computing incomplete symmetric tensor decompositions. It is based on generating polynomials. The method is described in Algorithm 3.4. When the input subtensor has small errors, we can similarly compute the incomplete symmetric tensor approximation, which is given by Algorithm 4.1. We have shown in Theorem 4.2 that if the input subtensor is sufficiently close to a low rank one, the produced tensor approximation is highly accurate. Then unknown parameters for Gaussian mixture models can be recovered by using the incomplete tensor decomposition method. It is described in Algorithm 5.1. When the estimations of M1 and M3 are accurate, the parameters recovered by Algorithm 5.1 are also accurate. The computational simulations demonstrate the good performance of the proposed method.

The proposed method deals with the case that the number of Gaussian components is less than one half of the dimension. How do we compute incomplete symmetric tensor decompositions when the set Ω is not like (1.4)? How can we learn parameters for Gaussian mixture models with more components? How can we do that when the covariance matrices are not diagonal? They are important and interesting topics for future research work.