Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions

This paper studies how to learn parameters in diagonal Gaussian mixture models. The problem can be formulated as computing incomplete symmetric tensor decompositions. We use generating polynomials to compute incomplete symmetric tensor decompositions and approximations. Then the tensor approximation method is used to learn diagonal Gaussian mixture models. We also do the stability analysis. When the first and third order moments are sufficiently accurate, we show that the obtained parameters for the Gaussian mixture models are also highly accurate. Numerical experiments are also provided.


Introduction
A Gaussian mixture model consists of several component Gaussian distributions. For given samples of a Gaussian mixture model, people often need to estimate parameters for each component Gaussian distribution [24,32]. Consider a Gaussian mixture model with r components. For each i ∈ [r] := {1, . . . , r}, let ω i be the positive probability for the ith component Gaussian to appear in the mixture model. We have each ω i > 0 and r i=1 ω i = 1. Suppose the ith Gaussian distribution is N (µ i , Σ i ), where µ i ∈ R d is the expectation (or mean) and Σ i ∈ R d×d is the covariance matrix. Let y ∈ R d be the random vector for the Gaussian mixture model and let y 1 , . . . , y N be identically independent distributed (i.i.d) samples from the mixture model. Each y j is sampled from one of the r component Gaussian distributions, associated with a label Z j ∈ [r] indicating the component that it is sampled from. The probability that a sample comes from the ith component is ω i . When people observe only samples without labels, the Z j 's are called latent variables. The density function for the random variable y is where µ i is the mean and Σ i is the covariance matrix for the ith component.
Learning a Gaussian mixture model is to estimate the parameters ω i , µ i , Σ i for each i ∈ [r], from given samples of y. The number of parameters in a covariance matrix grows quadratically with respect to the dimension. Due to the curse of dimensionality, the computation becomes very expensive for large d [35]. Hence, diagonal covariance matrices are preferable in applications. In this paper, we focus on learning Gaussian mixture models with diagonal covariance matrices, i.e., Σ i = diag σ 2 i1 , . . . , σ 2 id , i = 1, . . . , r. A natural approach for recovering the unknown parameters ω i , µ i , Σ i is the method of moments. It estimates parameters by solving a system of multivariate polynomial equations, from moments of the random vector y. Directly solving polynomial systems may encounter non-existence or non-uniqueness of statistically meaningful solutions [57]. However, for diagonal Gaussians, the third order moment tensor can help us avoid these troubles.
Let M 3 := E(y ⊗ y ⊗ y) be the third order tensor of moments for y. One can write that y = η(z) + ζ(z), where z is a discrete random variable such that Prob(z = i) = ω i , η(i) = µ i ∈ R d and ζ(i) is the random variable ζ i obeying the Gaussian distribution N (0, Σ i ). Assume all Σ i are diagonal, then The second equality holds because ζ i has zero mean and The random variable ζ i has diagonal covariance matrix, so E[(ζ i ) j (ζ i ) l ] = 0 for j = l. Therefore, where the vectors a j are given by (1.2) a j := r i=1 ω i σ 2 ij µ i , j = 1, . . . , d.
Similarly, we have Therefore, we can express M 3 in terms of ω i , µ i , Σ i as a j ⊗ e j ⊗ e j + e j ⊗ a j ⊗ e j + e j ⊗ e j ⊗ a j .
We are particularly interested in the following third order symmetric tensor When the labels i 1 , i 2 , i 3 are distinct from each other, we have Denote the label set The tensor M 3 can be estimated from the samplings for y, so the entries F i1i2i3 with (i 1 , i 2 , i 3 ) ∈ Ω can also be obtained from the estimation of M 3 . To recover the parameters ω i , µ i , we first find the tensor decomposition for F, from the partially given entries F i1i2i3 with (i 1 , i 2 , i 3 ) ∈ Ω. Once the parameters ω i , µ i are known, we can determine Σ i from the expressions of a j as in (1.2). The above observation leads to the incomplete tensor decomposition problem. For a third order symmetric tensor F whose partial entries F i1i2i3 with (i 1 , i 2 , i 3 ) ∈ Ω are known, we are looking for vectors p 1 , . . . , p r such that , for all (i 1 , i 2 , i 3 ) ∈ Ω.
The above is called an incomplete tensor decomposition for F. To find such a tensor decomposition for F, a straightforward approach is to do tensor completion: first find unknown tensor entries F i1i2i3 with (i 1 , i 2 , i 3 ) ∈ Ω such that the completed F has low rank, and then compute the tensor decomposition for F. However, there are serious disadvantages for this approach. The theory for tensor completion or recovery, especially for symmetric tensors, is premature. Low rank tensor completion or recovery is typically not guaranteed by the currently existing methodology. Most methods for tensor completion are based on convex relaxations, e.g., the nuclear norm or trace minimization [22,36,41,54,58]. These convex relaxations may not produce low rank completions [51].
In this paper, we propose a new method for determining incomplete tensor decompositions. It is based on the generating polynomial method in [40]. The label set Ω consists of (i 1 , i 2 , i 3 ) of distinct i 1 , i 2 , i 3 . We can still determine some generating polynomials, from the partially given tensor entries F i1i2i3 with (i 1 , i 2 , i 3 ) ∈ Ω. They can be used to get the incomplete tensor decomposition. We show that this approach works very well when the rank r is roughly not more than half of the dimension d. Consequently, the parameters for the Gaussian mixture model can be recovered from the incomplete tensor decomposition of F. Related Work Gaussian mixture models have broad applications in machine learning problems, e.g., automatic speech recognition [30,48,50], hyperspectral unmixing problem [4,34], background subtraction [60,32] and anomaly detection [56]. They also have applications in social and biological sciences [25,53,59].
There exist methods for estimating unknown parameters for Gaussian mixture models. A popular method is the expectation-maximization (EM) algorithm that iteratively approximates the maximum likelihood parameter estimation [16]. This approach is widely used in applications, while its convergence property is not very reliable [49]. Dasgupta [13] introduced a method that first projects data to a randomly chosen low-dimensional subspace and then use the empirical means and covariances of low-dimensional clusters to estimate the parameters. Later, Arora and Kannan [52] extended this idea to arbitrary Gaussians. Vempala and Wong [55] introduced the spectral technique to enhance the separation condition by projecting data to principal components of the sample matrix instead of selecting a random subspace. For other subsequent work, we refer to Dasgupta and Schulman [14], Kannan et al. [29], Achlioptas et al. [1], Chaudhuri and Rao [10], Brubaker and Vempala [7] and Chaudhuri et al. [9].
Another frequently used approach is based on moments, introduced by Pearson [46]. Belkin and Sinha [3] proposed a learning algorithm for identical spherical Gaussians (Σ i = σ 2 I) with arbitrarily small separation between mean vectors. It was also shown in [28] that a mixture of two Gaussians can be learned with provably minimal assumptions. Hsu and Kakade [27] provided a learning algorithm for a mixture of spherical Gaussians, i.e., each covariance matrix is a multiple of the identity matrix. This method is based on moments up to order three and only assumes non-degeneracy instead of separations. For general covariance matrices, Ge et al. [23] proposed a learning method when the dimension d is sufficiently high. More moment-based methods for general latent variable models can be found in [2].
Contributions This paper proposes a new method for learning diagonal Gaussian mixture models, based on samplings for the first and third order moments. Let y 1 , · · · , y N be samples and let {(ω i , µ i , Σ i ) : i ∈ [r]} be parameters of the diagonal Gaussian mixture model, where each covariance matrix Σ i is diagonal. We use the samples y 1 , · · · , y N to estimate the third order moment tensor M 3 , as well as the mean vector M 1 . We have seen that the tensor M 3 can be expressed as in (1.3).
For the tensor F in (1.4), we have F i1i2i3 = (M 3 ) i1i2i3 when the labels i 1 , i 2 , i 3 are distinct from each other. Other entries of F are not known, since the vectors a j are not available. The F is an incompletely given tensor. We give a new method for computing the incomplete tensor decomposition of F when the rank r is low (roughly no more than half of the dimension d). The tensor decomposition of F is unique under some genericity conditions [11], so it can be used to recover parameters ω i , µ i . To compute the incomplete tensor decomposition of F, we use the generating polynomial method in [40,42]. We look for a special set of generating polynomials for F, which can be obtained by solving linear least squares. It only requires to use the known entries of F. The common zeros of these generating polynomials can be determined from eigenvalue decompositions. Under some genericity assumptions, these common zeros can be used to get the incomplete tensor decomposition. After this is done, the parameters ω i , µ i can be recovered by solving linear systems. The diagonal covariance matrices Σ i can also be estimated by solving linear least squares. The tensor M 3 is estimated from the samples y 1 , . . . , y N . Typically, the tensor entries (M 3 ) i1i2i3 and F i1i2i3 , are not precisely given. We also provide a stability analysis for this case, showing that the estimated parameters are also accurate when the entries (M 3 ) i1i2i3 have small errors.
The paper is organized as follows. In Section 2, we review some basic results for symmetric tensor decompositions and generating polynomials. In Section 3, we give a new algorithm for computing an incomplete tensor decomposition for F, when only its subtensor F Ω is known. Section 4 gives the stability analysis when there are errors for the subtensor F Ω . Section 5 gives the algorithm for learning Gaussian mixture models. Numerical experiments and applications are given in Section 6. We make some conclusions and discussions in Section 7.

Preliminary
Notation. Denote N, C and R the set of nonnegative integers, complex and real numbers respectively. Denote the cardinality of a set L as |L|. Denote by e i the ith standard unit basis vector, i.e., the ithe entry of e i is one and all others are zeros. For a complex number c, n √ c or c 1/n denotes the principal nth root of c. For a complex vector v, Re(v), Im(v) denotes the real part and imaginary part of v respectively. A property is said to be generic if it is true in the whole space except a subset of zero Lebesgue measure. The · denotes the Euclidean norm of a vector or the Frobenius norm of a matrix. For a vector or matrix, the superscript T denotes the transpose and H denotes the conjugate transpose. For i, j ∈ N, [i] denotes the set {1, 2, . . . , i} and [i, j] denotes the set {i, i + 1, . . . , j} if i ≤ j. For a vector v, v i1:i2 denotes the vector (v i1 , v i1+1 , . . . , v i2 ). For a matrix A, denote by A [i1:i2,j1:j2] the submatrix of A whose row labels are i 1 , i 1 + 1, . . . , i 2 and whose column labels are j 1 , j 1 + 1 . . . , j 2 . For a tensor F, its subtensor F [i1:i2,j1:j2,k1:k2] is similarly defined.
Let S m (C d ) (resp., S m (R d )) denote the space of mth order symmetric tensors over the vector space C d (resp., R d ). For convenience of notation, the labels for tensors start with 0. A symmetric tensor A ∈ S m (C n+1 ) is labelled as where the entry A i1...im is invariant for all permutations of (i 1 , . . . , i m ). The Hilbert-Schmidt norm A is defined as The norm of a subtensor A Ω is similarly defined. For a vector u := (u 0 , u 1 , . . . , u n ) ∈ C d , the tensor power u ⊗m := u ⊗ · · · ⊗ u, where u is repeated m times, is defined such that (u ⊗m ) i1...im = u i1 × · · · × u im . For a symmetric tensor F, its symmetric rank is There are other types of tensor ranks [31,33]. In this paper, we only deal with symmetric tensors and symmetric ranks. We refer to [12,17,21,26,31,33] for general work about tensors and their ranks. For convenience, if r = rank S (F), we call F a rank-r tensor and F = r i=1 u ⊗m i is called a rank decomposition.
For a power α := (α 1 , α 2 , · · · , α n ) ∈ N n and x := (x 1 , x 2 , · · · , x n ), denote |α| := α 1 + α 2 + · · · + α n , x α := x α1 1 x α2 2 · · · x αn n , x 0 := 1. The monomial power set of degree m is denoted as be the space of all polynomials in x with complex coefficients and whose degrees are no more than m. For a cubic polynomial p ∈ C[x] 3 and F ∈ S 3 (C n+1 ), we define the bilinear product (note that x 0 = 1) where p i1i2i3 are coefficients of p. A polynomial g ∈ C[x] 3 is called a generating polynomial for a symmetric tensor F ∈ S 3 (C n+1 ) if , where deg(g) denotes the degree of g in x. When the order is bigger than 3, we refer to [40] for the definition of generating polynomials. They can be used to compute symmetric tensor decompositions and low rank approximations [40,42], which are closely related to truncated moment problems and polynomial optimization [20,37,38,39,43]. There are special versions of symmetric tensors and their decompositions [19,44,45].

Incomplete Tensor Decomposition
This section discusses how to compute an incomplete tensor decomposition for a symmetric tensor F ∈ S 3 (C d ) when only its subtensor F Ω is given, for the label set Ω in (1.5). For convenience of notation, the labels for F begin with zeros while a vector u ∈ C d is still labelled as u := (u 1 , . . . , u d ). We set For a given rank r, denote the monomial sets For a monomial power α ∈ N n , by writing α ∈ B 1 , we mean that The notion generating matrix is motivated from the fact that the entire tensor F can be recursively determined by G and its first r entries (see [40]). The existence and uniqueness of the generating matrix G is shown as follows.
Proof. We first prove the existence. For each i = 1, . . . , r, denote the vectors v i = (u i ) 1:r . Under the given assumption, For each l = r + 1, . . . , n, let Then N l v i = (u i ) l v i for i = 1, . . . , r, i.e., N l has eigenvalues (u 1 ) l , . . . , (u r ) l with corresponding eigenvectors (u 1 ) 1:r , . . . , (u r ) 1:r . We select G ∈ C [r]×B1 to be the matrix such that For each s = 1, . . . , r and For each t = 1, . . . , n, it holds that When t = 0, we can similarly get Therefore, the matrix G satisfies (3.3) and it is a generating matrix for F. Second, we prove the uniqueness of such G. For each α = e i + e j ∈ B 1 , let The sets {v 1 , . . . , v r } and {u 1 , . . . , u r } are both linearly independent. Since each λ i = 0, the matrix F has full column rank. Hence, the generating matrix G satisfying F · G(:, e i + e j ) = g ij for all i ∈ [r], j ∈ [r + 1, n] is unique.
The following is an example of generating matrices.
Example 3.2. Consider the tensor F ∈ S 3 (C 6 ) that is given as We have the vectors The matrices N 3 , N 4 , N 5 as in (3.5) are The entries of the generating matrix G are listed as below: The generating polynomials in (3.2) are Above generating polynomials can be written in the following form For x to be a common zero of ϕ 1j [G](x) and ϕ 2j [G](x), it requires that (x 1 , x 2 ) is an eigenvector of N j with the corresponding eigenvalue x j .
3.1. Computing the tensor decomposition. We show how to find an incomplete tensor decomposition (3.4) for F when only its subtensor F Ω is given, where the label set Ω is as in (1.5). Suppose that there exists the decomposition (3.4) for F, for vectors u i ∈ C n and nonzero scalars λ i ∈ C. Assume the subvectors (u 1 ) 1:r , . . . , (u r ) 1:r are linearly independent, so there is a unique generating matrix G for F, by Theorem 3.1. For each α = e i + e j ∈ B 1 with i ∈ [r], j ∈ [r + 1, n] and for each l = r + 1, . . . , j − 1, j + 1, . . . , n, the generating matrix G satisfies the equations To distinguish changes in the labels of tensor entries of F, the commas are inserted to separate labeling numbers. The equations in (3.8) can be equivalently written as Thus, the number of rows is not less than the number of columns for matrices A ij [F]. If A ij [F] has linearly independent columns, then (3.10) uniquely determines G(:, α). For such a case, the matrix G can be fully determined by the linear system (3.10). Let N r+1 (G), . . . , N m (G) ∈ C r×r be the matrices given as As in the proof of Theorem 3.1, one can see that The above is equivalent to the equations for the vectors (i = 1, . . . , r) Each v i is a common eigenvector of the matrices N r+1 (G), . . . , N n (G) and (w i ) l−r is the associated eigenvalue of N l (G). These matrices may or may not have repeated eigenvalues. Therefore, we select a generic vector ξ := (ξ r+1 , · · · , ξ n ) and let The eigenvalues of N (ξ) are ξ T w 1 , . . . , ξ T w r . When w 1 , . . . , w r are distinct from each other and ξ is generic, the matrix N (ξ) does not have a repeated eigenvalue and hence it has unique eigenvectors v 1 , . . . , v r , up to scaling. Letṽ 1 , . . . ,ṽ r be unit length eigenvectors of N (ξ). They are also common eigenvectors of N r+1 (G), . . ., N n (G). For each i = 1, . . . , r, letw i be the vector such that its jth entry (w i ) j is the eigenvalue of N j+r (G), associated to the eigenvectorṽ i , or equivalently, Up to a permutation of (ṽ 1 , . . . ,ṽ r ), there exist scalars γ i such that The tensor decomposition of F can also be written as The scalars λ 1 , · · · , λ r and γ 1 , · · · , γ r satisfy the linear equations λ 1 γ 1ṽ1 ⊗w 1 + · · · + λ r γ rṽr ⊗w r = F [0,1:r,r+1:n] , λ 1 γ 2 1ṽ1 ⊗ṽ 1 ⊗w 1 + · · · + λ r γ 2 rṽr ⊗ṽ r ⊗w r = F [1:r,1:r,r+1:n] . Denote the label sets (3.17) To determine the scalars λ i , γ i , we can solve the linear least squares Let (β * 1 , . . . , β * r ), (θ * 1 , . . . , θ * r ) be minimizers of (3.18) and (3.19) respectively. Then, for each i = 1, . . . , r, let (3.20) λ For the vectors (i = 1, . . . , r) r is a tensor decomposition for F. This is justified in the following theorem. Theorem 3.3. Suppose the tensor F has the decomposition as in (3.4). Assume that the vectors v 1 , . . . , v r are linearly independent and the vectors w 1 , . . . , w r are distinct from each other, where v 1 , . . . , v r , w 1 , . . . , w r are defined as in (3.13). Let ξ be a generically chosen coefficient vector and let p 1 , . . . , p r be the vectors produced as above. Then, the tensor decomposition F = p ⊗3 1 + · · · + p ⊗3 r is unique.
Proof. Since v 1 , . . . , v r are linearly independent, the tensor decomposition (3.4) is unique, up to scalings and permutations. By Theorem 3.1, there is a unique generating matrix G for F satisfying (3.3). Under the given assumptions, the equation (3.10) uniquely determines G. Note that ξ T w 1 , . . . , ξ T w r are the eigenvalues of N (ξ) and v 1 , . . . , v r are the corresponding eigenvectors. When ξ is generically chosen, the values of ξ T w 1 , . . . , ξ T w r are distinct eigenvalues of N (ξ). So N (ξ) has unique eigenvalue decompositions, and hence (3.16) must hold, up to a permutation of (v 1 , . . . , v r ). Since the coefficient matrices have full column ranks, the linear least squares problems have unique optimal solutions. Up to a permutation of p 1 , . . . , p r , it holds that p i = 3 √ λ i 1 u i . Then, the conclusion follows readily.
The following is the algorithm for computing an incomplete tensor decomposition for F when only its subtensor F Ω is given.  .14), for a randomly selected vector ξ.
The following is an example of applying Algorithm 3.4. Solve The unit length eigenvectors arẽ As in (3.15), we get the vectors Solving (3.18) and (3.19), we get the scalars This produces the decomposition F = λ 1 u ⊗3 1 + λ 2 u ⊗3 2 for the vectors 1, 1, 1, 1, 1), Remark 3.6. Algorithm 3.4 requires the value of r. This is generally a hard question. In computational practice, one can estimate the value of r as follows. Let Flat(F) ∈ C (n+1)×(n+1) 2 be the flattening matrix, labelled by (i, (j, k)) such that Flat(F) i,(j,k) = F ijk for all i, j, k = 0, 1, . . . , n. The rank of Flat(F) equals the rank of F when the vectors p 1 , . . . , p r are linearly independent. The rank of Flat(F) is not available since only the subtensor (F) Ω is known. However, we can calculate the ranks of submatrices of (F) Ω whose entries are known. If the tensor F as in (3.4) is such that both the sets {v 1 , . . . , v r } and {w 1 , . . . , w r } are linearly independent, one can see that i is a known submatrix of Flat(F) whose rank is r. This is generally the case if r ≤ d 2 − 1, since v i has the length r and w i has length d − 1 − r ≥ r. Therefore, the known submatrices of Flat(F) are generally sufficient to estimate rank S (F). For instance, we consider the case F ∈ S 3 (C 7 ). The flattening matrix Flat(F) is where each * means that entry is not given. The largest submatrices with known entries are  The rank of above matrices generally equals rank S (F) if r ≤ d 2 − 1 = 2.5.

Tensor approximations and stability analysis
In some applications, we do not have the subtensor F Ω exactly but only have an approximation F Ω for it. The Algorithm 3.4 can still provide a good rank-r approximation for F when it is applied to F Ω . We define the matrix A ij [ F] and the vector b ij [ F] in the same way as in (3.9), for each α = e i + e j ∈ B 1 . The generating matrix G for F can be approximated by solving the linear least squares for each α = e i + e j ∈ B 1 . Let G(:, e i + e j ) be the optimizer of the above and G be the matrix consisting of all such G(:, e i + e j ). Then G is an approximation for G. For each l = r + 1, . . . , n, define the matrix N l ( G) similarly as in (3.11). Choose a generic vector ξ = (ξ r+1 , . . . , ξ n ) and let The matrix N (ξ) is an approximation for N (ξ). Letv 1 , . . . ,v r be unit length eigenvectors of N (ξ). For k = 1, . . . , r, let Let (β 1 , . . . ,β r ) and (θ 1 , . . . ,θ r ) be their optimizers respectively. For each k = 1, . . . , r, let (4.6)λ k := (β k ) 2 /θ k ,γ k :=θ k /β k .
This results in the tensor approximation for the vectorsp k := 3 λ k (1,γ kvk ,ŵ k ). The above may not give an optimal tensor approximation. To get an improved one, we can usep 1 , . . . ,p r as starting points to solve the following nonlinear optimization The minimizer of the optimization (4.7) is denoted as (p * 1 , . . . , p * r ). Summarizing the above, we have the following algorithm for computing a tensor approximation. is a tensor approximation for F. 5. Usep 1 , . . . ,p r as starting points to solve the nonlinear optimization (4.7) for an optimizer (p * 1 , . . . , p * r ). Output: The tensor approximation (p * 1 ) ⊗3 + · · · + (p * r ) ⊗3 for F.
When F is close to F, Algorithm 4.1 also produces a good rank-r tensor approximation for F. This is shown in the following.  up to a permutation of (p 1 , . . . , p r ), where the constants inside O(·) only depend on F and the choice of ξ in Algorithm 4.1.
Proof. The conditions (i)-(ii), by Theorem 3.1, imply that there is a unique generating matrix G for F. The matrix G can be approximated by solving the linear least square problems (4.1). Note that where O( ) depends on F (see [15,Theorem 3.4]). For each j = r + 1, . . . , n, N j ( G) is part of the generating matrix G, so This implies that N (ξ) − N (ξ) = O( ). When is small enough, the matrix N (ξ) does not have repeated eigenvalues, due to the condition (iv). Thus, the matrix N (ξ) has a set of unit length eigenvectorsṽ 1 , . . . ,ṽ r with eigenvaluesw 1 , . . . ,w r respectively, such that This follows from Proposition 4.2.1 in [8]. The constants inside the above O(·) depend only on F and ξ. Thew 1 , . . . ,w r are scalar multiples of linearly independent vectors (p 1 ) r+2:d , . . . , (p r ) r+2:d respectively, sow 1 , . . . ,w r are linearly independent. When is small,ŵ 1 , . . . ,ŵ r are linearly independent as well. The scalarsλ iγi andλ i (γ i ) 2 are optimizers for the linear least square problems (4.4) and (4.5). By Theorem 3.4 in [15], we have The vector p i can be written as p i = 3 √ λ i (1, γ iṽi ,w i ), so we must have λ i , γ i = 0 due to the condition (ii). Thus, it holds that where constants inside O(·) depend only on F and ξ. For the vectorsp i := i , by Theorem 3.3. Since p 1 , . . . , p r are linearly independent by the assumption, the rank decomposition of F is unique up to scaling and permutation. There exist scalarsτ i such that (τ i ) 3 = 1 and τ ipi = p i , up to a permutation of p 1 , . . . , p r .
where the constants in O(·) only depend on F and ξ. Since For the tensor F * := When Algorithm 4.1 is applied to (F * ) Ω , Step 4 will give the exact decomposition By repeating the previous argument, we can similarly show that where the constants in O(·) only depend on F and ξ. is not explicitly given in the proof. It is related to the condition number κ(F) for tensor decomposition [5]. It was shown by Breiding and for some constant c. The continuity ofĜ inF is implicitly implied by the proof. Eigenvalues and unit eigenvectors of N (ξ) are continuous inĜ. Furthermore,λ i ,γ i are continuous in the eigenvalues and unit eigenvectors. All these functions are locally Lipschitz continuous. Thep i is Lipschitz continuous with respect toF, in a neighborhood of F, which also implies an error bound forp i . The tensors (p * i ) ⊗3 are also locally Lipschitz continuous in F, as illustrated in [6]. This also gives error bounds for decomposing vectors p * i . We refer to [5,6] for more details about condition numbers of tensor decompositions.
Example 4.4. We consider the same tensor F as in Example 3.2. The subtensor (F) Ω is perturbed to ( F) Ω . The perturbation is randomly generated from the Gaussian distribution N (0, 0.01). For neatness of the paper, we do not display ( F) Ω here. We use Algorithm 4.1 to compute the incomplete tensor approximation. The matrices A ij [ F] and vectors b ij [ F] are given as follows: The linear least square problems (4.1) are solved to obtain G and N 3 ( G), N 4 ( G), N 5 ( G), which are They are pretty close to the decomposition of F.

Learning Diagonal Gaussian Mixture
We use the incomplete tensor decomposition or approximation method to learn parameters for Gaussian mixture models. The Algorithms 3.4 and 4.1 can be applied to do that.
Let y be the random variable of dimension d for a Gaussian mixture model, with r components of Gaussian distribution parameters (ω i , µ i , Σ i ), i = 1, . . . , r. We consider the case that r ≤ d 2 − 1. Let y 1 , . . . , y N be samples drawn from the Gaussian mixture model. The sample average is an estimation for the mean M 1 := E[y] = ω 1 µ 1 + · · · + ω r µ r . The symmetric tensor is an estimation for the third order moment tensor i . When all the covariance matrices Σ i are diagonal, we have shown in (1.3) that (a j ⊗ e j ⊗ e j + e j ⊗ a j ⊗ e j + e j ⊗ e j ⊗ a j ).
Step 4. Use the aboveω i ,q i as initial points to solve the nonlinear optimization (5.2) for the optimal ω * i , µ * i , i = 1, . . . , r.
The sample averages M 1 , M 3 can typically be used as good estimates for the true moments M 1 , M 3 . When the value of r is not known, it can be determined as in Remark 3.6. The performance of Algorithm 5.1 is analyzed as follows. are real. When is small enough, such τ * i is the τ in Step 2 of Algorithm 5.1 that minimizes Im(τ i p * i ) , so we have Step 2. The vectorsq 1 , . . . ,q r are linearly independent when is small. Thus, the problem (5.1) has a unique solution and the weightsω i can be found by solving (5.1). Since [15,Theorem 3.4]). The mean vectorsμ i are obtained bŷ µ i =q i / 3 √ω i , so the approximation error is The constants inside the above O( ) depend on parameters of the Gaussian mixture model and ξ.
The problem (5.2) is solved to obtain ω * i and µ * i , so In addition, we have The first order moment is [15,Theorem 3.4]. This implies that ω i − ω * i = O( ), so The constants inside the above O(·) only depend on parameters {(ω i , µ i , Σ i ) : i ∈ [r]} and ξ. The covariance matrices Σ i are recovered by solving the linear least squares (5.7). In the least square problems, it holds that where tensors A, A are defined in (5.3). When the error is small, the vectors ω * i µ * 1 , . . . , ω * i µ * r are linearly independent and hence (5.7) has a unique solution for each j. By [15,Theorem 3.4], we have It implies that Σ i − Σ * i = O( ), where the constants inside O(·) only depend on parameters {(ω i , µ i , Σ i ) : i ∈ [r]} and ξ.

Numerical Simulations
This section gives numerical experiments for our proposed methods. The computation is implemented in MATLAB R2019b, on an Alienware personal computer with Intel(R)Core(TM)i7-9700K CPU@3.60GHz and RAM 16.0G. The MATLAB function lsqnonlin is used to solve (4.7) in Algorithm 4.1 and the MATLAB function fmincon is used to solve (5.2) in Algorithm 5.1. We compare our method with the classical EM algorithm, which is implemented by the MATLAB function fitgmdist (MaxIter is set to be 100 and RegularizationValue is set to be 0.001).
First, we show the performance of Algorithm 4.1 for computing incomplete symmetric tensor approximations. For a range of dimension d and rank r, we get the tensor F = (p 1 ) ⊗3 + · · · + (p r ) ⊗3 , where each p i is randomly generated according to the Gaussian distribution in MATLAB. Then, we apply the perturbation ( F) Ω = (F) Ω + E Ω , where E is a randomly generated tensor, also according to the Gaussian distribution in MATLAB, with the norm E ω Ω = . After that, Algorithm 4.1 is applied to the subtensor ( F) Ω to find the rank-r tensor approximation. The approximation quality is measured by the absolute error and the relative error abs-error := (F * − F) Ω , rel-error : where F * is the output of Algorithm 4.1. For each case of (d, r, ), we generate 100 random instances. The min, average, and max relative errors for each dimension d and rank r are reported in the Second, we explore the performance of Algorithm 5.1 for learning diagonal Gaussian mixture models. We compare it with the classical EM algorithm, for which the MATLAB function fitgmdist is used (MaxIter is set to be 100 and RegularizationValue is set to be 0.0001). The dimensions d = 20, 30, 40, 50, 60 are tested. Three values of r are tested for each case of d. We generate 100 random instances of {(ω i , µ i , Σ i ) : i = 1, · · · , r} for d ∈ {20, 30, 40}, and 20 random instances for d ∈ {50, 60}, because of the relatively more computational time for the latter case. For each instance, 10000 samples are generated. To generate the weights ω 1 , . . . , ω r , we first use the MATLAB function randi to generate a random 10000−dimensional integer vector of entries from [r], then the occurring frequency of i in [r] is used as the weight ω i . For each diagonal covariance matrix Σ i , its diagonal vector is set to be the square of a random vector generated by the MAT-LAB function randn. Each sample is generated from one of r component Gaussian distributions, so they are naturally separated into r groups. Algorithm 5.1 and the EM algorithm are applied to fit the Gaussian mixture model to the 10000 samples for each instance. For each sample, we calculate the likelihood of the sample to each component Gaussian distribution in the estimated Gaussian mixture model. A sample is classified to the ith group if its likelihood for the ith component is maximum. The classification accuracy is the rate that samples are classified to the correct group. In Table 2, for each pair (d, r), we report the accuracy of Algorithm 5.1 in the first row and the accuracy of the EM algorithm in the second row. As one can see, Algorithm 5.1 performs better than EM algorithm, and its accuracy isn't affected when the dimensions and ranks increase. Indeed, as the difference between the dimension d and the rank r increases, Algorithm 5.1 becomes more and more accurate. This is opposite to the EM algorithm. The reason is that the difference between the number of rows and the number of columns of A ij [F] in (3.9) increases as d − r becomes bigger, which makes Algorithm 5.1 more robust. Last, we apply Algorithm 5.1 to do texture classifications. We select 8 textured images of 512×512 pixels from the VisTex database. We use the MATLAB function rgb2gray to convert them into grayscale version since we only need their structure and texture information. Each image is divided into subimages of 32 × 32 pixels. We perform the discrete cosine transformation(DCT) on each block of size 16 × 16 pixels with overlap of 8 pixels. Each component of 'Wavelet-like' DCT feature is the sum of the absolute value of the DCT coefficients in the corresponding sub-block. So the dimension d of the feature vector extracted from each subimage is 13. We use blocks extracted from the first 160 subimages for training and those from the rest 96 subimages for testing. We refer to [47] for more details. For each image, we apply Algorithm 5.1 and the EM algorithm to fit a Gaussian mixture model to the image. We choose the number of components r according to Remark 3.6. To classify the test data, we follow the Bayes decision rule that assigns each block to the texture which maximizes the posteriori probability, where we assume a uniform prior over all classes [18]. The classification accuracy is the rate that a subimage is correctly classified, which is shown in Table 3. Algorithm 5.1 outperforms the classical EM algorithm for the accuracy rates for six of the images.

Conclusions and Future Work
This paper gives a new algorithm for learning Gaussian mixture models with diagonal covariance matrices. We first give a method for computing incomplete symmetric tensor decompositions. It is based on the usage of generating polynomials. The method is described in Algorithm 3.4. When the input subtensor has small errors, we can similarly compute the incomplete symmetric tensor approximation, which is given by Algorithm 4.1. We have shown in Theorem 4.2 that if the input subtensor is sufficiently close to a low rank one, the produced tensor approximation is highly accurate. Then unknown parameters for Gaussian mixture models can be recovered by using the incomplete tensor decomposition method. It is described in Algorithm 5.1. When the estimations of M 1 and M 3 are accurate, the parameters recovered by Algorithm 5.1 are also accurate. The computational simulations demonstrate the good performance of the proposed method.
The proposed methods deals with the case that the number of Gaussian components is less than one half of the dimension. How do we compute incomplete symmetric tensor decompositions when the set Ω is not like (1.5)? How can we learn parameters for Gaussian mixture models with more components? How can we do that when the covariance matrices are not diagonal? They are important and interesting topics for future research work.