Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions

Guo, Bingni; Nie, Jiawang; Yang, Zi

doi:10.1007/s10013-021-00534-3

Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions

Original Article
Open access
Published: 18 November 2021

Volume 50, pages 421–446, (2022)
Cite this article

Download PDF

You have full access to this open access article

Vietnam Journal of Mathematics Aims and scope Submit manuscript

Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions

Download PDF

2471 Accesses
6 Citations
Explore all metrics

Abstract

This paper studies how to learn parameters in diagonal Gaussian mixture models. The problem can be formulated as computing incomplete symmetric tensor decompositions. We use generating polynomials to compute incomplete symmetric tensor decompositions and approximations. Then the tensor approximation method is used to learn diagonal Gaussian mixture models. We also do the stability analysis. When the first and third order moments are sufficiently accurate, we show that the obtained parameters for the Gaussian mixture models are also highly accurate. Numerical experiments are also provided.

Tensor Decompositions for Learning Latent Variable Models (A Survey for ALT)

Generating Polynomials and Symmetric Tensor Decompositions

Article 02 November 2015

Low Rank Tensor Decompositions and Approximations

Article Open access 18 March 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

A Gaussian mixture model consists of several component Gaussian distributions. For given samples of a Gaussian mixture model, people often need to estimate parameters for each component Gaussian distribution [24, 32]. Consider a Gaussian mixture model with r components. For each i ∈ [r] := {1,…,r}, let ω_i be the positive probability for the i th component Gaussian to appear in the mixture model. We have each ω_i > 0 and ${\sum }_{i=1}^{r}\omega _{i}=1$. Suppose the i th Gaussian distribution is $\mathcal {N}(\mu _{i},{{\varSigma }}_{i})$, where $\mu _{i}\in \mathbb {R}^{d}$ is the expectation (or mean) and ${{\varSigma }}_{i}\in \mathbb {R}^{d\times d}$ is the covariance matrix. Let $y\in \mathbb {R}^{d}$ be the random vector for the Gaussian mixture model and let y₁,…,y_N be identically independent distributed (i.i.d) samples from the mixture model. Each y_j is sampled from one of the r component Gaussian distributions, associated with a label Z_j ∈ [r] indicating the component that it is sampled from. The probability that a sample comes from the i th component is ω_i. When people observe only samples without labels, the Z_j’s are called latent variables. The density function for the random variable y is

$$ f(y) := \sum\limits_{i=1}^{r}\omega_{i} \frac{1}{\sqrt{(2\pi)^{d} \det {{\varSigma}}_{i}}} \exp\left\{-\frac{1}{2}(y-\mu_{i})^{T} {{\varSigma}}_{i}^{-1} (y-\mu_{i}) \right\}, $$

where μ_i is the mean and Σ_i is the covariance matrix for the i th component.

Learning a Gaussian mixture model is to estimate the parameters ω_i,μ_i,Σ_i for each i ∈ [r], from given samples of y. The number of parameters in a covariance matrix grows quadratically with respect to the dimension. Due to the curse of dimensionality, the computation becomes very expensive for large d [35]. Hence, diagonal covariance matrices are preferable in applications. In this paper, we focus on learning Gaussian mixture models with diagonal covariance matrices, i.e.,

$$ {{\varSigma}}_{i} = \text{diag}\left( \sigma_{i1}^{2},\ldots,\sigma_{id}^{2}\right), \quad i=1,\ldots, r. $$

A natural approach for recovering the unknown parameters ω_i,μ_i,Σ_i is the method of moments. It estimates parameters by solving a system of multivariate polynomial equations, from moments of the random vector y. Directly solving polynomial systems may encounter non-existence or non-uniqueness of statistically meaningful solutions [57]. However, for diagonal Gaussians, the third order moment tensor can help us avoid these troubles.

Let $M_{3} := \mathbb {E} (y \otimes y \otimes y)$ be the third order moment tensor for y. One can write that y = η(z) + ζ(z), where z is a discrete random variable such that Prob(z = i) = ω_i, $\eta (i)=\mu _{i} \in \mathbb {R}^{d}$ and ζ(i) is the random variable ζ_i obeying the Gaussian distribution $\mathcal {N}(0,{{\varSigma }}_{i})$. Assume all Σ_i are diagonal, then

$$ \begin{array}{@{}rcl@{}} M_{3} &=&\sum\limits_{i=1}^{r} \omega_{i} \mathbb{E}\left[(\eta(i)+\zeta_{i})^{\otimes 3}\right]\\ &=& \sum\limits_{i=1}^{r} \omega_{i} \left( \mu_{i}\otimes\mu_{i}\otimes\mu_{i}+ \mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \zeta_{i}] + \mathbb{E}[\zeta_{i}\otimes \mu_{i}\otimes \zeta_{i}] +\mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes \mu_{i}] \right). \end{array} $$

The second equality holds because ζ_i has zero mean and

$$ \mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes \zeta_{i}]= \mathbb{E}[\mu_{i}\otimes \mu_{i}\otimes \zeta_{i}]= \mathbb{E}[\zeta_{i}\otimes \mu_{i}\otimes \mu_{i}]= \mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \mu_{i}]=0. $$

The random variable ζ_i has a diagonal covariance matrix, so $\mathbb {E}[(\zeta _{i})_{j}(\zeta _{i})_{l}] = 0$ for j≠l. Therefore,

$$ \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\mu_{i}\otimes \zeta_{i}\otimes \zeta_{i}] = \sum\limits_{i=1}^{r}\sum\limits_{j=1}^{d} \omega_{i}\sigma_{ij}^{2}\mu_{i}\otimes e_{j}\otimes e_{j} = \sum\limits_{j=1}^{d} a_{j}\otimes e_{j}\otimes e_{j}, $$

where the vectors a_j are given by

$$ a_{j} := \sum\limits_{i=1}^{r}\omega_{i}\sigma^{2}_{ij}\mu_{i}, \quad j=1,\ldots,d. $$

(1.1)

Similarly, we have

$$ \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\zeta_{i}\otimes\mu_{i}\otimes \zeta_{i}] = \sum\limits_{j=1}^{d} e_{j}\otimes a_{j}\otimes e_{j}, \quad \sum\limits_{i=1}^{r}\omega_{i}\mathbb{E}[\zeta_{i}\otimes \zeta_{i}\otimes\mu_{i}] = \sum\limits_{j=1}^{d} e_{j}\otimes e_{j}\otimes a_{j}. $$

Therefore, we can express M₃ in terms of ω_i, μ_i, Σ_i as

$$ M_{3}=\sum\limits_{i=1}^{r} \omega_{i}\mu_{i}\otimes\mu_{i}\otimes\mu_{i} + \sum\limits_{j=1}^{d} \left( a_{j}\otimes e_{j}\otimes e_{j}+e_{j}\otimes a_{j}\otimes e_{j} + e_{j}\otimes e_{j}\otimes a_{j} \right). $$

(1.2)

We are particularly interested in the following third order symmetric tensor

$$ \mathcal{F} := \sum\limits_{i=1}^{r} \omega_{i}\mu_{i}\otimes\mu_{i}\otimes\mu_{i}. $$

(1.3)

When the labels i₁, i₂, i₃ are distinct from each other, we have

$$ (M_{3})_{i_{1}i_{2}i_{3}} = (\mathcal{F})_{i_{1}i_{2}i_{3}} \quad \text{for} \quad i_{1} \ne i_{2} \ne i_{3} \ne i_{1}. $$

Denote the label set

$$ {{\varOmega}} = \{(i_{1}, i_{2}, i_{3}): i_{1} \ne i_{2} \ne i_{3} \ne i_{1}, i_{1},i_{2},i_{3} \text{ are labels for }M_{3} \}. $$

(1.4)

The tensor M₃ can be estimated from the samplings for y, so the entries $\mathcal {F}_{i_{1}i_{2}i_{3}}$ with (i₁,i₂,i₃) ∈Ω can also be obtained from the estimation of M₃. To recover the parameters ω_i, μ_i, we first find the tensor decomposition for $\mathcal {F}$, from the partially given entries $\mathcal {F}_{i_{1}i_{2}i_{3}}$ with (i₁,i₂,i₃) ∈Ω. Once the parameters ω_i, μ_i are known, we can determine Σ_i from the expressions of a_j as in (1.1).

The above observation leads to the incomplete tensor decomposition problem. For a third order symmetric tensor $\mathcal {F}$ whose partial entries $\mathcal {F}_{i_{1}i_{2}i_{3}}$ with (i₁,i₂,i₃) ∈Ω are known, we are looking for vectors p₁,…,p_r such that

$$ \mathcal{F}_{i_{1}i_{2}i_{3}} = \left( p_{1}^{\otimes 3} + {\cdots} + p_{r}^{\otimes 3}\right)_{i_{1}i_{2}i_{3}} \quad \text{for all }~(i_{1},i_{2},i_{3}) \in {{\varOmega}}. $$

The above is called an incomplete tensor decomposition for $\mathcal {F}$. To find such a tensor decomposition for $\mathcal {F}$, a straightforward approach is to do tensor completion: first find unknown tensor entries $\mathcal {F}_{i_{1}i_{2}i_{3}}$ with (i₁,i₂,i₃)∉Ω such that the completed $\mathcal {F}$ has low rank, and then compute the tensor decomposition for $\mathcal {F}$. However, there are serious disadvantages for this approach. The theory for tensor completion or recovery, especially for symmetric tensors, is premature. Low rank tensor completion or recovery is typically not guaranteed by the currently existing methodology. Most methods for tensor completion are based on convex relaxations, e.g., the nuclear norm or trace minimization [22, 36, 41, 54, 58]. These convex relaxations may not produce low rank completions [51].

In this paper, we propose a new method for determining incomplete tensor decompositions. It is based on the generating polynomial method in [40]. The label set Ω consists of (i₁,i₂,i₃) of distinct i₁, i₂, i₃. We can still determine some generating polynomials, from the partially given tensor entries $\mathcal {F}_{i_{1}i_{2}i_{3}}$ with (i₁,i₂,i₃) ∈Ω. They can be used to get the incomplete tensor decomposition. We show that this approach works very well when the rank r is roughly not more than half of the dimension d. Consequently, the parameters for the Gaussian mixture model can be recovered from the incomplete tensor decomposition of $\mathcal {F}$.

Related Work

Gaussian mixture models have broad applications in machine learning problems, e.g., automatic speech recognition [30, 48, 50], hyperspectral unmixing problem [4, 34], background subtraction [32, 60] and anomaly detection [56]. They also have applications in social and biological sciences [25, 53, 59].

There exist methods for estimating unknown parameters for Gaussian mixture models. A popular method is the expectation-maximization (EM) algorithm that iteratively approximates the maximum likelihood parameter estimation [16]. This approach is widely used in applications, while its convergence property is not very reliable [49]. Dasgupta [13] introduced a method that first projects data to a randomly chosen low-dimensional subspace and then use the empirical means and covariances of low-dimensional clusters to estimate the parameters. Later, Arora and Kannan [52] extended this idea to arbitrary Gaussians. Vempala and Wong [55] introduced the spectral technique to enhance the separation condition by projecting data to principal components of the sample matrix instead of selecting a random subspace. For other subsequent work, we refer to Dasgupta and Schulman [14], Kannan et al. [29], Achlioptas et al. [1], Chaudhuri and Rao [10], Brubaker and Vempala [7] and Chaudhuri et al. [9].

Another frequently used approach is based on moments, introduced by Pearson [46]. Belkin and Sinha [3] proposed a learning algorithm for identical spherical Gaussians (Σ_i = σ²I) with arbitrarily small separation between mean vectors. It was also shown in [28] that a mixture of two Gaussians can be learned with provably minimal assumptions. Hsu and Kakade [27] provided a learning algorithm for a mixture of spherical Gaussians, i.e., each covariance matrix is a multiple of the identity matrix. This method is based on moments up to order three and only assumes non-degeneracy instead of separations. For general covariance matrices, Ge et al. [23] proposed a learning method when the dimension d is sufficiently high. More moment-based methods for general latent variable models can be found in [2].

Contributions

This paper proposes a new method for learning diagonal Gaussian mixture models, based on samplings for the first and third order moments. Let $y_{1},\dots ,y_{N}$ be samples and let {(ω_i,μ_i,Σ_i) : i ∈ [r]} be parameters of the diagonal Gaussian mixture model, where each covariance matrix Σ_i is diagonal. We use the samples $y_{1},\dots ,y_{N}$ to estimate the third order moment tensor M₃, as well as the mean vector M₁. We have seen that the tensor M₃ can be expressed as in (1.2).

For the tensor $\mathcal {F}$ in (1.3), we have $\mathcal {F}_{i_{1}i_{2}i_{3}} = (M_{3})_{i_{1}i_{2}i_{3}}$ when the labels i₁, i₂, i₃ are distinct from each other. Other entries of $\mathcal {F}$ are not known, since the vectors a_j are not available. The $\mathcal {F}$ is an incompletely given tensor. We give a new method for computing the incomplete tensor decomposition of $\mathcal {F}$ when the rank r is low (roughly no more than half of the dimension d). The tensor decomposition of $\mathcal {F}$ is unique under some genericity conditions [11], so it can be used to recover parameters ω_i, μ_i. To compute the incomplete tensor decomposition of $\mathcal {F}$, we use the generating polynomial method in [40, 42]. We look for a special set of generating polynomials for $\mathcal {F}$, which can be obtained by solving linear least squares. It only requires to use the known entries of $\mathcal {F}$. The common zeros of these generating polynomials can be determined from eigenvalue decompositions. Under some genericity assumptions, these common zeros can be used to get the incomplete tensor decomposition. After this is done, the parameters ω_i, μ_i can be recovered by solving linear systems. The diagonal covariance matrices Σ_i can also be estimated by solving linear least squares. The tensor M₃ is estimated from the samples y₁,…,y_N. Typically, the tensor entries $(M_{3})_{i_{1}i_{2}i_{3}}$ and $\mathcal {F}_{i_{1}i_{2}i_{3}}$, are not precisely given. We also provide a stability

analysis for this case, showing that the estimated parameters are also accurate when the entries $(M_{3})_{i_{1}i_{2}i_{3}}$ have small errors.

The paper is organized as follows. In Section 2, we review some basic results for symmetric tensor decompositions and generating polynomials. In Section 3, we give a new algorithm for computing an incomplete tensor decomposition for $\mathcal {F}$, when only its subtensor $\mathcal {F}_{{{\varOmega }}}$ is known. Section 4 gives the stability analysis when there are errors for the subtensor $\mathcal {F}_{{{\varOmega }}}$. Section 5 gives the algorithm for learning Gaussian mixture models. Numerical experiments and applications are given in Section 6. We make some conclusions and discussions in Section 7.

2 Preliminary

Notation

Denote $\mathbb {N}$, $\mathbb {C}$ and $\mathbb {R}$ the set of nonnegative integers, complex and real numbers respectively. Denote the cardinality of a set L as |L|. Denote by e_i the i th standard unit basis vector, i.e., the i th entry of e_i is one and all others are zeros. For a complex number c, $\sqrt [n]{c}$ or c^1/n denotes the principal n th root of c. For a complex vector v, Re(v),Im(v) denotes the real part and imaginary part of v respectively. A property is said to be generic if it is true in the whole space except a subset of zero Lebesgue measure. The ∥⋅∥ denotes the Euclidean norm of a vector or the Frobenius norm of a matrix. For a vector or matrix, the superscript ^T denotes the transpose and ^H denotes the conjugate transpose. For $i,j \in \mathbb {N}$, [i] denotes the set {1,2,…,i} and [i,j] denotes the set {i,i + 1,…,j} if i ≤ j. For a vector v, $v_{i_{1}:i_{2}}$ denotes the vector $(v_{i_{1}},v_{i_{1}+1},\ldots ,v_{i_{2}})$. For a matrix A, denote by $A_{[i_{1}:i_{2}, j_{1}:j_{2}]}$ the submatrix of A whose row labels are i₁,i₁ + 1,…,i₂ and whose column labels are j₁,j₁ + 1…,j₂. For a tensor $\mathcal {F}$, its subtensor $\mathcal {F}_{[i_{1}:i_{2},j_{1}:j_{2},k_{1}:k_{2}]}$ is similarly defined.

Let $\mathrm {S}^{m}(\mathbb {C}^{d})$ (resp., $\mathrm {S}^{m}(\mathbb {R}^{d})$) denote the space of m th order symmetric tensors over the vector space $\mathbb {C}^{d}$ (resp., $\mathbb {R}^{d}$). For convenience of notation, the labels for tensors start with 0. A symmetric tensor $\mathcal {A} \in \mathrm {S}^{m}(\mathbb {C}^{n+1})$ is labelled as

$$ \mathcal{A} = (\mathcal{A}_{i_{1}{\dots} i_{m}} )_{0\le i_{1}, \ldots, i_{m} \le n}, $$

where the entry $\mathcal {A}_{i_{1}{\ldots } i_{m}}$ is invariant for all permutations of (i₁,…,i_m). The Hilbert–Schmidt norm $\|\mathcal {A}\|$ is defined as

$$ \|\mathcal{A}\| := \left( \sum\limits_{0\leq i_{1},\dots,i_{m}\leq n}|\mathcal{A}_{i_{1}{\ldots} i_{m}}|^{2}\right)^{1/2}. $$

The norm of a subtensor $\|\mathcal {A}_{{{\varOmega }}}\|$ is similarly defined. For a vector $u:=(u_{0},u_{1},\ldots ,u_{n})\in \mathbb {C}^{d}$, the tensor power u^⊗m := u ⊗⋯ ⊗ u, where u is repeated m times, is defined such that

$$ (u^{\otimes m})_{i_{1}{\ldots} i_{m}} = u_{i_{1}} \times {\cdots} \times u_{i_{m}} . $$

For a symmetric tensor $\mathcal {F}$, its symmetric rank is

$$ \text{rank}_{\mathrm{S}}(\mathcal{F}) := \min\left\{r ~|~ \mathcal{F}=\sum\limits_{i=1}^{r} u_{i}^{\otimes m}\right\}. $$

There are other types of tensor ranks [31, 33]. In this paper, we only deal with symmetric tensors and symmetric ranks. We refer to [12, 17, 21, 26, 31, 33] for general work about tensors and their ranks. For convenience, if $r=\text {rank}_{\mathrm {S}}(\mathcal {F})$, we call $\mathcal {F}$ a rank-r tensor and $\mathcal {F}={\sum }_{i=1}^{r} u_{i}^{\otimes m}$ is called a rank decomposition.

For a power $\alpha :=(\alpha _{1},\alpha _{2},\dots ,\alpha _{n})\in \mathbb {N}^{n}$ and $x:=(x_{1},x_{2},\dots ,x_{n})$, denote

$$ |\alpha| := \alpha_{1}+\alpha_{2}+\cdots+\alpha_{n},\quad x^{\alpha} := x_{1}^{\alpha_{1}}x_{2}^{\alpha_{2}}{\cdots} x_{n}^{\alpha_{n}}, \quad x_{0} :=1. $$

The monomial power set of degree m is denoted as

$$ \mathbb{N}^{n}_{m} := \{\alpha=(\alpha_{1},\alpha_{2},\dots,\alpha_{n}) \in \mathbb{N}^{n}:|\alpha|\leq m\}. $$

For $\alpha \in \mathbb {N}_{3}^{n}$, we can write that $x^{\alpha } = x_{i_{1}}x_{i_{2}}x_{i_{3}}$ for some 0 ≤ i₁,i₂,i₃ ≤ n.

Let $\mathbb {C}[x]_{m}$ be the space of all polynomials in x with complex coefficients and whose degrees are no more than m. For a cubic polynomial $p\in \mathbb {C}[x]_{3}$ and $\mathcal {F}\in \mathrm {S}^{3}(\mathbb {C}^{n+1})$, we define the bilinear product (note that x₀ = 1)

$$ \langle p,\mathcal{F}\rangle=\sum\limits_{0\leq i_{1},i_{2},i_{3}\leq n}p_{i_{1}i_{2}i_{3}}\mathcal{F}_{i_{1}i_{2}i_{3}} \quad\text{ for }~p = \sum\limits_{0\leq i_{1},i_{2},i_{3}\leq n}p_{i_{1}i_{2}i_{3}}x_{i_{1}}x_{i_{2}}x_{i_{3}}, $$

where $p_{i_{1}i_{2}i_{3}}$ are coefficients of p. A polynomial $g\in \mathbb {C}[x]_{3}$ is called a generating polynomial for a symmetric tensor $\mathcal {F} \in \mathrm {S}^{3}(\mathbb {C}^{n+1})$ if

$$ \langle g\cdot x^{\beta}, \mathcal{F}\rangle=0\quad\forall\beta\in\mathbb{N}_{3-\text{deg}(g)}^{n} , $$

where deg(g) denotes the degree of g in x. When the order is bigger than 3, we refer to [40] for the definition of generating polynomials. They can be used to compute symmetric tensor decompositions and low rank approximations [40, 42], which are closely related to truncated moment problems and polynomial optimization [20, 37,38,39, 43]. There are special versions of symmetric tensors and their decompositions [19, 44, 45].

3 Incomplete Tensor Decomposition

This section discusses how to compute an incomplete tensor decomposition for a symmetric tensor $\mathcal {F} \in \mathrm {S}^{3}(\mathbb {C}^{d})$ when only its subtensor $\mathcal {F}_{{{\varOmega }}}$ is given, for the label set Ω in (1.4). For convenience of notation, the labels for $\mathcal {F}$ begin with zeros while a vector $u\in \mathbb {C}^{d}$ is still labelled as u := (u₁,…,u_d). We set

$$ n:=d-1, \quad x = (x_{1}, \ldots, x_{n}), \quad x_{0} := 1. $$

For a given rank r, denote the monomial sets

$$ \mathscr{B}_{0} := \{x_{1},\dots,x_{r}\}, \quad \mathscr{B}_{1}=\{x_{i} x_{j}: i \in [r], j \in [r+1, n] \}. $$

For a monomial power $\alpha \in \mathbb {N}^{n}$, by writing $\alpha \in {\mathscr{B}}_{1}$, we mean that $x^{\alpha } \in {\mathscr{B}}_{1}$. For each $\alpha \in {\mathscr{B}}_{1}$, one can write α = e_i + e_j with i ∈ [r], j ∈ [r + 1,n]. Let $\mathbb {C}^{[r] \times {\mathscr{B}}_{1}}$ denote the space of matrices labelled by the pair $(k,\alpha )\in [r] \times {\mathscr{B}}_{1}$. For each $\alpha = e_{i} + e_{j}\in {\mathscr{B}}_{1}$ and $G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}$, denote the quadratic polynomial in x

$$ \varphi_{ij}[G](x) := \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})x_{k}- x_{i} x_{j}. $$

(3.1)

Suppose r is the symmetric rank of $\mathcal {F}$. A matrix $G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}$ is called a generating matrix of $\mathcal {F}$ if each φ_ij[G](x), with $\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}$, is a generating polynomial of $\mathcal {F}$. Equivalently, G is a generating matrix of $\mathcal {F}$ if and only if

$$ \langle x_{t} \varphi_{ij}[G](x),\mathcal{F} \rangle = {\sum}_{k=1}^{r} G(k,e_{i}+e_{j})\mathcal{F}_{0kt}-\mathcal{F}_{ijt} = 0, \quad t = 0, 1, \ldots, n, $$

(3.2)

for all i ∈ [r], j ∈ [r + 1,n]. The notion generating matrix is motivated from the fact that the entire tensor $\mathcal {F}$ can be recursively determined by G and its first r entries (see [40]). The existence and uniqueness of the generating matrix G is shown as follows.

Theorem 3.1

Suppose $\mathcal {F}$ has the decomposition

$$ \mathcal{F} = \lambda_{1}\left[\begin{array}{c} 1 \\ u_{1} \end{array}\right]^{\otimes 3}+\cdots+\lambda_{r}\left[\begin{array}{c} 1 \\ u_{r} \end{array}\right]^{\otimes 3} , $$

(3.3)

for vectors $u_{i} \in \mathbb {C}^{n}$ and scalars $ 0\neq \lambda _{i} \in \mathbb {C}$. If the subvectors (u₁)_1:r,…,(u_r)_1:r are linearly independent, then there exists a unique generating matrix $G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}$ satisfying (3.2) for the tensor $\mathcal {F}$.

Proof

We first prove the existence. For each i = 1,…,r, denote the vectors v_i = (u_i)_1:r. Under the given assumption, V := [v₁…v_r] is an invertible matrix. For each l = r + 1,…,n, let

$$ N_{l} := V \cdot \text{diag}\left( (u_{1})_{l},\ldots,(u_{r})_{l} \right) \cdot V^{-1}. $$

(3.4)

Then N_lv_i = (u_i)_lv_i for i = 1,…,r, i.e., N_l has eigenvalues (u₁)_l,…,(u_r)_l with corresponding eigenvectors (u₁)_1:r,…,(u_r)_1:r. We select $G\in \mathbb {C}^{[r] \times {\mathscr{B}}_{1}}$ to be the matrix such that

$$ N_{l} = \left[\begin{array}{ccc} G(1,e_{1}+e_{l}) & {\cdots} & G(r,e_{1}+e_{l}) \\ {\vdots} & {\ddots} & {\vdots} \\ G(1,e_{r}+e_{l}) & {\cdots} & G(r,e_{r}+e_{l}) \end{array}\right],\quad l=r+1,\ldots,n. $$

For each s = 1,…,r and $\alpha = e_{i} + e_{j} \in \mathbb {B}_{1}$ with i ∈ [r], j ∈ [r + 1,n],

$$ \varphi_{ij}[G](u_{s}) =\sum\limits_{k=1}^{r} G(k,e_{i}+e_{j})(u_{s})_{k} - (u_{s})_{i} (u_{s})_{j} = 0. $$

For each t = 1,…,n, it holds that

$$ \begin{array}{@{}rcl@{}} \langle x_{t} \varphi_{ij}[G](x),\mathcal{F} \rangle &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{t}x_{k} - x_{t} x_{i} x_{j}, \mathcal{F} \right\rangle \\ &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{t}x_{k} - x_{t} x_{i} x_{j}, \sum\limits_{s=1}^{r} \lambda_{s} \left[\begin{array}{c} 1 \\ u_{s} \end{array}\right]^{\otimes 3} \right\rangle \\ &=& \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})\sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} (u_{s})_{k} - \sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} (u_{s})_{i} (u_{s})_{j} \\ &=& \sum\limits_{s=1}^{r} \lambda_{s} (u_{s})_{t} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})(u_{s})_{k} -(u_{s})_{i} (u_{s})_{j} \right) \\ &=& 0. \end{array} $$

When t = 0, we can similarly get

$$ \begin{array}{@{}rcl@{}} \langle \varphi_{ij}[G](x) ,\mathcal{F} \rangle &=& \left\langle \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) x_{k} - x_{i} x_{j}, \mathcal{F} \right\rangle\\ &=& \sum\limits_{s=1}^{r} \lambda_{s} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})(u_{s})_{k} -(u_{s})_{i} (u_{s})_{j} \right) \\ &=& 0. \end{array} $$

Therefore, the matrix G satisfies (3.2) and it is a generating matrix for $\mathcal {F}$.

Second, we prove the uniqueness of such G. For each $\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}$, let

$$ F := \left[\begin{array}{ccc} \mathcal{F}_{011} & {\cdots} & \mathcal{F}_{0r1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{01n} & {\cdots} & \mathcal{F}_{0rn} \end{array}\right],\quad g_{ij} := \left[\begin{array}{c} \mathcal{F}_{1ij} \\ {\vdots} \\ \mathcal{F}_{nij} \end{array}\right]. $$

Since G satisfies (3.2), we have F ⋅ G(:,e_i + e_j) = g_ij. The decomposition (3.3) implies that

$$ F = \left[\begin{array}{ccc} u_{1} & {\cdots} & u_{r} \end{array}\right] \cdot \text{diag}(\lambda_{1},\ldots,\lambda_{r}) \cdot \left[\begin{array}{ccc} v_{1} & {\cdots} & v_{r} \end{array}\right]^{T}. $$

The sets {v₁,…,v_r} and {u₁,…,u_r} are both linearly independent. Since each λ_i≠ 0, the matrix F has full column rank. Hence, the generating matrix G satisfying F ⋅ G(:,e_i + e_j) = g_ij for all i ∈ [r],j ∈ [r + 1,n] is unique. □

The following is an example of generating matrices.

Example 3.2

Consider the tensor $\mathcal {F}\in \mathtt {S}^{3}(\mathbb {C}^{6})$ that is given as

$$ \mathcal{F} = 0.4\cdot(1,1,1,1,1,1)^{\otimes 3} + 0.6\cdot(1,-1,2,-1,2,3)^{\otimes 3}. $$

The rank r = 2, ${\mathscr{B}}_{0}=\{x_{1},x_{2}\}$ and ${\mathscr{B}}_{1} = \{x_{1}x_{3},x_{1}x_{4},x_{1}x_{5},x_{2}x_{3},x_{2}x_{4},x_{2}x_{5}\}$. We have the vectors

$$ u_{1} =(1,1,1,1,1), \quad u_{2} = (-1,2,-1,2,3), \quad v_{1} =(1,1), \quad v_{2} = (-1,2). $$

The matrices N₃, N₄, N₅ as in (3.4) are

$$ \begin{array}{@{}rcl@{}} N_{3} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & -1 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}1/3 & 2/3 \\ 4/3 & -1/3 \end{array}\right], \\ N_{4} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & 2 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}4/3 & -1/3 \\ -2/3 & 5/3 \end{array}\right], \\ N_{5} &=& \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right] \left[\begin{array}{cc}1 & 0 \\ 0 & 3 \end{array}\right] \left[\begin{array}{cc}1 & -1 \\ 1 & 2 \end{array}\right]^{-1}= \left[\begin{array}{cc}5/3 & -2/3 \\ -4/3 & 7/3 \end{array}\right]. \end{array} $$

The entries of the generating matrix G are listed as below:

(3.5)

The generating polynomials in (3.1) are

$$ \begin{array}{@{}rcl@{}} \varphi_{13}[G](x) &=& \frac{1}{3}x_{1}+\frac{2}{3}x_{2}-x_{1}x_{3},\quad \varphi_{23}[G](x) = \frac{4}{3}x_{1}-\frac{1}{3}x_{2}-x_{2}x_{3},\\ \varphi_{14}[G](x) &=& \frac{4}{3}x_{1}-\frac{1}{3}x_{2}-x_{1}x_{4},\quad \varphi_{24}[G](x) = -\frac{2}{3}x_{1}+\frac{5}{3}x_{2}-x_{2}x_{4},\\ \varphi_{15}[G](x) &=& \frac{5}{3}x_{1}-\frac{2}{3}x_{2}-x_{1}x_{5},\quad \varphi_{25}[G](x) = -\frac{4}{3}x_{1}+\frac{7}{3}x_{2}-x_{2}x_{5}. \end{array} $$

Above generating polynomials can be written in the following form

$$ \left[\begin{array}{c} \varphi_{1j}[G](x)\\ \varphi_{2j}[G](x) \end{array}\right] = N_{j}\left[\begin{array}{c} x_{1} \\ x_{2} \end{array}\right] - x_{j} \left[\begin{array}{c} x_{1}\\ x_{2} \end{array}\right]\quad \text{for } j=3,4,5. $$

For x to be a common zero of φ_1j[G](x) and φ_2j[G](x), it requires that (x₁,x₂) is an eigenvector of N_j with the corresponding eigenvalue x_j.

3.1 Computing the Tensor Decomposition

We show how to find an incomplete tensor decomposition (3.3) for $\mathcal {F}$ when only its subtensor $\mathcal {F}_{{{\varOmega }}}$ is given, where the label set Ω is as in (1.4). Suppose that there exists the decomposition (3.3) for $\mathcal {F}$, for vectors $u_{i} \in \mathbb {C}^{n}$ and nonzero scalars $\lambda _{i} \in \mathbb {C}$. Assume the subvectors (u₁)_1:r,…,(u_r)_1:r are linearly independent, so there is a unique generating matrix G for $\mathcal {F}$, by Theorem 3.1.

For each $\alpha = e_{i} + e_{j} \in {\mathscr{B}}_{1}$ with i ∈ [r],j ∈ [r + 1,n] and for each

$$ l=r+1,\ldots,j-1,j+1,\ldots,n, $$

the generating matrix G satisfies the equations

$$ \left\langle x_{l} \left( \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j})x_{k} - x_{i} x_{j} \right),\mathcal{F} \right\rangle = \sum\limits_{k =1}^{r}G(k, e_{i}+e_{j}) \mathcal{F}_{0kl} - \mathcal{F}_{ijl} = 0. $$

(3.6)

Let the matrix $A_{ij}[\mathcal {F}]\in \mathbb {C}^{(n-r-1)\times r}$ and the vector $b_{ij}[\mathcal {F}]\in \mathbb {C}^{n-r-1}$ be such that

$$ A_{ij}[\mathcal{F}] := \left[\begin{array}{ccc} \mathcal{F}_{0,1,r+1} & {\cdots} & \mathcal{F}_{0,r,r+1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{0,1,j-1} & {\cdots} & \mathcal{F}_{0,r,j-1} \\ \mathcal{F}_{0,1,j+1} & {\cdots} & \mathcal{F}_{0,r,j+1} \\ {\vdots} & {\ddots} & {\vdots} \\ \mathcal{F}_{0,1,n} & {\cdots} & \mathcal{F}_{0,r,n} \end{array}\right], \quad b_{ij}[\mathcal{F}] := \left[\begin{array}{c} \mathcal{F}_{i,j,r+1}\\ {\vdots} \\ \mathcal{F}_{i,j,j-1}\\ \mathcal{F}_{i,j,j+1}\\ {\vdots} \\ \mathcal{F}_{i,j,n} \end{array}\right]. $$

(3.7)

To distinguish changes in the labels of tensor entries of $\mathcal {F}$, the commas are inserted to separate labeling numbers.

The equations in (3.6) can be equivalently written as

$$ A_{ij}[\mathcal{F}] \cdot G(:, e_{i}+e_{j}) = b_{ij}[\mathcal{F}]. $$

(3.8)

If the rank $r\le \frac {d}{2}-1$, then n − r − 1 = d − r − 2 ≥ r. Thus, the number of rows is not less than the number of columns for matrices $A_{ij}[\mathcal {F}]$. If $A_{ij}[\mathcal {F}]$ has linearly independent columns, then (3.8) uniquely determines G(:,α). For such a case, the matrix G can be fully determined by the linear system (3.8). Let $N_{r+1}(G),\ldots ,N_{m}(G) \in \mathbb {C}^{r\times r}$ be the matrices given as

$$ N_{l}(G) = \left[\begin{array}{ccc} G(1,e_{1}+e_{l}) & {\cdots} &G(r,e_{1}+e_{l}) \\ {\vdots} & {\ddots} & {\vdots} \\ G(1,e_{r}+e_{l}) & {\cdots} &G(r,e_{r}+e_{l}) \end{array}\right],\quad l=r+1,\ldots, n. $$

(3.9)

As in the proof of Theorem 3.1, one can see that

$$ N_{l}(G) \left[\begin{array}{c} (u_{i})_{1} \\ {\vdots} \\(u_{i})_{r} \end{array}\right] = (u_{i})_{l} \cdot \left[\begin{array}{c} (u_{i})_{1} \\ {\vdots} \\(u_{i})_{r} \end{array}\right], \quad l=r+1,\ldots, n. $$

The above is equivalent to the equations

$$ N_{l}(G)v_{i} = (w_{i})_{l-r} \cdot v_{i}, \quad l=r+1,\ldots, n, $$

for the vectors (i = 1,…,r)

$$ v_{i} := (u_{i})_{1:r}, \quad w_{i} := (u_{i})_{r+1:n}. $$

(3.10)

Each v_i is a common eigenvector of the matrices N_r+ 1(G),…,N_n(G) and (w_i)_l−r is the associated eigenvalue of N_l(G). These matrices may or may not have repeated eigenvalues. Therefore, we select a generic vector $\xi := (\xi _{r+1},\dots ,\xi _{n})$ and let

$$ N(\xi) := \xi_{r+1}N_{r+1}+\cdots+\xi_{n}N_{n}. $$

(3.11)

The eigenvalues of N(ξ) are ξ^Tw₁,…,ξ^Tw_r. When w₁,…,w_r are distinct from each other and ξ is generic, the matrix N(ξ) does not have a repeated eigenvalue and hence it has unique eigenvectors v₁,…,v_r, up to scaling. Let $\tilde {v}_{1},\ldots ,\tilde {v}_{r}$ be unit length eigenvectors of N(ξ). They are also common eigenvectors of N_r+ 1(G),…,N_n(G). For each i = 1,…,r, let $\tilde {w}_{i}$ be the vector such that its j th entry $(\tilde {w}_{i})_{j}$ is the eigenvalue of N_j+r(G), associated to the eigenvector $\tilde {v}_{i}$, or equivalently,

$$ \tilde{w}_{i}=\left( \tilde{v}_{i}^{H} N_{r+1}(G)\tilde{v}_{i},\dots, \tilde{v}_{i}^{H} N_{n}(G)\tilde{v}_{i}\right),\quad i=1,\ldots,r. $$

(3.12)

Up to a permutation of $(\tilde {v}_{1},\ldots , \tilde {v}_{r})$, there exist scalars γ_i such that

$$ v_{i} = \gamma_{i} \tilde{v}_{i}, \quad w_{i} = \tilde{w}_{i}. $$

(3.13)

The tensor decomposition of $\mathcal {F}$ can also be written as

$$ \mathcal{F} = \lambda_{1} \left[\begin{array}{c} 1 \\ \gamma_{1} \tilde{v}_{1} \\ \tilde{w}_{1} \end{array}\right]^{\otimes 3} + {\cdots} +\lambda_{r} \left[\begin{array}{c} 1 \\ \gamma_{r} \tilde{v}_{r} \\ \tilde{w}_{r} \end{array}\right]^{\otimes 3}. $$

The scalars $\lambda _{1},\dots ,\lambda _{r}$ and $ \gamma _{1},\dots ,\gamma _{r}$ satisfy the linear equations

$$ \begin{array}{@{}rcl@{}} \lambda_{1}\gamma_{1} \tilde{v}_{1} \otimes\tilde{w}_{1} +\cdots+ {\lambda_{r}}{\gamma_{r}} \tilde{v}_{r} \otimes \tilde{w}_{r} &=& \mathcal{F}_{[0,1:r,r+1:n]}, \\ \lambda_{1}{\gamma_{1}^{2}}\tilde{v}_{1}\otimes \tilde{v}_{1} \otimes \tilde{w}_{1}+\cdots+\lambda_{r}{\gamma_{r}^{2}} \tilde{v}_{r}\otimes \tilde{v}_{r}\otimes \tilde{w}_{r} &=&\mathcal{F}_{[1:r,1:r,r+1:n]} . \end{array} $$

Denote the label sets

(3.14)

To determine the scalars λ_i, γ_i, we can solve the linear least squares

$$ \begin{array}{@{}rcl@{}} &&\underset{(\beta_{1},\ldots,\beta_{r})}{\min} \left\|\mathcal{F}_{J_{1}} - \sum\limits_{i=1}^{r} \beta_{i} \cdot \tilde{v}_{i} \otimes \tilde{w}_{i} \right\|^{2}, \end{array} $$

(3.15)

$$ \begin{array}{@{}rcl@{}} && \underset{(\theta_{1},\ldots,\theta_{r})}{\min} \left\|\mathcal{F}_{J_{2}} - \sum\limits_{k=1}^{r}\theta_{k} \cdot (\tilde{v}_{k} \otimes \tilde{v}_{k} \otimes \tilde{w}_{k})_{J_{2}} \right\|^{2}. \end{array} $$

(3.16)

Let $(\beta _{1}^{\ast },\ldots ,\beta _{r}^{\ast })$, $(\theta _{1}^{\ast },\ldots ,\theta _{r}^{\ast })$ be minimizers of (3.15) and (3.16) respectively. Then, for each i = 1,…,r, let

$$ \lambda_{i} := (\beta_{i}^{\ast})^{2}/\theta_{i}^{\ast}, \quad \gamma_{i} := \theta_{i}^{\ast}/\beta_{i}^{\ast}. $$

(3.17)

For the vectors (i = 1,…,r)

$$ p_{i} := \sqrt[3]\lambda_{i}(1,\gamma_{i} \tilde{v}_{i},\tilde{w}_{i}), $$

the sum $p_{1}^{\otimes 3}+ {\cdots } +p_{r}^{\otimes 3}$ is a tensor decomposition for $\mathcal {F}$. This is justified in the following theorem.

Theorem 3.3

Suppose the tensor $\mathcal {F}$ has the decomposition as in (3.3). Assume that the vectors v₁,…,v_r are linearly independent and the vectors w₁,…,w_r are distinct from each other, where v₁,…,v_r,w₁,…,w_r are defined as in (3.10). Let ξ be a generically chosen coefficient vector and let p₁,…,p_r be the vectors produced as above. Then, the tensor decomposition $\mathcal {F} = p_{1}^{\otimes 3}+ {\cdots } +p_{r}^{\otimes 3}$ is unique.

Proof

Since v₁,…,v_r are linearly independent, the tensor decomposition (3.3) is unique, up to scalings and permutations. By Theorem 3.1, there is a unique generating matrix G for $\mathcal {F}$ satisfying (3.2). Under the given assumptions, (3.8) uniquely determines G. Note that ξ^Tw₁,…,ξ^Tw_r are the eigenvalues of N(ξ) and v₁,…,v_r are the corresponding eigenvectors. When ξ is generically chosen, the values of ξ^Tw₁,…,ξ^Tw_r are distinct eigenvalues of N(ξ). So N(ξ) has unique eigenvalue decompositions, and hence (3.13) must hold, up to a permutation of (v₁,…,v_r). Since the coefficient matrices have full column ranks, the linear least squares problems have unique optimal solutions. Up to a permutation of p₁,…,p_r, it holds that $p_{i} = \sqrt [3]{\lambda _{i}} \left [\begin {array}{c} 1 \\ u_{i} \end {array}\right ]$. Then, the conclusion follows readily. □

The following is the algorithm for computing an incomplete tensor decomposition for $\mathcal {F}$ when only its subtensor $\mathcal {F}_{{{\varOmega }}}$ is given.

Algorithm 3.4

(Incomplete symmetric tensor decompositions)

A third order symmetric subtensor ${\mathcal {F}}_{{{\varOmega }}}$ and a rank $r =\text {rank}_{S}(\mathcal {F})\le \frac {d}{2}-1$.
Determine the matrix G by solving (3.8) for each $\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}$.
Let N(ξ) be the matrix as in (3.11), for a randomly selected vector ξ. Compute the unit length eigenvectors $\tilde {v}_{1},\ldots ,\tilde {v}_{r}$ of N(ξ) and choose $\tilde {w}_{i}$ as in (3.12).
Solve the linear least squares (3.15) and (3.16) to get the coefficients λ_i, γ_i as in (3.17).
For each i = 1,…,r, let $p_{i} := \sqrt [3]{ \lambda _{i}}(1, \gamma _{i} \tilde {v}_{i}, \tilde {w}_{i})$.
The tensor decomposition $\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}$.

The following is an example of applying Algorithm 3.4.

Example 3.5

Consider the same tensor $\mathcal {F}$ as in Example 3.2. The monomial sets ${\mathscr{B}}_{0}$, ${\mathscr{B}}_{1}$ are the same. The matrices $A_{ij}[\mathcal {F}]$ and vectors $b_{ij}[\mathcal {F}]$ are

$$ \begin{array}{@{}rcl@{}} A_{13}[\mathcal{F}] &=& A_{23}[\mathcal{F}]= \left[\begin{array}{cc} -0.8 & 2.8\\ -1.4 & 4 \end{array}\right], \qquad b_{13}[\mathcal{F}]=\left[\begin{array}{c}1.6\\2.2 \end{array}\right],\quad~~~ b_{23}[\mathcal{F}]=\left[\begin{array}{c}-2\\-3.2 \end{array}\right],\\ A_{14}[\mathcal{F}] &=& A_{24}[\mathcal{F}]= \left[\begin{array}{cc} 1 & -0.8\\ -1.4 & 4 \end{array}\right], \quad~ b_{14}[\mathcal{F}]=\left[\begin{array}{c}1.6\\-3.2 \end{array}\right],\quad b_{24}[\mathcal{F}]=\left[\begin{array}{c}-2\\7.6 \end{array}\right],\\ A_{15}[\mathcal{F}] &=& A_{25}[\mathcal{F}]= \left[\begin{array}{cc} 1 & -0.8\\ -0.8 & 2.8 \end{array}\right], \quad~ b_{15}[\mathcal{F}] = \left[\begin{array}{c}2.2\\-3.2 \end{array}\right],\quad~ b_{25}[\mathcal{F}] = \left[\begin{array}{c}-3.2\\7.6 \end{array}\right]. \end{array} $$

Solve (3.8) to obtain G, which is same as in (3.5). The matrices N₃(G), N₄(G), N₅(G) are

$$ N_{3}(G)=\left[\begin{array}{cc} 1/3 & 2/3\\4/3 & -1/3 \end{array}\right],\quad N_{4}(G)=\left[\begin{array}{cc} 4/3 & -1/3\\-2/3 & 5/3 \end{array}\right],\quad N_{5}(G)=\left[\begin{array}{cc} 5/3 & -2/3\\-4/3 & 7/3 \end{array}\right]. $$

Choose a generic ξ, say, ξ = (3,4,5), then

$$ N(\xi) = \left[\begin{array}{cc}1/\sqrt{2} & -1/\sqrt{5} \\ 1/\sqrt{2} & 2/\sqrt{5} \end{array}\right] \left[\begin{array}{cc} 12 & 0 \\ 0 & 20 \end{array}\right] \left[\begin{array}{cc}1/\sqrt{2} & -1/\sqrt{5} \\ 1/\sqrt{2} & 2/\sqrt{5} \end{array}\right]^{-1}. $$

The unit length eigenvectors are

$$ \tilde{v}_{1} = (1/\sqrt{2},1/\sqrt{2}), \quad \tilde{v}_{2}=(-1/\sqrt{5},2/\sqrt{5}) . $$

As in (3.12), we get the vectors

$$ w_{1} = (1,1,1),\quad w_{2} = (-1,2,3). $$

Solving (3.15) and (3.16), we get the scalars

$$ \gamma_{1}=\sqrt{2}, \quad \gamma_{2}=\sqrt{5}, \quad \lambda_{1}=0.4, \quad \lambda_{2} = 0.6. $$

This produces the decomposition $\mathcal {F}=\lambda _{1}u_{1}^{\otimes 3}+\lambda _{2}u_{2}^{\otimes 3}$ for the vectors

$$ u_{1}=(1,\gamma_{1}v_{1},w_{1})=(1,1,1,1,1,1), \quad u_{2}=(1,\gamma_{2}v_{2},w_{2})=(1,-1,2,-1,2,3). $$

Remark 3.6

Algorithm 3.4 requires the value of r. This is generally a hard question. In computational practice, one can estimate the value of r as follows. Let $\text {Flat}(\mathcal {F})\in \mathbb {C}^{(n+1) \times (n+1)^{2}}$ be the flattening matrix, labelled by (i,(j,k)) such that

$$ \text{Flat}(\mathcal{F})_{i,(j,k)} = \mathcal{F}_{ijk} $$

for all i,j,k = 0,1,…,n. The rank of $\text {Flat}(\mathcal {F})$ equals the rank of $\mathcal {F}$ when the vectors p₁,…,p_r are linearly independent. The rank of $\text {Flat}(\mathcal {F})$ is not available since only the subtensor $(\mathcal {F})_{{{\varOmega }}}$ is known. However, we can calculate the ranks of submatrices of $(\mathcal {F})_{{{\varOmega }}}$ whose entries are known. If the tensor $\mathcal {F}$ as in (3.3) is such that both the sets {v₁,…,v_r} and {w₁,…,w_r} are linearly independent, one can see that ${\sum }_{i=1}^{r} \lambda _{i} v_{i}{w_{i}^{T}}$ is a known submatrix of $\text {Flat}(\mathcal {F})$ whose rank is r. This is generally the case if $r\le \frac {d}{2}-1$, since v_i has the length r and w_i has length d − 1 − r ≥ r. Therefore, the known submatrices of $\text {Flat}(\mathcal {F})$ are generally sufficient to estimate $\text {rank}_{S}(\mathcal {F})$. For instance, we consider the case $\mathcal {F}\in \text {S}^{3}(\mathbb {C}^{7})$. The flattening matrix $\text {Flat}(\mathcal {F})$ is

$$ \left[\begin{array}{ccccccc} \ast & \ast & \ast & \ast & \ast & \ast & \ast\\ \ast & \ast & \mathcal{F}_{120} & \mathcal{F}_{130} & \mathcal{F}_{140} & \mathcal{F}_{150} & \mathcal{F}_{160}\\ \ast & \mathcal{F}_{210} & \ast & \mathcal{F}_{230} & \mathcal{F}_{240} & \mathcal{F}_{250} & \mathcal{F}_{260}\\ \ast & \mathcal{F}_{310} & \mathcal{F}_{320} & \ast & \mathcal{F}_{340} & \mathcal{F}_{350} & \mathcal{F}_{360}\\ \ast & \mathcal{F}_{410} & \mathcal{F}_{420} & \mathcal{F}_{430} & \ast & \mathcal{F}_{450} & \mathcal{F}_{460}\\ \ast & \mathcal{F}_{510} & \mathcal{F}_{520} & \mathcal{F}_{530} & \mathcal{F}_{540} & \ast & \mathcal{F}_{560}\\ \ast & \mathcal{F}_{610} & \mathcal{F}_{620} & \mathcal{F}_{630} & \mathcal{F}_{640} & \mathcal{F}_{650} & \ast \end{array}\right], $$

where each ∗ means that entry is not given. The largest submatrices with known entries are

$$ \left[\begin{array}{ccc} \mathcal{F}_{410} & \mathcal{F}_{420} & \mathcal{F}_{430}\\ \mathcal{F}_{510} & \mathcal{F}_{520} & \mathcal{F}_{530}\\ \mathcal{F}_{610} & \mathcal{F}_{620} & \mathcal{F}_{630} \end{array}\right], \quad \left[\begin{array}{ccc} \mathcal{F}_{140} & \mathcal{F}_{150} & \mathcal{F}_{160}\\ \mathcal{F}_{240} & \mathcal{F}_{250} & \mathcal{F}_{260}\\ \mathcal{F}_{340} & \mathcal{F}_{350} & \mathcal{F}_{360} \end{array}\right]. $$

The rank of above matrices generally equals $\text {rank}_{S}(\mathcal {F})$ if $r\le \frac {d}{2}-1 = 2.5$.

4 Tensor Approximations and Stability Analysis

In some applications, we do not have the subtensor $\mathcal {F}_{{{\varOmega }}}$ exactly but only have an approximation $\widehat {\mathcal {F}}_{{{\varOmega }}}$ for it. Algorithm 3.4 can still provide a good rank-r approximation for $\mathcal {F}$ when it is applied to $\widehat {\mathcal {F}}_{{{\varOmega }}}$. We define the matrix $A_{ij}[\widehat {\mathcal {F}}]$ and the vector $b_{ij}[\widehat {\mathcal {F}}]$ in the same way as in (3.7), for each $\alpha = e_{i}+e_{j} \in {\mathscr{B}}_{1}$. The generating matrix G for $\mathcal {F}$ can be approximated by solving the linear least squares

$$ \underset{g \in \mathbb{C}^{r} }{\min} \|A_{ij}[\widehat{\mathcal{F}}] \cdot g - b_{ij}[\widehat{\mathcal{F}}]\|^{2}, $$

(4.1)

for each $\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}$. Let $\widehat {G}(:,e_{i}+e_{j})$ be the optimizer of the above and $\widehat {G}$ be the matrix consisting of all such $\widehat {G}(:,e_{i}+e_{j})$. Then $\widehat {G}$ is an approximation for G. For each l = r + 1,…,n, define the matrix $N_{l}(\widehat {G})$ similarly as in (3.9). Choose a generic vector ξ = (ξ_r+ 1,…,ξ_n) and let

$$ \widehat{N}(\xi) := \xi_{r+1} N_{r+1}(\widehat{G}) +{\cdots} +\xi_{n}N_{n}(\widehat{G}). $$

(4.2)

The matrix $\widehat {N}(\xi )$ is an approximation for N(ξ). Let $\hat {v}_{1},\ldots ,\hat {v}_{r}$ be unit length eigenvectors of $\widehat {N}(\xi )$. For k = 1,…,r, let

$$ \hat{w}_{k} := \left( (\hat{v}_{k})^{H} N_{r+1}(\widehat{G})\hat{v}_{k},\ldots, (\hat{v}_{k})^{H} N_{n}(\widehat{G}) \hat{v}_{k} \right). $$

(4.3)

For the label sets J₁, J₂ as in (3.14), the subtensors $\widehat {\mathcal {F}}_{J_{1}}, \widehat {\mathcal {F}}_{J_{2}}$ are similarly defined like $\mathcal {F}_{J_{1}}$, $\mathcal {F}_{J_{2}}$. Consider the following linear least square problems

$$ \begin{array}{@{}rcl@{}} &&\underset{(\beta_{1},\ldots,\beta_{r})}{\min} \left\|\widehat{\mathcal{F}}_{J_{1}} - \sum\limits_{k=1}^{r} \beta_{k} \cdot \hat{v}_{k}\otimes \hat{w}_{k} \right\|^{2}, \end{array} $$

(4.4)

$$ \begin{array}{@{}rcl@{}} &&\underset{(\theta_{1},\ldots,\theta_{r})}{\min} \left\|\widehat{\mathcal{F}}_{J_{2}} - \sum\limits_{k=1}^{r}\theta_{k} \cdot (\hat{v}_{k} \otimes \hat{v}_{k} \otimes \hat{w}_{k})_{J_{2}} \right\|^{2}. \end{array} $$

(4.5)

Let $(\hat {\beta }_{1}, \ldots , \hat {\beta }_{r})$ and $(\hat {\theta }_{1}, \ldots , \hat {\theta }_{r})$ be their optimizers respectively. For each k = 1,…,r, let

$$ \hat{\lambda}_{k} := (\hat{\beta}_{k})^{2}/\hat{\theta}_{k}, \quad \hat{\gamma}_{k} := \hat{\theta_{k}}/\hat{\beta}_{k}. $$

This results in the tensor approximation

$$ \mathcal{F} \approx (\hat{p}_{1})^{\otimes 3}+\cdots+(\hat{p}_{r} )^{\otimes 3}, $$

for the vectors $\hat {p}_{k} := \sqrt [3]{ \hat {\lambda }_{k}}(1, \hat {\gamma }_{k}\hat {v}_{k}, \hat {w}_{k})$. The above may not give an optimal tensor approximation. To get an improved one, we can use $\hat {p}_{1},\ldots ,\hat {p}_{r}$ as starting points to solve the following nonlinear optimization

$$ \underset{(q_{1},\ldots,q_{r})}{\min} \left\| \left( \sum\limits_{k=1}^{r} (q_{k})^{\otimes 3} - \widehat{\mathcal{F}}\right)_{{{\varOmega}}} \right\|^{2}. $$

(4.6)

The minimizer of the optimization (4.6) is denoted as $(p_{1}^{\ast },\ldots ,p_{r}^{\ast })$.

Summarizing the above, we have the following algorithm for computing a tensor approximation.

Algorithm 4.1

(Incomplete symmetric tensor approximations)

A third order symmetric subtensor $\widehat {\mathcal {F}}_{{{\varOmega }}}$ and a rank $r\le \frac {d}{2}-1$.
Find the matrix $\widehat {G}$ by solving (4.1) for each $\alpha =e_{i}+e_{j} \in \mathbb {B}_{1}$.
Choose a generic vector and let $\widehat {N}(\xi )$ be the matrix as in (4.2). Compute unit length eigenvectors $\hat {v}_{1},\ldots ,\hat {v}_{r}$ for $\widehat {N}(\xi )$ and define $\hat {w}_{i}$ in (4.3).
Solve the linear least squares (4.4), (4.5) to get the coefficients $\hat {\lambda }_{i}, \hat {\gamma }_{i}$.
For each i = 1,…,r, let $\hat {p}_{i} := \sqrt [3]{ \hat {\lambda }_{i}}(1,\hat {\gamma }_{i} \hat {v}_{i},\hat {w}_{i})$. Then $(\hat {p}_{1})^{\otimes 3}+\cdots +(\hat {p}_{r})^{\otimes 3}$ is a tensor approximation for $\widehat {\mathcal {F}}$.
Use $\hat {p}_{1},\ldots , \hat {p}_{r}$ as starting points to solve the nonlinear optimization (4.6) for an optimizer $(p_{1}^{\ast },\ldots ,p_{r}^{\ast })$.
The tensor approximation $(p_{1}^{\ast })^{\otimes 3}+\cdots +(p_{r}^{\ast })^{\otimes 3}$ for $\widehat {\mathcal {F}}$.

When $\widehat {\mathcal {F}}$ is close to $\mathcal {F}$, Algorithm 4.1 also produces a good rank-r tensor approximation for $\mathcal {F}$. This is shown in the following.

Theorem 4.2

Suppose the tensor $\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}$, with $r \le \frac {d}{2}-1$, satisfies the following conditions:

(i)
The leading entry of each p_i is nonzero;
(ii)
the subvectors (p₁)_2:r+ 1,…,(p_r)_2:r+ 1 are linearly independent;
(iii)
the subvectors (p₁)_{[r+ 2:j,j+ 2:d]},…,(p_r)_{[r+ 2:j,j+ 2:d]} are linearly independent for each j ∈ [r + 1,n];
(iv)
the eigenvalues of the matrix N(ξ) in (3.11) are distinct from each other.

Let $\hat {p}_{i}$, $p_{i}^{\ast }$ be the vectors produced by Algorithm 4.1. If the distance $\varepsilon := \|(\widehat {\mathcal {F}}-\mathcal {F})_{{{\varOmega }}}\|$ is small enough, then there exist scalars $\hat {\tau }_{i}$, $\tau _{i}^{\ast }$ such that

$$ (\hat{\tau}_{i})^{3} = (\tau_{i}^{\ast})^{3}=1, \quad \|\hat{\tau}_{i} \hat{p}_{i}- p_{i}\| = O(\varepsilon), \quad \|\tau_{i}^{\ast} p^{\ast}_{i}- p_{i}\| = O(\varepsilon), $$

up to a permutation of (p₁,…,p_r), where the constants inside O(⋅) only depend on $\mathcal {F}$ and the choice of ξ in Algorithm 4.1.

Proof

The conditions (i)–(ii), by Theorem 3.1, imply that there is a unique generating matrix G for $\mathcal {F}$. The matrix G can be approximated by solving the linear least square problems (4.1). Note that

$$ \|A_{ij}[\widehat{\mathcal{F}}] - A_{ij}[\mathcal{F}]\| \leq \varepsilon, \quad \|b_{ij}[\widehat{\mathcal{F}}]-b_{ij}[\mathcal{F}]\|\le \varepsilon, $$

for all $\alpha =e_{i}+e_{j}\in {\mathscr{B}}_{1}$. The matrix $A_{ij}[\mathcal {F}]$ can be written as

$$ A_{ij}[\mathcal{F}] = [(p_{1})_{[r+2:j,j+2:d]},\ldots,(p_{r})_{[r+2:j,j+2:d]}]\cdot [(p_{1})_{2:r+1},\ldots,(p_{r})_{2:r+1}]^{T}. $$

By the conditions (ii)–(iii), the matrix $A_{ij}[\mathcal {F}]$ has full column rank for each j ∈ [r + 1,n] and hence the matrix $A_{ij}[\widehat {\mathcal {F}}]$ has full column rank when ε is small enough. Therefore, the linear least problems (4.1) have unique solutions and the solution $\widehat {G}$ satisfies that

$$ \|\widehat{G}-{G}\| = O(\varepsilon), $$

where O(ε) depends on $\mathcal {F}$ (see [15, Theorem 3.4]). For each j = r + 1,…,n, $N_{j}(\widehat {G})$ is part of the generating matrix $\widehat {G}$, so

$$ \|N_{j}(\widehat{G})-{N}_{j}(G)\|\le \|\widehat{G}-G\| = O(\varepsilon), \quad j=r+1,\ldots,n. $$

This implies that $\|\widehat {N}(\xi )-N(\xi )\|=O(\varepsilon )$. When ε is small enough, the matrix $\widehat {N}(\xi )$ does not have repeated eigenvalues, due to the condition (iv). Thus, the matrix N(ξ) has a set of unit length eigenvectors $\tilde {v}_{1},\ldots ,\tilde {v}_{r}$ with eigenvalues $\tilde {w}_{1},\ldots ,\tilde {w}_{r}$ respectively, such that

$$ \|\hat{v}_{i}-\tilde{v}_{i}\| = O(\varepsilon), \quad \|\hat{w}_{i}-\tilde{w}_{i}\| = O(\varepsilon). $$

This follows from Proposition 4.2.1 in [8]. The constants inside the above O(⋅) depend only on $\mathcal {F}$ and ξ. The $\tilde {w}_{1},\ldots ,\tilde {w}_{r} $ are scalar multiples of linearly independent vectors (p₁)_r+ 2:d,…,(p_r)_r+ 2:d respectively, so $\tilde {w}_{1},\ldots ,\tilde {w}_{r}$ are linearly independent. When ε is small, ${\hat {w}}_{1},\ldots ,{\hat {w}}_{r}$ are linearly independent as well. The scalars $\hat {\lambda }_{i} \hat {\gamma }_{i} $ and $\hat {\lambda }_{i}(\hat {\gamma }_{i})^{2}$ are optimizers for the linear least square problems (4.4) and (4.5). By Theorem 3.4 in [15], we have

$$ \|\hat{\lambda}_{i}\hat{\gamma}_{i} - \lambda_{i} \gamma_{i}\| = O(\varepsilon),\quad \|\hat{\lambda}_{i}(\hat{\gamma}_{i})^{2} - \lambda_{i} {\gamma_{i}^{2}}\| = O(\varepsilon). $$

The vector p_i can be written as $p_{i} = \sqrt [3]{\lambda _{i}}(1,\gamma _{i} \tilde {v}_{i},\tilde {w}_{i})$, so we must have λ_i,γ_i≠ 0 due to the condition (ii). Thus, it holds that

$$ \|\hat{\lambda}_{i} - {\lambda}_{i}\| = O(\varepsilon),\quad \|\hat{\gamma}_{i} - {\gamma}_{i}\| = O(\varepsilon), $$

where constants inside O(⋅) depend only on $\mathcal {F}$ and ξ. For the vectors $\tilde {p}_{i}:=\sqrt [3]{\lambda _{i}}(1,\gamma _{i}\tilde {v}_{i},\tilde {w}_{i})$, we have $\mathcal {F} = {\sum }_{i=1}^{r} \tilde {p}_{i}^{\otimes 3}$, by Theorem 3.3. Since p₁,…,p_r are linearly independent by the assumption, the rank decomposition of $\mathcal {F}$ is unique up to scaling and permutation. There exist scalars $\hat {\tau }_{i}$ such that $(\hat {\tau }_{i})^{3}=1$ and $\hat {\tau }_{i} \tilde {p}_{i} = p_{i}$, up to a permutation of p₁,…,p_r. For $\hat {p}_{i} = \sqrt [3]{\hat {\lambda }_{i}}(1, \hat {\gamma }_{i} \hat {v}_{i} ,\hat {w}_{i})$, we have $\|\hat {\tau }_{i} \hat {p}_{i}-p_{i}\|=O(\varepsilon )$, where the constants in O(⋅) only depend on $\mathcal {F}$ and ξ.

Since $\| \hat {\tau }_{i} \hat {p}_{i} - p_{i}\|=O(\varepsilon )$, we have $\|({\sum }_{i=1}^{r} (\hat {p}_{i})^{\otimes 3}-\mathcal {F})_{{{\varOmega }}}\| = O(\varepsilon )$. The $(p_{1}^{\ast },\ldots ,p_{r}^{\ast })$ is a minimizer of (4.6), so

$$ \left\|\left( \sum\limits_{i=1}^{r} (p_{i}^{\ast})^{\otimes 3}-\widehat{\mathcal{F}}\right)_{{{\varOmega}}}\right\| \le \left\|\left( \sum\limits_{i=1}^{r} (\hat{p}_{i})^{\otimes 3}-\widehat{\mathcal{F}}\right)_{{{\varOmega}}}\right\| = O(\varepsilon). $$

For the tensor $\mathcal {F}^{\ast }:={\sum }_{i=1}^{r} (p_{i}^{\ast })^{\otimes 3}$, we get

$$ \|(\mathcal{F}^{\ast}-\mathcal{F})_{{{\varOmega}}}\|\le\|(\mathcal{F}^{\ast}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|+ \|(\widehat{\mathcal{F}}-\mathcal{F})_{{{\varOmega}}}\|=O(\varepsilon) . $$

When Algorithm 4.1 is applied to $(\mathcal {F}^{\ast })_{{{\varOmega }}}$, Step 4 will give the exact decomposition $\mathcal {F}^{\ast }={\sum }_{i=1}^{r} (p_{i}^{\ast })^{\otimes 3}$. By repeating the previous argument, we can similarly show that $\|p_{i}- \tau _{i}^{\ast } p_{i}^{\ast }\|= O(\varepsilon )$ for some $\tau _{i}^{\ast }$ such that $(\tau _{i}^{\ast })^{3}=1$, where the constants in O(⋅) only depend on $\mathcal {F}$ and ξ. □

Remark 4.3

For the special case that ε = 0, Algorithm 4.1 is the same as Algorithm 3.4, which produces the exact rank decomposition for $\mathcal {F}$. The conditions in Theorem 4.2 are satisfied for generic vectors p₁,…,p_r, since $r\le \frac {d}{2}-1$. The constant in O(⋅) is not explicitly given in the proof. It is related to the condition number $\kappa (\mathcal {F})$ for tensor decomposition [5]. It was shown by Breiding and Vannieuwenhoven [5] that

$$ \sqrt{\sum\limits_{i=1}^{r}\|p_{i}^{\otimes 3}-\hat{p}_{i}^{\otimes 3}\|^{2}}\leq\kappa(\mathcal{F})\|\mathcal{F}-\hat{\mathcal{F}}\|+c\varepsilon^{2} $$

for some constant c. The continuity of $\hat {G}$ in $\hat {\mathcal {F}}$ is implicitly implied by the proof. Eigenvalues and unit eigenvectors of $\widehat {N}(\xi )$ are continuous in $\hat {G}$. Furthermore, $\hat {\lambda }_{i},\hat {\gamma }_{i}$ are continuous in the eigenvalues and unit eigenvectors. All these functions are locally Lipschitz continuous. The $\hat {p}_{i}$ is Lipschitz continuous with respect to $\hat {\mathcal {F}}$, in a neighborhood of $\mathcal {F}$, which also implies an error bound for $\hat {p}_{i}$. The tensors $(p_{i}^{\ast })^{\otimes 3}$ are also locally Lipschitz continuous in $\widehat {\mathcal {F}}$, as illustrated in [6]. This also gives error bounds for decomposing vectors $p_{i}^{\ast }$. We refer to [5, 6] for more details about condition numbers of tensor decompositions.

Example 4.4

We consider the same tensor $\mathcal {F}$ as in Example 3.2. The subtensor $(\mathcal {F})_{{{\varOmega }}}$ is perturbed to $(\widehat {\mathcal {F}})_{{{\varOmega }}}$. The perturbation is randomly generated from the Gaussian distribution $\mathcal {N}(0, 0.01)$. For neatness of the paper, we do not display $(\widehat {\mathcal {F}})_{{{\varOmega }}}$ here. We use Algorithm 4.1 to compute the incomplete tensor approximation. The matrices $A_{ij}[\widehat {\mathcal {F}}]$ and vectors $b_{ij} [\widehat {\mathcal {F}}]$ are given as follows:

$$ \begin{array}{@{}rcl@{}} A_{13}[\widehat{\mathcal{F}}]&=&A_{23} [\widehat{\mathcal{F}}]=\left[\begin{array}{cc} -0.8135 & 2.7988\\ -1.3697 & 4.0149 \end{array}\right],\qquad b_{13}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}1.5980\\2.1879 \end{array}\right],\quad~~ b_{23}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-2.0047\\-3.2027 \end{array}\right],\\ A_{14}[\widehat{\mathcal{F}}]&=&A_{24}[\widehat{\mathcal{F}}]=\left[\begin{array}{cc} 1.0277 & -0.8020\\-1.3697 & 4.0149 \end{array}\right],\quad b_{14}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}1.5920\\-3.2013 \end{array}\right],\quad b_{24}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-2.0059\\7.5915 \end{array}\right],\\ A_{15}[\widehat{\mathcal{F}}]&=&A_{25}[\widehat{\mathcal{F}}]=\left[\begin{array}{cc} 1.0277 & -0.8020\\-0.8135 & 2.7988 \end{array}\right],\quad b_{15}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}2.1993\\-3.2020 \end{array}\right],\quad b_{25}[\widehat{\mathcal{F}}]=\left[\begin{array}{c}-3.1917\\7.6153 \end{array}\right]. \end{array} $$

The linear least square problems (4.1) are solved to obtain $\widehat {G}$ and $N_{3}(\widehat {G})$, $N_{4}(\widehat {G})$, $N_{5}(\widehat {G})$, which are

$$ \begin{array}{@{}rcl@{}} N_{3}(\widehat{G}) &=& \left[\begin{array}{cc} 0.5156 & 0.7208\\1.6132 & -0.2474 \end{array}\right],\quad N_{4}(\widehat{G})=\left[\begin{array}{cc} 1.2631 & -0.3665\\-0.6489 & 1.6695 \end{array}\right],\\ N_{5}(\widehat{G})&=&\left[\begin{array}{cc} 1.6131 & -0.6752\\-1.2704 & 2.3517 \end{array}\right]. \end{array} $$

For ξ = (3,4,5), the eigendecomposition of the matrix $\widehat {N}(\xi )$ in (4.2) is

$$ \widehat{N}(\xi) = \left[\begin{array}{cc} -0.7078 & 0.4470 \\ -0.7064 & -0.8945 \end{array}\right]\left[\begin{array}{cc}12.0343 & 0 \\ 0 & 20.0786 \end{array}\right] \left[\begin{array}{cc}-0.7524 & 0.4499 \\ -0.6588 & -0.8931 \end{array}\right]^{-1}. $$

It has eigenvectors $\hat {v}_{1}=(-0.7078,-0.7064)$, $\hat {v}_{2}=(0.4470,-0.8945)$. The vectors $\hat {w}_{1}$, $\hat {w}_{2}$ obtained as in (4.3) are

$$ \hat{w}_{1} = (1.2021,0.9918,0.9899),\quad \hat{w}_{2} = (-1.0389,2.0145,3.0016). $$

By solving (4.4) and (4.5), we got the scalars

$$ \hat{\gamma}_{1}=-1.1990,\quad \hat{\gamma}_{2}=-2.1458, \qquad \hat{\lambda}_{1}=0.4521,\quad \hat{\lambda}_{2}=0.6232. $$

Finally, we got the decomposition $\hat {\lambda }_{1} \hat {u}_{1}^{\otimes 3}+\hat {\lambda }_{2} \hat {u}_{2}^{\otimes 3}$ with

$$ \begin{array}{@{}rcl@{}} \hat{u}_{1}&=&(1,\hat{\gamma}_{1}\hat{v}_{1},\hat{w}_{1})=(1,0.8477,0.8479,1.2021,0.9918,0.9899),\\ \hat{u}_{2}&=&(1,\hat{\gamma}_{2}\hat{v}_{2},\hat{w}_{2})=(1,-0.9776,1.9102,-1.0389,2.0145,3.0016). \end{array} $$

They are pretty close to the decomposition of $\mathcal {F}$.

5 Learning Diagonal Gaussian Mixture

We use the incomplete tensor decomposition or approximation method to learn parameters for Gaussian mixture models. Algorithms 3.4 and 4.1 can be applied to do that.

Let y be the random variable of dimension d for a Gaussian mixture model, with r components of Gaussian distribution parameters (ω_i,μ_i,Σ_i), i = 1,…,r. We consider the case that $ r\le \frac {d}{2}-1$. Let y₁,…,y_N be samples drawn from the Gaussian mixture model. The sample average

$$ \widehat{M}_{1} :=\frac{1}{N} (y_{1} + {\cdots} + y_{N}) $$

is an estimation for the mean $M_{1}:=\mathbb {E}[y]=\omega _{1} \mu _{1} + {\cdots } + \omega _{r} \mu _{r}$. The symmetric tensor

$$ \widehat{M}_{3} := \frac{1}{N} (y_{1}^{\otimes 3} + {\cdots} + y_{N}^{\otimes 3}) $$

is an estimation for the third order moment tensor $M_{3}:=\mathbb {E}[y^{\otimes 3}]$. Recall that $\mathcal {F} = {\sum }_{i=1}^{r} \omega _{i} \mu _{i}^{\otimes 3}$. When all the covariance matrices Σ_i are diagonal, we have shown in (1.2) that

$$ M_{3} = \mathcal{F} + \sum\limits_{j=1}^{d}(a_{j}\otimes e_{j}\otimes e_{j} +e_{j}\otimes a_{j}\otimes e_{j}+e_{j}\otimes e_{j}\otimes a_{j}). $$

If the labels i₁, i₂, i₃ are distinct from each other, $(M_{3})_{i_{1}i_{2}i_{3}} = (\mathcal {F})_{i_{1}i_{2}i_{3}}$. Recall the label set Ω in (1.4). It holds that

$$ (M_{3})_{{{\varOmega}}} = (\mathcal{F})_{{{\varOmega}}}. $$

Note that $(\widehat {M}_{3})_{{{\varOmega }}}$ is only an approximation for (M₃)_Ω and $(\mathcal {F})_{{{\varOmega }}}$, due to sampling errors. If the rank $r\le \frac {d}{2}-1$, we can apply Algorithm 4.1 with the input $(\widehat {M}_{3})_{{{\varOmega }}}$, to compute a rank-r tensor approximation for $\mathcal {F}$. Suppose the tensor approximation produced by Algorithm 4.1 is

$$ \mathcal{F} \approx (p_{1}^{\ast})^{\otimes 3} + {\cdots} + (p_{r}^{\ast})^{\otimes 3}. $$

The computed $p_{1}^{\ast },\ldots ,p_{r}^{\ast }$ may not be real vectors, even if $\mathcal {F}$ is real. When the error $\varepsilon :=\|(\mathcal {F}-\widehat {M}_{3})_{{{\varOmega }}}\|$ is small, by Theorem 4.2, we know

$$ \|\tau_{i}^{\ast} p_{i}^{\ast}-\sqrt[3]{\omega_{i}}\mu_{i}\| = O(\varepsilon) $$

where $(\tau _{i}^{\ast })^{3}=1$. In computation, we can choose $\tau _{i}^{\ast }$ such that $(\tau _{i}^{\ast })^{3}=1$ and the imaginary part vector $\text {Im}(\tau _{i}^{\ast } p_{i}^{\ast })$ has the smallest norm. It can be done by checking the imaginary part of $\tau _{i}^{\ast } p_{i}^{\ast }$ one by one for

$$ \tau_{i}^{\ast} = 1, \quad -\frac{1}{2}+\frac{\sqrt{-3}}{2}, \quad -\frac{1}{2}-\frac{\sqrt{-3}}{2}. $$

Then we get the real vector

$$ \hat{q}_{i} := \text{Re}(\tau_{i}^{\ast} p_{i}^{\ast}). $$

It is expected that $\hat {q}_{i} \approx \sqrt [3]{\omega _{i}} \mu _{i}$. Since

$$ M_{1} = \omega_{1} \mu_{1} + {\cdots} + \omega_{r} \mu_{r} \approx \omega_{1}^{2/3} \hat{q}_{1} + {\cdots} + \omega_{r}^{2/3} \hat{q}_{r}, $$

the scalars $\omega _{1}^{2/3},\ldots ,\omega _{r}^{2/3}$ can be obtained by solving the nonnegative linear least squares

$$ \underset{(\beta_{1},\ldots,\beta_{r})\in\mathbb{R}^{r}_{+}}{\min} \left\|\widehat{M}_{1}- \sum\limits_{i=1}^{r} \beta_{i} \hat{q}_{i}\right\|^{2}. $$

(5.1)

Let $(\beta _{1}^{\ast },\ldots ,\beta _{r}^{\ast })$ be an optimizer for the above, then $\hat {\omega }_{i} := (\beta _{i}^{\ast })^{3/2}$ is a good approximation for ω_i and the vector

$$ \hat{\mu}_{i} := \hat{q}_{i} / \sqrt[3]{ \hat{\omega}_{i}} $$

is a good approximation for μ_i. We may use

$$ \hat{\mu}_{i}, \quad \left( \sum\limits_{j=1}^{r} \hat{\omega}_{j} \right)^{-1} \hat{\omega}_{i}, \quad i=1,\ldots, r $$

as starting points to solve the nonlinear optimization

$$ \left\{ \begin{array}{cl} \underset{(\omega_{1},\ldots,\omega_{r}, \mu_{1},\ldots, \mu_{r})}{\min} & \|{\sum}_{i=1}^{r} \omega_{i} \mu_{i}-\widehat{M}_{1}\|^{2} + \|{\sum}_{i=1}^{r} \omega_{i} (\mu_{i}^{\otimes 3})_{{{\varOmega}}}-(\widehat{M}_{3})_{{{\varOmega}}}\|^{2}\\ \text{subject to} & \omega_{1} + {\cdots} +\omega_{r} = 1,~\omega_{1},\ldots,\omega_{r} \ge 0, \end{array} \right. $$

(5.2)

for getting improved approximations. Suppose an optimizer of the above is

$$ (\omega_{1}^{\ast}, \ldots, \omega_{r}^{\ast}, \mu_{1}^{\ast},\ldots, \mu_{r}^{\ast}). $$

Now, we discuss how to estimate the diagonal covariance matrices Σ_i. Let

$$ \mathcal{A}:=M_{3}-\mathcal{F}, \quad \widehat{\mathcal{A}} := \widehat{M}_{3}-(\hat{q}_{1})^{\otimes 3}-{\cdots} - (\hat{q}_{r} )^{\otimes 3}. $$

(5.3)

By (1.2), we know that

$$ \mathcal{A} = \sum\limits_{j=1}^{d}(a_{j}\otimes e_{j}\otimes e_{j}+e_{j}\otimes a_{j}\otimes e_{j}+e_{j}\otimes e_{j}\otimes a_{j}), $$

(5.4)

where $a_{j}={\sum }_{i=1}^{r}\omega _{i}\sigma _{ij}^{2}\mu _{i}$ for $j=1,\dots ,d$. The equation (5.4) implies that

$$ (a_{j})_{j}=\frac{1}{3}\mathcal{A}_{jjj},\quad (a_{j})_{i}=\mathcal{A}_{jij}, $$

for $i,j=1,\dots ,d$ and i≠j. So we choose vectors $\hat {a}_{j} \in \mathbb {R}^{d}$ such that

$$ (\hat{a}_{j})_{j}=\frac{1}{3}\widehat{\mathcal{A}}_{jjj},\quad (\hat{a}_{j})_{i}=\widehat{\mathcal{A}}_{jij} \quad \text{ for }~i \ne j. $$

(5.5)

Since $\hat {a}_{j}\approx {\sum }_{i=1}^{r}\omega _{i}\sigma _{ij}^{2}\mu _{i}$, the covariance matrices ${{\varSigma }}_{i} = \text {diag}(\sigma _{i1}^{2}, \ldots , \sigma _{id}^{2})$ can be estimated by solving the nonnegative linear least squares (j = 1,…,d)

$$ \left\{\begin{array}{cl} \underset{(\beta_{1j}, \ldots, \beta_{rj}) }{\min} & \left\|\hat{a}_{j} - {\sum}_{i=1}^{r} \omega^{\ast}_{i}\mu^{\ast}_{i} \beta_{ij}\right\|^{2} \\ \text{subject to} & \beta_{1j} \ge 0,\ldots, \beta_{rj} \ge 0. \end{array} \right. $$

(5.6)

For each j, let $(\beta ^{\ast }_{1j}, \ldots , \beta ^{\ast }_{rj})$ be the optimizer for the above. When $(\widehat {M}_{3})_{{{\varOmega }}}$ is close to (M₃)_Ω, it is expected that $\beta ^{\ast }_{ij}$ is close to (σ_ij)². Therefore, we can estimate the covariance matrices Σ_i as follows

$$ {{\varSigma}}_{i}^{\ast} := \text{diag}(\beta^{\ast}_{i1}, \ldots, \beta^{\ast}_{id}), \quad (\sigma_{ij}^{\ast})^{2}:=\beta^{\ast}_{ij}. $$

(5.7)

The following is the algorithm for learning Gaussian mixture models.

Algorithm 5.1

(Learning diagonal Gaussian mixture models)Input: Samples $\{y_{1},\ldots ,y_{N}\} \subseteq \mathbb {R}^{d}$ drawn from a Gaussian mixture model and the number r of component Gaussian distributions.

Step 1.
Compute the sample averages $\widehat {M}_{1} := \frac {1}{N} {\sum }_{i=1}^{N} y_{i}$ and $\widehat {M}_{3} :=\frac {1}{N}{\sum }_{i=1}^{N} y_{i}^{\otimes 3}$.
Step 2.
Apply Algorithm 4.1 to the subtensor $(\widehat {\mathcal {F}})_{{{\varOmega }}} := (\widehat {M}_{3})_{{{\varOmega }}}$. Let $(p_{1}^{\ast })^{\otimes 3}+{\cdots } +(p_{r}^{\ast })^{\otimes 3}$ be the obtained rank-r tensor approximation for $\widehat {\mathcal {F}}$. For each i = 1,…,r, let $\hat {q}_{i} :=\text {Re}(\tau _{i}p_{i}^{\ast })$ where τ_i is the cube root of 1 that minimizes the imaginary part vector norm $\|\text {Im}(\tau _{i} p_{i}^{\ast })\|$.
Step 3.
Solve (5.1) to get $\hat {\omega }_{1},\ldots ,\hat {\omega }_{r}$ and $\hat {\mu }_{i} = q_{i}/\sqrt [3]{\hat {\omega }_{i}},i=1,\ldots ,r$.
Step 4.
Use the above $\hat {\omega }_{i}$, $\hat {q}_{i}$ as initial points to solve the nonlinear optimization (5.2) for the optimal $\omega _{i}^{\ast },\mu _{i}^{\ast }$, i = 1,…,r.
Step 5.
Get vectors $\hat {a}_{1}, \ldots , \hat {a}_{d}$ as in (5.5). Solve the optimization (5.6) to get optimizers $\beta _{ij}^{\ast }$ and then choose ${{\varSigma }}_{i}^{\ast }$ as in (5.7).

Output: Component Gaussian distribution parameters $(\mu ^{\ast }_{i},{{\varSigma }}^{\ast }_{i},\omega ^{\ast }_{i})$, i = 1,…,r.

The sample averages $\widehat {M}_{1}$, $\widehat {M}_{3}$ can typically be used as good estimates for the true moments M₁, M₃. When the value of r is not known, it can be determined as in Remark 3.6. The performance of Algorithm 5.1 is analyzed as follows.

Theorem 5.2

Consider the d-dimensional diagonal Gaussian mixture model with parameters {(ω_i,μ_i,Σ_i) : i ∈ [r]} and $r\le \frac {d}{2}-1$. Let $\{(\omega ^{\ast }_{i},\mu ^{\ast }_{i},{{\varSigma }}^{\ast }_{i}):i\in [r]\}$ be produced by Algorithm 5.1. If the distance $\varepsilon :=\max \limits (\|M_{3}-\widehat {M}_{3}\|,\|M_{1}-\widehat {M}_{1}\|) $ is small enough and the tensor $\mathcal {F}={\sum }_{i=1}^{r}\omega _{i} \mu _{i}^{\otimes 3}$ satisfies conditions of Theorem 4.2, then

$$ \|\mu_{i}-\mu^{\ast}_{i}\| = O(\varepsilon),\quad \|\omega_{i}-\omega_{i}^{\ast}\| = O(\varepsilon),\quad \|{{\varSigma}}_{i} - {{\varSigma}}^{\ast}_{i}\| = O(\varepsilon), $$

where the above constants inside O(⋅) only depend on parameters {(ω_i,μ_i,Σ_i) : i ∈ [r]} and the choice of ξ in Algorithm 5.1.

Proof

For the vectors $p_{i}:=\sqrt [3]{\omega _{i}}\mu _{i}$, we have $\mathcal {F} = {\sum }_{i=1}^{r} p_{i}^{\otimes 3}$. Since

$$ \|(\mathcal{F}-\widehat{\mathcal{F}})_{{{\varOmega}}}\| = \|(M_{3}-\widehat{M}_{3})_{{{\varOmega}}}\| \le \varepsilon $$

and $\mathcal {F}$ satisfies conditions of Theorem 4.2, we know $\|\tau _{i}^{\ast } p^{\ast }_{i}-p_{i}\|=O(\varepsilon )$ for some $(\tau _{i}^{\ast })^{3}=1$, by Theorem 4.2. The constants inside O(ε) depend on parameters of the Gaussian model and ξ. Then, we have $\|\text {Im}(\tau _{i}^{\ast } p_{i}^{\ast })\|=O(\varepsilon )$ since the vectors p_i are real. When ε is small enough, such $\tau _{i}^{\ast }$ is the τ in Step 2 of Algorithm 5.1 that minimizes $\|\text {Im}(\tau _{i}p_{i}^{\ast })\|$, so we have

$$ \| \hat{q}_{i}-p_{i}\|\le \|\tau_{i}p^{\ast}_{i}-p_{i}\|=O(\varepsilon), $$

where $\hat {q}_{i}=\text {Re}(\tau _{i}p_{i}^{\ast })$ is from Step 2. The vectors $\hat {q}_{1},\ldots , \hat {q}_{r}$ are linearly independent when ε is small. Thus, the problem (5.1) has a unique solution and the weights $\hat {\omega }_{i} $ can be found by solving (5.1). Since $\|M_{1}-\widehat {M}_{1}\|\le \varepsilon $ and $\|\hat {q}_{i}-p_{i}\|=O(\varepsilon )$, we have $\|\omega _{i} - \hat {\omega }_{i}\|=O(\varepsilon )$ (see [15, Theorem 3.4]). The mean vectors $\hat {\mu }_{i}$ are obtained by $\hat {\mu }_{i} = \hat {q}_{i}/\sqrt [3]{\hat {\omega }_{i}}$, so the approximation error is

$$ \|\mu_{i} - \hat{\mu}_{i}\|=\|{p}_{i}/\sqrt[3]{{\omega}_{i} }- \hat{q}_{i}/\sqrt[3]{\hat{\omega}_{i}}\| = O(\varepsilon). $$

The constants inside the above O(ε) depend on parameters of the Gaussian mixture model and ξ.

The problem (5.2) is solved to obtain $\omega ^{\ast }_{i}$ and $\mu _{i}^{\ast }$, so

$$ \left\|\widehat{M_{1}} - \sum\limits_{i=3}^{r} \omega_{i}^{\ast} \mu_{i}^{\ast}\right\| + \left\|\widehat{\mathcal{F}} - \sum\limits_{i=1}^{r} \omega_{i}^{\ast} (\mu_{i}^{\ast})^{\otimes 3} \right\| = O(\varepsilon). $$

Let $\mathcal {F}^{\ast }:={\sum }_{i=1}^{r} \omega _{i}^{\ast } (\mu _{i}^{\ast })^{\otimes 3}={\sum }_{i=1}^{r}(\sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast })^{\otimes 3}$, then

$$ \|\mathcal{F}-\mathcal{F}^{\ast}\| \le \|\mathcal{F}-\hat{\mathcal{F}}\|+\|\hat{\mathcal{F}}-\mathcal{F}^{\ast}\| = O(\varepsilon). $$

Theorem 4.2 implies $\|p_{i}- \sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast }\|=O(\varepsilon )$. In addition, we have

$$ \left\|\widehat{M_{1}} - \sum\limits_{i=1}^{r} \omega_{i}^{\ast} \mu_{i}^{\ast}\right\|=\left\|\widehat{M_{1}} - \sum\limits_{i=1}^{r} (\omega_{i}^{\ast})^{2/3} \sqrt[3]{\omega_{i}^{\ast}}\mu_{i}^{\ast}\right\| = O(\varepsilon). $$

The first order moment is $M_{1}= {\sum }_{i=1}^{r} (\omega _{i})^{2/3} p_{i}$. Since $\|M_{1}-\hat {M}_{1}\|=O(\varepsilon )$ and $\|p_{i}- \sqrt [3]{\omega _{i}^{\ast }}\mu _{i}^{\ast }\|=O(\varepsilon )$, it holds that $\|\omega _{i}^{2/3}-(\omega _{i}^{\ast })^{2/3}\|=O(\varepsilon )$ by [15, Theorem 3.4]. This implies that $\|\omega _{i}-\omega _{i}^{\ast }\|=O(\varepsilon )$, so

$$ \|\mu_{i}-\mu_{i}^{\ast}\|=\|p_{i}/\sqrt[3]{\omega_{i}}-\left( \sqrt[3]{\omega_{i}^{\ast}}\mu_{i}^{\ast}\right)/\sqrt[3]{\omega_{i}^{\ast}}\|=O(\varepsilon). $$

The constants inside the above O(⋅) only depend on parameters {(ω_i,μ_i,Σ_i) : i ∈ [r]} and ξ.

The covariance matrices Σ_i are recovered by solving the linear least squares (5.6). In the least square problems, it holds that $\|\omega _{i}\mu _{i}-\omega _{i}^{\ast }\mu _{i}^{\ast }\|=O(\varepsilon )$ and

$$ \|\mathcal{A}-\widehat{\mathcal{A}}\|\le \|M_{3}-\widehat{M}_{3}\|+ \|\mathcal{F}-\sum\limits_{i=1}^{r} \hat{q}_{i}^{\otimes 3}\|=O(\varepsilon), $$

where tensors $\mathcal {A}$, $\widehat {\mathcal {A}}$ are defined in (5.3). When the error ε is small, the vectors $\omega _{i}^{\ast }\mu _{1}^{\ast },\ldots ,\omega _{i}^{\ast }\mu _{r}^{\ast }$ are linearly independent and hence (5.6) has a unique solution for each j. By [15, Theorem 3.4], we have

$$ \|(\sigma_{ij})^{2} - (\sigma_{ij}^{\ast})^{2}\| = O(\varepsilon). $$

It implies that $\|{{\varSigma }}_{i}-{{\varSigma }}^{\ast }_{i} \| = O(\varepsilon )$, where the constants inside O(⋅) only depend on parameters {(ω_i,μ_i,Σ_i) : i ∈ [r]} and ξ. □

6 Numerical Simulations

This section gives numerical experiments for our proposed methods. The computation is implemented in MATLAB R2019b, on an Alienware personal computer with Intel(R)Core(TM)i7-9700K CPU@3.60GHz and RAM 16.0G. The MATLAB function lsqnonlin is used to solve (4.6) in Algorithm 4.1 and the MATLAB function fmincon is used to solve (5.2) in Algorithm 5.1. We compare our method with the classical EM algorithm, which is implemented by the MATLAB function fitgmdist (MaxIter is set to be 100 and RegularizationValue is set to be 0.001).

First, we show the performance of Algorithm 4.1 for computing incomplete symmetric tensor approximations. For a range of dimension d and rank r, we get the tensor $\mathcal {F} = (p_{1})^{\otimes 3}+\cdots +(p_{r})^{\otimes 3}$, where each p_i is randomly generated according to the Gaussian distribution in MATLAB. Then, we apply the perturbation $(\widehat {\mathcal {F}})_{{{\varOmega }}} = (\mathcal {F})_{{{\varOmega }}} + \mathcal {E}_{{{\varOmega }}}$, where $\mathcal {E}$ is a randomly generated tensor, also according to the Gaussian distribution in MATLAB, with the norm $\|\mathcal {E}_{\omega }\|_{{{\varOmega }}} = \varepsilon $. After that, Algorithm 4.1 is applied to the subtensor $(\widehat {\mathcal {F}})_{{{\varOmega }}}$ to find the rank-r tensor approximation. The approximation quality is measured by the absolute error and the relative error

$$ \text{abs-error} := \|(\mathcal{F}^{\ast}-\mathcal{F})_{{{\varOmega}}}\|, \quad \text{rel-error} := \frac{\|(\mathcal{F}^{\ast}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|} {\|(\mathcal{F}-\widehat{\mathcal{F}})_{{{\varOmega}}}\|}, $$

where $\mathcal {F}^{\ast }$ is the output of Algorithm 4.1. For each case of (d,r,ε), we generate 100 random instances. The min, average, and max relative errors for each dimension d and rank r are reported in the Table 1. The results show that Algorithm 4.1 performs very well for computing tensor approximations.

Table 1 The performance of Algorithm 4.1

Full size table

Second, we explore the performance of Algorithm 5.1 for learning diagonal Gaussian mixture models. We compare it with the classical EM algorithm, for which the MATLAB function fitgmdist is used (MaxIter is set to be 100 and RegularizationValue is set to be 0.0001). The dimensions d = 20,30,40,50,60 are tested. Three values of r are tested for each case of d. We generate 100 random instances of $\{(\omega _{i},\mu _{i}, {{\varSigma }}_{i}):i=1,\dots ,r\}$ for d ∈{20,30,40}, and 20 random instances for d ∈{50,60}, because of the relatively more computational time for the latter case. For each instance, 10000 samples are generated. To generate the weights ω₁,…,ω_r, we first use the MATLAB function randi to generate a random 10000 −dimensional integer vector of entries from [r], then the occurring frequency of i in [r] is used as the weight ω_i. For each diagonal covariance matrix Σ_i, its diagonal vector is set to be the square of a random vector generated by the MATLAB function randn. Each sample is generated from one of r component Gaussian distributions, so they are naturally separated into r groups. Algorithm 5.1 and the EM algorithm are applied to fit the Gaussian mixture model to the 10000 samples for each instance. For each sample, we calculate the likelihood of the sample to each component Gaussian distribution in the estimated Gaussian mixture model. A sample is classified to the i th group if its likelihood for the i th component is maximum. The classification accuracy is the rate that samples are classified to the correct group. In Table 2, for each pair (d,r), we report the accuracy of Algorithm 5.1 in the first row and the accuracy of the EM algorithm in the second row. As one can see, Algorithm 5.1 performs better than EM algorithm, and its accuracy isn’t affected when the dimensions and ranks increase. Indeed, as the difference between the dimension d and the rank r increases, Algorithm 5.1 becomes more and more accurate. This is opposite to the EM algorithm. The reason is that the difference between the number of rows and the number of columns of $A_{ij}[\mathcal {F}]$ in (3.7) increases as d − r becomes bigger, which makes Algorithm 5.1 more robust.

Table 2 Comparison between Algorithm 5.1 and EM for simulations

Full size table

Last, we apply Algorithm 5.1 to do texture classifications. We select 8 textured images of 512 × 512 pixels from the VisTex database (Fig. 1). We use the MATLAB function rgb2gray to convert them into grayscale version since we only need their structure and texture information. Each image is divided into subimages of 32 × 32 pixels. We perform the discrete cosine transformation (DCT) on each block of size 16 × 16 pixels with overlap of 8 pixels. Each component of ‘Wavelet-like’ DCT feature is the sum of the absolute value of the DCT coefficients in the corresponding sub-block. So the dimension d of the feature vector extracted from each subimage is 13. We use blocks extracted from the first 160 subimages for training and those from the rest 96 subimages for testing. We refer to [47] for more details. For each image, we apply Algorithm 5.1 and the EM algorithm to fit a Gaussian mixture model to the image. We choose the number of components r according to Remark 3.6. To classify the test data, we follow the Bayesian decision rule that assigns each block to the texture which maximizes the posteriori probability, where we assume a uniform prior over all classes [18]. The classification accuracy is the rate that a subimage is correctly classified, which is shown in Table 3. Algorithm 5.1 outperforms the classical EM algorithm for the accuracy rates for six of the images.

Table 3 Classification results on 8 textures

Full size table

7 Conclusions and Future Work

This paper gives a new algorithm for learning Gaussian mixture models with diagonal covariance matrices. We first give a method for computing incomplete symmetric tensor decompositions. It is based on generating polynomials. The method is described in Algorithm 3.4. When the input subtensor has small errors, we can similarly compute the incomplete symmetric tensor approximation, which is given by Algorithm 4.1. We have shown in Theorem 4.2 that if the input subtensor is sufficiently close to a low rank one, the produced tensor approximation is highly accurate. Then unknown parameters for Gaussian mixture models can be recovered by using the incomplete tensor decomposition method. It is described in Algorithm 5.1. When the estimations of M₁ and M₃ are accurate, the parameters recovered by Algorithm 5.1 are also accurate. The computational simulations demonstrate the good performance of the proposed method.

The proposed method deals with the case that the number of Gaussian components is less than one half of the dimension. How do we compute incomplete symmetric tensor decompositions when the set Ω is not like (1.4)? How can we learn parameters for Gaussian mixture models with more components? How can we do that when the covariance matrices are not diagonal? They are important and interesting topics for future research work.

References

Achlioptas, D., McSherry, F.: On spectral learning of mixtures of distributions. In: Auer, P., Meir, R (eds.) Learning Theory. International Conference on Computational Learning Theory. Lecture Notes in Computer Science, vol. 3559, pp 458–469. Springer, Berlin, Heidelberg (2005)
Anandkumar, A., Ge, R., Hsu, D., Kakade, S., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15, 2773–2832 (2014)
MathSciNet MATH Google Scholar
Belkin, M., Sinha, K.: Toward learning Gaussian mixtures with arbitrary separation. In: The 23rd Annual Conference on Learning Theory (COLT 2010), Haifa, June 27–29 (2010)
Bioucas-Dias, J.M., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyperspectral unmixing overview: Geometrical, statistical, and sparse regression-based approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 5, 354–379 (2012)
Article Google Scholar
Breiding, P., Vannieuwenhoven, N.: The condition number of join decompositions. SIAM J. Matrix Anal. Appl. 39, 287–309 (2018)
Article MathSciNet Google Scholar
Breiding, P., Vannieuwenhoven, N.: The condition number of Riemannian approximation problems. SIAM J. Optim. 31, 1049–1077 (2021)
Article MathSciNet Google Scholar
Brubaker, S.C., Vempala, S.S.: Isotropic PCA and affine-invariant clustering. In: Grötschel, M., Katona, G.O.H., Sági, G. (eds.) Building Bridges. Bolyai Society Mathematical Studies, vol. 19, pp 241–281. Springer, Berlin, Heidelberg (2008)
Chatelin, F.: Eigenvalues of Matrices: Revised Edition. SIAM, Philadelphia (2012)
Book Google Scholar
Chaudhuri, K., Kakade, S., Livescu, K., Sridharan, K.: Multi-view clustering via canonical correlation analysis. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp 129–136 (2009)
Chaudhuri, K., Rao, S.: Learning mixtures of product distributions using correlations and independence. In: The 21st Annual Conference on Learning Theory (COLT 2008), 9–12 July, 2008, Helsinki, pp 9–20 (2008)
Chiantini, L., Ottaviani, G., Vannieuwenhoven, N.: On generic identifiability of symmetric tensors of subgeneric rank. Trans. Amer. Math. Soc. 369, 4021–4042 (2017)
Article MathSciNet Google Scholar
Comon, P., Lim, L.-H., Qi, Y., Ye, K.: Topology of tensor ranks. Adv. Math. 367, 107128 (2020)
Article MathSciNet Google Scholar
Dasgupta, S.: Learning mixtures of Gaussians. In: 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pp 634–644. IEEE (1999)
Dasgupta, S., Schulman, L.: A two-round variant of EM for Gaussian mixtures. arXiv:1301.3850 (2013)
Demmel, J.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997)
Book Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39, 1–22 (1977)
MathSciNet MATH Google Scholar
de Silva, V., Lim, L.-H.: Tensor rank and the ill-posedness of the best low-rank approximation problem. SIAM J. Matrix Anal. Appl. 30, 1084–1127 (2008)
Article MathSciNet Google Scholar
Dixit, M., Rasiwasia, N., Vasconcelos, N.: Adapted Gaussian models for image classification. CVPR 2011, pp 937–943. IEEE (2011)
Dressler, M., Nie, J., Yang, Z.: Separability of Hermitian tensors and PSD decompositions. arXiv:2011.08132 (2020)
Fialkow, L., Nie, J.: The truncated moment problem via homogenization and flat extensions. J. Funct. Anal. 263, 1682–1700 (2012)
Article MathSciNet Google Scholar
Friedland, S.: Remarks on the symmetric rank of symmetric tensors. SIAM J. Matrix Anal. Appl. 37, 320–337 (2016)
Article MathSciNet Google Scholar
Friedland, S., Lim, L.-H.: Nuclear norm of higher-order tensors. Math. Comput. 87, 1255–1281 (2018)
Article MathSciNet Google Scholar
Ge, R., Huang, Q., Kakade, S.M.: Learning mixtures of Gaussians in high dimensions. In: Proceedings of the forty-seventh annual ACM symposium on Theory of Computing, pp 761–770 (2015)
Ge, F., Ju, Y., Qi, Z., Lin, Y.: Parameter estimation of a Gaussian mixture model for wind power forecast error by Riemann L-BFGS optimization. IEEE Access 6, 38892–38899 (2018)
Article Google Scholar
Haas, M., Mittnik, S., Paolella, M.S.: Modelling and predicting market risk with Laplace–Gaussian mixture distributions. Appl. Financ. Econ. 16, 1145–1162 (2006)
Article Google Scholar
Hillar, C.J., Lim, L.-H.: Most tensor problems are NP-hard. J. ACM 60, 45 (2013)
Article MathSciNet Google Scholar
Hsu, D., Kakade, S.M.: Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In: Proceedings of the 4th conference on Innovations in Theoretical Computer Science, pp 11–20 (2013)
Kalai, A.T., Moitra, A., Valiant, G.: Efficiently learning mixtures of two Gaussians. In: Proceedings of the forty-second ACM symposium on Theory of computing, pp 553–562 (2010)
Kannan, R., Salmasian, H., Vempala, S.: The spectral method for general mixture models. In: Auer, P., Meir, R. (eds.) Learning Theory. International Conference on Computational Learning Theory. Lecture Notes in Computer Science, vol. 3559, pp 444–457. Springer, Berlin, Heidelberg (2005)
Karpagavalli, S., Chandra, E.: A review on automatic speech recognition architecture and approaches. Int. J. Signal Process. Image Process. Pattern Recognit. 9, 393–404 (2016)
Google Scholar
Landsberg, J.M.: Tensors: Geometry and Applications. Graduate Studies in Mathematics, vol. 128. American Mathematical Society, Providence, RI (2012)
Google Scholar
Lee, D.-S.: Effective Gaussian mixture learning for video background subtraction. IEEE Trans. Pattern Anal. Mach. Intell. 27, 827–832 (2005)
Article Google Scholar
Lim, L.-H.: Tensors and hypermatrices. In: Hogben, L. (ed.) Handbook of Linear Algebra, 2nd edn., pp 15–1–15-30. CRC Press, Boca Raton, FL (2013)
Ma, Y., Jin, Q., Mei, X., Dai, X., Fan, F., Li, H., Huang, J.: Hyperspectral unmixing with Gaussian mixture model and low-rank representation. Remote Sens. 11, 911 (2019)
Article Google Scholar
Magdon-Ismail, M., Purnell, J.T.: Approximating the covariance matrix of GMMs with low-rank perturbations. In: Eyfe, C., et al. (eds.) Intelligent Data Engineering and Automated Learning – IDEAL 2010. International Conference on Intelligent Data Engineering and Automated Learning. Lecture Notes in Computer Science, vol. 6283, pp 300–307. Springer, Berlin, Heidelberg (2010)
Mu, C., Huang, B., Wright, J., Goldfarb, D.: Square deal: lower bounds and improved relaxations for tensor recovery. In: Proceeding of the 31st International Conference on Machine Learning (PMLR), vol. 32, pp 73–81 (2014)
Nie, J., Sturmfels, B.: Matrix cubes parameterized by eigenvalues. SIAM J. matrix Anal. Appl. 31, 755–766 (2009)
Article MathSciNet Google Scholar
Nie, J.: The hierarchy of local minimums in polynomial optimization. Math. Program. 151, 555–583 (2015)
Article MathSciNet Google Scholar
Nie, J.: Linear optimization with cones of moments and nonnegative polynomials. Math. Program. 153, 247–274 (2015)
Article MathSciNet Google Scholar
Nie, J.: Generating polynomials and symmetric tensor decompositions. Found. Comput. Math. 17, 423–465 (2017)
Article MathSciNet Google Scholar
Nie, J.: Symmetric tensor nuclear norms. SIAM J. Appl. Algebra Geom. 1, 599–625 (2017)
Article MathSciNet Google Scholar
Nie, J.: Low rank symmetric tensor approximations. SIAM J. Matrix Anal. Appl. 38, 1517–1540 (2017)
Article MathSciNet Google Scholar
Nie, J.: Tight relaxations for polynomial optimization and Lagrange multiplier expressions. Math. Program. 178, 1–37 (2019)
Article MathSciNet Google Scholar
Nie, J., Ye, K.: Hankel tensor decompositions and ranks. SIAM J. Matrix Anal. Appl. 40, 486–516 (2019)
Article MathSciNet Google Scholar
Nie, J., Yang, Z.: Hermitian tensor decompositions. SIAM J. Matrix Anal. Appl. 41, 1115–1144 (2020)
Article MathSciNet Google Scholar
Pearson, K.: Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A 185, 71–110 (1894)
Article Google Scholar
Permuter, H., Francos, J., Jermyn, I.: A study of Gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognit. 39, 695–706 (2006)
Article Google Scholar
Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., Glembek, O., Goel, N., Karafiát, M., Rastrow, A., Rose, R.C., Schwarz, P., Thomas, S.: The subspace Gaussian mixture model—A structured model for speech recognition. Comput. Speech Lang. 25, 404–439 (2011)
Article Google Scholar
Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26, 195–239 (1984)
Article MathSciNet Google Scholar
Reynolds, D.A.: Speaker identification and verification using Gaussian mixture speaker models. Speech Commun. 17, 91–108 (1995)
Article Google Scholar
Romera-Paredes, B., Pontil, M.: A new convex relaxation for tensor completion. In: Advances in Neural Information Processing Systems 26, pp 2967–2975 (2013)
Sanjeev, A., Kannan, R.: Learning mixtures of arbitrary Gaussians. In: Proceedings of the thirty-third annual ACM symposium on Theory of computing, pp 247–257 (2001)
Shekofteh, Y., Jafari, S., Sprott, J.C., Golpayegani, S.M.R.H., Almasganj, F.: A Gaussian mixture model based cost function for parameter estimation of chaotic biological systems. Commun. Nonlinear Sci. Numer. Simul. 20, 469–481 (2015)
Article MathSciNet Google Scholar
Tang, G., Shah, P.: Guaranteed tensor decomposition: a moment approach. In: Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp 1491–1500 (2015)
Vempala, S., Wang, G.: A spectral algorithm for learning mixture models. J. Comput. Syst. Sci. 68, 841–860 (2004)
Article MathSciNet Google Scholar
Veracini, T., Matteoli, S., Diani, M., Corsini, G.: Fully unsupervised learning of Gaussian mixtures for anomaly detection in hyperspectral imagery. In: 2009 Ninth International Conference on Intelligent Systems Design and Applications, pp 596–601. IEEE (2009)
Wu, Y., Yang, P.: Optimal estimation of Gaussian mixtures via denoised method of moments. Ann. Statist. 48, 1981–2007 (2020)
Article MathSciNet Google Scholar
Yuan, M., Zhang, C.-H.: On tensor completion via nuclear norm minimization. Found. Comput. Math. 16, 1031–1068 (2016)
Article MathSciNet Google Scholar
Zhang, H., Giles, C.L., Foley, H.C., Yen, J.: Probabilistic community discovery using hierarchical latent Gaussian mixture model. In: AAAI-07/IAAI-07 Proceedings: 22nd AAAI Conference on Artificial Intelligence and the 19th Innovative Applications of Artificial Intelligence Conference, vol. 1, pp 663–668 (2007)
Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004, pp 28–31. IEEE (2004)

Download references

Author information

Authors and Affiliations

Department of Mathematics, University of California San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, USA
Bingni Guo, Jiawang Nie & Zi Yang

Authors

Bingni Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jiawang Nie
View author publications
You can also search for this author in PubMed Google Scholar
Zi Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiawang Nie.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The article is dedicated to Professor Bernd Sturmfels’ 60th birthday.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, B., Nie, J. & Yang, Z. Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions. Vietnam J. Math. 50, 421–446 (2022). https://doi.org/10.1007/s10013-021-00534-3

Download citation

Received: 08 February 2021
Accepted: 27 May 2021
Published: 18 November 2021
Issue Date: April 2022
DOI: https://doi.org/10.1007/s10013-021-00534-3

Keywords

Mathematics Subject Classification (2010)

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Learning Diagonal Gaussian Mixture Models and Incomplete Tensor Decompositions

Abstract

Similar content being viewed by others

Tensor Decompositions for Learning Latent Variable Models (A Survey for ALT)

Generating Polynomials and Symmetric Tensor Decompositions

Low Rank Tensor Decompositions and Approximations

1 Introduction

Related Work

Contributions

2 Preliminary

Notation

3 Incomplete Tensor Decomposition

Theorem 3.1

Proof

Example 3.2

3.1 Computing the Tensor Decomposition

Theorem 3.3

Proof

Algorithm 3.4

Example 3.5

Remark 3.6

4 Tensor Approximations and Stability Analysis

Algorithm 4.1

Theorem 4.2

Proof

Remark 4.3

Example 4.4

5 Learning Diagonal Gaussian Mixture

Algorithm 5.1

Theorem 5.2

Proof

6 Numerical Simulations

7 Conclusions and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification (2010)

Search

Navigation