Gaussian mixture model decomposition of multivariate signals

We propose a greedy variational method for decomposing a non-negative multivariate signal as a weighted sum of Gaussians which, borrowing the terminology from statistics, we refer to as a Gaussian mixture model (GMM). Mixture components are added one at the time in two steps. In the first step, a new Gaussian atom and an amplitude are chosen based on a heuristic that aims to minimize the 2-norm of the residual. In the second step the 2-norm of the residual is further decreased by simultaneously adjusting all current Gaussians. Notably, our method has the following features: (1) It accepts multivariate signals, i.e. sampled multivariate function, histograms, time series, images, etc. as input. (2) The method can handle general (i.e. ellipsoidal) Gaussians. (3) No prior assumption on the number of mixture components is needed. To the best of our knowledge, no previous method for GMM decomposition simultaneously enjoys all these features. Since finding the optimal atom is a non-convex problem, an important point is how to initialize each new atom. We initialize the mean at the maximum of the residual. As a motivation for this initialization procedure, we prove an upper bound, which cannot be improved by a global constant, for the distance from any mode of a GMM to the set of corresponding means. For mixtures of spherical Gaussians with common variance $\sigma^2$, the bound takes the simple form $\sqrt{n}\sigma$. We evaluate our method on one- and two-dimensional signals. We also discuss the relation between clustering and signal decomposition, and compare our method to the baseline expectation maximization algorithm.


Introduction
Mixtures of Gaussians are often used in clustering to fit a probability distribution to some given sample points. In this work we are concerned with the related problem of approximating a nonnegative but otherwise arbitrary signal by a sparse linear combination of potentially anisotropic Gaussians. Our interest in this problem stems mainly from its applications in transmission electron microscopy (TEM). The output of the TEM reconstruction procedure is a 3D voxelized structure that ideally represents the electrostatic potential of the imaged specimen. Via a process known in the TEM-community as coarse-graining, it is common to express this 3D structure as a linear combination of Gaussians [Kaw18,Jou16,JS16a,JS16b]. This speeds up post-processing tasks such as fitting an atomic model to the structure [Kaw18], but one can also use coarse-graining as a form of denoising [JS16b].
There are two classes of methods for GMM decomposition of multivariate signals in the literature. Methods of the first class are based on the expectation maximization algorithm that is commonly used for fitting a GMM to a point-cloud, and adapt it to input data in the form of multi-variate signals [Kaw18,Jou16]. The second class is the class of greedy variational methods [FÇ99,JS16a,KBGP18]. The proposed method, which belongs to the latter class, is similar to -and is inspired by -both [KBGP18] and [JS16a]. It is a continuously parameterized analogue of orthogonal matching pursuit where at each iteration the 2-norm of the error is non-increasing. It is mainly motivated by the fact that previous work on decomposition of multivariate-signals as GMMs either did not allow for arbitrary covariances, or the number of Gaussian components had to be set before-hand.
We complement our algorithm with a theorem (Theorem 3.4) that upper-bounds the distance from a local maximum of a GMM to the set of mean vectors. This provides theoretical support for our initialization of each new mean vector at a maximum of the residual. We remark that Theorem 3.4 could also be of interest in its own right; whereas the number of modes of Gaussian mixtures have been investigated previously [CP00,AEH17], the authors of this paper are not aware of any existing quantitative bounds on the distance from a mode of a GMM to its mean vectors.
The rest of this paper is organized as follows. Section 2 contains a description of the proposed algorithm together with numerical examples. In section 3 we state and prove the afore-mentioned upper bound. Finally, in section 4 we provide a conclusion.

Proposed method
For x 0 ∈ R n and Σ ∈ R n×n symmetric non-negative definite we define g(x 0 , Σ) as the Gaussian density in n dimensions with mean vector x 0 and covariance matrix Σ 2 , i.e.
where C Σ is a normalizing factor, ensuring that |g(x 0 , Σ)| 1 = 1. Further, by a Gaussian mixture model (GMM) we mean a linear combination of the form 2 The proposed method takes as input a non-negative signal d ∈ R k1×···×kn , where k i is the number of grid-points along the i:th variable. The output is a list of GMM parameters, i.e. it is a list (a * m , x * m , Σ * m ) M m=1 of weights, mean vectors and square roots of covariance matrices. The aim of the method is firstly that the residual should have a small 2-norm. Secondly, the approximation should be sparse, i.e. the number of Gaussians M in the sum should ideally be as small as possible given the 2-norm of the residual.
In each iteration of our algorithm, a new Gaussian is added to the GMM by a procedure that corresponds to one iteration of a continuously parametrized version of matching pursuit (MP), c.f. [MZ93]: The starting guess x 0 for the mean vector is defined to be a global maximum of r , which is a smoothed version of the current residual r. Likewise Σ 0 is set to be a square root of the matrix of second-order moments of r |r |1 , where r is r restricted to a neighborhood of x 0 . The initial weight a 0 is given by projecting the Gaussian atom defined by (x 0 , Σ 0 ) onto the residual, i.e. it is given as the global minimum of the convex objective a → |r − ag(x 0 , Σ 0 )| 2 2 , a ≥ 0. We found that this vanilla MP-type of approach was by itself not capable of producing a good approximation. Because of this, in our method the already obtained Gaussians are updated in a way that bears some resemblance to the projection step of orthogonal MP (OMP) [PRK93]: Starting from (a 0 , x 0 , Σ 0 ), the parameters defining the most recently added Gaussian are updated by minimizing the 2-norm of the residual. 3 The final step of an outer iteration is to simultaneously adjust all Gaussians in the current GMM, again by minimizing the 2-norm of the residual. We propose to use the L-BFGS-B[BLNZ95] method with non-negativity constraints on the weights for all of the three afore-mentioned minimization problems. Our algorithm runs until some user-specified stopping criterion is met and is summarized in pseudo-code in Algorithm 1.
We now assess the space-and time complexity of the proposed method in terms of the final number M of Gaussians used and the dimension n of the input signal. Let N i denote the number of parameters of i Gaussians, so N i = O(in 2 ). The most demanding step of a single outer iteration of the proposed algorithm is the simultaneous adjustment of all Gaussians. If this step is done using L-BFGS-B, this amounts to a cost of O(N i ), both in terms of space and time [BLNZ95]. Summing over the M iterations we hence find the space-and time requirements of the proposed method to be O(M 2 n 2 ).
while stopping criterion is not met do x 0 ← arg max y r (y) 7: r ← restriction of r to the τ 2 grid points closest to x 0 8: Σ 0 ← square root of the matrix of second-order moments of r |r |1 9:

Main numerical results
As proof of concept for the proposed method, we ran our algorithm on small toy-examples in 1D and 2D. We refer to the examples as Experiment 1 -Experiment 3, see Figures 1-3. A  clean signal d clean was generated by discretizing GMMs with parameters as in Table 1 to the grids y k = −10 + 20k As stopping criterion we used SNR stop ≥ 20, where SNR stop := 10 log 10 Var (d est ) and where d est := m a * m g(x * m , Σ * m ). In all minimization sub-procedures we used the L-BFGS-B method with non-negativity constraints on the weights and ran it until convergence. Hyper-parameters τ 1 and τ 2 (introduced in line 5 and 7 in Algorithm 1) should ideally be chosen based on the amount of noise in the data, however we found that the exact values were not so important in our toy examples, and we used the ad-hoc chosen values τ 1 = 10 and τ 2 = 20. We leave a more careful sensitivity analysis with respect to these parameter to future studies. The total run-time was a few minutes on a computer with an Intel Pentium CPU running at 2.90GHz, using ∼ 8 GB of RAM. The final results of our experiments are tabulated in Tables 2 -3  Remark 2.1. The number of modes of a GMM is not necessarily equal to the number of means. Indeed, there might be more modes than means [AEH17]. Alternatively, there could be fewer modes than means, for example, in Experiment 2, a low amplitude Gaussian and high amplitude Gaussian are close to each other, and there is only one mode within the vicinity of both means. Our method successfully recovered the parameters of both Gaussians. There is also a mode at the center dominantly formed by the superposition of the tails of three anisotropic Gaussians, where our algorithm did not introduce a spurious Gaussian. Similar performance is observed in Experiment 3. We attribute this to the greedy nature of the algorithm where a predefined number of modes is not enforced in the decomposition.

Clustering and comparison to expectation maximization
As one of the main techniques in unsupervised learning, clustering seeks to subdivide data into groups based on some similarity measure. A common approach in clustering is based on Gaussian mixtures. In this approach one is given a point-cloud P ⊂ R n and the goal is to find a GMM f such that the probability that P is a sample from f is maximal among all GMM's with a prescribed number of components.
There are many ways of transforming a point-cloud into a signal and vice-versa. Thus one may use the method proposed in this paper for GMM-based clustering, and conversely one may use methods form clustering in order to decompose a signal into a GMM. In the below we illustrate both directions of this "problem transformation" with some numerical examples.

Clustering via signal decomposition
A sample point-cloud P of size 10 5 was generated from a normalized version of the GMM in Experiment 3 above. Then a signal d was generated from P by computing a histogram based on the 2D-grid used in the previous experiments. Results from the expectation maximization (EM) algorithm and the proposed method applied to P and d, respectively, are shown in Figure 6. The two methods performed similarly in terms of the likelihood of P given the computed GMM's.

Signal decomposition via clustering
A noisy signal d was generated as in Experiment 3 described in section 2.1, except that the GMM was normalized before discretization. To d we then associated a point-cloud P following the methodology in [Jou16], so P had Cd(y) points at grid-point y, where the constant C := 10 −5 y d(y) was chosen so that P had roughly 10 5 points in total. Using P we estimated a GMM with the EM algorithm. The performance of EM depended on the random initialization. A typical result using EM applied to P , along with the result from the proposed method applied to d are shown in Figure 7. We remark that is possible that a better result with EM may be obtained using some other procedure for generating the point-cloud from the given signal.

Location of modes of Gaussian mixtures
The main contribution of our paper is Theorem 3.4, which says that a mode of a GMM can not lie too far away from the set of means. This theorem generalizes the 1D result that any mode will lie within one standard deviation away from some mean, and provides theoretical support for our way of initializing each iteration of our method. Before turning to this theorem, we need three lemmas. In the first two lemmas we compute some integrals over the n-sphere of certain polynomials, and in the third lemma we provide expressions for the gradient and Hessian of a GMM. Lemma 3.1.
where C k is constant only depending on k.
Proof. WLOG assume x = 0.    where the constant C k is defined by C k := S n−1 (e 1 , y) k dy.
Lemma 3.2. Let A be a symmetric n × n matrix. Then are given by Proof. Since this is a standard result, we omit the straightforward proof.
Theorem 3.4. Consider a Gaussian mixture model in n dimensions Let σ m,max and σ m,min be the maximal and minimal eigenvalues of Σ m . Let x be a local maximum of f . Then there exists an index m such that Proof. Since x is a local maximum we have Hf (x ) ≤ 0, i.e. (y, Hf (x )y) ≤ 0, y ∈ R n . We integrate the last inequality over the unit-sphere and make use of Lemma 3.3 to conclude: The last inequality implies that there exist some index m such that which leads to At this point we apply Lemma 3.1 and Lemma 3.2 and obtain: Now C 2 > 0 since C 2 is an integral of a continuous non-negative function that is not everywhere zero. Hence Note that and that and the claim of the theorem follows.
Corollary 3.5. Let f be a mixture of spherical Gaussians with common variance σ 2 , i.e.
If x is local maximum of f , then there is an index m such that Proof. This is an immediate consequence of Theorem 3.4.
Proposition 3.6. The bound in Theorem 3.4 cannot be improved by a constant, i.e. for any δ > 0 there exist a GMM f such that some mode x of f satisfies for all mean vectors x m of f .
Proof. We explicitly construct a family {f } of functions that satisfy the statement of this proposition. For such that min (δ, √ nσ) > > 0 let f be the 2n component n-dimensional spherical GMM with common variance σ 2 , common amplitude a and with means at ±( √ nσ− )e i , for i = 1, 2, . . . , n. We shall prove that f has a mode in the origin, and thus it has a mode at distance √ nσ − from the set of means of f . 5 Lemma 3.3 implies By symmetry m x m = 0, so f has a critical point in the origin. Hence we are done if we prove that Hf (0) < 0. Again by Lemma 3.3: Hence Hf (0) < 0.

Conclusion
Motivated primarily by applications in TEM, we have developed a new algorithm for decomposing a non-negative multivariate signal as a sum of Gaussians with full covariances. We have tested it on 1D and 2D data. Moreover, we have also proved an upper bound for the distance of a local maximum of a GMM to the set of its mean vectors. This upper bound provides motivation for a key step in our method, namely the initialization of each new Gaussian at the maximum of the residual. Finally we remark that, while we have only tested the proposed method on functions sampled on uniform grids, it is straightforward to extend the method to handle input data in the form of multivariate functions sampled on non-uniform grids.