A new random approach for initialization of the multiple restart EM algorithm for Gaussian model-based clustering

The paper proposes a new method for initialization of the multiple restart EM algorithm for Gaussian mixture model-based clustering. The method initializes randomly both the mean vector and covariance matrix of a mixture component. In particular, the mean vector is initialized by a feature vector selected deterministically from a random subset of candidate feature vectors. The selection criterion is the maximum Mahalanobis distance from the already initialized mixture component centers. The covariance matrix of a component is initialized by randomly generating its eigenvalues and eigenvectors. In computational experiments, the used approach was compared with three other random EM initialization methods. The experiments were performed on synthetic datasets generated from the Gaussian mixtures with the different overlap characteristics, as well as on four real-life datasets. The results on synthetic data indicate that, for well separated clusters, for which the maximum pairwise overlap is not excessively high, the described method yields clusterings which correspond better to the original partitions of data, as indicated by the adjusted Rand index. The experiments on real data indicate that the performance of the method is comparable to other three methods for two smaller datasets and significantly better for two larger datasets.


Introduction
Mixture models [12,25,27] are very useful tools, widely applied in pattern recognition for modeling complex probability distributions. A finite mixture model pðxjHÞ can be expressed by a weighted sum of K components: where a m is m-th mixing proportion and p m is the probability density function of the m-th component. In (1), h m is the set of parameters defining the m-th component and H ¼ fh 1 ; h 2 ; . . .; h K ; a 1 ; a 2 ; . . .; a K g is the complete set of the parameters needed to define the mixture. H is unknown and has to be estimated in the mixture learning process. The mixing proportions must satisfy the following conditions: a m [ 0; m ¼ 1; . . .; K and X K m¼1 a m ¼ 1: The number of components K is either known a priori or has to be determined during the mixture learning process.
In the paper it is assumed that K is known. The present paper considers the most widely used class of mixture models known as the Gaussian mixture models (GMMs), in which the probability density function of the m-th component is given by: where l m and R m denote, respectively, the mean vector and covariance matrix, j Á j denotes a determinant of a matrix and d is the dimension of the feature space. for a GMM H is defined by: H ¼ fl 1 ; R 1 ; . . .; l K ; R K ; a 1 ; . . .; a K g: GMMs are widely used in such applications as data clustering [12,27], data classification [15], speaker recognition [33], image segmentation and classification [29]. Estimation of the parameters of a GMM can be performed using the maximum likelihood approach. Given a set of independent and identically distributed feature vectors X ¼ fx 1 ; x 2 ; . . .; x N g, called the learning set, the log likelihood corresponding to a K-component GMM is given by: The maximum likelihood estimate of the parameters is given by: Since the solution of this maximization problem cannot be obtained in a closed form (cf. [5]), a numerical optimization method has to be employed. Because of its simplicity, the EM (expectation-maximization) algorithm [9,24] is the most popular tool for maximum likelihood estimation of the parameters of GMMs. This procedure, however, is highly sensitive to initialization and easily gets trapped in a local optimum of (4). Therefore, the quality of the final solution is strongly dependent on the initial guess of the mixture parameters. The problem can be to some degree alleviated by performing multiple runs of the algorithm, each run starting from different initial conditions and returning the result with the highest log pðXjHÞ. This approach is called multiple restart EM (MREM).
Model-based clustering [12,27] is one of the most important applications of GMMs. The aim of clustering can be defined as dividing a set of objects into K disjoint groups, called clusters, in such a way that the objects within the same group are very similar, whereas the objects in different groups are very distinct. In applications of GMMs to clustering, it is assumed that each feature vector was generated from one of K mixture components. The goal of clustering is to identify, for each feature vector, the mixture component from which it was generated. If it is possible to estimate the mixture parameters, clustering can be performed using maximum a posteriori (MAP) rule, by allocating a feature vector to a cluster (mixture component) with the highest posterior probability. From the Bayes theorem, this probability for the mixture component m can be expressed as: Maximization of (6) is equivalent to finding the mixture index m with the highest value a m p m ðx i jh m Þ. The aim of the paper is the description and thorough evaluation of a new method for the initialization of the EM algorithm for Gaussian model-based clustering. The method, while retaining the probabilistic nature of random initialization, tries to initialize the component means by feature vectors located far away from the already initialized components. The advantage of the method compared to purely random approach is the increased probability of proper initialization of all cluster (components) centers in case of well-separated clusters, which can lead to better clustering results.
The remainder of the paper is organized as follows. In the next section, the research background related to the presented work is discussed. In Sect. 3, the EM algorithm for GMMs is described. Section 4 presents the author's approach to the initialization problem. Section 5 presents the results of computational experiments, in which the approach was compared to three other initialization methods for solving the clustering problem, using both synthetic and real data. The final section offers the concluding remarks.

Related research
The EM algorithm is a local optimization method. Because the log likelihood function (4) has numerous local maxima [25], performance of the EM algorithm is strongly dependent on the initial conditions. Many initialization methods have been proposed, but no single strategy outperforms the rivaling procedures in all cases. In this section, only the most popular approaches are mentioned.
The standard procedure for tackling the problem of the EM algorithm initialization is the multiple restart approach (MREM). In this approach the EM algorithm is run many times, each run being started with different random initial conditions. The result of the best run (in the sense of the highest final log pðXjHÞ) is returned as the final outcome. The most popular random method [25], which in this paper is called rnd-nearest, initializes the component means with randomly chosen feature vectors. After initialization of the means, the feature vectors are clustered by assigning a particular feature vector to the cluster represented by the closest mean. Next, the covariance matrix and the mixing proportion of each cluster are used as initial estimates of the parameters. A variant of this method which is known as rnd-kmeans uses the K-means algorithm [26], initialized by random feature vectors, to find component means. Next, the covariance matrices and mixing proportions are initialized using the same method as in the rnd-nearest technique. However, the K-means algorithm is in itself susceptible to the local maxima.
Another random initialization method [10,30] uses mixing proportions equal to 1=K, selects random feature vectors as component means, and employs a spherical covariance matrix equal to 0:1r 2 I, where I is the identity matrix and In the above equation, R is the covariance matrix of the learning set X. This approach is referred to as rndspherical.
In [4], an extension of MREM called emEM was proposed. The idea of emEM involves performing several short EM runs using different random starting points and a lax convergence criterion. The mixture parameters obtained by the best (in the sense of the highest log pðXjHÞ) short run are used as a starting point for a long EM run. This strategy can be improved by repeating it many times until the available CPU time is exhausted. A variant of emEM called rndEM [21] reduces the short EM phase to the evaluation of log pðXjHÞ of the random starting position.
In yet another approach to the GMM parameter estimation problem, the EM algorithm is combined with some global optimization method, e.g.. an evolutionary algorithm [1,30] or a particle swarm optimizer [7]. However, global optimizers have high computational demands and this approach is limited to moderately sized datasets. Other algorithms proposed to deal with the problem of local maxima include versions of EM with split and merge operations [34,36], a greedy learning method [35], and a component-wise method [10]. Some of these approaches try to estimate simultaneously the number of components K while searching for the optimal set of mixture parameters H.
The idea of using distant feature vectors to initialize center-based clustering algorithms is not new. In [17] it was described in the context of Lloyd's method for vector quantization, equivalent to the K-means algorithm. The original approach initialized the centroid of the first cluster with the feature vector with maximum norm and considered all feature vectors (instead of a random subset, as in the method presented in this article) as new cluster centers. For this reason, it had a deterministic nature and was unsuitable for multiple restart scenario. In [18] a probabilistic version of this method for the K-means algorithm was proposed. Since the K-means algorithm can be interpreted in the framework of model-based clustering (see e.g., [21] or [23]), the method proposed in this paper can be considered as a generalization of [18] to the cases of nonspherical and non-homogeneous covariances. The wellknown K-means?? algorithm [2] initializes cluster centroids by random feature vectors chosen with the probability proportional to the squared shortest distance to the already initialized centroids.
The present article is an extension of the conference paper [19], in which initial experimental results achieved by the method were reported.

EM algorithm for Gaussian mixture models
The EM algorithm [9,24] is an iterative method which, starting from an initial guess of mixture parameters H ð0Þ , generates a sequence of estimates H ð1Þ ; H ð2Þ ; . . .; H ðjÞ ; . . . with increasing log likelihood (i.e., log pðXjH ðjÞ Þ [ log pðXjH ðjÀ1Þ Þ. Each iteration j of the algorithm consists of two steps called the expectation step (E-step) and the maximization step (M-step). For GMMs, these steps are defined as follows [27,32]: The E-steps and M-steps are applied alternately until a convergence criterion is met. In the experiments presented in this paper, a convergence criterion based on a relative improvement of log likelihood was employed. The EM algorithm was terminated if log pðXjH ðjÞ Þ À log pðXjH ðjÀ1Þ Þ log pðXjH ðjÀ1Þ Þ \; where ( 1 was a user-defined termination threshold (in the present case ¼ 1e À 5).

Description of the proposed rnd-maxmin method
In the discussed method, equal initial mixing proportions (a ð0Þ m ¼ 1=K) are used. In the next two sections, we describe how we generate initial component means and covariances.

Component means
The rationale behind the presented approach to initialization of the EM algorithm for GMMs was to correct the shortcomings of the random initialization methods shown in To amend this shortcoming, the method devised by the authors, denoted as rnd-maxmin, initializes the component means and covariances incrementally. When selecting new feature vectors as component means, it tries to disregard the vectors located closely to the already initialized components. The method has a probabilistic nature, which allows it to be used in the MREM algorithm. The initialization procedure can be described by the following steps: 1. Choose the mean of the first component l 2. Choose t (t is a parameter of the method) random unique feature vectors from the remaining (not yet selected as component means) elements of X. Denote by X r ¼ fx r 1 ; x r 2 ; . . .; x r t g the set of the chosen vectors.
For each x r i 2 X r ; compute the minimal squared Mahalanobis distance to the already chosen component means l 1 ; . . .; l ð0Þ mÀ1 : j ; x r i Þ is given by: 3. Select as the m-th component mean the element of X r with the largest minimal squared distance: 4. Generate randomly the covariance matrix R ð0Þ m .
terminate the algorithm.
The preliminary experiments indicated that the choice of parameter t has an influence on the performance of the above initialization method. In the computational experiments reported in section 5, we have chosen value t ¼ 5 for K [ 5 and t ¼ K for K 5, which yielded good results.

Component covariances
Random generation of the covariance matrix R ð0Þ m was based on the eigenvalue decomposition. Since a covariance matrix of a non-degenerate multivariate normal distribution is positive definite, it can be expressed as: where K m is a diagonal matrix of eigenvalues with all diagonal elements positive, and Q m is an orthogonal matrix of eigenvectors. In the presented method, the eigenvalues are generated first. They are generated randomly with the following two restrictions: -The relation of the largest eigenvalue to the smallest eigenvalue cannot be greater than 10. -The sum of all eigenvalues should be equal 1 10dK trðRÞ, where tr is the trace of a matrix and R is the covariance matrix of the learning set X.
Two fulfill these two conditions we first randomly generate d positive numbers k 1 ; k 2 ; . . .; k d from the uniform distribution. Let k max be the maximal generated number. For each of the generated numbers k i ; we check if k i \0:1k max . If this condition holds we assign k i ¼ 0:1k max . Finally, we scale all the numbers by the same factor, so their sum fulfills the second condition.
In the next step, the orthogonal matrix Q m is obtained by randomly generating a square d Â d matrix (each element is generated independently from the standard normal distribution) and performing its QR decomposition [13]. The covariance matrix R ð0Þ m is then obtained using (16) We have performed some profiling experiments with our software, which confirmed the above analysis: the ratio of the runtime of our initialization method to the total runtime of the MREM was less than 1.5 %.

Experimental results
In this section, the results of the computational experiments performed on many synthetic datasets and four real-life datasets are presented. The experiments used GMMs for model-based clustering, in which, after the learning of model parameters by the MREM algorithm, the feature vectors were clustered using the MAP rule, as explained in Sect. 1. External cluster validity measure was employed to compare the partitions of data discovered in the course of clustering with the original partitions. In the case of synthetic data, the original partitions were known, because all the datasets were generated by a random number generator, making it possible to track the source (i.e., mixture component) of each feature vector. Also, the real-life datasets come with class labels, which allows to compare their original partitions with the obtained clustering results.
The external cluster validity measure used to compare the results of clustering methods was the adjusted Rand index (ARI) [16]. The expected value of ARI in the case of randomly generated partitions is 0. A higher value of ARI indicates a higher similarity between partitions; the maximum value of 1 means that two partitions are identical.
The main aim of computational experiments with synthetic data was to identify conditions in which MREM using our initialization method produces better clustering results than MREM using other common methods. We also wanted to demonstrate its performance on several real-life datasets.

The algorithms
The rnd-maxmin method described in Sect. 4 was compared with rnd-nearest, rnd-spherical, and rnd-kmeans methods outlined in Sect. 2. It is possible that, after the initialization of component means, rnd-nearest or rndkmeans methods obtain a singular component covariance matrix. This happens, for instance, when the number of objects in the initial cluster is smaller than the number of dimensions d. In this case, the used implementations of rnd-nearest and rnd-kmeans resort to rnd-spherical initialization of the covariance matrix of the component.
The four random methods were always compared on an equal runtime basis. Each of the methods was allocated the same amount of CPU time. Then, the MREM algorithm was run until the allocated time was exhausted. The result of the best (in the sense of the highest log likelihood) solution found by the MREM was considered as the outcome of the method.
It should be noted that the log likelihood for GMMs is unbounded [5]. As a consequence, the EM algorithm may converge to the boundary of the parameter space, with a covariance matrix of a component becoming arbitrarily close to singular and log pðXjHÞ approaching infinity. In the present experiments with the multiple restart version of the EM algorithm, this issue was addressed by computing, after each EM iteration, condition numbers of component covariances [10]. If the largest condition number was above a fixed large threshold, the EM run was aborted and a new one was commenced.

Hardware and software
The EM algorithm and all four random initialization methods were implemented in C?? language and compiled by the Intel C?? compiler version 14.0.1 using optimizing options (-O3 -ipo -march=core2 -fno-alias). The algorithms were run on a Dell Poweredge 1950 server with two quad-core Intel Xeon 5355 (2.66 GHz) processors and 16 GB of RAM, running Ubuntu Linux 12.04. The implementation of EM was parallelized [20] using Open-MP standard for shared memory computers, taking advantage of all eight cores of the system. Moreover, a cluster of 16 identical Dell servers was put to use, making it possible to perform up to 16 independent experiments in parallel. While the approach with multiple servers did not speed up a single EM run, it was able to speed up batches of EM runs which had to be performed.
To prove that the difference between EM initialization methods was not a random effect, nonparametric statistical tests using R software package [31] were performed.

Synthetic datasets
In the experiments with synthetic data, a generator recently proposed in [22] was employed which randomly generates Gaussian mixtures according to the user-defined overlap characteristics. The overlap x ij between two clusters i and j is defined as the sum of two misclassification probabilities x jji and x ijj where: and similarly: The overlap characteristics of mixtures obtained from the generator [22] were controlled by the two parameters: x specifying average pairwise overlap between components and x specifying maximum pairwise overlap. In the experiments, the number of components K was fixed at 20 and mixtures with dimension d 2 f2; 5; 10g were generated. For each dimension, following [22], we used x 2 f0:001; 0:01; 0:05g. For each value of x we used three values of x: a value specifying very high maximum overlap, a value specifying very low maximum overlap and an intermediate value specifying moderate maximum overlap. Thus, for each dimension d; we used nine ð x; xÞ pairs. The parameter p^determining minimal mixing proportion was set at p^¼ 0:25 Ã 1=K ¼ 0:0125. Figure 2 shows six example datasets simulated from the mixtures obtained from the generator [22] for three values of x and extreme values of x. As noticed in [22], both parameters influence overlap characteristics of generated mixtures. For instance, mixtures illustrated in Fig. 2a, b have very low average overlap ( x ¼ 0:001) between component pairs. However, in Fig.  2b with high x ¼ 0:15; the components are much better separated, except for two component pairs which mostly contribute to average overlap. Similar trends are repeated in Fig. 2c, d. Mixture components in Fig. 2e, f are both poorly separated.
The experimental setup was as follows. For each triplet ðd; x; xÞ, 50 different random mixtures were generated using MixSim software [28]. The generated mixture parameters were stored to be used as ground truth for comparison. Then, for each mixture a single dataset consisting of 6000 feature vectors was realized, and for each of the datasets a single run of MREM using a given initialization method was performed. The runtime allocated for MREM was 100 s for d ¼ 10, 50 s for d ¼ 5, and 20 s for d ¼ 2.
The feature vectors were clustered by the MAP rule using the best (in the sense of highest log pðXjHÞ) set of parameters found by MREM. Next, the partition found by cluster analysis was compared (using ARI) with the original partition of the data obtained by tracking a source (mixture) component which generated each feature vector.
The average ARI values obtained for the clustering based on ground truth mixture parameters were used as the baseline for comparison of the four EM initialization methods. For each method, the result is presented as % error relative to the ground truth mixture parameters. The % error A of the method A is computed as A ¼ ðARI T À ARI A Þ= ARI T Ã 100, where ARI T is the average (over 50 different mixtures) ARI obtained using the ground truth parameters and ARI A is average ARI obtained using mixture parameters estimated by MREM using the initialization method A. A lower value of A indicates a better performance; values close to 0 indicate that MREM running the initialization method A achieves a similar performance as clustering using the ground truth mixture parameters.
To assess the statistical significance of the differences in ARI and log likelihood, Wilcoxon signed rank test for paired data [8] was conducted in which, separately for each triplet ðd; x; xÞ, the results obtained by the rnd-maxmin method were compared with the results obtained by the best of the three remaining methods.
The results of the experiments with synthetic data for dimension d ¼ 2 and different values of x and x are shown in Table 1. A represents % error in ARI relative ground truth mixture parameters. The last row of the Table, labeled 'optimal ARI' shows the ARI obtained when clustering was performed using ground truth mixture parameters. L represents average (over 50 different mixtures) log likelihood obtained using a given initialization method. In each column, the best results for A and L are shown in italics. In the row presenting the results of the rnd-maxmin method, p and p L represent Wilcoxon signed rank test p values in statistical comparison of the rndmaxmin method with the best of the three remaining with respect to A and L, respectively. If the comparison is statistically significant at 0:05 level, the corresponding result of rnd-maxmin is shown in bold.
The results obtained for d ¼ 5 and d ¼ 10 are shown in Tables 2 and 3, respectively. The results from Tables 1-3 indicate that average overlap x alone is mostly sufficient for determining the attainable ARI, when clustering is performed using the ground truth parameters. However, both x and x influence the performance of the four initialization methods with respect to A .
The comparison of the performance of the rnd-maxmin method with the best of the remaining methods indicates that: rnd-maxmin dominates the others with respect to ARI and log likelihood when the average overlap between components is very low ( x ¼ 0:001) regardless of the value of d and x. -For situations with low average overlap ( x ¼ 0:01), our method also dominates the others if the maximum overlap between component is not very high. However, the experiments identified the clear weakness of our approach: it usually yields results worse than the other methods (and sometimes by a large margin) if the maximum overlap between components is very high. We conjecture, that in these cases our strategy of locating component means far from already initialized components fails to place two means in close vicinity.
-A similar trend is repeated for x ¼ 0:05. -The performance trends with respect to ARI and log likelihood are not identical (although they are very similar for x ¼ 0:001).
We summarize the results with synthetic data in Table 4, which shows the number of significant (at 0:05 level) wins scored by each method against the best of the other methods, separately for each value of x. Table 4 demonstrates that if we remove rnd-maxmin from the competition, there is no clear winner among the three remaining methods.

Real datasets
In this section, we demonstrate the performance of the rndmaxmin method on four real-life datasets with known class labels. Two of them: iris and thyroid are small-scale datasets available from UCI machine learning repository [3]. They are commonly used in comparisons of clustering algorithms. Two other datasets: Brodatz and pendigits are examples of more challenging problems, because of greater sample size, dimension of feature space and number of classes.
In the experiments the class information was discarded during GMM learning; it was used only to compute ARI. (Figs. 3a-6a and . MREM algorithm with each of the initialization methods was run for 200 s larger Brodatz and pendigits datasets and for 0.1 s for smaller iris and thyroid datasets. This experiment was repeated 50 times, each time starting with a different seed of the random number generator. To assess statistical significances of differences in ARI and log pðXjHÞ; the Wilcoxon rank sum test [8] was performed.

Pendigits dataset
The pendigits dataset is available from the UCI machine learning repository [3]. It describes handwritten images of ten digits. Each digit is represented by a 16-dimensional feature vector. In [30] a subset of this dataset describing the first five digits was used for GMM learning. To evaluate the learning algorithm in a more challenging setting, in the presented experiments the complete dataset (K ¼ 10) consisting of 10992 feature vectors was used. Similarly to [30], first the dimensionality was reduced using principal component analysis (PCA), retaining the first nine principal components which together captured more than 95 % of the total variance. Figure 3 illustrates the progress toward the final solution of the four random initialization methods. The curves in Fig. 3a were obtained by measuring the ARI of the best (in the sense of highest log likelihood) solution found so far by the MREM algorithm in time steps of 10 s. The curves in Fig. 3b were updated after each EM run (average time of a single EM run was about 0.2 s). Table 5 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. The first column of the table indicates the name of the method. The second column shows the average (over 50 experiments) log likelihood obtained by the Table 4 Summary of the experiments with synthetic datasets. For each value of x and each method, we report the number of experiments (out of 9 conducted experiments) in which the method performed significantly better (at 0:05 level) than the best of the remaining methods. The numbers before ''/'' signs concern ARI; the numbers after these signs concern log likelihood Method x ¼ 0:001  The results in Fig. 3 and Table 5 indicate that the rndmaxmin method outperforms the three other random initialization methods with respect to ARI. Although the relative difference in ARI in comparison with the second best rnd-spherical is very small (about 0.5 %), it is statistically significant at the 0:05 level. The difference in log likelihood is significant as well.

Brodatz dataset
The Brodatz dataset originated from the Laboratory of Image Processing and Pattern Recognition (INPG-LTIRF). It was used in the framework of the Esprit project ELENA 1 . The original source of data was the Brodatz texture album [6]. Each object in the dataset belongs to 1 of 11 classes representing textures from the album. The objects are described by 40 features extracted using estimation of fourth-order modified moments in four orientations: 0, 45, 90 and 135 degrees [14]. Each of the 11 groups consists of 500 objects.
In this dataset there are linear dependencies between 40 features: the determinant of the covariance matrix is very near zero and the first 37 principal components capture 100 % of the total variance. Thus, this dataset is badly conditioned and a GMM cannot be estimated by the EM algorithm using the original features. To remove linear dependencies from this data, we used PCA to reduce the dimensionality to 37. Figure 4 illustrates the progress toward the final solution of the four random initialization methods.
It is evident that MREM using rnd-kmeans is easily trapped in local minima of the log likelihood function, as its performance is far worse than the performance of the other three methods. Table 6 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs.
Similarly, as in the case of pendigits dataset, the rndmaxmin method outperforms the others with respect to ARI of the clustering solution (1.9 % relative difference) and the log likelihood of estimated GMMs. The differences are significant at the 0:05 level.

Thyroid dataset
The thyroid dataset is also available from the UCI repository. It consists of the thyroid disease measurements of 215 patients from the same hospital. Each of the patients belongs to one of three groups with a known diagnosis: a group of healthy patients (150 cases), patients with hyperthyroidism (35 cases), and patients with hypothyroidism (30 cases). Each patient is characterized by the results of five laboratory tests. Figure 5 illustrates the progress toward the final solution of the four random initialization methods. The curves in Fig. 5a were obtained by measuring the ARI of the best (in the sense of highest log likelihood) solution found so far by the MREM algorithm in time steps of 0.005 s. The curves in Fig. 5b were updated after each EM run (the average time of a single EM run was about 0.002 s). Table 7 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. Since the same partition of data was found by every method in each of the 50 runs (i.e., there was no difference between the methods), we did not perform statistical significance tests.

Iris dataset
The well-known iris dataset [11], available from the UCI repository, contains four measurements of 150 samples of three iris species. The feature vectors are divided evenly into three classes. Fig. 6 illustrates the progress toward the final solution of the four random initialization methods. Table 8 shows the comparison of the final solutions of the four random initialization methods based on 50 independent runs. The results indicate that similarly, as in the   case of the thyroid dataset, there was no discernible difference between the four methods.

Conclusions and future work
In this article, a new random method for initialization of the EM algorithm for Gaussian mixture model-based clustering was proposed. The method was compared with three other approaches. The tests were carried out using multiple restart EM algorithm (MREM) on many synthetic datasets with different overlap characteristics and four real-life datasets.
The results of experiments confirm the well-known fact that no single method outperforms the others in all cases. Our approach also has its strengths and weaknesses. Generally, it performs better (or comparable) than other methods if the maximum overlap between clusters is not very high. Otherwise, it produces clustering results worse then the competing algorithms.
Finally, it should be noted that for small-scale clustering problems, as exemplified by experiments with the iris and thyroid datasets, the performance of model-based clustering solution estimated by the MREM algorithm may be similar (if not virtually the same) irrespective of the initialization method.
There are several possible directions of future work. To correct the deficiencies of our method, we plan to devise a hybrid solution combining rnd-maxmin with other random methods. For the MREM approach, the hybrid solution could be easily obtained by alternately running the EM algorithm using two or more different initialization methods (and keeping the mixture with the highest log likelihood). We plan to apply this hybrid method to classification problems and use it to model class-conditional densities in discriminant analysis [15]. This application of GMMs is often characterized by high overlap between mixture components. In this situation, a rndmaxmin strategy alone performs poorly.
An anonymous reviewer has pointed out that there are other possibilities than Mahalanobis distance, for measuring distance between a feature vector and a component mean (for instance, the contribution of the feature vector to the log likelihood). We plan to implement different distance metrics in the rnd-maxmin method and test their performance.
Also, there are ideas to combine the described method for initialization of the EM algorithm with a global optimization method, e.g., differential evolution, similarly to that done with the K-means algorithm in [18].