Ensemble Clustering of High Dimensional Data with FastMap Projection
- First Online:
DOI: 10.1007/978-3-319-13186-3_43
- Cite this paper as:
- Khan I., Huang J.Z., Tung N.T., Williams G. (2014) Ensemble Clustering of High Dimensional Data with FastMap Projection. In: Peng WC. et al. (eds) Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science, vol 8643. Springer, Cham
Abstract
In this paper, we propose an ensemble clustering method for high dimensional data which uses FastMap projection to generate subspace component data sets. In comparison with popular random sampling and random projection, FastMap projection preserves the clustering structure of the original data in the component data sets so that the performance of ensemble clustering is improved significantly. We present two methods to measure preservation of clustering structure of generated component data sets. The comparison results have shown that FastMap preserved the clustering structure better than random sampling and random projection. Experiments on three real data sets were conducted with three data generation methods and three consensus functions. The results have shown that the ensemble clustering with FastMap projection outperformed the ensemble clusterings with random sampling and random projection.
Keywords
Ensemble clustering FastMap Random sampling Random projection Consensus function1 Introduction
The emergence of new application domains results in very high dimensional big data such as text data, microarray data and smart phone user behavior data. Such high dimensional data with thousands of features present a big challenge to current data mining techniques [10]. Curse of dimensionality and sparsity are two main problems, among others, that handicap many clustering algorithms to find strongly cohesive clusters from very high dimensional big data.
Ensemble clustering [14, 15] is a new approach for clustering that integrates results of different clusterings from the same original data generated by different clustering algorithms or from different data sets sampled from the original data. The ensemble clustering can have a higher accuracy than individual component clustering results. A large number of research results have accelerated this field [6]. Different ensemble clustering methods have been proposed to ensemble different types of clustering results like clustering results of one algorithm from the same data with different parameters initialization [13], results from the same data by using different algorithms [1], results of multiple component data sets from the same data set [8]. Subspace ensemble clustering has become a useful strategy to find robust clusters from such sparse and high dimensional data.
In high dimensional data, clusters often exist in different subspaces. Ensemble clustering based on full space clustering algorithms fail to cluster such data. The innovation of subspace ensemble clustering techniques is promised to resolve this problem. Recently, two methods for generating low dimensional component data have been used to resolve the problem of subspace ensemble clustering of high dimensional data clustering. One method generates low dimensional data by randomly sampling different features. The other method generates low dimensional component data by using a random projection matrix to project the original high dimensional data onto a low dimensional space. We call the former random sampling method and the latter random projection method. In recent days, different flavours of random projection are available [11, 16] but for ensemble clustering former random projection has been used [4]. Both, random sampling and random projection benefit ensemble clustering for high dimensional sparse data. However, the drawback of these methods is that they cannot well preserve the clustering structure of the original data in their generated low dimensional component data, which increases discrepancy of clustering structures in component data sets, thus affecting the performance of ensemble clustering for high dimensional data.
In this paper, we present a new low dimensional component data generation method by FastMap [5], an algorithm that is used to generate low dimensional transformation of high dimensional data. Given a distance matrix of \(N\) objects, FastMap uses the well known Cosine Law to compute the coordinates of the \(N\) objects that are projected to the line of two pivot objects selected from the data set. By removing the distance component from the new generated dimension, a new set of coordinates are computed. This process repeats until \(k\) dimensional representation of the \(N\) objects is obtained. The advantage of FastMap projection in comparison with random sampling and random projection is that it can better preserve the clustering structure of the original data in its generated component data sets. Thus, the performance of ensemble clustering is improved significantly.
We propose two methods to measure preservation of clustering structure of original data in the generated component data sets. We used three real world data sets to analyze preservations by random sampling, random projection and FastMap projection. The comparison results have shown that FastMap preserved clustering structure better than other two methods. We also used the three data sets to conduct ensemble clustering experiments with three component data generation methods and three consensus functions to ensemble clustering results. \(k\)-means algorithm was used to generate component clustering. The results have shown that the ensemble clustering with FastMap projection outperformed the ensemble clusterings with random sampling and random projection on all three data sets. The overall performance of FastMap was the best among the three methods.
2 Framework for Subspace Ensemble Clustering
\(\mathbf{Step 1:}\) Generate \(K\) different component data sets {\(C_1,C_2\), \(\cdot \)\(\cdot \)\(\cdot \) ,\(C_K\)} from \(X\) using a component generation method.
\(\mathbf{Step 2:}\) Cluster the \(K\) component data sets to produce \(K\) component clusterings {\(\pi _1\), \(\pi _2\), \(\cdot \)\(\cdot \)\(\cdot \), \(\pi _k\)} independently using one or more clustering algorithms.
\(\mathbf{Step 3:}\) Ensemble \(K\) component clusterings into a single clustering \(\pi \) using an ensemble method called a consensus function.
2.1 Subspace Component Generation
In ensemble clustering of high dimensional data, we are interested to generating low dimensional component data sets that can better preserve the clustering structure of the original data so as to improve the performance of ensemble clustering on high dimensional data. Currently, random projection and random sampling are two widely used methods for low dimensional component data generation. We review these two method briefly below.
2.2 Component Data Clustering
Any clustering algorithm can be used to cluster a low dimensional component data. Popular clustering algorithms are \(k\)-means, subspace \(k\)-means and hierarchical clustering methods. The advantage of the \(k\)-means type algorithms is its efficiency in handling large data. In this work, we used \(k\)-means. Quite often, different clustering algorithms were used to generate different component clustering results for ensemble clustering. However, there is no clear guidance how the different clustering algorithms should be used. In practice, it is more convenient to use one clustering algorithm for ensemble clustering, rather than multiple clustering algorithms.
2.3 Ensemble Component Clusterings
An ensemble method is used to ensemble multiple component clusterings from different component data sets into a single clustering as the final clustering result. In ensemble clustering, ensemble method is also called consensus function. Several consensus functions have been proposed with different strategies and methods to ensemble component clustering results. Below, we briefly review three ensemble methods that were used in this work.
Similarity-Based Consensus Function. A clustering signifies a relationship between objects in the same cluster and can thus be used to establish a measure of pairwise similarity [15]. A similarity matrix for each component clustering is constructed. In the similarity matrix, the element indexed two objects in the same cluster is assigned value 1, otherwise, the element has value 0 if the two objects are in different clusters. After computation of \(K\) similarity matrices, a final matrix is obtained as the average of corresponding cells of all similarity matrices. The METIS algorithm [9] is then used to resultant similarity matrix to get final clustering ensemble.
Hyper Graph-Based Consensus Function. In Hyper Graph-based Consensus Function (HGPA), an ensemble problem is formulated as partitioning the hypergraph by cutting a minimal number of hyperedges [15]. The hyper graph is constructed by considering objects of a data set \(X\) as \(N\) vertices, and hyper-edges with the same weight are used to connect a set of vertices by using K component clusterings. The algorithm HMETIS [12] is used to partition the hyper-graph into unconnected components by cutting a minimum number of hyper-edges.
3 FastMap Projection for Component Data Generation
3.1 FastMap Projection
Given \(S^{'}_{N\times N}\), we can choose a new pair of pivot objects and use (4) to compute the coordinates of the second dimension. We repeat this process \(k\) times to generate a \(k\) dimensional data of \(X\).
We can also use Principal Component Analysis (PCA) to generate a low dimensional data of \(X\). However, given \(X\), we can only use PCA to generate one low dimensional data, thus not suitable for ensemble clustering which requires multiple component data sets. Using FastMap, we can use a random process to select different pairs of pivot objects to produce different projections of data as component data sets. Another advantage of FastMap is that it is efficient to handling large data.
3.2 Evaluation of Component Data Generation
In this section, we present two methods to evaluate component data generation for ensemble clustering, i.e., preservation of clustering structure of the original data in generated component data sets.
Intrinsic Dimensionality. Given a high dimensional data set \(X_{N\times m}\) of \(m\) dimensions and \(N\) objects, we use a component data generation function \(\varPhi (X,\theta )\) to generate a subspace data \(Y_{N\times p}\), i.e., \(\varPhi (X,\theta )=Y_{N\times p}\) where \(\theta \) are input parameters to produce different \(Y\)s from \(X\). Let \(\mathbf Y = \{Y_1,\dots ,Y_L\}\) be a set of \(L\) component data sets all in \(p\) dimensions and \(\mathbf D = \{D_1,\dots ,D_L\}\) the set of \(L\) distance matrices computed from \(\mathbf Y \). Given a distance matrix \(D_i\), we take the upper half mutual distances of \(D_i\) and plot the histogram of the mutual distances. Large mean and small variance of the histogram distribution of \(D_i\) represent a problem of curse of dimensionality. We use intrinsic dimensionality to measure curse of dimensionality of a data set as in [2].
Definition 1
The intrinsic dimensionality of a data set in a metric space is defined as \(\rho \) = \(\mu ^2\)/2\(\sigma ^2\) where \(\mu \) and \(\sigma \) are the mean and variance of its histogram of distances.
We use intrinsic dimensionality \(\rho \) to evaluate a method \(\varPhi (X,\theta )\). For each component \(Y_i\), we compute \(\rho _i\). Then, we compute the average \(\bar{\rho }\) of \(\rho _i\) of \(L\) component data sets in \(\mathbf Y \). The smaller \(\bar{\rho }\) the better the method \(\varPhi (X,\theta )\).
4 Experiments
Real world data sets.
Data sets | \(\#\)Instances | \(\#\)Features | Source | \(\#\)Classes |
---|---|---|---|---|
Internet Ad | 1000 | 1558 | Multivariate | 02 |
GLI-85 | 85 | 22283 | Microarray | 02 |
Orlraws10P | 100 | 10304 | Image | 10 |
4.1 Experiment Settings
For each data set, we used three methods, random sampling (RS), random projection (RP) and FastMap projection (FP) to generate component data sets. We used the \(k\)-means clustering algorithm to cluster each component data set. The number of clusters \(k\) was given as the number of classes in the data set. For ensemble clustering, we used the three consensus functions discussed in Sect. 3, i.e., hyper graph based consensus function (HGPA), similarity-based consensus function (CSPA) and meta cluster-based consensus function (MCLA). By combining the three component data generation methods and the three consensus functions, we produced 9 ensemble clustering results from each data set. We denote these 9 ensemble clustering methods as RS-CSPA, RP-CSPA, FM-CSPA, RS-MCLA, RP-MCLA, FM-MCLA, RS-HGPA, RP-HGPA and FM-HGPA respectively. We have also shown the ensemble clustering results by using \(k\)-means (KM) upon original data.
In conducting the experiments for comparisons, the component data sets from the same data set were generated with the same number of dimensions by each component generation method. Each ensemble clustering was produced from 10 component clusterings which were produced with the \(k\)-means algorithm.
4.2 Evaluation Methods
We used four evaluation methods to evaluate the results of ensemble clustering with the 9 ensemble clustering methods. They were one unsupervised method and three supervised methods given below.
4.3 Experimental Results
Clustering result comparison.
Methods | Internet Ad | GLI-85 | Orlraws10P | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
- | CP | NMI | ARI | CA | CP | NMI | ARI | CA | CP | NMI | ARI | CA |
RS-CSPA | 171.5 | 0.01 | 0.50 | 0.54 | 561570 | 0.45 | 0.48 | 0.67 | 3711 | 0.67 | 0.81 | 0.78 |
RP-CSPA | 163.1 | 0.51 | 0.53 | 0.62 | 542341 | 0.61 | 0.69 | 0.69 | 3686 | 0.69 | 0.83 | 0.80 |
FM-CSPA | 145.9 | 0.55 | 0.61 | 0.74 | 529449 | 0.58 | 0.49 | 0.74 | 3011 | 0.71 | 0.84 | 0.85 |
RS-HGPA | 271.1 | 0.01 | 0.49 | 0.53 | 581715 | 0.14 | 0.47 | 0.67 | 3616 | 0.69 | 0.83 | 0.80 |
RP-HGPA | 202.3 | 0.24 | 0.49 | 0.52 | 567975 | 0.25 | 0.48 | 0.67 | 3624 | 0.70 | 0.84 | 0.81 |
FM-HGPA | 202.3 | 0.68 | 0.50 | 0.54 | 525676 | 0.29 | 0.49 | 0.69 | 3523 | 0.72 | 0.85 | 0.82 |
RS-MLCA | 171.5 | 0.02 | 0.50 | 0.54 | 570900 | 0.31 | 0.48 | 0.67 | 3619 | 0.72 | 0.85 | 0.82 |
RP-MLCA | 184.0 | 0.62 | 0.54 | 0.65 | 547288 | 0.68 | 0.48 | 0.68 | 3608 | 0.73 | 0.86 | 0.83 |
FM-MLCA | 143.3 | 0.59 | 0.61 | 0.77 | 522337 | 0.50 | 0.49 | 0.71 | 3575 | 0.75 | 0.87 | 0.84 |
KM-CSPA | 170.6 | 0.47 | 0.56 | 0.68 | 614645 | 0.46 | 0.47 | 0.70 | 3798 | 0.59 | 0.81 | 0.70 |
KM-HGPA | 202.3 | 0.47 | 0.49 | 0.54 | 645685 | 0.27 | 0.48 | 0.69 | 3705 | 0.65 | 0.84 | 0.77 |
KM-MCLA | 149.9 | 0.48 | 0.62 | 0.71 | 608945 | 0.47 | 0.49 | 0.69 | 3755 | 0.61 | 0.85 | 0.71 |
The results of ensemble clustering with FastMap projection were also better than those with other two methods under ARI evaluation in all data sets. The majority best results were also obtained with FastMap method, except for one case of GLI-85. However, the difference was not very significant.
Under CP and NMI evaluations, the ensemble clustering with FastMap projection also outperformed the ensemble clusterings with other two methods in most data sets. The majority best results also occurred in the FastMap method. These results demonstrated that the FastMap projection for component data set generation improved the performance of ensemble clustering of high dimensional data.
4.4 Comparisons of FastMap Projection vs. Random Sampling and Random Projection
Intrinsic dimensionality.
Data sets | 5-dimensional space | 10-dimensional space | 15-dimensional space | ||||||
---|---|---|---|---|---|---|---|---|---|
- | FM | RP | RS | FM | RP | RS | FM | RP | RS |
Internet Ad | 3E-05 | 5E-05 | 0.526 | 2E-05 | 4E-05 | 0.485 | 1E-05 | 4E-04 | 0.390 |
GLI85 | 5E-11 | 2E-11 | 5E-09 | 5E-11 | 1E-11 | 2E-09 | 9E-11 | 1E-10 | 1E-10 |
Orlraws10P | 1E-06 | 5E-05 | 1E-03 | 2E-06 | 9E-05 | 8E-03 | 1E-06 | 1E-04 | 7E-03 |
Distance preservation.
Data sets | 5-dimensional space | 10-dimensional space | 15-dimensional space | ||||||
---|---|---|---|---|---|---|---|---|---|
- | FM | RP | RS | FM | RP | RS | FM | RP | RS |
Internet Ad | 0.681 | 0.739 | 1.812 | 0.690 | 0.901 | 1.001 | 0.671 | 0.719 | 1.009 |
GLI85 | 0.596 | 0.639 | 0.960 | 0.539 | 0.590 | 0.921 | 0.447 | 0.46 | 0.931 |
Orlraws10P | 0.479 | 0.519 | 1.238 | 0.394 | 0.469 | 1.108 | 0.359 | 0.493 | 0.998 |
5 Conclusions
In this paper, we have presented the FastMap projection method to generate low dimensional component data sets for ensemble clustering. We have analyzed FastMap projection, random sampling and random projection and demonstrated that FastMap projection can better preserve the clustering structure of the original data than other two methods. Because of this property, the ensemble clustering with FastMap projection outperformed ensemble clusterings with other two methods in experiments on three real world high dimensional data sets. Beside better performance, another advantage of FastMap is that it is efficient in handling big data and flexible in component data generation.
Acknowledgment
This research is supported by Shenzhen New Industry Development Fund under Grant No. JC201005270342A.