Adapting K-Means Algorithm for Discovering Clusters in Subspaces
Subspace clustering is a challenging task in the field of data mining. Traditional distance measures fail to differentiate the furthest point from the nearest point in very high dimensional data space. To tackle the problem, we design minimal subspace distance which measures the similarity between two points in the subspace where they are nearest to each other. It can discover subspace clusters implicitly when measuring the similarities between points. We use the new similarity measure to improve traditional k-means algorithm for discovering clusters in subspaces. By clustering with low-dimensional minimal subspace distance first, the clusters in low-dimensional subspaces are detected. Then by gradually increasing the dimension of minimal subspace distance, the clusters get refined in higher dimensional subspaces. Our experiments on both synthetic data and real data show the effectiveness of the proposed similarity measure and algorithm.
KeywordsCluster Solution Normalize Mutual Information Conditional Entropy Random Projection Subspace Cluster
Unable to display preview. Download preview PDF.
- 1.Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: Proc. of the 8th International Conference on Database Theory (2001)Google Scholar
- 2.Agrawal, R., Gehrke, J., et al.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD 1998), Seattle, WA, pp. 94–105 (June 1998)Google Scholar
- 3.Fern, X.Z., Brodley, C.E.: Random Projection for High Dimensional Data Clustering: A Clustering Ensemble Approach. In: Proc. 20th Int. Conf. On Machine Learning (ICML 2003), Washington DC (2003)Google Scholar
- 4.Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Higher Education Press. Morgan Kaufmann Publishers, San Francisco (2001)Google Scholar
- 5.Hinneburg, A., Aggarwal, C.C., Keim, D.A.: What is the nearest neighbor in high dimensional spaces? In: Proc. of the 26th International Conference on Very Large Data Bases, Cairo, Egypt, pp. 506–515 (2000)Google Scholar
- 6.Nagesh, H., Goil, S., Choudhary, A.: MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets. Technical Report 9906-010, Northwestern University (June 1999)Google Scholar
- 7.Procopiuc, M., Jones, M., Agarwal, P., Murali, T.M.: A Monte-Carlo Algorithm for Fast Projective Clustering. In: Proc. of the 2002 International Conference on Management of Data (2002)Google Scholar