Space Decomposition in Data Mining: A Clustering Approach
Data mining algorithms aim at searching interesting patterns in large amount of data in manageable complexity and good accuracy. Decomposition methods are used to improve both criteria. As opposed to most decomposition methods, that partition the dataset via sampling, this paper presents an accuracy-oriented method that partitions the instance space into mutually exclusive subsets using K-means clustering algorithm. After employing the basic divide-and-induce method on several datasets with different classifiers, its error rate is compared to that of the basic learning algorithm. An analysis of the results shows that the proposed method is well suited for datasets of numeric input attributes and that its performance is influenced by the dataset size and its homogeneity. Finally, a homogeneity threshold is developed, that can be used for deciding whether to decompose the data set or not.
KeywordsData Mining Cluster Center Unlabeled Data Validity Index Data Mining Algorithm
Unable to display preview. Download preview PDF.
- 3.Fayyad, U., Piatesky-Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 1–30. MIT Press, Cambridge (1996)Google Scholar
- 5.Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the first Pacific-Asia conference on knowledge discovery and data mining, Singapore (1997)Google Scholar
- 6.Kim, D.J., Park, Y.W., Park, D.J.: A novel validity index for determination of the optimal number of clusters. IEICE Trans. Inf. E84-D(2), 281–285 (2001)Google Scholar
- 8.Maimon, O., Rokach, L.: Theory and Applications of Attribute Decomposition. In: IEEE International Conference on Data Mining, pp. 473–480 (2001)Google Scholar
- 9.Provost, F.J., Kolluri, V.: A Survey of Methods for Scaling Up Inductive Learning Algorithms. In: Proc. 3rd International Conference on Knowledge Discovery and Data Mining (1997)Google Scholar
- 10.Ray, S., Turi, R.H.: Determination of Number of Clusters in K-Means Clustering and Application in Color Image Segmentation. Monash university (1999)Google Scholar
- 11.Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a dataset via the gap statistic. Tech. Rep. 208, Dept. of Statistics, Stanford University (2000)Google Scholar
- 12.Wang, X., Yu, Q.: Estimate the number of clusters in web documents via gap statistic (May 2001)Google Scholar
- 13.Dietterich, T.G.: Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation 10(7) (1998)Google Scholar