Constraint-Based Clustering in Large Databases
Constrained clustering — finding clusters that satisfy user-specified constraints — is highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. A scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that finding the optimal solution is NP-hard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. By experiments, we show the effectiveness and efficiency of the heuristics.
Unable to display preview. Download preview PDF.
- [AGGR98]R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD’98.Google Scholar
- [BBD00]P. Bradley, K. P. Bennet, and A. Demiriz. Constrained k-means clustering. In MSR-TR-2000-65, Microsoft Research, May 2000.Google Scholar
- [BFR98]P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD’98.Google Scholar
- [EKSX96]M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD’96.Google Scholar
- [HaKa00]J. Han and M. Kamber Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.Google Scholar
- [KR90]L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.Google Scholar
- [LNHP99]L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD’99.Google Scholar
- [NH94]R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In VLDB’94.Google Scholar
- [NLHP98]R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In SIGMOD’98.Google Scholar
- [STA97]D. B. Shmoys, E. Tardos, and K. Aardal. Approximation algorithms for facility location problems. In STOC’97.Google Scholar
- [TNLH00]A. K. H. Tung, R. Ng, L. Lakshmanan, and J. Han. Constraint-based clustering in large databases. http://www.cs.sfu.ca/pub/cs/techreports/2000/CMPT2000-05.pdf.
- [WYM97]W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB’97.Google Scholar
- [ZRL96]T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD’96.Google Scholar