Constraint-Based Clustering in Large Databases

  • Anthony K. H. Tung
  • Jiawei Han
  • Laks V.S. Lakshmanan
  • Raymond T. Ng
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1973)

Abstract

Constrained clustering — finding clusters that satisfy user-specified constraints — is highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. A scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that finding the optimal solution is NP-hard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. By experiments, we show the effectiveness and efficiency of the heuristics.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [AGGR98]
    R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD’98.Google Scholar
  2. [BBD00]
    P. Bradley, K. P. Bennet, and A. Demiriz. Constrained k-means clustering. In MSR-TR-2000-65, Microsoft Research, May 2000.Google Scholar
  3. [BFR98]
    P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD’98.Google Scholar
  4. [EKSX96]
    M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD’96.Google Scholar
  5. [GJ79]
    M. Garey and D. Johnson. Computers and Intractability: a Guide to The Theory of NP-Completeness. Freeman and Company, New York, 1979.MATHGoogle Scholar
  6. [HaKa00]
    J. Han and M. Kamber Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.Google Scholar
  7. [KHK99]
    G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. COMPUTER, 32:68–75, 1999.CrossRefGoogle Scholar
  8. [KMR97]
    D. Karger, R. Motwani, and G. D. S. Ramkumar. On approximating the longest path in a graph. Algorithmica, 18:99–110, 1997.CrossRefMathSciNetGoogle Scholar
  9. [KPR98]
    J. M. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2:311–324, 1998.CrossRefGoogle Scholar
  10. [KR90]
    L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.Google Scholar
  11. [LNHP99]
    L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD’99.Google Scholar
  12. [NH94]
    R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In VLDB’94.Google Scholar
  13. [NLHP98]
    R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In SIGMOD’98.Google Scholar
  14. [STA97]
    D. B. Shmoys, E. Tardos, and K. Aardal. Approximation algorithms for facility location problems. In STOC’97.Google Scholar
  15. [TNLH00]
    A. K. H. Tung, R. Ng, L. Lakshmanan, and J. Han. Constraint-based clustering in large databases. http://www.cs.sfu.ca/pub/cs/techreports/2000/CMPT2000-05.pdf.
  16. [WYM97]
    W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB’97.Google Scholar
  17. [ZRL96]
    T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD’96.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Anthony K. H. Tung
    • 1
  • Jiawei Han
    • 1
  • Laks V.S. Lakshmanan
    • 2
  • Raymond T. Ng
    • 3
  1. 1.Simon Fraser UniversityCanada
  2. 2.IITBombay & Concordia U
  3. 3.University of British ColumbiaCanada

Personalised recommendations