Constraint-Based Clustering in Large Databases

Tung, Anthony K. H.; Han, Jiawei; Lakshmanan, Laks V.S.; Ng, Raymond T.

doi:10.1007/3-540-44503-X_26

Anthony K. H. Tung⁶,
Jiawei Han⁶,
Laks V.S. Lakshmanan⁷ &
…
Raymond T. Ng⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1973))

Included in the following conference series:

International Conference on Database Theory

2749 Accesses
53 Citations
3 Altmetric

Abstract

Constrained clustering — finding clusters that satisfy user-specified constraints — is highly desirable in many applications. In this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. A scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. Our algorithm consists of two phases: pivot movement and deadlock resolution. For both phases, we show that finding the optimal solution is NP-hard. We then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. By experiments, we show the effectiveness and efficiency of the heuristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In SIGMOD’98.
Google Scholar
P. Bradley, K. P. Bennet, and A. Demiriz. Constrained k-means clustering. In MSR-TR-2000-65, Microsoft Research, May 2000.
Google Scholar
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD’98.
Google Scholar
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In KDD’96.
Google Scholar
M. Garey and D. Johnson. Computers and Intractability: a Guide to The Theory of NP-Completeness. Freeman and Company, New York, 1979.
MATH Google Scholar
J. Han and M. Kamber Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
Google Scholar
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. COMPUTER, 32:68–75, 1999.
Article Google Scholar
D. Karger, R. Motwani, and G. D. S. Ramkumar. On approximating the longest path in a graph. Algorithmica, 18:99–110, 1997.
Article MathSciNet Google Scholar
J. M. Kleinberg, C. Papadimitriou, and P. Raghavan. A microeconomic view of data mining. Data Mining and Knowledge Discovery, 2:311–324, 1998.
Article Google Scholar
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
Google Scholar
L. V. S. Lakshmanan, R. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD’99.
Google Scholar
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In VLDB’94.
Google Scholar
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In SIGMOD’98.
Google Scholar
D. B. Shmoys, E. Tardos, and K. Aardal. Approximation algorithms for facility location problems. In STOC’97.
Google Scholar
A. K. H. Tung, R. Ng, L. Lakshmanan, and J. Han. Constraint-based clustering in large databases. http://www.cs.sfu.ca/pub/cs/techreports/2000/CMPT2000-05.pdf.
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. In VLDB’97.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In SIGMOD’96.
Google Scholar

Download references

Author information

Authors and Affiliations

Simon Fraser University, Canada
Anthony K. H. Tung & Jiawei Han
IIT, Bombay & Concordia U
Laks V.S. Lakshmanan
University of British Columbia, Canada
Raymond T. Ng

Authors

Anthony K. H. Tung
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Han
View author publications
You can also search for this author in PubMed Google Scholar
Laks V.S. Lakshmanan
View author publications
You can also search for this author in PubMed Google Scholar
Raymond T. Ng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Limburg University (LUC), 3590, Diepenbeek, Belgium
Jan Van den Bussche
Department of Computer Science and Engineering, University of California, 92093-0114, La Jolla, CA, USA
Victor Vianu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tung, A.K.H., Han, J., Lakshmanan, L.V., Ng, R.T. (2001). Constraint-Based Clustering in Large Databases. In: Van den Bussche, J., Vianu, V. (eds) Database Theory — ICDT 2001. ICDT 2001. Lecture Notes in Computer Science, vol 1973. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44503-X_26

Download citation

DOI: https://doi.org/10.1007/3-540-44503-X_26
Published: 12 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41456-8
Online ISBN: 978-3-540-44503-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics