Advertisement

Data Mining and Knowledge Discovery

, Volume 14, Issue 1, pp 25–61 | Cite as

The complexity of non-hierarchical clustering with instance and cluster level constraints

  • Ian Davidson
  • S. S. Ravi
Article

Abstract

Recent work has looked at extending clustering algorithms with instance level must-link (ML) and cannot-link (CL) background information. Our work introduces δ and ε cluster level constraints that influence inter-cluster distances and cluster composition. The addition of background information, though useful at providing better clustering results, raises the important feasibility question: Given a collection of constraints and a set of data, does there exist at least one partition of the data set satisfying all the constraints? We study the complexity of the feasibility problem for each of the above constraints separately and also for combinations of constraints. Our results clearly delineate combinations of constraints for which the feasibility problem is computationally intractable (i.e., NP-complete) from those for which the problem is efficiently solvable (i.e., in the computational class P). We also consider the ML and CL constraints in conjunctive and disjunctive normal forms (CNF and DNF respectively). We show that for ML constraints, the feasibility problem is intractable for CNF but efficiently solvable for DNF. Unfortunately, for CL constraints, the feasibility problem is intractable for both CNF and DNF. This effectively means that CL-constraints in a non-trivial form cannot be efficiently incorporated into clustering algorithms. To overcome this, we introduce the notion of a choice-set of constraints and prove that the feasibility problem for choice-sets is efficiently solvable for both ML and CL constraints. We also present empirical results which indicate that the feasibility problem occurs extensively in real world problems.

Keywords

Non-hierarchical clustering Constraints Complexity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bansal N, Blum A, Chawla S (2002) Correlation clustering. In: Proc. 43rd annual IEEE symposium on Foundations of Computer Science (FOCS-2002), pp 238–246Google Scholar
  2. Basu S, Banerjee, A, Mooney, R (2002) Semi-supervised learning by seeding. In: Proc. 19th Intl. Conf. on Machine Learning (ICML-2002). Sydney, Australia pp 19–26Google Scholar
  3. Basu S, Bilenko M, Mooney R (2004a) A probabilistic framework for semi-supervised clustering. In: Proc. 10th ACM SIGKDD intl. conf. on knowledge discovery and data mining (KDD-2004). Seattle, WA, pp 59–68Google Scholar
  4. Basu S, Bilenko M, Mooney R (2004b) Active semi-supervision for pairwise constrained clustering. In: Proc. 4th SIAM intl. conf. on data mining (SDM-2004) pp 333–344Google Scholar
  5. Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proc. 21st international conference on on machine learning (ICML-2004), pp 11–18Google Scholar
  6. Bradley P, Fayyad U (1998) Refining initial points for K-Means clustering. In: Proc. 15th intl. conf. on machine learning (ICML-1998), pp 91–99Google Scholar
  7. Campers G, Henkes O, Leclerq P (1987) Graph coloring heuristics: a survey, some new propositions and computational experiences on random and Leighton’s graphs. In: Proc. Operational Research ’87. Buenos Aires, pp 917–932Google Scholar
  8. Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. In: Proc. 44th Annual IEEE symposium on foundations of computer science (FOCS-2003), pp 524–533Google Scholar
  9. Cooper GF (1990) The computational complexity of probabilistic inference using bayesian belief networks. In: Artif Intell 42(2–3):393–405Google Scholar
  10. Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms 2nd edn. MIT Press and McGraw-Hill, Cambridge, MAzbMATHGoogle Scholar
  11. Davidson I, Ravi SS (2005a) Clustering with constraints feasibility issues and the k-Means algorithm. In: Proc. 2005 SIAM International Conference on Data Mining (SDM’05). Newport Beach, CA, pp 138–149Google Scholar
  12. Davidson I, Ravi SS (2005b) Hierarchical clustering with constraints: theory and practice. In: Proc. 9th European principles and practice of KDD (PKDD’05). Porto, Portugal pp 59–70Google Scholar
  13. Dyer M, Frieze A (1986) Planar 3DM is NP-Complete. J Algorithms :174–184Google Scholar
  14. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd intl. conf. on knowledge discovery and data mining (KDD-96). Portland, OR, pp 226–231Google Scholar
  15. Feige U, Kilian J (1998) Zero knowledge and the chromatic number. J Comput Syst Sci 57:187–199zbMATHCrossRefMathSciNetGoogle Scholar
  16. Garey MR, Johnson DJ (1979) Computers and intractability: a guide to the theory of NP-completeness. W H Freeman and Co., San Francisco, CAzbMATHGoogle Scholar
  17. Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38(2–3):293–306zbMATHCrossRefGoogle Scholar
  18. Hansen P, Jaumard B (1997) Cluster analysis and mathematical programming. Math Program 79:191–215CrossRefMathSciNetGoogle Scholar
  19. Hertz A, de Werra D (1987) Using Tabu search techniques for graph coloring. Computing 39:345–351zbMATHCrossRefMathSciNetGoogle Scholar
  20. Klein D, Kamvar S, Manning C (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proc. 19th intl. conf. on machine learning (ICML 2002). Sydney, Australia, July pp 307–314Google Scholar
  21. Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proc. ACM SIGKDD Intl. conf. on knowledge discovery and data mining. San Diego, CA pp 277–281Google Scholar
  22. Tamassia R, Tollis I (1989) Planar grid embedding in linear time. In: IEEE Trans Circuits Syst CAS-36(9):1230–1234CrossRefMathSciNetGoogle Scholar
  23. Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proc. 17th intl. conf. on machine learning (ICML 2000). Stanford, CA, pp 1103–1110Google Scholar
  24. Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained K-means clustering with background knowledge. In: Proc. 18th intl. conf. on machine learning (ICML 2001). Williamstown, MA, pp 577–584Google Scholar
  25. Wagstaff K (2002) Intelligent clustering with instance-level constraints. Ph.D Thesis, Department of Computer Science, Cornell University, Ithaca, NY, Chapter 3, pp 50–51Google Scholar
  26. West DB (2001) Introduction to Graph Theory 2nd edn. Prentice Hall, Inc., Englewood Cliffs, NJGoogle Scholar
  27. Wijsen J, Meersman R (1998) On the complexity of mining quantitative association rules. In: J Data Mining Knowl Discovery 2(3):263–281CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2007

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity at Albany - State University of New YorkAlbanyUSA

Personalised recommendations