Data Mining and Knowledge Discovery

, Volume 27, Issue 3, pp 344–371 | Cite as

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

  • R. J. G. B. Campello
  • D. Moulavi
  • A. Zimek
  • J. Sander
Article

Abstract

We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.

Keywords

Hierarchical clustering Optimal selection of clusters  Should-link and should-not-link constraints Cluster quality 

References

  1. Ahmed EB, Nabli A, Gargouri F (2012) Shacun: semi-supervised hierarchical active clustering based on ranking constraints. In: Proceedings of the 12th industrial conference on data mining. Springer, Berlin, pp 194–208Google Scholar
  2. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28:49–60CrossRefGoogle Scholar
  3. Bade K, Nürnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence (WI)Google Scholar
  4. Bade K, Nürnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM international conference on data mining (SDM), AtlantaGoogle Scholar
  5. Bade K, Hermkes M, Nürnberger A (2007) User oriented hierarchical information organization and retrieval. In: European conference on machine Learning (ECML), Corvallis, pp 518–526Google Scholar
  6. Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained clustering: advances in algorithms applications and theory. CRC Press, Boca RatonGoogle Scholar
  7. Benkhalifa M, Mouradi A, Bouyakhf H (2001) Integrating wordnet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. Int J Intell Syst 16:929–947MATHCrossRefGoogle Scholar
  8. Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: International conference on machine learning (ICML), pp 55–63Google Scholar
  9. Böhm C, Plant C (2008) Hissclu: a hierarchical density-based method for semi-supervised clustering. In: International conference on extending database technology (EDBT)Google Scholar
  10. Boudaillier E, Hébrail G (1997) Interactive interpretation of hierarchical clustering. In: Principles of data mining and knowledge discovery, LNCS, vol 1263, Springer, Heidelberg, pp 288–298Google Scholar
  11. Boudaillier E, Hébrail G (1998) Interactive interpretation of hierarchical clustering. Intell Data Anal 2:229–244CrossRefGoogle Scholar
  12. Brecheisen S, Kriegel HP, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: SIAM international conference on data mining (SDM)Google Scholar
  13. Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: European conference on principles and practice of knowledge discovery in databases (PKDD)Google Scholar
  14. Davidson I, Ravi S (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18:257–282MathSciNetCrossRefGoogle Scholar
  15. Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles and practice of knowledge in databases (PKDD)Google Scholar
  16. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining (KDD)Google Scholar
  17. Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, LondonGoogle Scholar
  18. Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: International conference of the Italian association for artificial intelligence (AI*IA)Google Scholar
  19. Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Dec 2011
  20. Geusebroek JM, Burghouts G, Smeulders A (2005) The Amsterdam library of object images. Int J Comput Vis 61:103–112Google Scholar
  21. Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD)Google Scholar
  22. Gupta G, Liu A, Ghosh J (2006) Hierarchical density shaving: a clustering and visualization framework for large biological datasets. In: IEEE ICDM workshop on data mining in bioinformatics (DMB)Google Scholar
  23. Gupta G, Liu A, Ghosh J (2010) Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Trans Comput Biol Bioinformatics 7(2):223–237CrossRefGoogle Scholar
  24. Hamasuna Y, Endo Y, Miyamoto S (2012) On agglomerative hierarchical clustering using clusterwise tolerance based pairwise constraints. J Adv Comput Intell Intell Inform 16(1):174–179Google Scholar
  25. Hartigan JA (1975) Clustering algorithms. Wiley, New YorkMATHGoogle Scholar
  26. Herbin M, Bonnet N, Vautrot P (2001) Estimation of the number of clusters and influence zones. Pattern Recognit Lett 22(14):1557–1568MATHCrossRefGoogle Scholar
  27. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: International conference on knowledge discovery and data mining (KDD)Google Scholar
  28. Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recognit 45:4370–4388MATHCrossRefGoogle Scholar
  29. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRefGoogle Scholar
  30. Jain AK, Dubes RC (1988) Algorithms for clustering data. Englewood Cliffs, Prentice HallGoogle Scholar
  31. Kestler H, Kraus J, Palm G, Schwenker F (2006) On the effects of constraints in semi-supervised hierarchical clustering. In: IAPR workshop on artificial neural networks in pattern recognition (ANNPR)Google Scholar
  32. Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30MathSciNetCrossRefGoogle Scholar
  33. Kim HJ, Lee SG (2002) An effective document clustering method using user-adaptable distance metrics. In: ACM symposium on applied computing (SAC)Google Scholar
  34. Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference on machine learning (ICML)Google Scholar
  35. Kraus JM, Palm G, Kestler HA (2007) On the robustness of semi-supervised hierarchical graph clustering in functional genomics. In: 5th International workshop on mining and learning with graphs (MLG), pp 1–4Google Scholar
  36. Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240CrossRefGoogle Scholar
  37. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: International conference on knowledge discovery and data mining (KDD)Google Scholar
  38. Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: International conference on data mining (ICDM)Google Scholar
  39. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179CrossRefGoogle Scholar
  40. Miyamoto S, Terami A (2010) Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–6Google Scholar
  41. Naldi M, Campello R, Hruschka E, Carvalho A (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952CrossRefGoogle Scholar
  42. Paulovich F, Nonato L, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575CrossRefGoogle Scholar
  43. Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)Google Scholar
  44. Skarmeta AG, Bensaid A, Tazi N (2000) Data mining for text categorization with semi-supervised agglomerative hierarchical clustering. Int J Intell Syst 15(7):633–646MATHCrossRefGoogle Scholar
  45. Struyf J, Džeroski S (2007) Clustering trees with instance level constraints. In: European conference on machine learning (ECML), pp 359–370Google Scholar
  46. Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47MathSciNetMATHCrossRefGoogle Scholar
  47. Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418MathSciNetCrossRefGoogle Scholar
  48. Sun H, Huang J, Han J, Deng H, Zhao P, Feng B (2010) gSkeletonClu: density-based network clustering via structure-connected tree division or agglomeration. In: IEEE international conference on data mining (ICDM)Google Scholar
  49. Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, BostonGoogle Scholar
  50. Wagstaff KL (2002) Intelligent clustering with instance-level constraints. Ph.D. thesis, Department of Computer Science, Cornell UniversityGoogle Scholar
  51. Xiong T, Wang S, Mayers A, Monga E (2011) Semi-supervised parameter-free divisive hierarchical clustering of categorical data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 265–276Google Scholar
  52. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987CrossRefGoogle Scholar
  53. Yeung KY, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34Google Scholar
  54. Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: International conference on knowledge discovery and data mining (WKDD)Google Scholar
  55. Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10:141–168MathSciNetCrossRefGoogle Scholar
  56. Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining (ICDM)Google Scholar

Copyright information

© The Author(s) 2013

Authors and Affiliations

  • R. J. G. B. Campello
    • 2
    • 1
  • D. Moulavi
    • 1
  • A. Zimek
    • 1
  • J. Sander
    • 1
  1. 1.Department of Computing SciencesUniversity of AlbertaEdmontonCanada
  2. 2.Department of Computer SciencesUniversity of São PauloSão CarlosBrazil

Personalised recommendations