Abstract
We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.
Similar content being viewed by others
Notes
Note that such a reduction may be even more noticeable for higher values of \(m_{ clSize }\).
In the example of Fig. 3b–d, these nodes would be virtual children of \(\mathbf{C}_1\) and \(\mathbf{C}_2\), which have been omitted for the sake of clarity.
References
Ahmed EB, Nabli A, Gargouri F (2012) Shacun: semi-supervised hierarchical active clustering based on ranking constraints. In: Proceedings of the 12th industrial conference on data mining. Springer, Berlin, pp 194–208
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28:49–60
Bade K, Nürnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence (WI)
Bade K, Nürnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM international conference on data mining (SDM), Atlanta
Bade K, Hermkes M, Nürnberger A (2007) User oriented hierarchical information organization and retrieval. In: European conference on machine Learning (ECML), Corvallis, pp 518–526
Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained clustering: advances in algorithms applications and theory. CRC Press, Boca Raton
Benkhalifa M, Mouradi A, Bouyakhf H (2001) Integrating wordnet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. Int J Intell Syst 16:929–947
Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: International conference on machine learning (ICML), pp 55–63
Böhm C, Plant C (2008) Hissclu: a hierarchical density-based method for semi-supervised clustering. In: International conference on extending database technology (EDBT)
Boudaillier E, Hébrail G (1997) Interactive interpretation of hierarchical clustering. In: Principles of data mining and knowledge discovery, LNCS, vol 1263, Springer, Heidelberg, pp 288–298
Boudaillier E, Hébrail G (1998) Interactive interpretation of hierarchical clustering. Intell Data Anal 2:229–244
Brecheisen S, Kriegel HP, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: SIAM international conference on data mining (SDM)
Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: European conference on principles and practice of knowledge discovery in databases (PKDD)
Davidson I, Ravi S (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18:257–282
Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles and practice of knowledge in databases (PKDD)
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London
Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: International conference of the Italian association for artificial intelligence (AI*IA)
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Dec 2011
Geusebroek JM, Burghouts G, Smeulders A (2005) The Amsterdam library of object images. Int J Comput Vis 61:103–112
Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Gupta G, Liu A, Ghosh J (2006) Hierarchical density shaving: a clustering and visualization framework for large biological datasets. In: IEEE ICDM workshop on data mining in bioinformatics (DMB)
Gupta G, Liu A, Ghosh J (2010) Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Trans Comput Biol Bioinformatics 7(2):223–237
Hamasuna Y, Endo Y, Miyamoto S (2012) On agglomerative hierarchical clustering using clusterwise tolerance based pairwise constraints. J Adv Comput Intell Intell Inform 16(1):174–179
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Herbin M, Bonnet N, Vautrot P (2001) Estimation of the number of clusters and influence zones. Pattern Recognit Lett 22(14):1557–1568
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recognit 45:4370–4388
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Jain AK, Dubes RC (1988) Algorithms for clustering data. Englewood Cliffs, Prentice Hall
Kestler H, Kraus J, Palm G, Schwenker F (2006) On the effects of constraints in semi-supervised hierarchical clustering. In: IAPR workshop on artificial neural networks in pattern recognition (ANNPR)
Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30
Kim HJ, Lee SG (2002) An effective document clustering method using user-adaptable distance metrics. In: ACM symposium on applied computing (SAC)
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference on machine learning (ICML)
Kraus JM, Palm G, Kestler HA (2007) On the robustness of semi-supervised hierarchical graph clustering in functional genomics. In: 5th International workshop on mining and learning with graphs (MLG), pp 1–4
Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: International conference on knowledge discovery and data mining (KDD)
Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: International conference on data mining (ICDM)
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Miyamoto S, Terami A (2010) Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–6
Naldi M, Campello R, Hruschka E, Carvalho A (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952
Paulovich F, Nonato L, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575
Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)
Skarmeta AG, Bensaid A, Tazi N (2000) Data mining for text categorization with semi-supervised agglomerative hierarchical clustering. Int J Intell Syst 15(7):633–646
Struyf J, Džeroski S (2007) Clustering trees with instance level constraints. In: European conference on machine learning (ECML), pp 359–370
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47
Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418
Sun H, Huang J, Han J, Deng H, Zhao P, Feng B (2010) gSkeletonClu: density-based network clustering via structure-connected tree division or agglomeration. In: IEEE international conference on data mining (ICDM)
Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Wagstaff KL (2002) Intelligent clustering with instance-level constraints. Ph.D. thesis, Department of Computer Science, Cornell University
Xiong T, Wang S, Mayers A, Monga E (2011) Semi-supervised parameter-free divisive hierarchical clustering of categorical data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 265–276
Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987
Yeung KY, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34
Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: International conference on knowledge discovery and data mining (WKDD)
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10:141–168
Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining (ICDM)
Acknowledgments
This study was supported by Fapesp / CNPq (Brazil) and NSERC (Canada).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.
Rights and permissions
About this article
Cite this article
Campello, R.J.G.B., Moulavi, D., Zimek, A. et al. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Disc 27, 344–371 (2013). https://doi.org/10.1007/s10618-013-0311-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0311-4