A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

Campello, R. J. G. B.; Moulavi, D.; Zimek, A.; Sander, J.

doi:10.1007/s10618-013-0311-4

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

Published: 04 April 2013

Volume 27, pages 344–371, (2013)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

R. J. G. B. Campello^2,1,
D. Moulavi¹,
A. Zimek¹ &
…
J. Sander¹

3720 Accesses
43 Citations
3 Altmetric
Explore all metrics

Abstract

We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comprehensive Survey of Clustering Algorithms

Article 01 June 2015

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Notes

These methods should be distinguished from those that use hierarchical clustering and partially labeled data to categorize unlabeled data into predefined categories (semi-supervised categorization; e.g., see Skarmeta et al. 2000; Benkhalifa et al. 2001).
Note that such a reduction may be even more noticeable for higher values of \(m_{ clSize }\).
In the example of Fig. 3b–d, these nodes would be virtual children of \(\mathbf{C}_1\) and \(\mathbf{C}_2\), which have been omitted for the sake of clarity.
An alternative setting, in which both objectives in Eqs. (1) and (3) are combined into a single, mixed one, will be discussed further, in Sect. 3.2.
Notice that \(\varGamma \) has the same properties as \(S\) discussed in Sect. 2.3: it can be computed locally for each node and it is additive w.r.t. the objects in the node, so it is compatible with the aggregation operator (sum) used in the objective function in Eqs. (1) and (5).

References

Ahmed EB, Nabli A, Gargouri F (2012) Shacun: semi-supervised hierarchical active clustering based on ranking constraints. In: Proceedings of the 12th industrial conference on data mining. Springer, Berlin, pp 194–208
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. SIGMOD Rec 28:49–60
Article Google Scholar
Bade K, Nürnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence (WI)
Bade K, Nürnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SIAM international conference on data mining (SDM), Atlanta
Bade K, Hermkes M, Nürnberger A (2007) User oriented hierarchical information organization and retrieval. In: European conference on machine Learning (ECML), Corvallis, pp 518–526
Basu S, Davidson I, Wagstaff K (eds) (2008) Constrained clustering: advances in algorithms applications and theory. CRC Press, Boca Raton
Benkhalifa M, Mouradi A, Bouyakhf H (2001) Integrating wordnet knowledge to supplement training data in semi-supervised agglomerative hierarchical clustering for text categorization. Int J Intell Syst 16:929–947
Article MATH Google Scholar
Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: International conference on machine learning (ICML), pp 55–63
Böhm C, Plant C (2008) Hissclu: a hierarchical density-based method for semi-supervised clustering. In: International conference on extending database technology (EDBT)
Boudaillier E, Hébrail G (1997) Interactive interpretation of hierarchical clustering. In: Principles of data mining and knowledge discovery, LNCS, vol 1263, Springer, Heidelberg, pp 288–298
Boudaillier E, Hébrail G (1998) Interactive interpretation of hierarchical clustering. Intell Data Anal 2:229–244
Article Google Scholar
Brecheisen S, Kriegel HP, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: SIAM international conference on data mining (SDM)
Davidson I, Ravi S (2005) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: European conference on principles and practice of knowledge discovery in databases (PKDD)
Davidson I, Ravi S (2009) Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min Knowl Disc 18:257–282
Article MathSciNet Google Scholar
Davidson I, Wagstaff KL, Basu S (2006) Measuring constraint-set utility for partitional clustering algorithms. In: European conference on principles and practice of knowledge in databases (PKDD)
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Everitt BS, Landau S, Leese M (2001) Cluster analysis, 4th edn. Arnold, London
Ferraretti D, Gamberoni G, Lamma E (2009) Automatic cluster selection using index driven search strategy. In: International conference of the Italian association for artificial intelligence (AI*IA)
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 1 Dec 2011
Geusebroek JM, Burghouts G, Smeulders A (2005) The Amsterdam library of object images. Int J Comput Vis 61:103–112
Google Scholar
Gilpin S, Davidson I (2011) Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. In: ACM SIGKDD international conference on knowledge discovery and data mining (KDD)
Gupta G, Liu A, Ghosh J (2006) Hierarchical density shaving: a clustering and visualization framework for large biological datasets. In: IEEE ICDM workshop on data mining in bioinformatics (DMB)
Gupta G, Liu A, Ghosh J (2010) Automated hierarchical density shaving: a robust automated clustering and visualization framework for large biological data sets. IEEE/ACM Trans Comput Biol Bioinformatics 7(2):223–237
Article Google Scholar
Hamasuna Y, Endo Y, Miyamoto S (2012) On agglomerative hierarchical clustering using clusterwise tolerance based pairwise constraints. J Adv Comput Intell Intell Inform 16(1):174–179
Google Scholar
Hartigan JA (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Herbin M, Bonnet N, Vautrot P (2001) Estimation of the number of clusters and influence zones. Pattern Recognit Lett 22(14):1557–1568
Article MATH Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: International conference on knowledge discovery and data mining (KDD)
Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recognit 45:4370–4388
Article MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Englewood Cliffs, Prentice Hall
Kestler H, Kraus J, Palm G, Schwenker F (2006) On the effects of constraints in semi-supervised hierarchical clustering. In: IAPR workshop on artificial neural networks in pattern recognition (ANNPR)
Kettenring JR (2006) The practice of cluster analysis. J Classif 23:3–30
Article MathSciNet Google Scholar
Kim HJ, Lee SG (2002) An effective document clustering method using user-adaptable distance metrics. In: ACM symposium on applied computing (SAC)
Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: International conference on machine learning (ICML)
Kraus JM, Palm G, Kestler HA (2007) On the robustness of semi-supervised hierarchical graph clustering in functional genomics. In: 5th International workshop on mining and learning with graphs (MLG), pp 1–4
Kriegel HP, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240
Article Google Scholar
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: International conference on knowledge discovery and data mining (KDD)
Lelis L, Sander J (2009) Semi-supervised density-based clustering. In: International conference on data mining (ICDM)
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179
Article Google Scholar
Miyamoto S, Terami A (2010) Semi-supervised agglomerative hierarchical clustering algorithms with pairwise constraints. In: IEEE international conference on fuzzy systems (FUZZ-IEEE), pp 1–6
Naldi M, Campello R, Hruschka E, Carvalho A (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952
Article Google Scholar
Paulovich F, Nonato L, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575
Article Google Scholar
Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD)
Skarmeta AG, Bensaid A, Tazi N (2000) Data mining for text categorization with semi-supervised agglomerative hierarchical clustering. Int J Intell Syst 15(7):633–646
Article MATH Google Scholar
Struyf J, Džeroski S (2007) Clustering trees with instance level constraints. In: European conference on machine learning (ECML), pp 359–370
Stuetzle W (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. J Classif 20:25–47
Article MathSciNet MATH Google Scholar
Stuetzle W, Nugent R (2010) A generalized single linkage method for estimating the cluster tree of a density. J Comput Graph Stat 19(2):397–418
Article MathSciNet Google Scholar
Sun H, Huang J, Han J, Deng H, Zhao P, Feng B (2010) gSkeletonClu: density-based network clustering via structure-connected tree division or agglomeration. In: IEEE international conference on data mining (ICDM)
Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, Boston
Wagstaff KL (2002) Intelligent clustering with instance-level constraints. Ph.D. thesis, Department of Computer Science, Cornell University
Xiong T, Wang S, Mayers A, Monga E (2011) Semi-supervised parameter-free divisive hierarchical clustering of categorical data. In: Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 265–276
Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987
Article Google Scholar
Yeung KY, Medvedovic M, Bumgarner R (2003) Clustering gene-expression data with repeated measurements. Genome Biol 4(5):R34
Google Scholar
Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: International conference on knowledge discovery and data mining (WKDD)
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10:141–168
Article MathSciNet Google Scholar
Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining (ICDM)

Download references

Acknowledgments

This study was supported by Fapesp / CNPq (Brazil) and NSERC (Canada).

Author information

Authors and Affiliations

Department of Computing Sciences, University of Alberta, Edmonton, AB, Canada
R. J. G. B. Campello, D. Moulavi, A. Zimek & J. Sander
Department of Computer Sciences, University of São Paulo, São Carlos, Brazil
R. J. G. B. Campello

Authors

R. J. G. B. Campello
View author publications
You can also search for this author in PubMed Google Scholar
D. Moulavi
View author publications
You can also search for this author in PubMed Google Scholar
A. Zimek
View author publications
You can also search for this author in PubMed Google Scholar
J. Sander
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. J. G. B. Campello.

Additional information

Responsible editor: Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Campello, R.J.G.B., Moulavi, D., Zimek, A. et al. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Min Knowl Disc 27, 344–371 (2013). https://doi.org/10.1007/s10618-013-0311-4

Download citation

Received: 03 October 2012
Accepted: 21 March 2013
Published: 04 April 2013
Issue Date: November 2013
DOI: https://doi.org/10.1007/s10618-013-0311-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation