On Semi-Supervised Clustering

Bongini, Marco; Schwenker, Friedhelm; Trentin, Edmondo

doi:10.1007/978-3-319-09259-1_9

Marco Bongini²,
Friedhelm Schwenker³ &
Edmondo Trentin²

2947 Accesses

Abstract

Due to its capability to exploit training datasets encompassing both labeled and unlabeled patterns, semi-supervised learning (SSL) has been receiving attention from the community throughout the last decade. Several SSL approaches to data clustering have been proposed and investigated, as well. Unlike typical SSL setups, in semi-supervised clustering (SSC) the partial supervision is generally not available in terms of class labels associated with a subset of the training sample. In fact, general SSC algorithms rely rather on additional constraints which bring some kind of a-priori, weak side-knowledge to the clustering process. Significant instances are: COP-COBWEB and COP k-means, HMRF k-means, seeded k-means, constrained k-means, and active fuzzy constrained clustering. This chapter is a survey of major SSC philosophies, setups, and techniques. It provides the reader with an insight into these notions, categorizing and reviewing the major state-of-the-art approaches to SSC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
According to the statistical notion of sufficient statistics.
2.
The authors introduce their algorithm as a partitional method, since it yields a flat partition of the data, corresponding to the top level of the resulting COBWEB hierarchy. Nevertheless, it is our conviction that COP-COBWEB is actually a HSSC approach, due to the hierarchical way in which the process is carried out. The eventual selection of a partition from the dendrogram does not affect this; actually, it is quite a common fact in the hierarchical framework.

References

Alpaydin E (2010) Introduction to machine learning, 2nd edn. MIT Press, Cambridge
MATH Google Scholar
Anand R, Reddy CK (2011) Graph-based clustering with constraints. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining - volume part II, PAKDD’11, pp 51–62. Springer, New York
Google Scholar
Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256
Article Google Scholar
Bade K, Nurnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence, pp 181–187
Google Scholar
Bade K, Nurnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SDM ’08, pp 13–24
Google Scholar
Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19st international conference on machine learning, pp 19–26
Google Scholar
Basu S, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM international conference on data mining (SDM-04). URL http://www.cs.utexas.edu/users/ai-lab/?basu:sdm04
Basu S, Bilenko M, Mooney R (2004) A probabilistic framework for semi-supervised clustering. In: Proc. of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD’04), pp 59–68
Google Scholar
Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st international conference on machine learning, Banff, Canada, pp 81–88
Google Scholar
Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, New York
Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Celebi ME, Kingravi H, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210
Article Google Scholar
Chu SM, Tang H, Huang TS (2009) Fishervoice and semi-supervised speaker clustering. In: IEEE international conference on acoustics, speech and signal processing (ICASSP’F09), pp 4089–4092. IEEE, Washington, DC, USA.
Chapter Google Scholar
Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Tech. rep.
Google Scholar
Daniels K, Giraud-Carrier C (2006) Learning the threshold in hierarchical agglomerative clustering. In: Machine learning and applications (ICMLA ’06) 5th international conferance, pp 270–278
Google Scholar
Davidson I, Ravi SS (2005) Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In: Lecture notes in computer science, pp 59–70. Springer, New York
Google Scholar
Davidson I, Ravi SS (2007) Intractability and clustering with constraints. In: Proceedings of the 24th international conference on machine learning, ICML ’07, pp. 201–208. ACM, New York. DOI 10.1145/1273496.1273522. URL http://doi.acm.org/10.1145/1273496.1273522
Deborah L, Baskaran R, Kannan A (2010) A survey on internal validity measure for cluster validation. Int J Comput Sci Eng Survey 1(2):85–102
Article Google Scholar
Dhillon I, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, p 556. ACM, New York
Google Scholar
Dhillon IS, Fan J, Guan Y (2001) Efficient clustering of very large document collections. In: Grossman RL, Kamath C, Kegelmeyer P, Kumar V, Namburu RR (eds) Data mining for scientific and engineering applications. Springer, New York, pp 357–381
Chapter Google Scholar
Duda RO, Hart PE (1973) Pattern classification and scene analysis. Willey, New York
MATH Google Scholar
Faußer S, Schwenker F (2012) Semi-supervised kernel clustering with sample-to-cluster weights. In: Schwenker F, Trentin E (eds) Partially supervised learning - First IAPR TC3 workshop, PSL 2011, Ulm, Germany, September 15–16, 2011, Revised Selected Papers, pp 72–81. Springer, New York
Google Scholar
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
Floyd RW (1962) Algorithm 97: shortest path. Commun ACM 5(6):345
Article Google Scholar
Frigui H, Krishnapuram R (1997) Clustering by competitive agglomeration. Pattern Recogn 7:1109–1119
Article Google Scholar
Grira N, Crucianu M, Boujemaa N (2005) Semi-supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In: IEEE international conference on fuzzy systems
Google Scholar
Grira N, Crucianu M, Boujemaa N (2008) Active semi-supervised fuzzy clustering. Pattern Recogn 41:1834–1844
Article MATH Google Scholar
Hofmann T, Buhmann JM (1998) Active data clustering. In: In advances in neural information processing systems 10, pp 528–534
Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666
Article Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River
MATH Google Scholar
Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: A review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37
Article Google Scholar
Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: IJCAI, pp 561–566
Google Scholar
Kohonen T (ed) (1997) Self-organizing maps. Springer, New York
MATH Google Scholar
Kulis B, Basu S, Dhillon I, Mooney R (2009) Semi-supervised graph clustering: A kernel approach. Mach Learn 74(1):1–22
Article Google Scholar
Křivánek M, Morávek J (1986) Np-hard problems in hierarchical-tree clustering. Acta Inf 23(3):311–323. DOI 10.1007/BF00289116. URL http://dx.doi.org/10.1007/BF00289116
Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, ICDM ’07, pp 577–582. IEEE Computer Society, Washington, DC, USA. DOI 10.1109/ICDM.2007.98. URL http://dx.doi.org/10.1109/ICDM.2007.98
Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York
Book MATH Google Scholar
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE international conference on data mining, pp. 911–916. IEEE Computer Society, Washington, DC, USA
Chapter Google Scholar
Lloyd S (2006) Least squares quantization in pcm. IEEE Trans Inform Theory 28(2):129–137
Article MathSciNet Google Scholar
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proc. of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, California, pp 281–297
Google Scholar
Martinetz TM, Berkovich SG, Schulten KJ (1993) Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Trans Neural Network 4(4):558–569
Article Google Scholar
Newman CBD, Merz C (1998) UCI repository of machine learning databases. URL http://www.ics.uci.edu/~mlearn/MLRepository.html
Podani J (2000) Simulation of random dendrograms and comparison tests: Some comments. J Classification 17(1):123–142
Article MATH MathSciNet Google Scholar
Rendón E, Abundez IM, Gutierrez C, Zagal SD, Arizmendi A, Quiroz EM, Arzate HE (2011) A comparison of internal and external cluster validation indexes. In: Proceedings of the 2011 american conference on applied mathematics and the 5th WSEAS international conference on computer engineering and applications, pp. 158–163. World Scientific and Engineering Academy and Society (WSEAS)
Google Scholar
Rockafellar RT (1970) Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton
MATH Google Scholar
Sattah S, Tversky A (1977) Additive similarity trees. Psychometrika 3:319–345
Article Google Scholar
Schaeffer SE (2007) Survey: graph clustering. Comput Sci Rev 1(1):27–64
Article MATH MathSciNet Google Scholar
Schwenker F, Trentin E (2014) Pattern classification and clustering: A review of partially supervised learning approaches. Pattern Recogn Lett 37:4–14
Article Google Scholar
Segal E, Wang H, Koller D (2003) Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 74(19):264–272
Article Google Scholar
Soleymani Baghshah M, Bagheri Shouraki S (2010) Kernel-based metric learning for semi-supervised clustering. Neurocomputing 73(7):1352–1361
Article MATH Google Scholar
Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for web search (AAAI 2000), 30–31 July 2000. AAAI, Austin, Texas, USA, pp 58–64
Google Scholar
Streit RL, Luginbuhl TE (1994) Maximum likelihood training of probabilistic neural networks. IEEE Trans Neural Network 5(5):764–783
Article Google Scholar
Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: A comparative overview. Stat Anal Data Min 3(4):209–235
MathSciNet Google Scholar
Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceeding of the 17th international conference on machine learning, ICML 2000, pp 1103–1110
Google Scholar
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proc. of the 18th international conference on machine learning (ICML’01), pp 577–584
Google Scholar
Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15, pp 505–512. MIT Press, Cambridge
Google Scholar
Xiong H, Li Z (2013) Clustering validation measures. In: Data clustering: algorithms and applications, pp 571–606
Google Scholar
Zha H, He X, Ding C, Simon H, Gu M (2001) Spectral relaxation for k-means clustering. In: NIPS, pp 1057–1064. MIT Press, Cambridge
Google Scholar
Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Proceedings of the 2010 third international conference on knowledge discovery and data mining, WKDD ’10, pp. 195–199. IEEE Computer Society, Washington, DC, USA. DOI 10.1109/WKDD.2010.123. URL http://dx.doi.org/10.1109/WKDD.2010.123
Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining, pp 982–991
Google Scholar

Download references

Author information

Authors and Affiliations

DIISM, University of Siena, Siena, Italy
Marco Bongini & Edmondo Trentin
Institute of Neural Information Processing, University of Ulm, Ulm, Germany
Friedhelm Schwenker

Authors

Marco Bongini
View author publications
You can also search for this author in PubMed Google Scholar
Friedhelm Schwenker
View author publications
You can also search for this author in PubMed Google Scholar
Edmondo Trentin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marco Bongini .

Editor information

Editors and Affiliations

Computer Science dept., Louisiana State University Shreveport, Shreveport, Louisiana, USA
M. Emre Celebi

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bongini, M., Schwenker, F., Trentin, E. (2015). On Semi-Supervised Clustering. In: Celebi, M. (eds) Partitional Clustering Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-09259-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-09259-1_9
Published: 17 October 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09258-4
Online ISBN: 978-3-319-09259-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics