Skip to main content

On Semi-Supervised Clustering

  • Chapter
  • First Online:
Partitional Clustering Algorithms

Abstract

Due to its capability to exploit training datasets encompassing both labeled and unlabeled patterns, semi-supervised learning (SSL) has been receiving attention from the community throughout the last decade. Several SSL approaches to data clustering have been proposed and investigated, as well. Unlike typical SSL setups, in semi-supervised clustering (SSC) the partial supervision is generally not available in terms of class labels associated with a subset of the training sample. In fact, general SSC algorithms rely rather on additional constraints which bring some kind of a-priori, weak side-knowledge to the clustering process. Significant instances are: COP-COBWEB and COP k-means, HMRF k-means, seeded k-means, constrained k-means, and active fuzzy constrained clustering. This chapter is a survey of major SSC philosophies, setups, and techniques. It provides the reader with an insight into these notions, categorizing and reviewing the major state-of-the-art approaches to SSC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    According to the statistical notion of sufficient statistics.

  2. 2.

    The authors introduce their algorithm as a partitional method, since it yields a flat partition of the data, corresponding to the top level of the resulting COBWEB hierarchy. Nevertheless, it is our conviction that COP-COBWEB is actually a HSSC approach, due to the hierarchical way in which the process is carried out. The eventual selection of a partition from the dendrogram does not affect this; actually, it is quite a common fact in the hierarchical framework.

References

  1. Alpaydin E (2010) Introduction to machine learning, 2nd edn. MIT Press, Cambridge

    MATH  Google Scholar 

  2. Anand R, Reddy CK (2011) Graph-based clustering with constraints. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining - volume part II, PAKDD’11, pp 51–62. Springer, New York

    Google Scholar 

  3. Arbelaitz O, Gurrutxaga I, Muguerza J, Perez JM, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256

    Article  Google Scholar 

  4. Bade K, Nurnberger A (2006) Personalized hierarchical clustering. In: IEEE/WIC/ACM international conference on web intelligence, pp 181–187

    Google Scholar 

  5. Bade K, Nurnberger A (2008) Creating a cluster hierarchy under constraints of a partially known hierarchy. In: SDM ’08, pp 13–24

    Google Scholar 

  6. Basu S, Banerjee A, Mooney R (2002) Semi-supervised clustering by seeding. In: Proceedings of the 19st international conference on machine learning, pp 19–26

    Google Scholar 

  7. Basu S, Banerjee A, Mooney R (2004) Active semi-supervision for pairwise constrained clustering. In: Proceedings of the 2004 SIAM international conference on data mining (SDM-04). URL http://www.cs.utexas.edu/users/ai-lab/?basu:sdm04

  8. Basu S, Bilenko M, Mooney R (2004) A probabilistic framework for semi-supervised clustering. In: Proc. of the 10th ACM SIGKDD conference on knowledge discovery and data mining (KDD’04), pp 59–68

    Google Scholar 

  9. Bilenko M, Basu S, Mooney R (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the 21st international conference on machine learning, Banff, Canada, pp 81–88

    Google Scholar 

  10. Bishop CM (1995) Neural networks for pattern recognition. Oxford University Press, New York

    Google Scholar 

  11. Bishop CM (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  12. Celebi ME, Kingravi H, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  13. Chu SM, Tang H, Huang TS (2009) Fishervoice and semi-supervised speaker clustering. In: IEEE international conference on acoustics, speech and signal processing (ICASSP’F09), pp 4089–4092. IEEE, Washington, DC, USA.

    Chapter  Google Scholar 

  14. Cohn D, Caruana R, McCallum A (2003) Semi-supervised clustering with user feedback. Tech. rep.

    Google Scholar 

  15. Daniels K, Giraud-Carrier C (2006) Learning the threshold in hierarchical agglomerative clustering. In: Machine learning and applications (ICMLA ’06) 5th international conferance, pp 270–278

    Google Scholar 

  16. Davidson I, Ravi SS (2005) Agglomerative hierarchical clustering with constraints: Theoretical and empirical results. In: Lecture notes in computer science, pp 59–70. Springer, New York

    Google Scholar 

  17. Davidson I, Ravi SS (2007) Intractability and clustering with constraints. In: Proceedings of the 24th international conference on machine learning, ICML ’07, pp. 201–208. ACM, New York. DOI 10.1145/1273496.1273522. URL http://doi.acm.org/10.1145/1273496.1273522

  18. Deborah L, Baskaran R, Kannan A (2010) A survey on internal validity measure for cluster validation. Int J Comput Sci Eng Survey 1(2):85–102

    Article  Google Scholar 

  19. Dhillon I, Guan Y, Kulis B (2004) Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, p 556. ACM, New York

    Google Scholar 

  20. Dhillon IS, Fan J, Guan Y (2001) Efficient clustering of very large document collections. In: Grossman RL, Kamath C, Kegelmeyer P, Kumar V, Namburu RR (eds) Data mining for scientific and engineering applications. Springer, New York, pp 357–381

    Chapter  Google Scholar 

  21. Duda RO, Hart PE (1973) Pattern classification and scene analysis. Willey, New York

    MATH  Google Scholar 

  22. Faußer S, Schwenker F (2012) Semi-supervised kernel clustering with sample-to-cluster weights. In: Schwenker F, Trentin E (eds) Partially supervised learning - First IAPR TC3 workshop, PSL 2011, Ulm, Germany, September 15–16, 2011, Revised Selected Papers, pp 72–81. Springer, New York

    Google Scholar 

  23. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  24. Floyd RW (1962) Algorithm 97: shortest path. Commun ACM 5(6):345

    Article  Google Scholar 

  25. Frigui H, Krishnapuram R (1997) Clustering by competitive agglomeration. Pattern Recogn 7:1109–1119

    Article  Google Scholar 

  26. Grira N, Crucianu M, Boujemaa N (2005) Semi-supervised fuzzy clustering with pairwise-constrained competitive agglomeration. In: IEEE international conference on fuzzy systems

    Google Scholar 

  27. Grira N, Crucianu M, Boujemaa N (2008) Active semi-supervised fuzzy clustering. Pattern Recogn 41:1834–1844

    Article  MATH  Google Scholar 

  28. Hofmann T, Buhmann JM (1998) Active data clustering. In: In advances in neural information processing systems 10, pp 528–534

    Google Scholar 

  29. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  30. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Upper Saddle River

    MATH  Google Scholar 

  31. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: A review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  32. Kamvar SD, Klein D, Manning CD (2003) Spectral learning. In: IJCAI, pp 561–566

    Google Scholar 

  33. Kohonen T (ed) (1997) Self-organizing maps. Springer, New York

    MATH  Google Scholar 

  34. Kulis B, Basu S, Dhillon I, Mooney R (2009) Semi-supervised graph clustering: A kernel approach. Mach Learn 74(1):1–22

    Article  Google Scholar 

  35. Křivánek M, Morávek J (1986) Np-hard problems in hierarchical-tree clustering. Acta Inf 23(3):311–323. DOI 10.1007/BF00289116. URL http://dx.doi.org/10.1007/BF00289116

  36. Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of the 2007 seventh IEEE international conference on data mining, ICDM ’07, pp 577–582. IEEE Computer Society, Washington, DC, USA. DOI 10.1109/ICDM.2007.98. URL http://dx.doi.org/10.1109/ICDM.2007.98

  37. Little RJA, Rubin DB (2002) Statistical analysis with missing data. Wiley, New York

    Book  MATH  Google Scholar 

  38. Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE international conference on data mining, pp. 911–916. IEEE Computer Society, Washington, DC, USA

    Chapter  Google Scholar 

  39. Lloyd S (2006) Least squares quantization in pcm. IEEE Trans Inform Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  40. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proc. of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, California, pp 281–297

    Google Scholar 

  41. Martinetz TM, Berkovich SG, Schulten KJ (1993) Neural-gas’ network for vector quantization and its application to time-series prediction. IEEE Trans Neural Network 4(4):558–569

    Article  Google Scholar 

  42. Newman CBD, Merz C (1998) UCI repository of machine learning databases. URL http://www.ics.uci.edu/~mlearn/MLRepository.html

  43. Podani J (2000) Simulation of random dendrograms and comparison tests: Some comments. J Classification 17(1):123–142

    Article  MATH  MathSciNet  Google Scholar 

  44. Rendón E, Abundez IM, Gutierrez C, Zagal SD, Arizmendi A, Quiroz EM, Arzate HE (2011) A comparison of internal and external cluster validation indexes. In: Proceedings of the 2011 american conference on applied mathematics and the 5th WSEAS international conference on computer engineering and applications, pp. 158–163. World Scientific and Engineering Academy and Society (WSEAS)

    Google Scholar 

  45. Rockafellar RT (1970) Convex analysis. Princeton Mathematical Series. Princeton University Press, Princeton

    MATH  Google Scholar 

  46. Sattah S, Tversky A (1977) Additive similarity trees. Psychometrika 3:319–345

    Article  Google Scholar 

  47. Schaeffer SE (2007) Survey: graph clustering. Comput Sci Rev 1(1):27–64

    Article  MATH  MathSciNet  Google Scholar 

  48. Schwenker F, Trentin E (2014) Pattern classification and clustering: A review of partially supervised learning approaches. Pattern Recogn Lett 37:4–14

    Article  Google Scholar 

  49. Segal E, Wang H, Koller D (2003) Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics 74(19):264–272

    Article  Google Scholar 

  50. Soleymani Baghshah M, Bagheri Shouraki S (2010) Kernel-based metric learning for semi-supervised clustering. Neurocomputing 73(7):1352–1361

    Article  MATH  Google Scholar 

  51. Strehl A, Ghosh J, Mooney R (2000) Impact of similarity measures on web-page clustering. In: Proceedings of the 17th national conference on artificial intelligence: workshop of artificial intelligence for web search (AAAI 2000), 30–31 July 2000. AAAI, Austin, Texas, USA, pp 58–64

    Google Scholar 

  52. Streit RL, Luginbuhl TE (1994) Maximum likelihood training of probabilistic neural networks. IEEE Trans Neural Network 5(5):764–783

    Article  Google Scholar 

  53. Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: A comparative overview. Stat Anal Data Min 3(4):209–235

    MathSciNet  Google Scholar 

  54. Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceeding of the 17th international conference on machine learning, ICML 2000, pp 1103–1110

    Google Scholar 

  55. Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. In: Proc. of the 18th international conference on machine learning (ICML’01), pp 577–584

    Google Scholar 

  56. Xing EP, Ng AY, Jordan MI, Russell S (2003) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15, pp 505–512. MIT Press, Cambridge

    Google Scholar 

  57. Xiong H, Li Z (2013) Clustering validation measures. In: Data clustering: algorithms and applications, pp 571–606

    Google Scholar 

  58. Zha H, He X, Ding C, Simon H, Gu M (2001) Spectral relaxation for k-means clustering. In: NIPS, pp 1057–1064. MIT Press, Cambridge

    Google Scholar 

  59. Zhao H, Qi Z (2010) Hierarchical agglomerative clustering with ordering constraints. In: Proceedings of the 2010 third international conference on knowledge discovery and data mining, WKDD ’10, pp. 195–199. IEEE Computer Society, Washington, DC, USA. DOI 10.1109/WKDD.2010.123. URL http://dx.doi.org/10.1109/WKDD.2010.123

  60. Zheng L, Li T (2011) Semi-supervised hierarchical clustering. In: IEEE international conference on data mining, pp 982–991

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Bongini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Bongini, M., Schwenker, F., Trentin, E. (2015). On Semi-Supervised Clustering. In: Celebi, M. (eds) Partitional Clustering Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-09259-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09259-1_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09258-4

  • Online ISBN: 978-3-319-09259-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics