Skip to main content

Consensus of Clusterings Based on High-Order Dissimilarities

  • Chapter
  • First Online:
Partitional Clustering Algorithms

Abstract

Over the years, many clustering algorithms have been developed, handling different issues such as cluster shape, density or noise. Most clustering algorithms require a similarity measure between patterns, either implicitly or explicitly. Although most of them use pairwise distances between patterns, e.g., the Euclidean distance, better results can be achieved using other measures. The dissimilarity increments is a new high-order dissimilarity measure, that uses the information from triplets of nearest neighbor patterns. The distribution of such measure (DID) was recently derived under the hypothesis of local Gaussian generative models, leading to new clustering algorithms. DID-based algorithm builds upon an initial data partition, different initializations producing different data partitions. To overcome this issue, we present an unifying approach based on a combination strategy of all these different initializations. Even though this allows obtaining a robust partition of the data, one must select a clustering algorithm to extract the final partition. We also present a validation criterion based on DID to select the best final partition, consisting in the estimation of graph probabilities for each cluster based on the DID.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://archive.ics.uci.edu/ml.

  2. 2.

    http://www.ai.mit.edu/people/jrennie/20Newsgroups/.

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas LM, Tiwary A (eds) Proceedigns ACM SIGMOD International Conference on Management of Data (SIGMOD 1998), ACM Press, Seattle, WA, USA, pp 94–105

    Google Scholar 

  2. Aidos H, Fred A (2012) Statistical modeling of dissimilarity increments for d-dimensional data: Application in partitional clustering. Pattern Recogn 45(9):3061–3071

    Article  MATH  Google Scholar 

  3. Anderson TW (1962) On the distribution of the two-sample Cramér-von-Mises criterion. Ann Math Stat 33(3):1148–1159

    Article  MATH  Google Scholar 

  4. Ayad HG, Kamel MS (2008) Cumulative voting consensus method for partitions with variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1):160–173

    Article  Google Scholar 

  5. Ball GH, Hall DJ (1965) ISODATA, a novel method of data analysis and pattern classification. Tech. rep., Stanford Research Institute

    Google Scholar 

  6. Benavent AP, Ruiz FE, Martínez JMS (2006) EBEM: An entropy-based EM algorithm for Gaussian mixture models. In: 18th International Conference on Pattern Recognition (ICPR 2006), IEEE Computer Society, Hong Kong, vol 2, pp 451–455

    Google Scholar 

  7. Castro RM, Coates MJ, Nowak RD (2004) Likelihood based hierarchical clustering. IEEE Trans Signal Process 52(8):2308–2321

    Article  MathSciNet  Google Scholar 

  8. Celebi ME, Kingravi HA (2012) Deterministic initialization of the k-means algorithm using hierarchical clustering. Int J Pattern Recogn Artif Intell 26(7) DOI: 10.1142/S0218001412500188

  9. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40(1):200–210

    Article  Google Scholar 

  10. Dimitriadou E, Weingessel A, Hornik K (2002) A combination scheme for fuzzy clustering. In: Pal NR, Sugeno M (eds) Advances in soft computing - proceedings international conference on fuzzy systems (AFSS 2002). Lecture Notes in Computer Science, vol 2275. Springer, Calcutta, pp 332–338

    Google Scholar 

  11. Duflou H, Maenhaut W (1990) Application of principal component and cluster analysis to the study of the distribution of minor and trace elements in normal human brain. Chemometr Intell Lab Syst 9:273–286

    Article  Google Scholar 

  12. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD 1996). AAAI Press, Portland, Oregon, USA, pp 226–231

    Google Scholar 

  13. Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley

    Google Scholar 

  14. Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Brodley CE (ed) Machine learning - proceedings of the 21st international conference (ICML 2004), ACM, Banff, Alberta, Canada, ACM International Conference Proceeding Series, vol 69

    Google Scholar 

  15. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  16. Fred A (2001) Finding consistent clusters in data partitions. In: Kittler J, Roli F (eds) Multiple classifier systems - proceedings 2nd international workshop (MCS 2001). Lecture Notes in Computer Science, vol 2096. Springer, Cambridge, pp 309–318

    Chapter  Google Scholar 

  17. Fred A, Jain A (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Article  Google Scholar 

  18. Fred A, Jain A (2008) Cluster validation using a probabilistic attributed graph. In: 19th international conference on pattern recognition (ICPR 2008), IEEE, Tampa, Florida, USA, pp 1–4

    Google Scholar 

  19. Fred A, Leitão J (2003) A new cluster isolation criterion based on dissimilarity increments. IEEE Trans Pattern Anal Mach Intell 25(8):944–958

    Article  Google Scholar 

  20. Fred A, Lourenço A, Aidos H, Bulò SR, Rebagliati N, Figueiredo M, Pelillo M (2013) Similarity-based pattern analysis and recognition, chap Learning similarities from examples under the evidence accumulation clustering paradigm. Springer, New York, pp 85–117

    Book  Google Scholar 

  21. Gowda KC, Ravi TV (1995) Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recogn 28(8):1277–1282

    Article  Google Scholar 

  22. Guha S, Rastogi R, Shim K (1998) CURE: An efficient clustering algorithm for large datasets. In: Haas LM, Tiwary A (eds) Proceedins of the ACM SIGMOD international conference of management of data (SIGMOD 1998). ACM Press, Seattle, Washington, USA, pp 73–84

    Chapter  Google Scholar 

  23. Guha S, Rastogi R, Shim K (1999) ROCK: A robust clustering algorithm for categorical attributes. In: Kitsuregawa M, Papazoglou MP, Pu C (eds) Proceedings of the 15th international conference on data engineering (ICDE 1999). IEEE Computer Society, Sydney, Australia, pp 512–521

    Google Scholar 

  24. Hartigan JA, Wong MA (1979) Algorithm AS 136: A k-means clustering algorithm. J Roy Stat Soc C (Appl Stat) 28(1):100–108

    MATH  Google Scholar 

  25. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666

    Article  Google Scholar 

  26. Jain AK, Duin RPW, Mao J (2000) Statistical pattern recognition: A review. IEEE Trans Pattern Anal Mach Intell 22(1):4–37

    Article  Google Scholar 

  27. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  28. Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions. Applied probability and statistics, vol 1, 2nd edn. Wiley, New York

    Google Scholar 

  29. Johnson SC (1967) Hierarchical clustering schemes. Psychometrika 32(3):241–254

    Article  Google Scholar 

  30. Kamvar S, Klein D, Manning C (2002) Interpreting and extending classical agglomerative clustering algorithms using a model-based approach. In: Sammut C, Hoffmann AG (eds) Machine learning - proceedings of the 19th international conference (ICML 2002). Morgan Kaufmann, Sydney, Australia, pp 283–290

    Google Scholar 

  31. Kannan R, Vempala S, Vetta A (2000) On clusterings – good, bad and spectral. In: 41st annual symposium on foundations of computer science (FOCS 2000). IEEE Computer Science, Redondo Beach, CA, USA, pp 367–377

    Google Scholar 

  32. Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75

    Article  Google Scholar 

  33. Kaufman L, Rousseeuw PJ (2005) Finding groups in data: an introduction to cluster analysis. Wiley

    Google Scholar 

  34. Kuncheva LI, Hadjitodorov ST (2004) Using diversity in cluster ensembles. In: Proceedings of the IEEE international conference on systems, man & cybernetics, vol 2. IEEE, The Hague, Netherlands, pp 1214–1219

    Google Scholar 

  35. Lance GN, Williams WT (1968) Note on a new information-statistic classificatory program. Comput J 11:195

    Article  Google Scholar 

  36. Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction. Information science and statistics. Springer, New York

    Book  MATH  Google Scholar 

  37. Lehmann EL, Romano JP (2005) Testing statistical hypotheses, 3rd edn. Springer texts in statistics. Springer, New York

    MATH  Google Scholar 

  38. Lin J (1991) Divergence measures based on the Shannon entropy. IEEE Trans Inform Theory 37(1):145–151

    Article  MATH  MathSciNet  Google Scholar 

  39. von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416

    Article  MathSciNet  Google Scholar 

  40. MacKay DJ (2006) Information theory, inference, and learning algorithms, 5th edn. Cambridge University Press, Cambridge

    Google Scholar 

  41. MacNaughton-Smith P, Williams WT, Dale MB, Mockett LG (1964) Dissimilarity analysis: a new technique of hierarchical sub-division. Nature 202:1034–1035

    Article  Google Scholar 

  42. Meila M (2003) Comparing clusterings by the variation of information. In: Schölkopf B, Warmuth MK (eds) Computational learning theory and kernel machines - proceedings 16th annual conference on computational learning theory and 7th kernel workshop (COLT 2003). Lecture Notes in Computer Science, vol 2777. Springer, Washington, USA, pp 173–187

    Google Scholar 

  43. Milligan GW, Soon SC, Sokol LM (1983) The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Trans Pattern Anal Mach Intell PAMI-5(1):40–47

    Article  Google Scholar 

  44. Olver FWJ, Lozier DW, Boisvert RF, Clark CW (eds) (2010) NIST handbook of mathematical functions. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  45. Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2:169–194

    Article  Google Scholar 

  46. Strehl A, Ghosh J (2002) Cluster ensembles - a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  Google Scholar 

  47. Su T, Dy JG (2007) In search of deterministic methods for initializing k-means and Gaussian mixture clustering. Intell Data Anal 11(4):319–338

    Google Scholar 

  48. Theodoridis S, Koutroumbas K (2009) Pattern recognition, 4th edn. Elsevier Academic, Amsterdam

    Google Scholar 

  49. Topchy A, Jain A, Punch W (2003) Combining multiple weak clusterings. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM 2003). IEEE Computer Society, Melbourne, Florida, USA, pp 331–338

    Google Scholar 

  50. Topchy A, Jain A, Punch W (2005) Clustering ensemble: Models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881

    Article  Google Scholar 

  51. Ueda N, Nakano R, Ghahramani Z, Hinton GE (2000) SMEM algorithm for mixture models. Neural Comput 12(9):2109–2128

    Article  Google Scholar 

  52. Vaithyanathan S, Dom B (2000) Model-based hierarchical clustering. In: Boutilier C, Goldszmidt M (eds) Proceedings of the 16th conference in uncertainty in artificial intelligence (UAI 2000). Morgan Kaufmann, Stanford, California, USA, pp 599–608

    Google Scholar 

  53. Wang H, Shan H, Banerjee A (2009) Bayesian cluster ensembles. In: Proceedings of the SIAM international conference on data mining (SDM 2009). SIAM, Sparks, Nevada, USA, pp 211–222

    Google Scholar 

  54. Wang P, Domeniconi C, Laskey KB (2010) Nonparametric Bayesian clustering ensembles. In: Balcázar JL, Bonchi F, Gionis A, Sebag M (eds) Machine learning and knowledge discovery in databases - proceedings european conference: Part III (ECML PKDD 2010). Lecture notes in computer science, vol 6323. Springer, Barcelona, Spain, pp 435–450

    Chapter  Google Scholar 

  55. Williams WT, Lambert JM (1959) Multivariate methods in plant ecology: 1. association-analysis in plant communities. J Ecol 47(1):83–101

    Article  Google Scholar 

  56. Xu R, Wunsch II D (2005) Survey of clustering algorithms. IEEE Trans Neural Network 16(3):645–678

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the Portuguese Foundation for Science and Technology grant PTDC/EEI-SII/2312/2012.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Helena Aidos .

Editor information

Editors and Affiliations

Appendix

Appendix

Assume that X = {x 1, , x N } is a l-dimensional set of patterns, and \(\mathbf{x}_{i} \sim \mathcal{N}(\boldsymbol{\mu },\varSigma )\). Also, with no loss of generality, assume that \(\boldsymbol{\mu }= 0\) and Σ is diagonal; this only involves translation and rotation of the data, which does not affect Euclidean distances. If x denotes a sample from this Gaussian, we transform x into x through a process known as “whitening” or “sphering”, such that its i-th entry is given by \(x_{i}^{{\ast}}\equiv x_{i}/\varSigma _{ii}\); x i thus follows the standard normal distribution, \(\mathcal{N}(0,1)\). Now, it is known that the difference between samples, such as \(x_{i}^{{\ast}}- y_{i}^{{\ast}}\), from two univariate standard normal distributions follows a normal distribution with covariance 2. Therefore,

$$\displaystyle{ \frac{x_{i}^{{\ast}}- y_{i}^{{\ast}}} {\sqrt{2}} \sim \mathcal{N}(0,1). }$$
(10.30)

It can be shown that the squared Euclidean distance,

$$\displaystyle{ (d^{{\ast}}(\mathbf{x}^{{\ast}}/\sqrt{2},\mathbf{y}^{{\ast}}/\sqrt{2}))^{2} =\sum _{ i=1}^{l}\left (\frac{x_{i}^{{\ast}}- y_{ i}^{{\ast}}} {\sqrt{2}} \right )^{2} \sim \chi ^{2}(l), }$$
(10.31)

i.e., follows a chi-square distribution with l degrees of freedom [28]. Thus, the probability density function for \((d^{{\ast}})^{2} \equiv (d^{{\ast}}(\cdot,\cdot ))^{2}\) is given by:

$$\displaystyle{ p(x) = \frac{2^{-l/2}} {\varGamma (l/2)} \;x^{l/2-1}\exp \left \{-\frac{x} {2}\right \},\;x \in [0,+\infty [, }$$
(10.32)

where Γ(⋅ ) denotes the gamma function.

After the sphering, the transformed data has circular symmetry in \(\mathbb{R}^{l}\). We define angular coordinates in a (l − 1)-sphere, with θ i  ∈ [0, π[, i = 1, , l − 2 and θ l−1 ∈ [0, 2π[. Define \(\mathbf{x} -\mathbf{y} \equiv \left (b_{1},b_{2},\ldots,b_{l}\right )\), where b i can be expressed in terms of polar coordinates as

$$\displaystyle\begin{array}{rcl} b_{1}& =& \sqrt{2\varSigma _{11}}\,d^{{\ast}}\cos \theta _{ 1} {}\\ b_{i}& =& \sqrt{2\varSigma _{ii}}\,d^{{\ast}}\left [\prod _{ k=1}^{i-1}\sin \theta _{ k}\right ]\cos \theta _{i},i = 2,\ldots,l - 1 {}\\ b_{l}& =& \sqrt{2\varSigma _{ll}}\,d^{{\ast}}\left [\prod _{ k=1}^{l-1}\sin \theta _{ k}\right ]. {}\\ \end{array}$$

The squared Euclidean distance in the original space, d 2, is

$$\displaystyle\begin{array}{rcl} d^{2}& =& \|\mathbf{x} -\mathbf{y}\|^{2} \\ & =& 2\left [\varSigma _{11}\cos ^{2}\theta _{ 1} +\sum _{ i=2}^{l-1}\varSigma _{ ii}\left (\prod _{k=1}^{i-1}\sin ^{2}\theta _{ k}\right )\cos ^{2}\theta _{ i} +\varSigma _{ll}\left (\prod _{k=1}^{l-1}\sin ^{2}\theta _{ k}\right )\right ]\left (d^{{\ast}}\right )^{2} \\ & \equiv & 2A(\varTheta )\left (d^{{\ast}}\right )^{2}, {}\end{array}$$
(10.33)

where A(Θ), with Θ = (θ 1, θ 2, , θ l−1), is called the expansion factor. Naturally, this expansion factor depends on the angle vector Θ, and it is hard to properly deal with this dependence. Thus, we will use the approximation that the expansion factor is constant and equal to the expected value of the true expansion factor over all angles Θ. This expected value is given by

$$\displaystyle{ \mathbb{E}[A(\varTheta )] =\int _{S^{l-1}}\left [\prod _{i=1}^{l-2}p_{\theta _{ i}}(\theta _{i})\right ]p_{\theta _{l-1}}(\theta _{l-1})A(\varTheta )\;\mathrm{d}_{S^{l-1}}V }$$
(10.34)

where the volume element is \(\mathrm{d}_{S^{l-1}}V = \left (\prod _{i=1}^{l-2}\sin ^{l-(i+1)}\theta _{i}\right )\,\mathrm{d}\theta _{1}\ldots \,\mathrm{d}\theta _{l-1}\). Since we sphered the data, we can assume for simplicity that \(\theta _{i} \sim Unif([0,\pi [)\) for i = 1, , l − 2 and that \(\theta _{l-1} \sim Unif([0,2\pi [)\); then \(p_{\theta _{i}}(\theta _{i}) = 1/\pi\) and \(p_{\theta _{l-1}}(\theta _{l-1}) = 1/2\pi\). Thus, after some computations (see [2] for details), the expected value of the true expansion factor over all angles Θ is given by

$$\displaystyle{ \mathbb{E}[A(\varTheta )] = \frac{\pi ^{-l/2+1}} {2\varGamma (1 + l/2)}\mathrm{tr}(\varSigma ). }$$
(10.35)

Using Eq. (10.35), the transformation Eq. (10.33) from the normalized space to the original space is given by

$$\displaystyle{ \left (d\right )^{2} = \frac{\pi ^{-l/2+1}} {\varGamma (1 + l/2)}\mathrm{tr}(\varSigma )\,\left (d^{{\ast}}\right )^{2}. }$$
(10.36)

Assume that Y = aX, a constant, with p X (x) the probability density function of X, so \(p_{Y }(y) = p_{X}(y/a)\mathrm{d}x/\mathrm{d}y = p_{X}(y/a) \cdot 1/a\). From Eq. (10.32), we obtain the probability density function for the squared Euclidean distance in the original space, \(\left (d\right )^{2}\). Again, assuming that Y 2 = X and p X (x) the probability density function of X, we have \(p_{Y }(y) = p_{X}(y^{2})\mathrm{d}x/\mathrm{d}y = p_{X}(y^{2}) \cdot 2y\). Therefore, we obtain the probability density function of the Euclidean distance, \(d \equiv d(\mathbf{x},\mathbf{y})\), as

$$\displaystyle{ p(y) = 2G_{l}(\mathrm{tr}(\varSigma ))y^{l-1}\exp \left \{-C_{ l}(\mathrm{tr}(\varSigma ))y^{2}\right \},\,y \in [0,+\infty [, }$$
(10.37)

where we define

$$\displaystyle{ G_{l}(\mathrm{tr}(\varSigma )) \equiv l^{l/2}\varGamma (l/2)^{l/2-1}2^{-l}\mathrm{tr}(\varSigma )^{-l/2}\pi ^{l/2(l/2-1)} }$$
(10.38)

and

$$\displaystyle{ C_{l}(\mathrm{tr}(\varSigma )) \equiv l\varGamma (l/2)(4\mathrm{tr}(\varSigma ))^{-1}\pi ^{l/2-1}. }$$
(10.39)

Now, the dissimilarity increment is defined as the absolute value of the difference of two Euclidean distances. Define d ≡ d(x, y) and d  ≡ d(y, z), which follow the distribution in Eq. (10.37). The probability density function of d inc = dd is given by the convolution

$$\displaystyle\begin{array}{rcl} p_{d_{ \mbox{ inc}}}(w;\mathrm{tr}(\varSigma ))& =& \int _{-\infty }^{\infty }(2G_{ l}(\mathrm{tr}(\varSigma )))^{2}\,(t(t + w))^{l-1}\exp \left \{-C_{ l}(\mathrm{tr}(\varSigma ))\left (t^{2} + (t + w)^{2}\right )\right \} \\ & & \mathbf{1}_{\{t\geq 0\}}\mathbf{1}_{\{t+w\geq 0\}}\mathrm{d}t. {}\end{array}$$
(10.40)

Therefore, after some calculations (see [2] for details), the probability density function for the dissimilarity increments is given by

$$\displaystyle\begin{array}{rcl} p_{d_{ \mbox{ inc}}}(w;\mathrm{tr}(\varSigma ))& =& \frac{G_{l}(\mathrm{tr}(\varSigma ))^{2}} {2^{l-5/2}C_{l}(\mathrm{tr}(\varSigma ))^{l-1/2}}\exp \left \{-\frac{C_{l}(\mathrm{tr}(\varSigma ))} {2} \;w^{2}\right \} \\ & & \Bigg[\sum _{k=0}^{l-1}\sum _{ i=0}^{2l-2-k}(-1)^{i}w^{k+i}\binom{l - 1}{k}\binom{2l - 2 - k}{i}2^{k/2-i/2}C_{ l}(\mathrm{tr}(\varSigma ))^{k/2+i/2}\Bigg. \\ & & \left.\varGamma \left (\frac{2l - 1 - k - i} {2}, \frac{C_{l}(\mathrm{tr}(\varSigma ))} {2} \;w^{2}\right )\right ], {}\end{array}$$
(10.41)

where Γ(a, x) is the incomplete gamma function [44].

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Aidos, H., Fred, A. (2015). Consensus of Clusterings Based on High-Order Dissimilarities. In: Celebi, M. (eds) Partitional Clustering Algorithms. Springer, Cham. https://doi.org/10.1007/978-3-319-09259-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09259-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09258-4

  • Online ISBN: 978-3-319-09259-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics