Abstract
A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.
Similar content being viewed by others
Notes
Refer to Aggarwal and Reddy (2014), Chapter 23 on clustering validation measures (in particular the section on external clustering validation measures); and Chapter 22 on cluster ensembles (in particular the section on measuring similarity between clustering solutions).
a.k.a. community mining; refer to Fortunato (2010) for a survey.
In other words, it should have a constant baseline, i.e. the expected value of agreements between two random clusterings of a same dataset. If not constant, an example of 0.7 agreement value can be both a strong (when baseline is 0.2) or a weak (when baseline is 0.6) agreement.
The well-studied inter-rater agreement indices in statistics are defined to measure the agreement between different coders, or judges on categorizing the same data. Examples are the goodness of fit, chi-square test, the likelihood Chi-square, kappa measure of agreement, Fisher’s exact test, Krippendroff’s alpha, etc. (see test 16 in Cortina-Borja 2012). These statistical tests are also defined based on the contingency table which displays the multivariate frequency distribution of the (categorical) variables.
Which happens when the clusterings are identical, and only the order of the clusters is permuted, i.e. the distribution of overlaps in each row/column of the contingency table has a single spike on the matched cluster and is zero elsewhere.
For example, we introduce an extension of this generalization for clusterings of nodes in graphs, a.k.a. communities, in the following section.
Which are measured based on the properties and attributes of data-points.
Refer to the supplementary materials for more information, available at: https://github.com/rabbanyk/CommunityEvaluation.
For crisp clusters (a.k.a strict membership), \(u_{ik}\) is restricted to 0, 1 (1 if node i belongs to cluster k and 0 otherwise); whereas for probabilistic clusters (or soft membership), \(u_{ik}\) could be any real number in [0, 1]. Fuzzy clusters usually assume an additional constraint that the total membership of a data-point is equal to one, i.e. \(u_{i.} = \sum _k u_{ik} = 1\). Which should also be true for disjoint clusters, since each data-point can only belong to one cluster.
It is also worth pointing out that in some applications, such as ensemble or multi-view clustering, we may not need the normalization and a measure of distance may suffice.
Parameters are chosen similar to the experiments by Lancichinetti and Fortunato (2009), i.e. networks with 1000 nodes, average degree of 20, max degree of 50, and power law degree exponent of \(-2\); where the size of communities follows a power law distribution with exponent of \(-1\), and ranges between 20 and 100 nodes. Results for other parameter settings, including smaller sized communities, 10 to 50, could be found in the supplementary materials.
Similar trends are observed for other variations of the agreement measures which can be found in the supplementary materials.
The \(\delta \) subscript indicates that the ARI is computed based on our \(\delta \)-based formulation, which is equivalent to the original ARI in this experiment, since communities are covering all nodes and non-overlapping (Identity 6).
Available at: https://github.com/rabbanyk/CommunityEvaluation.
This equality is also useful in the implementation to improve the scalability.
References
Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton
Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313
Anderson DT, Bezdek JC, Popescu M, Keller JM (2010) Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Trans Fuzzy Syst 18(5):906–918
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res 6:1705–1749
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008+
Brouwer RK (2008) Extending the rand, adjusted rand and Jaccard indices to fuzzy partitions. J Intell Inf Syst 32(3):213–235
Campello RR (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975
Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242
Cortina-Borja M (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. J R Stat Soc Ser A 175(3):829–829
Cui Y, Fern X, Dy J (2007) Non-redundant multi-view clustering via orthogonalization. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007. pp 133–142
Dhillon IS, Tropp JA (2007) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29(4):1120–1146
Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174
Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103,018
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Hullermeier E, Rifqi M, Henzgen S, Senge R (2012) Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans Fuzzy Syst 20(3):546–556
Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10:341–376
Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056,117
Lancichinetti A, Fortunato S, Kertesz J (2008a) Detecting the overlapping and hierarchical community structure of complex networks. New J Phys 11(3):20
Lancichinetti A, Fortunato S, Radicchi F (2008b) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046,110
Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS One 6(4):e18,961
Light RJ, Margolin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Mcauley J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data 8(1):4:1–4:28
McDaid A, Hurley N (2010) Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, pp 112–119
McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515
Meilă M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895
Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69(066):133
Nielsen F, Nock R (2014) On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process Lett 21(1):10–13
Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences-ISCIS 2005, Springer, pp 284–293
Quere R, Le Capitaine H, Fraisseix N, Frelicot C (2010) On normalizing fuzzy coincidence matrices to compare fuzzy and/or possibilistic partitions with the rand index. In: 2010 IEEE International Conference on Data Mining, IEEE, pp 977–982
Ronhovde P, Nussinov Z (2009) Multiresolution community detection for megascale networks by information-based replica correlations. Phys Rev E 80(1):016,109
Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, ICML ’09, pp 1073–1080
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Warrens MJ (2008a) On similarity coefficients for \(2\times 2\) tables and correction for chance. Psychometrika 73:487–502
Warrens MJ (2008b) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted rand index. J Classif 25:177–183
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, KDD ’09, pp 877–886
Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, New York, pp 587–596
Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on Machine learning, ACM, New York, pp 1028–1035
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendix: Proofs
Appendix: Proofs
1.1 Proof of Proposition 1
From the definition of Variation of information we have:
On the other hand, we have:
\(\square \)
\((*)\) \(E_j\)/\(Var_j\) shows the average/variance of values in the \(j^{th}\) column of the contingency table.
\((**)\) The RI is in fact proportional to the average variance of rows/columns values in the contingency table, which we denote by conditional variance. For other forms of conditional variance for categorical data see Light and Margolin (1971).
1.2 Proof of Corollary 1
We first show that \(0 \le \mathcal {D}_{\varphi }^\eta (U||V)\) which also results in the lower bound 0 for \(\mathcal {D}_{\varphi }^\eta (U,V)\) since, \( \mathcal {D}_{\varphi }^\eta (U,V) = \mathcal {D}_{\varphi }^\eta (U||V) + \mathcal {D}_{\varphi }^\eta (V||U) \). From the superadditivity of \(\varphi \) we have:
Similarly for the upper bound, from positivity and super-additivity we get respectively:
1.3 Proof of Identity 1
The proof is elementary, if we write the definition for \(\varphi ={x}\log x\), we get:
\(\square \)
\((*)\) slight change of notation, i.e. from \(\sum _{u\in U}\) to \(\sum _{i}^k\), \(\sum _{v\in V}\) to \(\sum _{j}^r\) and \(|u\cap v|\) to \(n_{ij}\).
\((**)\) assuming disjoint covering partitionings: \(\sum _{i}^k\sum _{j}^r n_{ij} = n\), \(\sum _{i}^k n_{ij} = n_{.j}\) and \(\sum _{j}^r n_{ij} = n_{i.}\).
1.4 Proof of Identity 2
Similar to the previous proof from the definition we derive:
\((*), (**)\) same as previous proof. \(\square \)
1.5 Proof of Identity 3 and 4
This formula resembles the adjuctment for chance in Eq. 4, where the measure being adjusted is \(\sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\), the upper bound used for it is \(\frac{1}{2}[\sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.})]\), and the expectation is defined as:
Now if we have \(\varphi (xy) = \varphi (x)\varphi (y)\), which is true for \(\varphi (x)=x^2\), we get:
Using this expecation, if we substitute \(\varphi =x^2\) we get the \(ARI'\) of Eq. 6, and using the \(\varphi =\left( {\begin{array}{c}x\\ 2\end{array}}\right) \) and the later reformulation of E, we get the original ARI of Eq. 5, as:
\(\square \)
On the other hand for the NMI, we have:
\(\square \)
1.6 Proof of Identity 5 and 6
First we prove that in general cases we have:
where \(\Vert .\Vert ^2_F\) is squared Frob norm. This holds since we have:
Where the \(\circ \) is element-wise matrix product, a.k.a. hadamard product, and |.| is sum of all elements in the matrix.Footnote 16 The proof is complete with showing:
Now, we can prove the identities for the cases of disjoint hard clusters, using the notation \(n_{ij} = (U^TV)_{ij}\), we have \(\Vert U^TV\Vert ^2_F = \sum _{ij} n^2_{ij}\) and:
\((*)\) with assumption that clusters are disjoint, \( u_{ki}u_{kj}\) is only non-zero iff \(i=j\)
\((**)\) with the assumption that memberships are hard, \(u_{ki}\) is either 0 or 1, therefore \(u_{ki}= u^2_{ki}\)
\((***)\) marginals of N give cluster sizes in U and V, i.e. \(n_{i.} = \sum _{j} n_{ij} = \sum _{k} u_{ki}=|V_i| \)
Therefore for disjoint hard clusters we get:
The RI normalization assumes that all pairs are in disagreement, i.e. \(|{\mathbf {1}}_{n\times n}| = n^2 \), since \(max(UU^T)=1\) and, \(max(VV^T)=1\). The ARI normalization compares \(\varDelta \) to the difference where the two random variable of \(UU^T_{ij}\) and \(VV^T_{ij}\) are independent, in which case we would have:
which is calculated by:
Since \(\varDelta = ||UU^T - VV^T||^2_F = ||UU^T||^2_F + ||VV^T||^2_F - 2Sum (UU^T \circ VV^T )\), we have \(ARI = 0\) or normalized distance 1, i.e. agreement no better than chance, when this independence condition holds, i.e.:
Rights and permissions
About this article
Cite this article
Rabbany, R., Zaïane, O.R. Generalization of clustering agreements and distances for overlapping clusters and network communities. Data Min Knowl Disc 29, 1458–1485 (2015). https://doi.org/10.1007/s10618-015-0426-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-015-0426-x