Generalization of clustering agreements and distances for overlapping clusters and network communities

Rabbany, Reihaneh; Zaïane, Osmar R.

doi:10.1007/s10618-015-0426-x

Generalization of clustering agreements and distances for overlapping clusters and network communities

Published: 07 July 2015

Volume 29, pages 1458–1485, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Reihaneh Rabbany¹ &
Osmar R. Zaïane¹

704 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Article 16 November 2019

Minimum adjusted Rand index for two clusterings of a given size

Article Open access 09 February 2022

Consensus of Clusterings Based on High-Order Dissimilarities

Notes

Refer to Aggarwal and Reddy (2014), Chapter 23 on clustering validation measures (in particular the section on external clustering validation measures); and Chapter 22 on cluster ensembles (in particular the section on measuring similarity between clustering solutions).
a.k.a. community mining; refer to Fortunato (2010) for a survey.
In other words, it should have a constant baseline, i.e. the expected value of agreements between two random clusterings of a same dataset. If not constant, an example of 0.7 agreement value can be both a strong (when baseline is 0.2) or a weak (when baseline is 0.6) agreement.
The well-studied inter-rater agreement indices in statistics are defined to measure the agreement between different coders, or judges on categorizing the same data. Examples are the goodness of fit, chi-square test, the likelihood Chi-square, kappa measure of agreement, Fisher’s exact test, Krippendroff’s alpha, etc. (see test 16 in Cortina-Borja 2012). These statistical tests are also defined based on the contingency table which displays the multivariate frequency distribution of the (categorical) variables.
Which happens when the clusterings are identical, and only the order of the clusters is permuted, i.e. the distribution of overlaps in each row/column of the contingency table has a single spike on the matched cluster and is zero elsewhere.
For example, we introduce an extension of this generalization for clusterings of nodes in graphs, a.k.a. communities, in the following section.
Which are measured based on the properties and attributes of data-points.
Refer to the supplementary materials for more information, available at: https://github.com/rabbanyk/CommunityEvaluation.
Other examples of matching based agreements are the Balanced Error Rate with alignment, average F1 score, and Recall measures used in (Yang and Leskovec 2013; Mcauley and Leskovec 2014; McDaid et al. 2011).
For crisp clusters (a.k.a strict membership), $u_{ik}$ is restricted to 0, 1 (1 if node i belongs to cluster k and 0 otherwise); whereas for probabilistic clusters (or soft membership), $u_{ik}$ could be any real number in [0, 1]. Fuzzy clusters usually assume an additional constraint that the total membership of a data-point is equal to one, i.e. $u_{i.} = \sum _k u_{ik} = 1$. Which should also be true for disjoint clusters, since each data-point can only belong to one cluster.
It is also worth pointing out that in some applications, such as ensemble or multi-view clustering, we may not need the normalization and a measure of distance may suffice.
Parameters are chosen similar to the experiments by Lancichinetti and Fortunato (2009), i.e. networks with 1000 nodes, average degree of 20, max degree of 50, and power law degree exponent of $-2$; where the size of communities follows a power law distribution with exponent of $-1$, and ranges between 20 and 100 nodes. Results for other parameter settings, including smaller sized communities, 10 to 50, could be found in the supplementary materials.
Similar trends are observed for other variations of the agreement measures which can be found in the supplementary materials.
The $\delta $ subscript indicates that the ARI is computed based on our $\delta $-based formulation, which is equivalent to the original ARI in this experiment, since communities are covering all nodes and non-overlapping (Identity 6).
Available at: https://github.com/rabbanyk/CommunityEvaluation.
This equality is also useful in the implementation to improve the scalability.

References

Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton
Google Scholar
Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313
Article MathSciNet Google Scholar
Anderson DT, Bezdek JC, Popescu M, Keller JM (2010) Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Trans Fuzzy Syst 18(5):906–918
Article Google Scholar
Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res 6:1705–1749
MathSciNet MATH Google Scholar
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008+
Article Google Scholar
Brouwer RK (2008) Extending the rand, adjusted rand and Jaccard indices to fuzzy partitions. J Intell Inf Syst 32(3):213–235
Article MathSciNet Google Scholar
Campello RR (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975
Article Google Scholar
Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242
Article Google Scholar
Cortina-Borja M (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. J R Stat Soc Ser A 175(3):829–829
Cui Y, Fern X, Dy J (2007) Non-redundant multi-view clustering via orthogonalization. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007. pp 133–142
Dhillon IS, Tropp JA (2007) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29(4):1120–1146
Article MathSciNet Google Scholar
Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174
Article MathSciNet Google Scholar
Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103,018
Article Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Article Google Scholar
Hullermeier E, Rifqi M, Henzgen S, Senge R (2012) Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans Fuzzy Syst 20(3):546–556
Article Google Scholar
Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10:341–376
MathSciNet MATH Google Scholar
Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056,117
Article Google Scholar
Lancichinetti A, Fortunato S, Kertesz J (2008a) Detecting the overlapping and hierarchical community structure of complex networks. New J Phys 11(3):20
Google Scholar
Lancichinetti A, Fortunato S, Radicchi F (2008b) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046,110
Article Google Scholar
Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS One 6(4):e18,961
Article Google Scholar
Light RJ, Margolin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544
Article MathSciNet MATH Google Scholar
Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Book MATH Google Scholar
Mcauley J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data 8(1):4:1–4:28
Article Google Scholar
McDaid A, Hurley N (2010) Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, pp 112–119
McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515
Meilă M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895
Article MATH Google Scholar
Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69(066):133
Google Scholar
Nielsen F, Nock R (2014) On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process Lett 21(1):10–13
Article Google Scholar
Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences-ISCIS 2005, Springer, pp 284–293
Quere R, Le Capitaine H, Fraisseix N, Frelicot C (2010) On normalizing fuzzy coincidence matrices to compare fuzzy and/or possibilistic partitions with the rand index. In: 2010 IEEE International Conference on Data Mining, IEEE, pp 977–982
Ronhovde P, Nussinov Z (2009) Multiresolution community detection for megascale networks by information-based replica correlations. Phys Rev E 80(1):016,109
Article Google Scholar
Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123
Article Google Scholar
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, ICML ’09, pp 1073–1080
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
MathSciNet MATH Google Scholar
Warrens MJ (2008a) On similarity coefficients for $2\times 2$ tables and correction for chance. Psychometrika 73:487–502
Article MathSciNet Google Scholar
Warrens MJ (2008b) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted rand index. J Classif 25:177–183
Article MathSciNet MATH Google Scholar
Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, KDD ’09, pp 877–886
Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, New York, pp 587–596
Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on Machine learning, ACM, New York, pp 1028–1035

Download references

Author information

Authors and Affiliations

Department of Computing Science, University of Alberta, Edmonton, AB, Canada
Reihaneh Rabbany & Osmar R. Zaïane

Authors

Reihaneh Rabbany
View author publications
You can also search for this author in PubMed Google Scholar
Osmar R. Zaïane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reihaneh Rabbany.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 166 KB)

Supplementary material 2 (pdf 1073 KB)

Appendix: Proofs

1.1 Proof of Proposition 1

From the definition of Variation of information we have:

$$\begin{aligned} VI(U,V)= & {} H(U) + H(V) - 2I(U,V) = 2H(U,V)-H(U)-H(V)\\= & {} {\mathbf {H(V|U)+H(U|V)}} \end{aligned}$$

On the other hand, we have:

$$\begin{aligned} RI(U,V)&\propto \frac{1}{n^2-n}\left( \sum _{i=1}^k \left[ \sum _{j=1}^r n_{ij}^2 - \left( \sum _{j=1}^r n_{ij}\right) ^2\right] + \sum _{j=1}^r \left[ \sum _{i=1}^k n_{ij}^2 - \left( \sum _{i=1}^k n_{ij}\right) ^2\right] \right) \\&\overset{*}{\propto } \sum _{i=1}^k \left[ E_j(n_{ij}^2) - E_j( n_{ij})^2\right] + \sum _{j=1}^r \left[ E_i(n_{ij}^2) - E_i(n_{ij})^2\right] \\&\overset{*}{\propto } \sum _{i=1}^k Var_j(n_{ij}) + \sum _{j=1}^r Var_i(n_{ij}) \quad \overset{**}{\propto } \quad \mathbf { Var(V|U)+ Var(U|V) } \quad \quad \end{aligned}$$

$\square $

$(*)$ $E_j$/$Var_j$ shows the average/variance of values in the $j^{th}$ column of the contingency table.

$(**)$ The RI is in fact proportional to the average variance of rows/columns values in the contingency table, which we denote by conditional variance. For other forms of conditional variance for categorical data see Light and Margolin (1971).

1.2 Proof of Corollary 1

We first show that $0 \le \mathcal {D}_{\varphi }^\eta (U||V)$ which also results in the lower bound 0 for $\mathcal {D}_{\varphi }^\eta (U,V)$ since, $ \mathcal {D}_{\varphi }^\eta (U,V) = \mathcal {D}_{\varphi }^\eta (U||V) + \mathcal {D}_{\varphi }^\eta (V||U) $. From the superadditivity of $\varphi $ we have:

$$\begin{aligned} \sum _{u\in U} \varphi (\eta _{uv}) \le \varphi \left( \sum _{u\in U} \eta _{uv}\right)&\implies \sum _{v\in V} \left[ \varphi \left( \sum _{u\in U} \eta _{uv}\right) - \sum _{u\in U} \varphi (\eta _{uv}) \right] \ge 0\\&\implies \mathbf {\mathcal {D}_{\varphi }^\eta (U||V) \ge 0} \end{aligned}$$

Similarly for the upper bound, from positivity and super-additivity we get respectively:

$$\begin{aligned} \mathcal {D}_{\varphi }^\eta (U||V) = \sum _{v\in V} \varphi \left( \sum _{u\in U} \eta _{uv}\right) - \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv}) \le \sum _{v\in V} \varphi \left( \sum _{u\in U} \eta _{uv}\right) \le \varphi \left( \sum _{v\in V} \sum _{u\in U} \eta _{uv}\right) \ \end{aligned}$$

1.3 Proof of Identity 1

The proof is elementary, if we write the definition for $\varphi ={x}\log x$, we get:

$$\begin{aligned} \mathcal {N}\mathcal {D}_{x\log x}^{|\cap |}(U,V)&= \frac{\sum _{v\in V} \sum _{u\in U} |u\cap v| \left[ \log \left( \sum _{u\in U} |u\cap v|\right) - \log \left( |u\cap v|\right) \right] }{\left( \sum _{v\in V} \sum _{u\in U} |u\cap v| \right) \log \left( \sum _{v\in V} \sum _{u\in U} |u\cap v| \right) }\\&\quad + \frac{\sum _{u\in U} \sum _{v\in V} |u\cap v| \left[ \log \left( \sum _{v\in V} |u\cap v|\right) - \log (|u\cap v|) \right] }{\left( \sum _{u\in U} \sum _{v\in V}|u\cap v|\right) \log \left( \sum _{u\in U} \sum _{v\in V}|u\cap v|\right) }\\&\overset{*}{=} \frac{\sum _{j}^r \sum _{i}^k n_{ij} \left[ \log \left( \sum _{i}^k n_{ij}\right) + \log \left( \sum _{j}^r n_{ij}\right) -2 \log \left( n_{ij}\right) \right] }{\left( \sum _{i}^k\sum _{j}^{r} n_{ij}\right) \log \left( \sum _{i} ^k\sum _{j}^r n_{ij}\right) }\\&\overset{**}{=} \frac{1}{\log n}\sum _{j}^r \sum _{i}^k \frac{ n_{ij} }{n} \log \left( \frac{ n_{i.} n_{.j}}{n_{ij}^2}\right) = \frac{VI(U,V)}{\log n}\quad \quad \end{aligned}$$

$\square $

$(*)$ slight change of notation, i.e. from $\sum _{u\in U}$ to $\sum _{i}^k$, $\sum _{v\in V}$ to $\sum _{j}^r$ and $|u\cap v|$ to $n_{ij}$.

$(**)$ assuming disjoint covering partitionings: $\sum _{i}^k\sum _{j}^r n_{ij} = n$, $\sum _{i}^k n_{ij} = n_{.j}$ and $\sum _{j}^r n_{ij} = n_{i.}$.

1.4 Proof of Identity 2

Similar to the previous proof from the definition we derive:

$$\begin{aligned} \mathcal {N}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&\overset{*}{=} \frac{ \sum _{j}^r \left[ \left( \sum _{i}^k n_{ij}\right) ^2 -\sum _{i}^k n_{ij}^2 \right] + \sum _{i}^k \left[ \left( \sum _{j}^r n_{ij}\right) ^2 - \sum _{j}^r n_{ij}^2 \right] }{\left( \sum _{i}^k\sum _{j}^r n_{ij}\right) ^2-\sum _{i}^k\sum _{j}^r n_{ij}} \\&\overset{**}{=} \frac{1}{n^2-n} \left[ \sum _{j}^r (n_{.j})^2 + \sum _{i}^k (n_{i.})^2 - 2 \sum _{j}^r \sum _{i}^k n_{ij}^2 \right] = 1-RI(U,V) \end{aligned}$$

$(*), (**)$ same as previous proof. $\square $

1.5 Proof of Identity 3 and 4

$$\begin{aligned}&\mathcal {A}\mathcal {D}_{\varphi }^\eta = \frac{\sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.}) -2 \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv})}{ \sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.}) - 2 \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) }\\ \Rightarrow \;&1 - \mathcal {A}\mathcal {D}_{\varphi }^{\eta }(U,V) = \frac{ \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv})- \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) }{ \frac{1}{2}\left[ \sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.})\right] - \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) } \end{aligned}$$

This formula resembles the adjuctment for chance in Eq. 4, where the measure being adjusted is $\sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})$, the upper bound used for it is $\frac{1}{2}[\sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.})]$, and the expectation is defined as:

$$\begin{aligned} E\left[ \sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\right] = \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \end{aligned}$$

Now if we have $\varphi (xy) = \varphi (x)\varphi (y)$, which is true for $\varphi (x)=x^2$, we get:

$$\begin{aligned} E\left[ \sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\right] = \sum _{u\in U}\sum _{v\in V} \frac{\varphi (\eta _{.v})\varphi (\eta _{u.})}{\varphi \left( \sum _{u\in U}\sum _{v\in V} \eta _{uv} \right) } = \frac{\sum _{v\in V} \varphi (\eta _{.v})\sum _{u\in U}\varphi (\eta _{u.})}{\varphi \left( \sum _{u\in U}\sum _{v\in V} \eta _{uv} \right) } \end{aligned}$$

Using this expecation, if we substitute $\varphi =x^2$ we get the $ARI'$ of Eq. 6, and using the $\varphi =\left( {\begin{array}{c}x\\ 2\end{array}}\right) $ and the later reformulation of E, we get the original ARI of Eq. 5, as:

$$\begin{aligned} 1 - \mathcal {A}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&=\frac{ \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) }{ \frac{1}{2}\left[ \sum \limits _{v\in V}\left( {\begin{array}{c}\sum \limits _{u\in U} |u\cap v|\\ 2\end{array}}\right) + \sum \limits _{u\in U}\left( {\begin{array}{c}\sum \limits _{v\in V} |u\cap v|\\ 2\end{array}}\right) \right] - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) }\\&\text {where} \quad E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) = \frac{ \sum \limits _{v\in V}\left( {\begin{array}{c}\sum \limits _{u\in U} |u\cap v|\\ 2\end{array}}\right) \sum \limits _{u\in U}\left( {\begin{array}{c}\sum \limits _{v\in V} |u\cap v|\\ 2\end{array}}\right) }{ \left( {\begin{array}{c} n\\ 2\end{array}}\right) }\\ \Rightarrow 1 - \mathcal {A}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&\overset{*,**}{=} \frac{ \sum _j^r \sum _i^k \left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) - \sum _j^r \left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) }{ \frac{1}{2}\left[ \sum _j^r\left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) + \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) \right] -\sum _j^r \left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \;= ARI(U,V)\\&(*), (**) \text { same as proof of identity 1.} \end{aligned}$$

$\square $

On the other hand for the NMI, we have:

$$\begin{aligned}&1 - \mathcal {A}\mathcal {D}_{x\log x}^{|\cap |}(U,V) = \frac{ \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}} - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}}\right) }{ \frac{1}{2}\left[ \sum \limits _{v\in V} {n_{.v}\log {n_{.v}}} + \sum \limits _{u\in U} n_{u.}\log {n_{u.}} \right] - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U}{n_{uv}} \log {n_{uv}}\right) }\\&\quad where \; E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}}\right) = \sum _{u\in U}\sum _{v\in V} \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \log \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \\&\quad \Rightarrow \;1 - \mathcal {A}\mathcal {D}_{x\log {x}}^{|\cap |}(U,V) \overset{*,**}{=} \frac{ \sum _j^r \sum _i^k n_{ij}\log {n_{ij}} - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}\log {\frac{n_{.j}n_{i.}}{n}} }{ \frac{1}{2}\left[ \sum _j^r n_{.j}\log {n_{.j}} + \sum _i^k n_{i.}\log {n_{i.}}\right] - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}\log {\frac{n_{.j}n_{i.}}{n}}} \\&\quad = \frac{ n \sum _j^r \sum _i^k \frac{n_{ij}}{n}\log \frac{n_{ij}}{n} + n \log n - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n} [ \log {\frac{n_{.j}}{n}}+\log {\frac{n_{i.}}{n}}+\log {n}] }{ \frac{n}{2}\left[ \sum _j^r \frac{n_{.j}}{n} \log {\frac{n_{.j}}{n}} + \sum _i^k \frac{n_{i.}}{n}\log \frac{n_{i.}}{n} + 2\log n \right] - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}[ \log {\frac{n_{.j}}{n}}+\log {\frac{n_{i.}}{n}}+\log {n}] } \\&\quad = \frac{ -H(U,V) + \log n - \sum _i^k\frac{n_{i.}}{n}\sum _j^r \frac{n_{.j}}{n} \log {\frac{n_{.j}}{n}}+\sum _i^k\frac{n_{.j}}{n} -\sum _j^r \frac{n_{i.}}{n}\log {\frac{n_{i.}}{n}}-\sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n^2}\log {n} }{ \frac{1}{2}\left[ -H(U) - H(V) \right] + \log n + \sum _i^k\frac{n_{i.}}{n}H(V) +\sum _i^k\frac{n_{.j}}{n}H(U)-\log {n} }\\&\quad = \frac{ - H(U,V) + H(V) +H(U) }{ -\frac{1}{2}\left[ H(U) + H(V) \right] + H(V) +H(U) } = \frac{I(U,V)}{\frac{1}{2}\left[ H(U) + H(V) \right] } = NMI_{sum}(U,V)\\&\quad (*), (**) \text { same as proof of identity 1.} \end{aligned}$$

$\square $

1.6 Proof of Identity 5 and 6

First we prove that in general cases we have:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F = \Vert U^TU\Vert ^2_F + \Vert V^TV\Vert ^2_F - 2\Vert U^TV\Vert ^2_F \end{aligned}$$

where $\Vert .\Vert ^2_F$ is squared Frob norm. This holds since we have:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F&= \sum _{ij}(UU^T - VV^T)_{ij}^2\\&= \sum _{ij}(UU^T)_{ij}^2 + \sum _{ij}(VV^T)_{ij}^2 - 2\sum _{ij}(UU^T)_{ij}(VV^T)_{ij}\\&= \Vert UU^T\Vert ^2_F + \Vert VV^T\Vert ^2_F - 2 |UU^T \circ VV^T| \end{aligned}$$

Where the $\circ $ is element-wise matrix product, a.k.a. hadamard product, and |.| is sum of all elements in the matrix.^{Footnote 16} The proof is complete with showing:

$$\begin{aligned} |UU^T \circ VV^T|&= tr ((UU^T)^T VV^T) = tr (V^TUU^TV)\\ {}&= tr ( (U^TV)^T U^TV) = ||U^TV||^2_F \\ ||UU^T||^2_F&=tr((UU^T)^T UU^T) =tr (U^TU U^TU) \\ {}&=tr ( (U^TU)^T U^TU) = ||U^TU||^2_F \end{aligned}$$

Now, we can prove the identities for the cases of disjoint hard clusters, using the notation $n_{ij} = (U^TV)_{ij}$, we have $\Vert U^TV\Vert ^2_F = \sum _{ij} n^2_{ij}$ and:

$$\begin{aligned} \Vert U^TU\Vert ^2_F&= \sum _{ij} <U_{.i},U_{.j}>^2 = \sum _{ij}\left( \sum _k u_{ki}u_{kj}\right) ^2 \overset{*}{=} \sum _{i} \left( \sum _k u^2_{ki}\right) ^2\\&\overset{**}{=} \sum _{i} \left( \sum _k u_{ki}\right) ^2 \overset{***}{=} \sum _{i} n_{i.}^2 \end{aligned}$$

$(*)$ with assumption that clusters are disjoint, $ u_{ki}u_{kj}$ is only non-zero iff $i=j$

$(**)$ with the assumption that memberships are hard, $u_{ki}$ is either 0 or 1, therefore $u_{ki}= u^2_{ki}$

$(***)$ marginals of N give cluster sizes in U and V, i.e. $n_{i.} = \sum _{j} n_{ij} = \sum _{k} u_{ki}=|V_i| $

Therefore for disjoint hard clusters we get:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F = \sum _{i} n_{i.}^2 +\sum _{j} n_{.j}^2 - 2 \sum _{ij} n^2_{ij} \end{aligned}$$

The RI normalization assumes that all pairs are in disagreement, i.e. $|{\mathbf {1}}_{n\times n}| = n^2 $, since $max(UU^T)=1$ and, $max(VV^T)=1$. The ARI normalization compares $\varDelta $ to the difference where the two random variable of $UU^T_{ij}$ and $VV^T_{ij}$ are independent, in which case we would have:

$$\begin{aligned} E( UU^T_{ij}VV^T_{ij}) = E((UU^T)_{ij}) E((VV^T)_{ij}) \end{aligned}$$

which is calculated by:

$$\begin{aligned} \frac{\sum _{ij} ((UU^T)_{ij} (VV^T)_{ij} )}{n^2} = \frac{\sum _{ij} (UU^T)_{ij}}{n^2} \frac{\sum _{ij} (VV^T)_{ij}}{n^2} \end{aligned}$$

Since $\varDelta = ||UU^T - VV^T||^2_F = ||UU^T||^2_F + ||VV^T||^2_F - 2Sum (UU^T \circ VV^T )$, we have $ARI = 0$ or normalized distance 1, i.e. agreement no better than chance, when this independence condition holds, i.e.:

$$\begin{aligned} Sum (UU^T \circ VV^T ) = \frac{|UU^T||VV^T|}{n^2} \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rabbany, R., Zaïane, O.R. Generalization of clustering agreements and distances for overlapping clusters and network communities. Data Min Knowl Disc 29, 1458–1485 (2015). https://doi.org/10.1007/s10618-015-0426-x

Download citation

Received: 16 February 2015
Accepted: 19 June 2015
Published: 07 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10618-015-0426-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalization of clustering agreements and distances for overlapping clusters and network communities

Abstract

Access this article

Similar content being viewed by others

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Minimum adjusted Rand index for two clusterings of a given size

Consensus of Clusterings Based on High-Order Dissimilarities

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 166 KB)

Supplementary material 2 (pdf 1073 KB)

Appendix: Proofs

1.1 Proof of Proposition 1

1.2 Proof of Corollary 1

1.3 Proof of Identity 1

1.4 Proof of Identity 2

1.5 Proof of Identity 3 and 4

1.6 Proof of Identity 5 and 6

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generalization of clustering agreements and distances for overlapping clusters and network communities

Abstract

Access this article

Similar content being viewed by others

Comparing clusterings using combination of the kappa statistic and entropy-based measure

Minimum adjusted Rand index for two clusterings of a given size

Consensus of Clusterings Based on High-Order Dissimilarities

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 166 KB)

Supplementary material 2 (pdf 1073 KB)

Appendix: Proofs

Appendix: Proofs

1.1 Proof of Proposition 1

1.2 Proof of Corollary 1

1.3 Proof of Identity 1

1.4 Proof of Identity 2

1.5 Proof of Identity 3 and 4

1.6 Proof of Identity 5 and 6

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation