Generalization of clustering agreements and distances for overlapping clusters and network communities

Abstract

A measure of distance between two clusterings has important applications, including clustering validation and ensemble clustering. Generally, such distance measure provides navigation through the space of possible clusterings. Mostly used in cluster validation, a normalized clustering distance, a.k.a. agreement measure, compares a given clustering result against the ground-truth clustering. The two widely-used clustering agreement measures are adjusted rand index and normalized mutual information. In this paper, we present a generalized clustering distance from which these two measures can be derived. We then use this generalization to construct new measures specific for comparing (dis)agreement of clusterings in networks, a.k.a. communities. Further, we discuss the difficulty of extending the current, contingency based, formulations to overlapping cases, and present an alternative algebraic formulation for these (dis)agreement measures. Unlike the original measures, the new co-membership based formulation is easily extendable for different cases, including overlapping clusters and clusters of inter-related data. These two extensions are, in particular, important in the context of finding communities in complex networks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Notes

  1. 1.

    Refer to Aggarwal and Reddy (2014), Chapter 23 on clustering validation measures (in particular the section on external clustering validation measures); and Chapter 22 on cluster ensembles (in particular the section on measuring similarity between clustering solutions).

  2. 2.

    a.k.a. community mining; refer to Fortunato (2010) for a survey.

  3. 3.

    In other words, it should have a constant baseline, i.e. the expected value of agreements between two random clusterings of a same dataset. If not constant, an example of 0.7 agreement value can be both a strong (when baseline is 0.2) or a weak (when baseline is 0.6) agreement.

  4. 4.

    The well-studied inter-rater agreement indices in statistics are defined to measure the agreement between different coders, or judges on categorizing the same data. Examples are the goodness of fit, chi-square test, the likelihood Chi-square, kappa measure of agreement, Fisher’s exact test, Krippendroff’s alpha, etc. (see test 16 in Cortina-Borja 2012). These statistical tests are also defined based on the contingency table which displays the multivariate frequency distribution of the (categorical) variables.

  5. 5.

    Which happens when the clusterings are identical, and only the order of the clusters is permuted, i.e. the distribution of overlaps in each row/column of the contingency table has a single spike on the matched cluster and is zero elsewhere.

  6. 6.

    For example, we introduce an extension of this generalization for clusterings of nodes in graphs, a.k.a. communities, in the following section.

  7. 7.

    Which are measured based on the properties and attributes of data-points.

  8. 8.

    Refer to the supplementary materials for more information, available at: https://github.com/rabbanyk/CommunityEvaluation.

  9. 9.

    Other examples of matching based agreements are the Balanced Error Rate with alignment, average F1 score, and Recall measures used in (Yang and Leskovec 2013; Mcauley and Leskovec 2014; McDaid et al. 2011).

  10. 10.

    For crisp clusters (a.k.a strict membership), \(u_{ik}\) is restricted to 0, 1 (1 if node i belongs to cluster k and 0 otherwise); whereas for probabilistic clusters (or soft membership), \(u_{ik}\) could be any real number in [0, 1]. Fuzzy clusters usually assume an additional constraint that the total membership of a data-point is equal to one, i.e. \(u_{i.} = \sum _k u_{ik} = 1\). Which should also be true for disjoint clusters, since each data-point can only belong to one cluster.

  11. 11.

    It is also worth pointing out that in some applications, such as ensemble or multi-view clustering, we may not need the normalization and a measure of distance may suffice.

  12. 12.

    Parameters are chosen similar to the experiments by Lancichinetti and Fortunato (2009), i.e. networks with 1000 nodes, average degree of 20, max degree of 50, and power law degree exponent of \(-2\); where the size of communities follows a power law distribution with exponent of \(-1\), and ranges between 20 and 100 nodes. Results for other parameter settings, including smaller sized communities, 10 to 50, could be found in the supplementary materials.

  13. 13.

    Similar trends are observed for other variations of the agreement measures which can be found in the supplementary materials.

  14. 14.

    The \(\delta \) subscript indicates that the ARI is computed based on our \(\delta \)-based formulation, which is equivalent to the original ARI in this experiment, since communities are covering all nodes and non-overlapping (Identity 6).

  15. 15.

    Available at: https://github.com/rabbanyk/CommunityEvaluation.

  16. 16.

    This equality is also useful in the implementation to improve the scalability.

References

  1. Aggarwal CC, Reddy CK (2014) Data clustering: algorithms and applications. CRC Press, Boca Raton

    Google Scholar 

  2. Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313

    MathSciNet  Article  Google Scholar 

  3. Anderson DT, Bezdek JC, Popescu M, Keller JM (2010) Comparing fuzzy, probabilistic, and possibilistic partitions. IEEE Trans Fuzzy Syst 18(5):906–918

    Article  Google Scholar 

  4. Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res 6:1705–1749

    MathSciNet  MATH  Google Scholar 

  5. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008:P10008+

    Article  Google Scholar 

  6. Brouwer RK (2008) Extending the rand, adjusted rand and Jaccard indices to fuzzy partitions. J Intell Inf Syst 32(3):213–235

    MathSciNet  Article  Google Scholar 

  7. Campello RR (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975

    Article  Google Scholar 

  8. Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242

    Article  Google Scholar 

  9. Cortina-Borja M (2012) Handbook of parametric and nonparametric statistical procedures, 5th edn. J R Stat Soc Ser A 175(3):829–829

  10. Cui Y, Fern X, Dy J (2007) Non-redundant multi-view clustering via orthogonalization. In: Seventh IEEE International Conference on Data Mining, 2007. ICDM 2007. pp 133–142

  11. Dhillon IS, Tropp JA (2007) Matrix nearness problems with bregman divergences. SIAM J Matrix Anal Appl 29(4):1120–1146

    MathSciNet  Article  Google Scholar 

  12. Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174

    MathSciNet  Article  Google Scholar 

  13. Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12(10):103,018

    Article  Google Scholar 

  14. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  15. Hullermeier E, Rifqi M, Henzgen S, Senge R (2012) Comparing fuzzy partitions: a generalization of the rand index and related measures. IEEE Trans Fuzzy Syst 20(3):546–556

    Article  Google Scholar 

  16. Kulis B, Sustik MA, Dhillon IS (2009) Low-rank kernel learning with bregman matrix divergences. J Mach Learn Res 10:341–376

    MathSciNet  MATH  Google Scholar 

  17. Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056,117

    Article  Google Scholar 

  18. Lancichinetti A, Fortunato S, Kertesz J (2008a) Detecting the overlapping and hierarchical community structure of complex networks. New J Phys 11(3):20

    Google Scholar 

  19. Lancichinetti A, Fortunato S, Radicchi F (2008b) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046,110

    Article  Google Scholar 

  20. Lancichinetti A, Radicchi F, Ramasco JJ, Fortunato S (2011) Finding statistically significant communities in networks. PLoS One 6(4):e18,961

    Article  Google Scholar 

  21. Light RJ, Margolin BH (1971) An analysis of variance for categorical data. J Am Stat Assoc 66(335):534–544

    MathSciNet  Article  MATH  Google Scholar 

  22. Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New York

    Google Scholar 

  23. Mcauley J, Leskovec J (2014) Discovering social circles in ego networks. ACM Trans Knowl Discov Data 8(1):4:1–4:28

    Article  Google Scholar 

  24. McDaid A, Hurley N (2010) Detecting highly overlapping communities with model-based overlapping seed expansion. In: 2010 International Conference on Advances in Social Networks Analysis and Mining (ASONAM), IEEE, pp 112–119

  25. McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv:1110.2515

  26. Meilă M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895

    Article  MATH  Google Scholar 

  27. Newman MEJ (2004) Fast algorithm for detecting community structure in networks. Phys Rev E 69(066):133

    Google Scholar 

  28. Nielsen F, Nock R (2014) On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process Lett 21(1):10–13

    Article  Google Scholar 

  29. Pons P, Latapy M (2005) Computing communities in large networks using random walks. In: Computer and Information Sciences-ISCIS 2005, Springer, pp 284–293

  30. Quere R, Le Capitaine H, Fraisseix N, Frelicot C (2010) On normalizing fuzzy coincidence matrices to compare fuzzy and/or possibilistic partitions with the rand index. In: 2010 IEEE International Conference on Data Mining, IEEE, pp 977–982

  31. Ronhovde P, Nussinov Z (2009) Multiresolution community detection for megascale networks by information-based replica correlations. Phys Rev E 80(1):016,109

    Article  Google Scholar 

  32. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci 105(4):1118–1123

    Article  Google Scholar 

  33. Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, ICML ’09, pp 1073–1080

  34. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854

    MathSciNet  MATH  Google Scholar 

  35. Warrens MJ (2008a) On similarity coefficients for \(2\times 2\) tables and correction for chance. Psychometrika 73:487–502

    MathSciNet  Article  Google Scholar 

  36. Warrens MJ (2008b) On the equivalence of Cohen’s Kappa and the Hubert-Arabie adjusted rand index. J Classif 25:177–183

    MathSciNet  Article  MATH  Google Scholar 

  37. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, KDD ’09, pp 877–886

  38. Yang J, Leskovec J (2013) Overlapping community detection at scale: a nonnegative matrix factorization approach. In: Proceedings of the sixth ACM international conference on Web search and data mining, ACM, New York, pp 587–596

  39. Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on Machine learning, ACM, New York, pp 1028–1035

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Reihaneh Rabbany.

Additional information

Responsible editors: Joao Gama, Indre Zliobaite, Alipio Jorge, Concha Bielza.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 166 KB)

Supplementary material 2 (pdf 1073 KB)

Appendix: Proofs

Appendix: Proofs

Proof of Proposition 1

From the definition of Variation of information we have:

$$\begin{aligned} VI(U,V)= & {} H(U) + H(V) - 2I(U,V) = 2H(U,V)-H(U)-H(V)\\= & {} {\mathbf {H(V|U)+H(U|V)}} \end{aligned}$$

On the other hand, we have:

$$\begin{aligned} RI(U,V)&\propto \frac{1}{n^2-n}\left( \sum _{i=1}^k \left[ \sum _{j=1}^r n_{ij}^2 - \left( \sum _{j=1}^r n_{ij}\right) ^2\right] + \sum _{j=1}^r \left[ \sum _{i=1}^k n_{ij}^2 - \left( \sum _{i=1}^k n_{ij}\right) ^2\right] \right) \\&\overset{*}{\propto } \sum _{i=1}^k \left[ E_j(n_{ij}^2) - E_j( n_{ij})^2\right] + \sum _{j=1}^r \left[ E_i(n_{ij}^2) - E_i(n_{ij})^2\right] \\&\overset{*}{\propto } \sum _{i=1}^k Var_j(n_{ij}) + \sum _{j=1}^r Var_i(n_{ij}) \quad \overset{**}{\propto } \quad \mathbf { Var(V|U)+ Var(U|V) } \quad \quad \end{aligned}$$

\(\square \)

\((*)\) \(E_j\)/\(Var_j\) shows the average/variance of values in the \(j^{th}\) column of the contingency table.

\((**)\) The RI is in fact proportional to the average variance of rows/columns values in the contingency table, which we denote by conditional variance. For other forms of conditional variance for categorical data see Light and Margolin (1971).

Proof of Corollary 1

We first show that \(0 \le \mathcal {D}_{\varphi }^\eta (U||V)\) which also results in the lower bound 0 for \(\mathcal {D}_{\varphi }^\eta (U,V)\) since, \( \mathcal {D}_{\varphi }^\eta (U,V) = \mathcal {D}_{\varphi }^\eta (U||V) + \mathcal {D}_{\varphi }^\eta (V||U) \). From the superadditivity of \(\varphi \) we have:

$$\begin{aligned} \sum _{u\in U} \varphi (\eta _{uv}) \le \varphi \left( \sum _{u\in U} \eta _{uv}\right)&\implies \sum _{v\in V} \left[ \varphi \left( \sum _{u\in U} \eta _{uv}\right) - \sum _{u\in U} \varphi (\eta _{uv}) \right] \ge 0\\&\implies \mathbf {\mathcal {D}_{\varphi }^\eta (U||V) \ge 0} \end{aligned}$$

Similarly for the upper bound, from positivity and super-additivity we get respectively:

$$\begin{aligned} \mathcal {D}_{\varphi }^\eta (U||V) = \sum _{v\in V} \varphi \left( \sum _{u\in U} \eta _{uv}\right) - \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv}) \le \sum _{v\in V} \varphi \left( \sum _{u\in U} \eta _{uv}\right) \le \varphi \left( \sum _{v\in V} \sum _{u\in U} \eta _{uv}\right) \ \end{aligned}$$

Proof of Identity 1

The proof is elementary, if we write the definition for \(\varphi ={x}\log x\), we get:

$$\begin{aligned} \mathcal {N}\mathcal {D}_{x\log x}^{|\cap |}(U,V)&= \frac{\sum _{v\in V} \sum _{u\in U} |u\cap v| \left[ \log \left( \sum _{u\in U} |u\cap v|\right) - \log \left( |u\cap v|\right) \right] }{\left( \sum _{v\in V} \sum _{u\in U} |u\cap v| \right) \log \left( \sum _{v\in V} \sum _{u\in U} |u\cap v| \right) }\\&\quad + \frac{\sum _{u\in U} \sum _{v\in V} |u\cap v| \left[ \log \left( \sum _{v\in V} |u\cap v|\right) - \log (|u\cap v|) \right] }{\left( \sum _{u\in U} \sum _{v\in V}|u\cap v|\right) \log \left( \sum _{u\in U} \sum _{v\in V}|u\cap v|\right) }\\&\overset{*}{=} \frac{\sum _{j}^r \sum _{i}^k n_{ij} \left[ \log \left( \sum _{i}^k n_{ij}\right) + \log \left( \sum _{j}^r n_{ij}\right) -2 \log \left( n_{ij}\right) \right] }{\left( \sum _{i}^k\sum _{j}^{r} n_{ij}\right) \log \left( \sum _{i} ^k\sum _{j}^r n_{ij}\right) }\\&\overset{**}{=} \frac{1}{\log n}\sum _{j}^r \sum _{i}^k \frac{ n_{ij} }{n} \log \left( \frac{ n_{i.} n_{.j}}{n_{ij}^2}\right) = \frac{VI(U,V)}{\log n}\quad \quad \end{aligned}$$

\(\square \)

\((*)\) slight change of notation, i.e. from \(\sum _{u\in U}\) to \(\sum _{i}^k\), \(\sum _{v\in V}\) to \(\sum _{j}^r\) and \(|u\cap v|\) to \(n_{ij}\).

\((**)\) assuming disjoint covering partitionings: \(\sum _{i}^k\sum _{j}^r n_{ij} = n\), \(\sum _{i}^k n_{ij} = n_{.j}\) and \(\sum _{j}^r n_{ij} = n_{i.}\).

Proof of Identity 2

Similar to the previous proof from the definition we derive:

$$\begin{aligned} \mathcal {N}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&\overset{*}{=} \frac{ \sum _{j}^r \left[ \left( \sum _{i}^k n_{ij}\right) ^2 -\sum _{i}^k n_{ij}^2 \right] + \sum _{i}^k \left[ \left( \sum _{j}^r n_{ij}\right) ^2 - \sum _{j}^r n_{ij}^2 \right] }{\left( \sum _{i}^k\sum _{j}^r n_{ij}\right) ^2-\sum _{i}^k\sum _{j}^r n_{ij}} \\&\overset{**}{=} \frac{1}{n^2-n} \left[ \sum _{j}^r (n_{.j})^2 + \sum _{i}^k (n_{i.})^2 - 2 \sum _{j}^r \sum _{i}^k n_{ij}^2 \right] = 1-RI(U,V) \end{aligned}$$

\((*), (**)\) same as previous proof. \(\square \)

Proof of Identity 3 and 4

$$\begin{aligned}&\mathcal {A}\mathcal {D}_{\varphi }^\eta = \frac{\sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.}) -2 \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv})}{ \sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.}) - 2 \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) }\\ \Rightarrow \;&1 - \mathcal {A}\mathcal {D}_{\varphi }^{\eta }(U,V) = \frac{ \sum _{v\in V} \sum _{u\in U} \varphi (\eta _{uv})- \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) }{ \frac{1}{2}\left[ \sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.})\right] - \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) } \end{aligned}$$

This formula resembles the adjuctment for chance in Eq. 4, where the measure being adjusted is \(\sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\), the upper bound used for it is \(\frac{1}{2}[\sum _{v\in V} \varphi (\eta _{.v}) + \sum _{u\in U} \varphi (\eta _{u.})]\), and the expectation is defined as:

$$\begin{aligned} E\left[ \sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\right] = \sum _{u\in U}\sum _{v\in V} \varphi \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \end{aligned}$$

Now if we have \(\varphi (xy) = \varphi (x)\varphi (y)\), which is true for \(\varphi (x)=x^2\), we get:

$$\begin{aligned} E\left[ \sum _{v\in V}\sum _{u\in U} \varphi (\eta _{uv})\right] = \sum _{u\in U}\sum _{v\in V} \frac{\varphi (\eta _{.v})\varphi (\eta _{u.})}{\varphi \left( \sum _{u\in U}\sum _{v\in V} \eta _{uv} \right) } = \frac{\sum _{v\in V} \varphi (\eta _{.v})\sum _{u\in U}\varphi (\eta _{u.})}{\varphi \left( \sum _{u\in U}\sum _{v\in V} \eta _{uv} \right) } \end{aligned}$$

Using this expecation, if we substitute \(\varphi =x^2\) we get the \(ARI'\) of Eq. 6, and using the \(\varphi =\left( {\begin{array}{c}x\\ 2\end{array}}\right) \) and the later reformulation of E, we get the original ARI of Eq. 5, as:

$$\begin{aligned} 1 - \mathcal {A}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&=\frac{ \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) }{ \frac{1}{2}\left[ \sum \limits _{v\in V}\left( {\begin{array}{c}\sum \limits _{u\in U} |u\cap v|\\ 2\end{array}}\right) + \sum \limits _{u\in U}\left( {\begin{array}{c}\sum \limits _{v\in V} |u\cap v|\\ 2\end{array}}\right) \right] - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) }\\&\text {where} \quad E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} \left( {\begin{array}{c}|u\cap v|\\ 2\end{array}}\right) \right) = \frac{ \sum \limits _{v\in V}\left( {\begin{array}{c}\sum \limits _{u\in U} |u\cap v|\\ 2\end{array}}\right) \sum \limits _{u\in U}\left( {\begin{array}{c}\sum \limits _{v\in V} |u\cap v|\\ 2\end{array}}\right) }{ \left( {\begin{array}{c} n\\ 2\end{array}}\right) }\\ \Rightarrow 1 - \mathcal {A}\mathcal {D}_{\left( {\begin{array}{c}x\\ 2\end{array}}\right) }^{|\cap |}(U,V)&\overset{*,**}{=} \frac{ \sum _j^r \sum _i^k \left( {\begin{array}{c}n_{ij}\\ 2\end{array}}\right) - \sum _j^r \left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) }{ \frac{1}{2}\left[ \sum _j^r\left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) + \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) \right] -\sum _j^r \left( {\begin{array}{c}n_{.j}\\ 2\end{array}}\right) \sum _i^k\left( {\begin{array}{c}n_{i.}\\ 2\end{array}}\right) /\left( {\begin{array}{c}n\\ 2\end{array}}\right) } \;= ARI(U,V)\\&(*), (**) \text { same as proof of identity 1.} \end{aligned}$$

\(\square \)

On the other hand for the NMI, we have:

$$\begin{aligned}&1 - \mathcal {A}\mathcal {D}_{x\log x}^{|\cap |}(U,V) = \frac{ \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}} - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}}\right) }{ \frac{1}{2}\left[ \sum \limits _{v\in V} {n_{.v}\log {n_{.v}}} + \sum \limits _{u\in U} n_{u.}\log {n_{u.}} \right] - E\left( \sum \limits _{v\in V} \sum \limits _{u\in U}{n_{uv}} \log {n_{uv}}\right) }\\&\quad where \; E\left( \sum \limits _{v\in V} \sum \limits _{u\in U} {n_{uv}}\log {n_{uv}}\right) = \sum _{u\in U}\sum _{v\in V} \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \log \left( \frac{\eta _{.v}\eta _{u.}}{\sum _{u\in U}\sum _{v\in V} \eta _{uv} }\right) \\&\quad \Rightarrow \;1 - \mathcal {A}\mathcal {D}_{x\log {x}}^{|\cap |}(U,V) \overset{*,**}{=} \frac{ \sum _j^r \sum _i^k n_{ij}\log {n_{ij}} - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}\log {\frac{n_{.j}n_{i.}}{n}} }{ \frac{1}{2}\left[ \sum _j^r n_{.j}\log {n_{.j}} + \sum _i^k n_{i.}\log {n_{i.}}\right] - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}\log {\frac{n_{.j}n_{i.}}{n}}} \\&\quad = \frac{ n \sum _j^r \sum _i^k \frac{n_{ij}}{n}\log \frac{n_{ij}}{n} + n \log n - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n} [ \log {\frac{n_{.j}}{n}}+\log {\frac{n_{i.}}{n}}+\log {n}] }{ \frac{n}{2}\left[ \sum _j^r \frac{n_{.j}}{n} \log {\frac{n_{.j}}{n}} + \sum _i^k \frac{n_{i.}}{n}\log \frac{n_{i.}}{n} + 2\log n \right] - \sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n}[ \log {\frac{n_{.j}}{n}}+\log {\frac{n_{i.}}{n}}+\log {n}] } \\&\quad = \frac{ -H(U,V) + \log n - \sum _i^k\frac{n_{i.}}{n}\sum _j^r \frac{n_{.j}}{n} \log {\frac{n_{.j}}{n}}+\sum _i^k\frac{n_{.j}}{n} -\sum _j^r \frac{n_{i.}}{n}\log {\frac{n_{i.}}{n}}-\sum _i^k\sum _j^r \frac{n_{.j}n_{i.}}{n^2}\log {n} }{ \frac{1}{2}\left[ -H(U) - H(V) \right] + \log n + \sum _i^k\frac{n_{i.}}{n}H(V) +\sum _i^k\frac{n_{.j}}{n}H(U)-\log {n} }\\&\quad = \frac{ - H(U,V) + H(V) +H(U) }{ -\frac{1}{2}\left[ H(U) + H(V) \right] + H(V) +H(U) } = \frac{I(U,V)}{\frac{1}{2}\left[ H(U) + H(V) \right] } = NMI_{sum}(U,V)\\&\quad (*), (**) \text { same as proof of identity 1.} \end{aligned}$$

\(\square \)

Proof of Identity 5 and 6

First we prove that in general cases we have:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F = \Vert U^TU\Vert ^2_F + \Vert V^TV\Vert ^2_F - 2\Vert U^TV\Vert ^2_F \end{aligned}$$

where \(\Vert .\Vert ^2_F\) is squared Frob norm. This holds since we have:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F&= \sum _{ij}(UU^T - VV^T)_{ij}^2\\&= \sum _{ij}(UU^T)_{ij}^2 + \sum _{ij}(VV^T)_{ij}^2 - 2\sum _{ij}(UU^T)_{ij}(VV^T)_{ij}\\&= \Vert UU^T\Vert ^2_F + \Vert VV^T\Vert ^2_F - 2 |UU^T \circ VV^T| \end{aligned}$$

Where the \(\circ \) is element-wise matrix product, a.k.a. hadamard product, and |.| is sum of all elements in the matrix.Footnote 16 The proof is complete with showing:

$$\begin{aligned} |UU^T \circ VV^T|&= tr ((UU^T)^T VV^T) = tr (V^TUU^TV)\\ {}&= tr ( (U^TV)^T U^TV) = ||U^TV||^2_F \\ ||UU^T||^2_F&=tr((UU^T)^T UU^T) =tr (U^TU U^TU) \\ {}&=tr ( (U^TU)^T U^TU) = ||U^TU||^2_F \end{aligned}$$

Now, we can prove the identities for the cases of disjoint hard clusters, using the notation \(n_{ij} = (U^TV)_{ij}\), we have \(\Vert U^TV\Vert ^2_F = \sum _{ij} n^2_{ij}\) and:

$$\begin{aligned} \Vert U^TU\Vert ^2_F&= \sum _{ij} <U_{.i},U_{.j}>^2 = \sum _{ij}\left( \sum _k u_{ki}u_{kj}\right) ^2 \overset{*}{=} \sum _{i} \left( \sum _k u^2_{ki}\right) ^2\\&\overset{**}{=} \sum _{i} \left( \sum _k u_{ki}\right) ^2 \overset{***}{=} \sum _{i} n_{i.}^2 \end{aligned}$$

\((*)\) with assumption that clusters are disjoint, \( u_{ki}u_{kj}\) is only non-zero iff \(i=j\)

\((**)\) with the assumption that memberships are hard, \(u_{ki}\) is either 0 or 1, therefore \(u_{ki}= u^2_{ki}\)

\((***)\) marginals of N give cluster sizes in U and V, i.e. \(n_{i.} = \sum _{j} n_{ij} = \sum _{k} u_{ki}=|V_i| \)

Therefore for disjoint hard clusters we get:

$$\begin{aligned} \Vert UU^T - VV^T\Vert ^2_F = \sum _{i} n_{i.}^2 +\sum _{j} n_{.j}^2 - 2 \sum _{ij} n^2_{ij} \end{aligned}$$

The RI normalization assumes that all pairs are in disagreement, i.e. \(|{\mathbf {1}}_{n\times n}| = n^2 \), since \(max(UU^T)=1\) and, \(max(VV^T)=1\). The ARI normalization compares \(\varDelta \) to the difference where the two random variable of \(UU^T_{ij}\) and \(VV^T_{ij}\) are independent, in which case we would have:

$$\begin{aligned} E( UU^T_{ij}VV^T_{ij}) = E((UU^T)_{ij}) E((VV^T)_{ij}) \end{aligned}$$

which is calculated by:

$$\begin{aligned} \frac{\sum _{ij} ((UU^T)_{ij} (VV^T)_{ij} )}{n^2} = \frac{\sum _{ij} (UU^T)_{ij}}{n^2} \frac{\sum _{ij} (VV^T)_{ij}}{n^2} \end{aligned}$$

Since \(\varDelta = ||UU^T - VV^T||^2_F = ||UU^T||^2_F + ||VV^T||^2_F - 2Sum (UU^T \circ VV^T )\), we have \(ARI = 0\) or normalized distance 1, i.e. agreement no better than chance, when this independence condition holds, i.e.:

$$\begin{aligned} Sum (UU^T \circ VV^T ) = \frac{|UU^T||VV^T|}{n^2} \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rabbany, R., Zaïane, O.R. Generalization of clustering agreements and distances for overlapping clusters and network communities. Data Min Knowl Disc 29, 1458–1485 (2015). https://doi.org/10.1007/s10618-015-0426-x

Download citation

Keywords

  • Clustering agreement
  • Cluster evaluation
  • Cluster validation
  • Network clusters
  • Community detection
  • Overlapping clusters