Communities validity: methodical evaluation of community mining algorithms
Abstract
Grouping data points is one of the fundamental tasks in data mining, which is commonly known as clustering if data points are described by attributes. When dealing with interrelated data, that is represented in the form a graph wherein a link between two nodes indicates a relationship between them, there has been a considerable number of approaches proposed in recent years for mining communities in a given network. However, little work has been done on how to evaluate the community mining algorithms. The common practice is to evaluate the algorithms based on their performance on standard benchmarks for which we know the ground-truth. This technique is similar to external evaluation of attribute-based clustering methods. The other two well-studied clustering evaluation approaches are less explored in the community mining context; internal evaluation to statistically validate the clustering result and relative evaluation to compare alternative clustering results. These two approaches enable us to validate communities discovered in a real-world application, where the true community structure is hidden in the data. In this article, we investigate different clustering quality criteria applied for relative and internal evaluation of clustering data points with attributes and also different clustering agreement measures used for external evaluation and incorporate proper adaptations to make them applicable in the context of interrelated data. We further compare the performance of the proposed adapted criteria in evaluating community mining results in different settings through extensive set of experiments.
Keywords
Evaluation approaches Quality measures Clustering evaluation Clustering objective function Community mining1 Introduction
Data mining is the analysis of large-scale data to discover meaningful patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) or dependencies (association rule mining) which are crucial in a very broad range of applications. It is a multidisciplinary field that involves methods at the intersection of artificial intelligence, machine learning, statistics and database systems. The recent growing trend in the Data mining field is the analysis of structured/interrelated data, motivated by the natural presence of relationships between data points in a variety of the present-day applications. The structures in these interrelated data are typically modelled by a graph of interconnected nodes, known as complex networks or information networks. Examples of such networks are hyperlink networks of web pages, citation or collaboration networks of scholars, biological networks of genes or proteins, trust and social networks of humans among others.
All these networks exhibit common statistical properties, such as power law degree distribution, small-world phenomenon, relatively high transitivity, shrinking diameter and densification power laws (Leskovec et al. 2005; Newman 2010). Network clustering, a.k.a. community mining, is one of the principal tasks in the analysis of complex networks. Many community mining algorithms have been proposed in recent years: for surveys refer to Fortunato (2010), Porter et al. (2009). These algorithms evolved very quickly from simple heuristic approaches to more sophisticated optimization-based methods that are explicitly or implicitly trying to maximize the goodness of the discovered communities. The broadly used explicit maximization objective is the modularity introduced by Newman and Girvan (2004).
Although there have been many methods proposed for community mining, very little research has been done to explore evaluation and validation methodologies. Similar to the well-studied clustering validity methods in the Machine Learning field, we have three classes of approaches to evaluate community mining algorithms: external, internal and relative evaluation. The first two are statistical tests that measure the degree to which a clustering confirms a-priori specified scheme. The third approach compares and ranks clusterings of a same dataset discovered by different parameter settings (Halkidi et al. 2001).
In this article, we investigate the evaluation approaches of the community mining algorithms considering the same classification framework. We classify the common evaluation practices into external, internal and relative approaches and further extend these by introducing a new set of adapted criteria and measures that are adequate for community mining evaluation. More specifically, the evaluation approaches are defined based on different clustering validity criteria and clustering similarity measures. We propose proper adaptations that these measures require to handle comparison of community mining results. Most of these validity criteria, that are introduced and adapted here, are for the first time applied to the context of interrelated data, i.e. used for the community mining evaluation. These criteria not only can be used as means to measure the goodness of discovered communities, but also as objective functions to detect communities. Furthermore, we propose the adaptation of the clustering similarity measures for the context of interrelated data, which has been overlooked in the previous literature. Apart from the evaluation, these clustering similarity measures can also be used to determine the number of clusters in a data set or to combine different clustering results and obtain a consensus clustering (Vinh et al. 2010).
The remainder of this paper is organized as follows: in the next section, we first present some background, where we briefly introduce the well-known community mining algorithms and the related work regarding evaluation of these algorithms. We continue the background with an elaboration on the three classes of evaluation approaches incorporating the common evaluation practices. In the subsequent section, we overview the clustering validity criteria and clustering similarity measures and introduce our proposed adaptations of these measures for the context of interrelated data. Then, we extensively compare and discuss the performance of these adapted validity criteria and the properties of the adapted similarity measures, through a set of carefully designed experiments on real and synthetic networks. Finally, we conclude with a brief analysis of these results.
2 Background and related works
A community is roughly defined as “densely connected” individuals that are "loosely connected" to others outside their group. A large number of community mining algorithms have been developed in the past few years having different interpretations of this definition. Basic heuristic approaches mine communities by assuming that the network of interest divides naturally into some subgroups, determined by the network itself. For instance, the Clique Percolation Method (Palla et al. 2005) finds groups of nodes that can be reached via chains of k-cliques. The common optimization approaches mine communities by maximizing the overall “goodness” of the result. The most credible “goodness” objective is known as modularity Q, proposed in Newman and Girvan (2004), which considers the difference between the fraction of edges that are within the communities and the expected such fraction if the edges are randomly distributed. Several community mining algorithms for optimizing the modularity Q have been proposed, such as fast modularity (Newman 2006) and Max–Min modularity (Chen et al. 2009).
Although many mining algorithms are based on the concept of modularity, Fortunato and Barthélemy (2007) have shown that the modularity cannot accurately evaluate small communities due to its resolution limit. Hence, any algorithm based on modularity is biased against small communities. As an alternative to optimizing modularity Q, we previously proposed TopLeaders community mining approach (Rabbany et al. 2010), which implicitly maximizes the overall closeness of followers and leaders, assuming that a community is a set of followers congregating around a potential leader. There are many other alternative methods. One notable family of approaches mine communities by utilizing information theory concepts such as compression, e.g. Infomap (Rosvall and Bergstrom 2008) and entropy, e.g. entropy-base (Kenley and Cho 2011). For a survey on different community mining techniques refer to Fortunato (2010).
Fortunato (2010) shows that the different community mining algorithms discover communities from different perspective and may outperform others in specific classes of networks and have different computational complexities. Therefore, an important research direction is to evaluate and compare the results of different community mining algorithms and select the one providing more meaningful clustering for each class of networks. An intuitive practice is to validate the results partly by a human expert (Luo et al. 2008). However, the community mining problem is NP-complete; the human expert validation is limited and is based on narrow intuition rather than on an exhaustive examination of the relations in the given network, especially for large real networks. To validate the result of a community mining algorithm, three approaches are available: external evaluation, internal evaluation and relative evaluation, which are described in the following.
2.1 Evaluation approaches
2.1.1 External evaluation
External evaluation involves comparing the discovered clustering with a prespecified structure, often called ground-truth, using a clustering agreement measure such as Jaccard, Adjusted Rand Index, or Normalized Mutual Information. In the case of attribute-based data, clustering similarity measures are not only used for evaluation, but also applied to determine the number of clusters in a data set or to combine different clustering results and obtain a consensus clustering, i.e. ensemble clustering (Vinh et al. 2010). In the interrelated data context, these measures are used commonly for external evaluation of community mining algorithms, where the performance of the algorithms are examined on standard benchmarks for which the true communities are known (Chen et al. 2009; Danon et al. 2005; Lancichinetti and Fortunato 2009; Orman et al. 2011). There are few and typically small real-world benchmarks with known communities available for external evaluation of community mining algorithms, while the current generators used for synthesizing benchmarks with built-in ground-truth, overlook some characteristics of the real networks (Orman and Labatut 2010). Moreover, in a real-world application the interesting communities that need to be discovered are hidden in the structure of the network, thus, the discovered communities cannot be validated based on the external evaluation. These facts motivate investigating the other two alternatives approaches—internal and relative evaluation. Before describing these evaluation approaches, we first elaborate more on the synthetic benchmark generators and the studies that used the external evaluation approach.
To synthesize networks with built-in ground truth, several generators have been proposed. GN benchmark (Girvan and Newman 2002) is the first synthetic network generator. This benchmark is a graph with 128 nodes, with expected degree of 16, and is divided into four groups of equal sizes; where the probabilities of the existence of a link between a pair of nodes of the same group and of different groups are z_{in} and 1 − z_{in}, respectively. However, the same expected degree for all the nodes and equal-size communities are not accordant to real social network properties. LFR benchmark (Lancichinetti et al. 2008) amends GN benchmark by considering power law distributions for degrees and community sizes. Similar to GN benchmark, each node shares a fraction 1 − μ of its links with the other nodes of its community and a fraction μ with the other nodes of the network. However, having the same mixing parameter μ for all nodes and not satisfying the densification power laws and heavy-tailed distribution are the main drawback of this benchmark.
Apart form many papers that used the external evaluation to assess the performance of their proposed algorithms, there are recent studies specifically on comparison of different community mining algorithms using the external evaluation approach. Gustafsson et al. (2006) compare hierarchical and k-means community mining on real networks and also synthetic networks generated by the GN benchmark. Lancichinetti and Fortunato (2009) compare a total of a dozen community mining algorithms, where the performance of the algorithms is compared against the network generated by both GN and LFR benchmark. Orman et al. (2011) compare a total of five community mining algorithms on the synthetic networks generated by LFR benchmark. They first assess the quality of the different algorithms by their difference with the ground truth. Then, they perform a qualitative analysis of the identified communities by comparing their size distribution with the community size distribution of the ground truth. All these mentioned works borrow clustering agreement measures from traditional clustering literature. In this article we overview different agreement measures and also provide an alternative measure which is adapted specifically for clustering of interrelated data.
2.1.2 Internal and relative evaluation
Internal evaluation techniques verify whether the clustering structure produced by a clustering algorithm matches the underlying structure of the data, using only information inherent in the data. These techniques are based on an internal criterion that measures the correlation between the discovered clustering structure and the structure of the data, represented as a proximity matrix—a square matrix in which the entry in cell (j, k) is some measure of the similarity (or distance) between the items i, and j. The significance of this correlation is examined statistically based on the distribution of the defined criteria, which is usually not known and is estimated using Monte Carlo sampling method (Theodoridis and Koutroumbas 2009). An internal criterion can also be considered as a quality index to compare different clusterings which overlaps with relative evaluation techniques. The well-known modularity of Newman (2006) can be considered as such, which is used both to validate a single community mining result and also to compare different community mining results (Clauset 2005; Rosvall and Bergstrom 2007). Modularity is defined as the fraction of edges within communities, i.e. the correlation of adjacency matrix and the clustering structure, minus the expected value of this fraction that is computed based on the configuration model (Newman 2006). Another work that could be considered in this class is the evaluation of different community mining algorithms studied in Leskovec et al. (2010), where they propose network community profile (NCP) that characterizes the quality of communities as a function of their size. The quality of the community at each size is characterized by the notion of conductance which is the ratio between the number of edges inside the community and the number of edges leaving the community. Then, they compared the shape of the NCP for different algorithms over random and real networks.
Relative evaluation compares alternative clustering structures based on an objective function or quality index. This evaluation approach is the least explored in the community mining context. Defining an objective function to evaluate community mining is non-trivial. Aside from the subjective nature of the community mining task, there is no formal definition on the term community. Consequently, there is no consensus on how to measure “goodness” of the discovered communities by a mining algorithm. Nevertheless, the well-studied clustering methods in the Machine Learning field are subject to similar issues and yet there exists an extensive set of validity criteria defined for clustering evaluation, such as Davies–Bouldin index (Davies and Bouldin 1979), Dunn index (Dunn 1974) and Silhouette (Rousseeuw 1987); for a recent survey refer to Vendramin et al. (2010). In the next section, we describe how these criteria could be adapted to the context of community mining to compare results of different community mining algorithms. Also, these criteria can be used as alternatives to modularity to design novel community mining algorithms.
3 Evaluation of community mining results
In this section, we elaborate on how to evaluate results of a community mining algorithm based on external and relative evaluation. External evaluation of community mining results involves comparing the discovered communities with a prespecified community structure, often called ground truth, using a clustering agreement measure, while the relative evaluation ranks different alternative community structures based on an objective function—quality index (Theodoridis and Koutroumbas 2009). To be consistent with the terms used in attribute-based data, we use clustering to refer to the result of any community mining algorithm, and partitioning to refer to the case where the communities are mutually exclusive. Note that, in this study we only focus on non-overlapping community mining algorithms that always produce disjoint communities. Thus, in the definition of the following quality criteria and agreement measures, partitioning is used instead of clustering which implies that these are only applicable in the case of mutually exclusive communities. In the rest, we first overview relative community quality criteria and then describe different clustering agreement measures.
3.1 Community quality criteria
Here, we overview several validity criteria that could be used as relative indexes for comparing and evaluating different partitionings of a given network. All of these criteria are generalized from well-known clustering criteria. The clustering quality criteria are originally defined with the implicit assumption that data points consist of vectors of attributes. Consequently their definition is mostly integrated or mixed with the definition of the distance measure between data points. The commonly used distance measure is the Euclidean distance, which cannot be defined for graphs. Therefore, we first review different possible proximity measures that could be used in graphs. Then, we present generalizations of criteria that could use any notion of proximity.
3.1.1 Proximity between nodes
Let A denote the adjacency matrix of the graph, and let A_{ij} be the weight of the edge between nodes n_{i} and n_{j}. The proximity between n_{i} and n_{j}, p_{ij} = p(i, j) can be computed by one of the following distance or similarity measures. The latter is more typical in the context of interrelated data; therefore, we tried to plug-in similarities in the relative criteria definitions. When it is not straightforward, we used the inverse of the similarity index to obtain the according dissimilarity/distance. For avoiding division by zero, when P_{ij} is zero, if it is a similarity ε and if it is distance 1/ε is returned, where ε is a very small number, i.e. 10E−9.
Shortest Path (SP) distance between two nodes is the length of the shortest path between them, which could be computed using the well-known Dijkstra’s Shortest Path algorithm
The corresponding distance is derived as d_{ij}^{NO} = 1 − p_{ij}^{NO}.
3.1.2 Community centroid
In addition to the notion of proximity measure, most of the cluster validity criteria use averaging between the numerical data points to determine the centroid of a cluster. The averaging is not defined for nodes in a graph; therefore, we modify the criteria definitions to use a generalized centroid notion, in a way that if the centroid is set as averaging, we would obtain the original criteria definitions, but we could also use other alternative notions for centroid of a group of data points.
3.1.3 Relative validity criteria
Here, we present our generalizations of well-known clustering validity criteria defined as quality measures for internal or relative evaluation of clustering results. All these criteria are originally defined based on distances between data points, which in all cases is the Euclidean or other inner product norms of difference between their vectors of attributes; refer to (Vendramin et al.) for comparative analysis of these criteria in the clustering context. We alter the formulae to use a generalized distance, so that we can plug in our graph proximity measures. The other alteration is generalizing the mean over data points to a general centroid notion, which can be set as averaging in the presence of attributes and the medoid in our case of dealing with graphs and in the absence of attributes.
In a nutshell, in every criterion, the average of points in a cluster is replaced with a generalized notion of centroid, and distances between data points are generalized from Euclidean/norm to a generic distance. Consider a partitioning C = {C_{1} ∪ C_{2} ∪... ∪ C_{k} } of N data points, where \(\overline{C}\) denotes the (generalized) centroid of data points belonging to C and d(i, j) denotes the (generalized) distance between point i and point j. The quality of C can be measured using one of the following criteria.
The original clustering formula proposed by Calinski and Harabasz (1974) for attribute vectors is obtained if the centroid is fixed to averaging of vectors of attributes and distance to (square of) Euclidean distance. Here we use this formula with one of the proximity measures mentioned in the previous section; if it is a similarity measure, we either transform the similarity to its distance form and apply the above formula, or we use it directly as a similarity and inverse the ratio to within/between while keeping the normalization; the latter approach is distinguished in the experiments as VRC′.
Similar to DB, if used directly with a similarity proximity measure, we change the min to max and the final criterion becomes a minimizer instead of maximizer, which is denoted by (A)SWC′.
Again similar to DB, here also if used directly with a similarity measure, we change the max to min and consider the final criterion as a minimizer instead of maximizer, which is denoted by PBM′.
The minθ/maxθ is computed by summing the \(\varTheta\) smallest/largest distances between every two points, where \(\varTheta = \frac{1}{2}\sum_{l=1}^k{|C_l| (|C_l|-1)}.\)
C-Index can be directly used with a similarity measure as a maximization criterion, whereas with a distance measure it is a minimizer. This is also true for the two following criteria.
3.1.4 Computational complexity
The computational complexity of different clustering validity criteria is provided in the previous work by Vendramin et al. (2010). For the adapted criteria, the time complexity of the indexes is affected by the cost of the chosen proximity measure. All the proximity measures we introduced here can be computed in linear time, \(\mathcal{O}(n),\) except for the A (adjacency) which is \(\mathcal{O}(1),\) the NP (number of paths) which is \(\mathcal{O}(n^2)\) and the IC (Icloseness) which is \(\mathcal{O}(E)\). However, for the case of sparse graphs and using a proper graph data structure such as incidence list, this complexity can be reduced to \(O(\hat{d}),\) where \(\hat{d}\) is the average degree in the network, i.e. the average neighbors of a node in the network. For example, let us revisit the formula for AR (adjacency relation): \(d^{\rm{AR}}_{ij} = \sqrt{\sum_{k \neq{j,i}}{ (A_{ik} - A_{jk})^2}}.\) In this formula we can change ∑_{k} to \(\sum_{k\in \aleph_i\cup\aleph_j}\) since the expression (A_{ik} − A_{jk})^{2} is zero for other values of k, i.e. for nodes that are not neighbour to either i or j and, therefore, have A_{ik} = A_{jk} = 0. The same trick could be applied to other proximity measures.
The other cost that should be considered is the cost of computing the medoid of m data points, which is \(\mathcal{O}(p m^2),\) where p is the cost of the proximity measure. Therefore, the VRC criterion that require computing the overall centroid is in order of \(\mathcal{O}(p n^2).\) This is while the VRC for traditional clustering is linear with respect to the size of the dataset, since it uses averaging for computing the centroid which is \(\mathcal{O}(n)\). Similarly, any other measure that requires computing all the pairwise distances will have the \(\Upomega(p n^2)\). This holds for the adapted Dunn index which is in order of \(\mathcal{O}(p n^2)\), because for finding the minimum distances between any two clusters, it requires to compute the distances between all pair of nodes. Similarly, the ZIndex computes all the pairwise distances and is in order of \(\mathcal{O}(p n^2).\) The same also holds for the PB. The CIndex is even more expensive since it not only computes all the pairwise distances but also sorts them, and hence is in order of \(\mathcal{O}(n^2 (p + log n)).\) These orders (except for VRC) are along the computational complexities previously reported in Vendramin et al. (2010), where the cost of the p is the size of the feature vectors there.
The adapted DB and PBM, on the other hand, do not require computing the medoid of the whole dataset nor all pairwise distances. Instead, they only compute the medoid of each cluster, which makes them in \(\Upomega(pk \hat{m}^2),\) where k is the number of clusters and the \(\hat{m}\) is the average size of the clusters. Consequently, this term will be added to the complexity of these criteria, giving them the order of \(\mathcal{O}(p(n+k^2+k\hat{m}^2)).\) Finally for the silhouette criterion, the (A)SWC0 that uses the average distance, has the order of \(\mathcal{O}(p n^2)\); however, the order for (A)SWC1 is simplified to \(\mathcal{O}(kp(n+\hat{m}^2))\) since it uses the distance to centroid instead of averaging. The latter is similar to the order for modularity Q which is \(\mathcal{O}(k(n+\hat{m}^2))\). To sum up, none of the adapted criteria is significantly superior or inferior in terms of its order; therefore, one should focus on which criterion is more appropriate according to its performance which is demonstrated in the experiments.
3.2 Clustering agreement measures
Here, we formally review different well-studied partitioning agreement measures used in the external evaluation of clustering results. Consider two different partitionings U and V of data points in D. There are several measures to examine the agreement between U and V, originally introduced in the Machine Learning field. These measures assume that the partitionings are disjoint and cover the dataset. More formally, consider D consists of n data items, \(D = \{d_1,d_2,d_3\dots d_n\}\) and let U = {U_{1},U_{2} ... U_{k}} denote the k clusters in U; then, D = ∪_{i=1}^{k}U_{i} and U_{i}∩U_{j} = ∅ ∀i ≠ j.
3.2.1 Pair counting-based measures
_{V} \^{U} | Same | Different |
---|---|---|
Same | M_{11} | M_{10} |
Different | M_{01} | M_{00} |
V_{1} | V_{2} | \(\dots\) | V_{r} | Sums | |
---|---|---|---|---|---|
U_{1} | n_{11} | n_{12} | \(\dots\) | n_{1r} | n_{1.} |
U_{2} | n_{21} | n_{22} | \(\dots\) | n_{2r} | n_{2.} |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) |
U_{k} | n_{k1} | n_{k2} | \(\dots\) | n_{kr} | n_{k.} |
Sums | n_{.1} | n_{.2} | \(\dots\) | n_{.r} | n |
These pair counts have been used to define a variety of different clustering agreement measures. In the following, we briefly explain the most common pair counting measures; the reader can refer to Albatineh et al. (2006) for a recent survey.
The parameter β indicates how much recall is more important than precision. The two common values for β are 2 and 0.5; the former weights recall higher than precision while the latter prizes the precision more.
Vinh et al. (2010) discussed another important property that a proper clustering agreement measure should comply with: correction for chance, which is adjusting the agreement index in a way that the expected value for agreements no better than random becomes a constant, e.g. 0. As an example, consider that the agreement between a clustering and the ground-truth is measured as 0.7 using an unadjusted index, i.e. a measure without a constant baseline where its baseline may be 0.6 in one settings or 0.2 in another; therefore, this 0.7 value cannot be interpreted directly as strong or weak agreement without knowing the baseline.
3.2.2 Graph agreement measures
The result of a community mining algorithm is a set of sub-graphs. To also consider the structure of these sub-graphs in the agreement measure, we first define a weighted version of these measures; where nodes with more importance affect the agreement measure more. Second, we alter the measures to directly assess the structural similarity of these sub-graphs by focusing on the edges instead of nodes.
4 Comparison methodology and results
In this section, we first describe our experimental settings. Then, we examine behaviour of different external indexes in comparing different community mining results. Next, we report the performances of the proposed community quality criteria in relative evaluation of communities.
4.1 Experiment settings
We have used three set of benchmarks as our datasets: Real, GN and LFR. The Real dataset consists of five well-known real-world benchmarks: Karate Club (weighted) by Zachary (Zachary 1977), Sawmill Strike data-set (Nooy et al. 2004), NCAA Football Bowl Subdivision (Girvan and Newman 2002), and Politician Books from Amazon (Krebs 2004). The GN and LFR datasets each include 10 realizations of the GN and LFR synthetic benchmarks (Lancichinetti et al. 2008), which are the benchmarks widely in use for community mining evaluation.
For each graph in our datasets, we generate different partitionings to sample the space of all possible partitionings. For doing so, given the perfect partitioning, we generate different randomized versions of the true partitioning by randomly merging and splitting communities and swapping nodes between them. The sampling procedure is described in more detail in the supplementary materials. The set of the samples obtained covers the partitioning space in a way that it includes very poor to perfect samples.
4.2 Agreement indexes experiments
Here we first examine two desired properties for general clustering agreement indexes, and then we illustrate these properties in our adapted indexes for graphs.
4.2.1 Bias of unadjusted indexes
4.2.2 Knee shape
Correlation between external indexes averaged for datasets of Fig. 3, computed based on Spearman’s Correlation
Index | ARI | Rand | NMI | VI | Jaccard | AMI | F_{β=2} |
---|---|---|---|---|---|---|---|
ARI | 1 | 0.73 ± 0.18 | 0.67 ± 0.07 | −0.80 ± 0.17 | 0.85 ± 0.08 | 0.76 ± 0.15 | 0.64 ± 0.16 |
Rand | 0.73 ± 0.18 | 1 | 0.83 ± 0.12 | −0.46 ± 0.42 | 0.41 ± 0.32 | 0.71 ± 0.11 | 0.13 ± 0.46 |
NMI | 0.67 ± 0.07 | 0.83 ± 0.12 | 1 | −0.43 ± 0.27 | 0.31 ± 0.17 | 0.93 ± 0.07 | 0.04 ± 0.10 |
VI | −0.80 ± 0.17 | −0.46 ± 0.42 | −0.43 ± 0.27 | 1 | −0.93 ± 0.02 | −0.54 ± 0.27 | −0.82 ± 0.21 |
Jaccard | 0.85 ± 0.08 | 0.41 ± 0.32 | 0.31 ± 0.17 | −0.93 ± 0.02 | 1 | 0.46 ± 0.28 | 0.90 ± 0.13 |
AMI | 0.76 ± 0.15 | 0.71 ± 0.11 | 0.93 ± 0.07 | −0.54 ± 0.27 | 0.46 ± 0.28 | 1 | 0.25 ± 0.13 |
F_{β=2} | 0.64 ± 0.16 | 0.13 ± 0.46 | 0.04 ± 0.10 | −0.82 ± 0.21 | 0.90 ± 0.13 | 0.25 ± 0.13 | 1 |
There are different ways to compute the correlation between two vectors. The classic options are Pearson Product Moment coefficient or the Spearman’s Rank correlation coefficient. The reported results in our experiments are based on the Spearman’s Correlation, since we are interested in the correlation of rankings that an index provides for different partitionings, and not the actual values of that index. However, the reported results mostly agree with the results obtained using Pearson correlation, which are reported in the supplementary materials available from: http://cs.ualberta.ca/~rabbanyk/criteriaComparison.
4.2.3 Graph partitioning agreement indexes
Correlation between adapted external indexes on karate and strike datasets, computed based on Spearman’s Correlation
Index | ARI | ξ | \(\eta_{w_i=d_i}\) | \(\eta_{w_i=t_i}\) | \(\eta_{w_i=c_i}\) | NMI |
---|---|---|---|---|---|---|
ARI | 1 ± 0 | 0.571 ± 0.142 | 0.956 ± 0.031 | 0.819 ± 0.135 | 0.838 ± 0.087 | 0.736 ± 0.096 |
ξ | 0.571 ± 0.142 | 1 ± 0 | 0.623 ± 0.133 | 0.572 ± 0.169 | 0.45 ± 0.109 | 0.497 ± 0.2 |
\(\eta_{w_i=d_i}\) | 0.956 ± 0.031 | 0.623 ± 0.133 | 1 ± 0 | 0.876 ± 0.097 | 0.777 ± 0.106 | 0.787 ± 0.094 |
\(\eta_{w_i=t_i}\) | 0.819 ± 0.135 | 0.572 ± 0.169 | 0.876 ± 0.097 | 1 ± 0 | 0.848 ± 0.056 | 0.759 ± 0.107 |
\(\eta_{w_i=c_i}\) | 0.838 ± 0.087 | 0.45 ± 0.109 | 0.777 ± 0.106 | 0.848 ± 0.056 | 1 ± 0 | 0.6 ± 0.064 |
NMI | 0.736 ± 0.096 | 0.497 ± 0.2 | 0.787 ± 0.094 | 0.759 ± 0.107 | 0.6 ± 0.064 | 1 ± 0 |
In the following we compare the performance of different quality indexes, defined in Sect. 3.1, in relative evaluation of clustering results.
4.3 Quality indexes experiments
The performance of a criterion could be examined by how well it could rank different partitionings of a given dataset. More formally, consider for the dataset d, we have a set of m different possible partitionings: P(d) = {p_{1}, p_{2} , ..., p_{m}}. Then, the performance of criterion c on dataset d could be determined by how much its values, I_{c}(d) = { c(p_{1}), c(p_{2}), ... , c(p_{m})}, correlate with the “goodness” of these partitionings. Assuming that the true partitioning (i.e. ground truth) p^{*} is known for dataset d, the “goodness” of partitioning p_{i} could be determined using partitioning agreement measure a. Hence, for dataset d with set of possible partitionings P(d), the external evaluation provides E(d) = {a(p_{1}, p^{*}), a(p_{,}p^{*}), ... , a(p_{m}, p^{*})}, where (p_{1},p^{*}) denotes the “goodness” of partitioning p_{1} comparing to the ground truth. Then, the performance score of criterion c on dataset d could be examined by the correlation of its values I_{c}(d) and the values obtained from the external evaluation E(d) on different possible partitionings. Finally, the criteria are ranked based on their average performance score over a set of datasets. The following procedure summarizes our comparison approach. Open image in new window
4.3.1 Results on real-world datasets
Statistics for sample partitionings of each real world dataset
Dataset | K^{*} | # | \(\overline{K}\) | \(\overline{\rm{ARI}}\) |
---|---|---|---|---|
Strike | 3 | 100 | 3.2 ± 1.08 ∈ [2,7] | 0.45 ± 0.27 ∈ [0.01,1] |
Polboks | 3 | 100 | 4.36 ± 1.73 ∈ [2,9] | 0.43 ± 0.2 ∈ [0.03,1] |
Karate | 2 | 100 | 3.82 ± 1.51 ∈ [2,7] | 0.29 ± 0.26 ∈ [−0.04,1] |
Football | 11 | 100 | 12.04 ± 4.8 ∈ [4,25] | 0.55 ± 0.22 ∈ [0.16,1] |
Overall ranking of criteria on the real world datasets, based on the average Spearman’s correlation of criteria with the ARI external index, ARI_{corr}
Rank | Criterion | ARI_{corr} | Rand | Jaccard | NMI | AMI |
---|---|---|---|---|---|---|
1 | ZIndex’ TO | 0.925 ± 0.018 | 9 | 148 | 9 | 7 |
2 | ZIndex’ \(\widehat{PC}\) | 0.923 ± 0.012 | 2 | 197 | 2 | 2 |
3 | ZIndex’ \(\widehat{NPC}\) | 0.923 ± 0.012 | 3 | 198 | 1 | 1 |
4 | ZIndex’ IC2 | 0.922 ± 0.024 | 8 | 182 | 5 | 3 |
5 | ZIndex’ \(\widehat{TO}\) | 0.922 ± 0.016 | 10 | 153 | 8 | 8 |
6 | ZIndex’ \(\widehat{NPO}\) | 0.921 ± 0.014 | 6 | 204 | 3 | 4 |
7 | ZIndex’ ICV2 | 0.919 ± 0.04 | 18 | 163 | 12 | 10 |
8 | ZIndex’ PC | 0.918 ± 0.018 | 4 | 207 | 10 | 11 |
9 | ZIndex’ IC3 | 0.918 ± 0.039 | 19 | 165 | 15 | 12 |
10 | ZIndex’ \(\widehat{NOV}\) | 0.915 ± 0.014 | 11 | 213 | 6 | 9 |
11 | ZIndex’ IC1 | 0.912 ± 0.02 | 5 | 235 | 13 | 20 |
12 | ZIndex’ NPE2.0 | 0.911 ± 0.03 | 26 | 168 | 21 | 15 |
13 | ZIndex’ NOV | 0.91 ± 0.023 | 12 | 225 | 18 | 21 |
14 | ZIndex’ ICV1 | 0.91 ± 0.023 | 13 | 226 | 19 | 22 |
15 | ZIndex’ \(\widehat{NPE2.0}\) | 0.91 ± 0.025 | 23 | 184 | 22 | 19 |
16 | ZIndex’ NPL2.0 | 0.909 ± 0.02 | 24 | 202 | 14 | 13 |
17 | ZIndex’ M | 0.908 ± 0.028 | 25 | 149 | 26 | 23 |
18 | ZIndex’ ICV3 | 0.908 ± 0.057 | 29 | 176 | 28 | 25 |
19 | ZIndex’ NP2.0 | 0.907 ± 0.021 | 20 | 212 | 16 | 14 |
20 | ZIndex’ \(\widehat{NPL2.0}\) | 0.906 ± 0.022 | 21 | 216 | 17 | 17 |
21 | ZIndex’ \(\widehat{NP2.0}\) | 0.906 ± 0.022 | 22 | 217 | 20 | 18 |
22 | ZIndex’ \(\widehat{NO}\) | 0.905 ± 0.022 | 16 | 253 | 11 | 16 |
23 | ZIndex’ NO | 0.904 ± 0.034 | 7 | 250 | 23 | 31 |
24 | ZIndex’ \(\widehat{MM}\) | 0.903 ± 0.037 | 17 | 233 | 24 | 30 |
25 | CIndex SP | 0.9 ± 0.02 | 1 | 251 | 31 | 42 |
26 | ZIndex’ \(\widehat{NPL3.0}\) | 0.899 ± 0.032 | 30 | 200 | 27 | 24 |
27 | ZIndex’ \(\widehat{NP3.0}\) | 0.899 ± 0.033 | 33 | 196 | 29 | 27 |
28 | ZIndex’ \(\widehat{NPE3.0}\) | 0.899 ± 0.048 | 31 | 205 | 35 | 33 |
29 | ZIndex \(\widehat{AR}\) | 0.898 ± 0.035 | 14 | 264 | 30 | 36 |
30 | ZIndex’ NPE3.0 | 0.897 ± 0.052 | 35 | 187 | 39 | 34 |
31 | ZIndex’ NPL3.0 | 0.897 ± 0.038 | 36 | 170 | 32 | 28 |
32 | ZIndex SP | 0.895 ± 0.036 | 28 | 215 | 40 | 41 |
33 | ZIndex’ NP3.0 | 0.895 ± 0.039 | 37 | 166 | 34 | 29 |
34 | ZIndex AR | 0.895 ± 0.039 | 15 | 255 | 36 | 38 |
35 | ZIndex’ A | 0.894 ± 0.045 | 32 | 158 | 38 | 35 |
36 | ZIndex’ MD | 0.894 ± 0.048 | 34 | 179 | 33 | 32 |
37 | ZIndex’ \(\hat{A}\) | 0.891 ± 0.05 | 27 | 241 | 37 | 37 |
38 | Q | 0.878 ± 0.034 | 45 | 110 | 45 | 44 |
39 | CIndex’ NPE3.0 | 0.876 ± 0.054 | 43 | 9 | 4 | 6 |
40 | CIndex’ ICV3 | 0.869 ± 0.069 | 44 | 4 | 7 | 5 |
41 | CIndex AR | 0.864 ± 0.031 | 40 | 268 | 42 | 40 |
42 | CIndex \(\widehat{AR}\) | 0.861 ± 0.032 | 42 | 266 | 41 | 39 |
43 | CIndex’ \(\widehat{NPE3.0}\) | 0.858 ± 0.07 | 47 | 8 | 25 | 26 |
44 | ZIndex’ \(\widehat{MD}\) | 0.856 ± 0.101 | 38 | 323 | 43 | 45 |
45 | SWC0 IC1 | 0.847 ± 0.09 | 41 | 108 | 46 | 47 |
46 | SWC0 IC2 | 0.838 ± 0.092 | 49 | 11 | 50 | 49 |
47 | SWC0 NO | 0.837 ± 0.106 | 39 | 146 | 48 | 50 |
48 | SWC0 IC3 | 0.819 ± 0.104 | 57 | 7 | 58 | 52 |
49 | SWC0 NOV | 0.814 ± 0.094 | 52 | 26 | 54 | 56 |
50 | SWC0 ICV1 | 0.814 ± 0.094 | 53 | 27 | 55 | 57 |
Difficulty analysis of the results: considering ranking for partitionings near optimal ground truth, medium far and very far
Near optimal samples | ||||||
---|---|---|---|---|---|---|
Rank | Criterion | ARI_{corr} | Rand | Jaccard | NMI | AMI |
1 | ZIndex’ \(\widehat{NPC}\) | 0.851 ± 0.081 | 1 | 3 | 4 | 5 |
2 | ZIndex’ \(\widehat{PC}\) | 0.851 ± 0.081 | 2 | 4 | 3 | 3 |
3 | ZIndex SP | 0.847 ± 0.084 | 18 | 2 | 8 | 8 |
4 | ZIndex’ \(\widehat{NPO}\) | 0.845 ± 0.088 | 3 | 9 | 6 | 6 |
5 | DB ICV2 | 0.845 ± 0.065 | 30 | 1 | 31 | 30 |
6 | ZIndex’ \(\widehat{NPE3.0}\) | 0.842 ± 0.082 | 10 | 5 | 2 | 2 |
7 | ZIndex’ ICV3 | 0.839 ± 0.084 | 4 | 20 | 20 | 21 |
8 | ZIndex’ \(\widehat{NOV}\) | 0.835 ± 0.093 | 11 | 14 | 15 | 15 |
9 | ZIndex’ \(\widehat{TO}\) | 0.835 ± 0.09 | 9 | 10 | 7 | 7 |
10 | ZIndex’ \(\widehat{NPE2.0}\) | 0.834 ± 0.089 | 13 | 8 | 1 | 1 |
11 | ZIndex’ TO | 0.834 ± 0.089 | 7 | 16 | 11 | 11 |
12 | ZIndex’ IC2 | 0.834 ± 0.095 | 5 | 23 | 18 | 18 |
\(\vdots\) | ||||||
36 | ZIndex’ M | 0.763 ± 0.139 | 33 | 29 | 30 | 31 |
37 | Q | 0.762 ± 0.166 | 39 | 21 | 41 | 41 |
38 | DB ICV3 | 0.757 ± 0.126 | 37 | 35 | 38 | 36 |
39 | DB IC3 | 0.753 ± 0.176 | 35 | 36 | 39 | 39 |
40 | PB’ PC | 0.753 ± 0.289 | 45 | 26 | 71 | 71 |
Medium far samples | ||||||
1 | ZIndex’ TO | 0.775 ± 0.087 | 5 | 361 | 22 | 20 |
2 | ZIndex’ \(\widehat{TO}\) | 0.771 ± 0.091 | 6 | 386 | 19 | 17 |
3 | ZIndex’ IC3 | 0.768 ± 0.134 | 2 | 372 | 16 | 13 |
4 | ZIndex’ ICV2 | 0.766 ± 0.124 | 3 | 370 | 2 | 2 |
5 | ZIndex’ NPL3.0 | 0.762 ± 0.079 | 12 | 349 | 28 | 27 |
6 | ZIndex’ ICV3 | 0.757 ± 0.12 | 4 | 376 | 21 | 19 |
7 | ZIndex’ NP3.0 | 0.756 ± 0.085 | 15 | 354 | 29 | 28 |
8 | ZIndex’ \(\widehat{PC}\) | 0.755 ± 0.122 | 9 | 417 | 4 | 4 |
9 | ZIndex’ \(\widehat{NPC}\) | 0.755 ± 0.122 | 11 | 418 | 3 | 3 |
10 | ZIndex’ NPE2.0 | 0.753 ± 0.107 | 10 | 373 | 14 | 14 |
11 | ZIndex’ NPE3.0 | 0.746 ± 0.093 | 8 | 369 | 24 | 24 |
12 | ZIndex’ \(\widehat{NPO}\) | 0.744 ± 0.123 | 14 | 437 | 5 | 5 |
\(\vdots\) | ||||||
29 | ZIndex’ \(\widehat{MM}\) | 0.694 ± 0.168 | 31 | 458 | 40 | 32 |
30 | Q | 0.69 ± 0.151 | 58 | 70 | 79 | 72 |
\(\vdots\) | ||||||
31 | ZIndex’ A | 0.69 ± 0.144 | 34 | 366 | 58 | 54 |
46 | PB’ PC | 0.623 ± 0.06 | 112 | 28 | 200 | 157 |
Far far samples | ||||||
1 | ZIndex’ ICV2 | 0.724 ± 0.066 | 36 | 520 | 4 | 9 |
2 | ZIndex’ IC3 | 0.72 ± 0.062 | 40 | 523 | 11 | 19 |
3 | ZIndex’ ICV3 | 0.717 ± 0.059 | 47 | 511 | 23 | 25 |
4 | ZIndex’ IC2 | 0.715 ± 0.072 | 35 | 540 | 3 | 6 |
5 | ZIndex’ TO | 0.706 ± 0.064 | 49 | 519 | 16 | 14 |
6 | ZIndex’ \(\widehat{NPO}\) | 0.704 ± 0.076 | 44 | 547 | 1 | 3 |
7 | ZIndex’ \(\widehat{TO}\) | 0.704 ± 0.062 | 51 | 522 | 13 | 5 |
8 | ZIndex’ NPE2.0 | 0.701 ± 0.057 | 55 | 505 | 15 | 7 |
9 | ZIndex’ \(\widehat{NPC}\) | 0.698 ± 0.083 | 45 | 552 | 6 | 10 |
10 | ZIndex’ \(\widehat{PC}\) | 0.697 ± 0.083 | 46 | 553 | 9 | 11 |
11 | ZIndex’ \(\widehat{NPE2.0}\) | 0.688 ± 0.047 | 57 | 521 | 24 | 23 |
12 | ZIndex’ NPL2.0 | 0.688 ± 0.072 | 58 | 529 | 12 | 4 |
\(\vdots\) | ||||||
30 | ZIndex’ IC1 | 0.655 ± 0.132 | 43 | 566 | 34 | 40 |
31 | ZIndex’ \(\widehat{NO}\) | 0.651 ± 0.106 | 52 | 567 | 22 | 26 |
32 | Q | 0.643 ± 0.033 | 86 | 444 | 50 | 45 |
33 | ZIndex’ NO | 0.638 ± 0.158 | 38 | 572 | 38 | 47 |
34 | ZIndex’ MD | 0.63 ± 0.099 | 78 | 513 | 43 | 41 |
\(\vdots\) | ||||||
117 | PB’ PC | 0.372 ± 0.126 | 197 | 170 | 159 | 129 |
4.3.2 Synthetic benchmarks datasets
Overall ranking and difficulty analysis of the synthetic results
Overall results | ||||||
---|---|---|---|---|---|---|
Rank | Criterion | ARI_{corr} | Rand | Jaccard | NMI | AMI |
1 | ZIndex’ ICV2 | 0.96 ± 0.029 | 5 | 32 | 3 | 3 |
2 | ZIndex’ IC3 | 0.958 ± 0.028 | 4 | 42 | 2 | 2 |
3 | ZIndex’ IC2 | 0.958 ± 0.033 | 1 | 58 | 1 | 1 |
4 | ZIndex’ \(\widehat{PC}\) | 0.953 ± 0.04 | 3 | 78 | 6 | 6 |
5 | ZIndex’ \(\widehat{NPC}\) | 0.953 ± 0.04 | 2 | 79 | 7 | 7 |
6 | ZIndex’ ICV3 | 0.953 ± 0.027 | 8 | 44 | 4 | 5 |
7 | ZIndex’ \(\widehat{NPO}\) | 0.951 ± 0.041 | 6 | 83 | 9 | 9 |
8 | ZIndex’ \(\widehat{TO}\) | 0.949 ± 0.045 | 13 | 60 | 17 | 17 |
9 | ZIndex’ \(\widehat{NOV}\) | 0.949 ± 0.042 | 7 | 90 | 8 | 8 |
10 | ZIndex’ TO | 0.948 ± 0.046 | 16 | 50 | 21 | 21 |
11 | ZIndex’ PC | 0.947 ± 0.043 | 10 | 77 | 16 | 15 |
12 | ZIndex’ \(\widehat{NPE2.0}\) | 0.947 ± 0.042 | 11 | 68 | 13 | 13 |
13 | ZIndex’ NPE2.0 | 0.946 ± 0.043 | 17 | 51 | 20 | 20 |
14 | ZIndex’ NOV | 0.941 ± 0.047 | 14 | 95 | 18 | 18 |
15 | ZIndex’ ICV1 | 0.941 ± 0.047 | 15 | 96 | 19 | 19 |
\(\vdots\) | ||||||
29 | ZIndex’ NPL3.0 | 0.895 ± 0.072 | 31 | 121 | 38 | 37 |
30 | Q | 0.893 ± 0.046 | 33 | 33 | 26 | 22 |
31 | ZIndex’ NP3.0 | 0.89 ± 0.076 | 32 | 130 | 39 | 39 |
Near optimal results | ||||||
1 | ZIndex’ IC2 | 0.826 ± 0.227 | 2 | 10 | 4 | 6 |
2 | CIndex’ ICV2 | 0.822 ± 0.132 | 7 | 1 | 11 | 7 |
3 | ZIndex’ IC3 | 0.821 ± 0.232 | 1 | 16 | 5 | 9 |
4 | CIndex’ ICV3 | 0.818 ± 0.237 | 4 | 9 | 3 | 5 |
5 | ZIndex’ ICV2 | 0.816 ± 0.232 | 3 | 18 | 7 | 10 |
6 | ZIndex’ \(\hat{A}\) | 0.813 ± 0.225 | 5 | 19 | 2 | 2 |
7 | CIndex’ IC3 | 0.8 ± 0.2 | 31 | 2 | 13 | 8 |
8 | ZIndex’ A | 0.795 ± 0.177 | 30 | 20 | 6 | 4 |
9 | ZIndex’ \(\widehat{MM}\) | 0.794 ± 0.221 | 9 | 33 | 1 | 1 |
\(\vdots\) | ||||||
206 | SWC1’ \(\widehat{NO}\) | 0.591 ± 0.179 | 225 | 194 | 244 | 233 |
207 | Q | 0.589 ± 0.161 | 222 | 198 | 138 | 110 |
Medium far results | ||||||
1 | ZIndex’ ICV2 | 0.741 ± 0.177 | 4 | 231 | 22 | 22 |
2 | ZIndex’ IC2 | 0.738 ± 0.181 | 1 | 247 | 16 | 20 |
3 | ZIndex’ IC3 | 0.728 ± 0.188 | 5 | 252 | 18 | 21 |
4 | ZIndex’ ICV3 | 0.721 ± 0.177 | 8 | 258 | 21 | 23 |
5 | ZIndex’ \(\widehat{PC}\) | 0.719 ± 0.204 | 3 | 285 | 30 | 35 |
6 | ZIndex’ \(\widehat{NPC}\) | 0.719 ± 0.204 | 2 | 286 | 31 | 36 |
7 | CIndex’ ICV3 | 0.713 ± 0.151 | 28 | 21 | 33 | 27 |
8 | ZIndex’ \(\widehat{NPO}\) | 0.709 ± 0.205 | 7 | 278 | 32 | 38 |
9 | ZIndex’ \(\widehat{TO}\) | 0.703 ± 0.216 | 12 | 240 | 42 | 48 |
10 | ZIndex’ TO | 0.702 ± 0.217 | 14 | 239 | 45 | 53 |
\(\vdots\) | ||||||
37 | Q | 0.62 ± 0.139 | 42 | 167 | 56 | 47 |
Far far results | ||||||
1 | ZIndex’ ICV2 | 0.834 ± 0.062 | 9 | 464 | 5 | 3 |
2 | ZIndex’ IC3 | 0.832 ± 0.06 | 7 | 469 | 4 | 2 |
3 | ZIndex’ TO | 0.825 ± 0.098 | 22 | 423 | 29 | 27 |
4 | ZIndex’ ICV3 | 0.823 ± 0.063 | 12 | 458 | 6 | 6 |
5 | ZIndex’ \(\widehat{TO}\) | 0.823 ± 0.096 | 18 | 446 | 27 | 25 |
\(\vdots\) | ||||||
30 | ZIndex’ M | 0.638 ± 0.151 | 31 | 537 | 9 | 4 |
31 | Q | 0.581 ± 0.155 | 95 | 368 | 69 | 32 |
32 | ZIndex SP | 0.58 ± 0.158 | 72 | 539 | 25 | 29 |
Statistics for sample partitionings of each synthetic dataset
Dataset | K* | # | \(\overline{K}\) | \(\overline{\rm{ARI}}\) |
---|---|---|---|---|
network1 | 4 | 100 | 5.26 ± 2.45 ∈ [2,12] | 0.45 ± 0.18 ∈ [0.13,1] |
network2 | 3 | 100 | 4 ± 1.7 ∈ [2,8] | 0.47 ± 0.23 ∈ [0.06,1] |
network3 | 2 | 100 | 4 ± 1.33 ∈ [2,6] | 0.36 ± 0.22 ∈ [0.07,1] |
network4 | 7 | 100 | 10.68 ± 3.3 ∈ [4,19] | 0.69 ± 0.21 ∈ [0.25,1] |
network5 | 2 | 100 | 4.68 ± 1.91 ∈ [2,9] | 0.32 ± 0.22 ∈ [−0.01,1] |
network6 | 5 | 100 | 5.98 ± 2.63 ∈ [2,14] | 0.52 ± 0.21 ∈ [0.12,1] |
network7 | 4 | 100 | 6.62 ± 2.72 ∈ [2,12] | 0.52 ± 0.22 ∈ [0.11,1] |
network8 | 5 | 100 | 5.8 ± 2.45 ∈ [2,12] | 0.55 ± 0.22 ∈ [0.15,1] |
network9 | 5 | 100 | 6.54 ± 2.08 ∈ [3,11] | 0.64 ± 0.2 ∈ [0.25,1] |
network10 | 6 | 100 | 8.88 ± 2.74 ∈ [4,15] | 0.59 ± 0.19 ∈ [0.21,1] |
Overall ranking of criteria based on AMI & Spearman’s Correlation on the synthetic benchmarks with the same parameters as in Table 6 but much higher mixing parameter, 0.4
Overall results | ||||||
---|---|---|---|---|---|---|
Rank | Criterion | ARI_{corr} | Rand | Jaccard | NMI | AMI |
1 | Q | 0.854 ± 0.039 | 11 | 1 | 4 | 2 |
2 | ZIndex’ M | 0.839 ± 0.067 | 2 | 5 | 1 | 1 |
3 | ZIndex’ A | 0.813 ± 0.071 | 4 | 11 | 3 | 3 |
4 | ZIndex’ \(\widehat{MM}\) | 0.785 ± 0.115 | 1 | 63 | 2 | 4 |
5 | ZIndex’ \(\hat{A}\) | 0.767 ± 0.101 | 3 | 86 | 5 | 5 |
6 | ZIndex’ \(\widehat{PC}\) | 0.748 ± 0.19 | 5 | 108 | 7 | 7 |
7 | ZIndex’ \(\widehat{NPC}\) | 0.748 ± 0.19 | 6 | 109 | 8 | 8 |
8 | ZIndex’ \(\widehat{NPO}\) | 0.745 ± 0.191 | 7 | 110 | 9 | 9 |
9 | ZIndex’ \(\widehat{TO}\) | 0.738 ± 0.197 | 13 | 88 | 16 | 15 |
10 | ZIndex’ \(\widehat{NOV}\) | 0.738 ± 0.197 | 8 | 134 | 10 | 10 |
Near optimal results | ||||||
1 | ZIndex’ M | 0.825 ± 0.105 | 1 | 1 | 1 | 1 |
2 | ZIndex’ A | 0.8 ± 0.184 | 2 | 2 | 2 | 2 |
3 | ZIndex’ \(\widehat{MM}\) | 0.768 ± 0.166 | 3 | 4 | 3 | 3 |
4 | ZIndex’ \(\hat{A}\) | 0.76 ± 0.192 | 4 | 6 | 4 | 4 |
5 | Q | 0.72 ± 0.209 | 34 | 3 | 34 | 34 |
6 | ASWC0 \(\widehat{NPL2.0}\) | 0.719 ± 0.248 | 22 | 8 | 5 | 5 |
7 | SWC0 \(\widehat{NPL2.0}\) | 0.718 ± 0.247 | 23 | 9 | 6 | 6 |
8 | ZIndex’ \(\widehat{NPE2.0}\) | 0.714 ± 0.259 | 5 | 21 | 7 | 8 |
9 | ASWC0 SP | 0.71 ± 0.286 | 28 | 5 | 29 | 26 |
10 | ZIndex’ \(\widehat{NPL2.0}\) | 0.702 ± 0.261 | 6 | 29 | 13 | 18 |
Medium far results | ||||||
1 | Q | 0.578 ± 0.124 | 106 | 22 | 3 | 1 |
2 | CIndex’ \(\widehat{NPC}\) | 0.522 ± 0.146 | 154 | 12 | 78 | 69 |
3 | CIndex’ \(\widehat{PC}\) | 0.521 ± 0.146 | 155 | 13 | 79 | 70 |
4 | CIndex’ \(\widehat{NPO}\) | 0.519 ± 0.142 | 176 | 5 | 120 | 100 |
5 | CIndex’ \(\widehat{NOV}\) | 0.501 ± 0.14 | 209 | 4 | 142 | 135 |
6 | ZIndex’ M | 0.498 ± 0.199 | 4 | 364 | 2 | 2 |
7 | CIndex’ IC2 | 0.492 ± 0.146 | 227 | 9 | 176 | 173 |
8 | CIndex’ ICV2 | 0.483 ± 0.193 | 149 | 79 | 119 | 115 |
9 | CIndex’ IC3 | 0.478 ± 0.191 | 187 | 43 | 148 | 146 |
10 | CIndex’ TO | 0.478 ± 0.175 | 179 | 31 | 204 | 203 |
Far far results | ||||||
1 | ZIndex’ \(\widehat{PC}\) | 0.527 ± 0.169 | 61 | 501 | 5 | 4 |
2 | ZIndex’ \(\widehat{NPC}\) | 0.527 ± 0.169 | 62 | 502 | 6 | 5 |
3 | Q | 0.523 ± 0.192 | 128 | 73 | 93 | 25 |
4 | ZIndex’ M | 0.522 ± 0.121 | 77 | 465 | 8 | 2 |
5 | ZIndex’ \(\widehat{NPO}\) | 0.518 ± 0.168 | 63 | 504 | 10 | 6 |
6 | ZIndex’ \(\widehat{NOV}\) | 0.515 ± 0.166 | 60 | 518 | 11 | 7 |
7 | ZIndex’ \(\widehat{TO}\) | 0.489 ± 0.171 | 78 | 485 | 15 | 9 |
8 | ZIndex’ \(\widehat{NPE2.0}\) | 0.481 ± 0.168 | 79 | 491 | 24 | 14 |
9 | ZIndex’ \(\widehat{MM}\) | 0.48 ± 0.15 | 30 | 553 | 2 | 3 |
10 | ZIndex’ \(\widehat{NO}\) | 0.48 ± 0.17 | 43 | 552 | 7 | 8 |
In short, the relative performances of different criteria depend on the difficulty of the network itself, as well as how far we are sampling from the ground truth. Altogether, choosing the right criterion for evaluating different community mining results depends both on the application, i.e. how well-separated communities might be in the given network, and also on the algorithm that produces these results, i.e. how fine the results might be. For example, if the algorithm is producing high quality results close to the optimal, modularity Q might not distinguish the good and bad partitionings very well, while if we are choosing between mixed and not well separated clusterings, it is the superior criterion. Please note that these results and criteria are different from our earlier work (Rabbany et al. 2012), particularly, ZIndex is defined differently in this paper.
5 Summary and future perspectives
In this section, we summarize our paper and elaborate on the findings of our results and suggest some line of works that could be followed.
In this article, we examined different approaches for evaluating community mining results. Particularly, we examined different external and relative measures for clustering validity and adapted these for community mining evaluation. Our main contribution is the generalization of well-known clustering validity criteria originally used as quantitative measures for evaluating quality of clusters of data points represented by attributes. The first reason of this generalization is to adapt these criteria in the context of interrelated data, where the only commonly used criterion to evaluate the goodness of detected communities is currently the modularity Q. Providing a more extensive set of validity criteria can help researchers to better evaluate and compare community mining results in different settings. Also, these adapted validity criteria can be further used as objectives to design new community mining algorithms. Unlike most of the original clustering validity criteria that are defined specifically based on the Euclidean distance, our generalized formulation is independent of any particular distance measure.
In our experiments, several of these adapted criteria exhibit high performances on ranking different partitionings of a given dataset, which makes them useful alternatives for the Q modularity. Particularly the \(ZIndex\) criterion exhibits good performance almost regardless of the choice of the proximity measure. This makes \(ZIndex\) also an attractive objective for finding communities. We intend to further investigate this direction in the future work.
Our results suggest that the performances of different criteria and their rankings change in different settings. Here we examined the effects of how well-separated are the communities in the ground truth and also the general distance of a clustering from the ground truth. We further observed that the quality of different criteria is also affected by the choice of benchmarks: Synthetic vs. Real benchmarks. This difference motivates further investigation to produce more realistic synthetic generators (Aldecoa and Marin 2012). Another direction is to classify the criteria according to their performance based on different network characteristics; Onnela et al. (2010) and Sallaberry et al. (2013) provide examples of network characterisation.
We also compared common clustering similarity/agreement measures used in external evaluation of community mining results. Our results confirm that the commonly used agreement measure NMI is biased in favour of large number of communities and falls short of detecting the true number of communities compared with other measures. In contrast, ARI possess both of these desirable properties. We further proposed the need for modified measures specific to communities agreement, pointing out that the current clustering agreement measures completely ignore the edges. We have proposed few straightforward extensions for the agreement measures to adapt them for the context of inter-related data, including a degree weighted variation of ARI. The resulting agreement measures are more appropriate for external evaluation of community mining results while exhibiting the desirable qualities of ARI (i.e. adjustment for chance and detecting true number of clusters). Our results also motivate further investigation into the properties of these extensions and also examining alternative extensions (for example incorporating the notion of assortativity); this is mainly because despite being unbiased, these extensions are not as stable as ARI in evaluating random clusterings. Another line of work following the agreement similarity measures is investigating their application in consensus or ensemble clustering, for example see Strehl and Ghosh (2003); Lancichinetti and Fortunato (2012).
As a part of future work we intend to provide extensions of the criteria and measures defined here for more general cases of community mining: overlapping communities, dynamic communities and also local communities. For example in the literature on cluster analysis, there are clustering algorithms and validation indexes specially designed to deal with data involving overlapping categories. In particular, fuzzy clustering algorithms produce clustering results in which data objects may belong to multiple clusters at different degrees (Bezdek 1981; Dumitrescu and Jain 2000). In order to evaluate the results of such algorithms, a number of relative, internal and external fuzzy clustering validation indexes have been proposed (Campello 2010; Campello and Hruschka 2006; Collins and Dent 1988; Dumitrescu and Jain 2000; Halkidi et al. 2001; Hppner et al. 1999). Furthermore, some recent works study methods of finding and evaluating overlapping communities in the context of interrelated data (Gregory 2011; Lancichinetti et al. 2009; Rees and Gallagher 2012; Yoshida 2013).
Notes
Acknowledgments
The authors are grateful for the support from Alberta Innovates Centre for Machine Learning and NSERC. Ricardo Campello also acknowledges the financial support of Fapesp and CNPq.
References
- Albatineh AN, Niewiadomska-Bugaj M, Mihalko D (2006) On similarity indices and correction for chance agreement. J Classif 23:301–313. doi:10.1007/s00357-006-0017-z MathSciNetCrossRefGoogle Scholar
- Aldecoa R, Marin I (2012) Closed benchmarks for network community structure characterization. Phys Rev E 85:026109CrossRefGoogle Scholar
- Bezdek JC (1981) Pattern Recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, NorwellCrossRefMATHGoogle Scholar
- Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3:1–27MathSciNetCrossRefMATHGoogle Scholar
- Campello R (2010) Generalized external indexes for comparing data partitions with overlapping categories. Pattern Recogn Lett 31(9):966–975CrossRefGoogle Scholar
- Campello R, Hruschka ER (2006) A fuzzy extension of the silhouette width criterion for cluster analysis. Fuzzy Sets Syst 157(21):2858–2875MathSciNetCrossRefMATHGoogle Scholar
- Chen J, Zaïane OR, Goebel R (2009) Detecting communities in social networks using max-min modularity. In: SIAM international conference on data mining, pp 978–989Google Scholar
- Clauset A (2005) Finding local community structure in networks. Phys Rev E (Statistical, Nonlinear, and Soft Matter Physics) 72(2):026132CrossRefGoogle Scholar
- Collins LM, Dent CW (1988) Omega: a general formulation of the rand index of cluster recovery suitable for non-disjoint solutions. Multivar Behav Res 23(2):231–242CrossRefGoogle Scholar
- Dalrymple-Alford EC (1970) Measurement of clustering in free recall. Psychol Bull 74:32–34CrossRefGoogle Scholar
- Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp 2005(09):09008. doi:10.1088/1742-5468/2005/09/P09008
- Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227Google Scholar
- Dumitrescu D, BL, Jain LC (2000) Fuzzy sets and their application to clustering and training. CRC Press, Boca RatonGoogle Scholar
- Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104MathSciNetCrossRefGoogle Scholar
- Fortunato S (2010) Community detection in graphs. Phys Rep 486(35):75–174MathSciNetCrossRefGoogle Scholar
- Fortunato S, Barthélemy M (2007) Resolution limit in community detection. Proc Nat Acad Sci 104(1):36–41CrossRefGoogle Scholar
- Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Nat Acad Sci 99(12):7821–7826MathSciNetCrossRefMATHGoogle Scholar
- Gregory S (2011) Fuzzy overlapping communities in networks. J Stat Mech Theory Exp 2:17Google Scholar
- Gustafsson M, Hörnquist M, Lombardi A (2006) Comparison and validation of community structures in complex networks. Phys A Stat Mech Appl 367:559–576CrossRefGoogle Scholar
- Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inform Syst 17:107–145CrossRefMATHGoogle Scholar
- Hppner F, Klawonn F, Kruse R, Runkler T (1999) Fuzzy cluster analysis: methods for classification, data analysis and image recognition. Wiley, New YorkGoogle Scholar
- Hubert L, Arabie P (1985) Comparing partitions. J Classif 2:193–218CrossRefGoogle Scholar
- Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 83:1072–1080CrossRefGoogle Scholar
- Kenley EC, Cho Y-R (2011) Entropy-based graph clustering: application to biological and social networks. In: IEEE International Conference on Data MiningGoogle Scholar
- Krebs V. Books about us politics. http://www.orgnet.com/2004
- Lancichinetti A, Fortunato S (2009) Community detection algorithms: a comparative analysis. Phys Rev E 80(5):056117CrossRefGoogle Scholar
- Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Nat Sci Rep 2:336Google Scholar
- Lancichinetti A, Fortunato S, Kertsz J (2009) Detecting the overlapping and hierarchical community structure in complex networks. New J Phys 11(3):033015CrossRefGoogle Scholar
- Lancichinetti A, Fortunato S, Radicchi F (2008) Benchmark graphs for testing community detection algorithms. Phys Rev E 78(4):046110Google Scholar
- Leskovec J, Kleinberg J, Faloutsos C (2005) Graphs over time: densification laws, shrinking diameters and possible explanations. In: ACM SIGKDD international conference on knowledge discovery in data mining, pp 177–187Google Scholar
- Leskovec J, Lang KJ, Mahoney M (2010) Empirical comparison of algorithms for network community detection. In: International conference on world wide web, pp 631–640Google Scholar
- Luo F, Wang JZ, Promislow E (2008) Exploring local community structures in large networks. Web Intell Agent Syst 6(4):387–400Google Scholar
- Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefMATHGoogle Scholar
- Meil M (2007) Comparing clusteringsan information based distance. J Multivar Anal 98(5):873–895CrossRefGoogle Scholar
- Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179CrossRefGoogle Scholar
- Newman M (2010) Networks: an introduction. Oxford University Press, Inc., New YorkCrossRefGoogle Scholar
- Newman MEJ (2006) Modularity and community structure in networks. Proc Nat Acad Sci 103(23):8577–8582CrossRefGoogle Scholar
- Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69(2):026113CrossRefGoogle Scholar
- Nooy Wd, Mrvar A, Batagelj V (2004) Exploratory Social Network Analysis with Pajek. Cambridge University Press, CambridgeGoogle Scholar
- Onnela J-P, Fenn DJ, Reid S, Porter MA, Mucha PJ, Fricker MD, Jones NS (2010) Taxonomies of Networks. ArXiv e-printsGoogle Scholar
- Orman GK, Labatut V (2010) The effect of network realism on community detection algorithms. In: Proceedings of the 2010 international conference on advances in social networks analysis and mining. ASONAM ’10, pp 301–305Google Scholar
- Orman GK, Labatut V, Cherifi H (2011) Qualitative comparison of community detection algorithms. In: International conference on digital information and communication technology and its applications, vol 167, pp 265–279Google Scholar
- Pakhira M, Dutta A (2011) Computing approximate value of the pbm index for counting number of clusters using genetic algorithm. In: International conference on recent trends in information systemsGoogle Scholar
- Palla G, Derenyi I, Farkas I, Vicsek T (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818CrossRefGoogle Scholar
- Porter MA, Onnela J-P, Mucha PJ (2009) Communities in networks. Notices of the AMS 56(9):1082–1097Google Scholar
- Rabbany R, Chen J, Zaïane OR (2010) Top leaders community detection approach in information networks. In: SNA-KDD workshop on social network mining and analysis Google Scholar
- Rabbany R, Takaffoli M, Fagnan J, Zaiane O, Campello R (2012) Relative validity criteria for community mining algorithms. In: International conference on advances in social networks analysis and mining (ASONAM)Google Scholar
- Rabbany R, Zaïane OR (2011) A diffusion of innovation-based closeness measure for network associations. In: IEEE international conference on data mining workshops, pp 381–388Google Scholar
- Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabsi A-L (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555CrossRefGoogle Scholar
- Rees BS, Gallagher KB (2012) Overlapping community detection using a community optimized graph swarm. Soc Netw Anal Mining 2(4):405–417CrossRefGoogle Scholar
- Rosvall M, Bergstrom CT (2007) An information-theoretic framework for resolving community structure in complex networks. Proc Nat Acad Sci 104(18):7327–7331CrossRefGoogle Scholar
- Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118–1123CrossRefGoogle Scholar
- Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(1):53–65CrossRefMATHGoogle Scholar
- Sallaberry A, Zaidi F, Melançon G (2013) Model for generating artificial social networks having community structures with small-world and scale-free properties. Soc Netw Anal Min 3(3):597–609Google Scholar
- Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617MathSciNetMATHGoogle Scholar
- Theodoridis S, Koutroumbas K (2009) Cluster validity. In: Pattern recognition, chapter 16, 4 ed. Elsevier Science, LondonGoogle Scholar
- Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Mining 3(4):209–235MathSciNetGoogle Scholar
- Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning, ICML ’09. ACM, New York, pp 1073–1080Google Scholar
- Vinh NX, Epps J, Bailey J (2010). Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854MathSciNetMATHGoogle Scholar
- Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, CambridgeGoogle Scholar
- Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New York, pp 877–886Google Scholar
- Yoshida T (2013) Weighted line graphs for overlapping community discovery. Soc Netw Anal Min 1–13. doi:10.1007/s13278-013-0104-1
- Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33:452–473Google Scholar