1 Introduction

Research in mining and analyzing large-scale complex networks has been boosted recently after discovering that much of complex networks extracted form natural and artificial systems share a set of non-trivial characteristics that distinguish them from pure random graphs. Basic topological characteristics of complex networks are: low separation degree (or what is better known as small-world feature [37]), power-law distribution of node’s degrees [75], and high clustering coefficient [46]. As a consequence of these basic topological features, almost all real-world complex networks exhibit a mesoscopic level of organization, called communities [58]. A community is loosely defined as a connected subgraph whose nodes are much linked with one each other than with nodes outside the subgraph. Nodes in a community are generally supposed to share common properties or play similar roles within the network. This suggests that we can gain much insight into complex networked systems by discovering and examining their underlaying communities. The semantic interpretation of a community depends on the type of the analyzed graph. In a metabolic network, a community would express a biological function in a cell [26]. In a network of transactions in an e-commerce site, this would express a set of similar customers [6]. Considering the web as a complex network, a community would be a set of pages dealing with a same topic [20].

More importantly, since the community-level structure is exhibited by almost all studied real-world complex networks, an efficient algorithm for detecting communities would be useful to implement a pre-treatment step for a number of general complex operations such as computation distribution, huge graph visualization and large-scale graph compression [25].

A quite big number of algorithms have been proposed for detecting communities in complex networks. Recent interesting survey tidies on this topic can be found in [21, 66, 83]. A quick review of the scientific literature allows to distinguish three different, but related problems:

  • Disjoint communities detection: The goal here is to compute a partition of the graph node’s set. One node can belong to only one community at once. Most of the work in the area of community detection deals with this problem [21].

  • Overlapping communities detection: The goal is to compute soft clustering of the graph node’s set where a node can belongs to several communities at once [61, 64, 77, 87, 88].

  • Local community identification: The goal here is to compute the community of a given node rather than partitioning the whole graph into communities. This can be useful in different settings, namely in the area of recommender systems [5, 11, 13, 33].

Both problems, disjoint and overlapping community detection are NP-hard [10]. Different heuristics have been proposed to compute sub-optimal partitions. Most popular methods are based on applying greedy optimisation approaches of a graph partition quality measure [7, 23, 73]. The most applied graph partition criteria are the modularity initially introduced in [23]. However, some recent studies has pointed out some serious limitations of modularity optimization-based approaches [24, 40]. These limitations have boosted the research for alternative approaches for community detection. Emergent approaches include label propagation approaches [71] and seed-centric ones [34]. The basic idea of seed-centric approaches is to select a set of nodes (i.e. seeds) around which communities are constructed. Being based on local computations, these approaches are very attractive to deal with large-scale and/or dynamic networks. One special case of seeds is to select nodes that are likely to act as leaders of their communities [36, 76]. In this work, we propose a general framework for implementing Leader-driven community detection algorithms (LdCD hereafter) called LICOD. The approach we develop here is an extension of the work presented in [32]. Major enhancements are about transforming LICOD into a framework for implementing LdCD algorithms as described in Sect. 4. Another major new contribution concerns the evaluation process. Actually, since LdCD algorithms are not based on maximizing an objective function (i.e. the modularity), it is unfair to use the later criteria to compare these algorithms with popular modularity-guided approaches. One idea to provide fair evaluation criteria for different community detection algorithms is task-oriented evaluation. This can be conducted by evaluating how good are computed communities for realizing a given dependent task. In this paper, we propose using data clustering task for that purpose. The idea is to transform classical clustering benchmarks into a community detection problem. Algorithms can then be evaluated using classical extrinsic clustering evaluation metrics [52].

To sum up, main contributions of this paper are the following:

  • Proposing LICOD, a general framework form implementing LdCD algorithms.

  • Introducing task-oriented evaluation of community detection algorithms and providing an approach for evaluating different community detection algorithms on data clustering tasks.

The remainder of this paper is organized as follows. Next in Sect. 2, we provide basic notations used in this paper. In Sect. 3, we review briefly major approaches for community detection algorithms as well as evaluation approaches. The LICOD approach is detailed in Sect. 4. Next, in Sect. 5, experimentation on both small benchmark networks and applying the proposed task-oriented evaluation approach are described. The clustering-oriented evaluation approach is described in Sect. 5.2. Obtained results are provided and commented. Finally, we conclude in Sect. 6.

2 Definitions and notations

In this study, we only consider simple unweighted, undirected graphs. A graph \(G\) is defined by a couple: \(G = \langle V , E\rangle \) where \(V = \{v_1 \dots , v_n\}\) is a set of nodes (a.k.a actors, sites, vertices) and \( E \subseteq V \times V\) is a set of links (a.k.a ties, arcs, or relationships). We denote by \(n_G = | V |\) (reps. \(m_G = |E]\)) the number of nodes (reps. links) of graph \(G\). The set of direct neighbors of a node \( v \in V\) is given by the function \(\Gamma (v)\). The number of direct neighbors of a node is the node’s degree and is denoted by \(d_v = | \Gamma (v)|\). The density of a graph \(G\) is given by the ratio of the number of existing links to the number of potential links. This is given by: \(d(G) = \frac{2\times m_g}{n_g\times (n_g-1)}\). We denote by \(A\) the adjacency matrix of graph \(G\). We have \(A_{ij}=1\) (resp. \(A_{ij}=0\)) if nodes \(v_i, v_j \in V\) are linked (resp. unlinked).

3 Related work

In this section, we provide a brief survey on both following topics related to the contributions of this paper: community detection algorithms and community evaluation approaches.

3.1 Community detection approaches

We focus in this study on approaches that aim to compute a partition, or disjoint communities of a complex network. A wide variety of different approaches have been proposed so far. Some comprehensive survey studies are provided in [21, 66, 83]. Here, we propose to classify existing approaches into four classes: Group-based approaches, network-based approaches, propagation-based approaches and seed-centric ones. Next we briefly review each of these identified classes.

3.1.1 Group-based approaches

These are approaches based on identifying groups of nodes that are highly connected or share some strong connection patterns. Some relevant connection patterns are the following:

  • High mutual connectivity: a community can be assimilated to a maximal clique or to a \(\gamma \)-quasi-clique. A subgraph \(G\) is said to be \(\gamma \)-quasi-clique if \(d(G) \le \gamma \). Finding maximal cliques in a graph is known to be a NP-hard problem. Generally, cliques of reduced size are used as seeds to find larger communities. An example is the clique percolation algorithm [1, 82]. Such approaches are relevant for networks that are rather dense.

  • High internal reachability: One way to relax the constraint of having cliques or quasi-cliques is to consider the internal reachability of nodes within a community. Following this, a community core can be approximated by a maximal k-clique, k-club or k-core subgraph. A k-clique (resp. k-club) is a maximal subgraph in which the longest shortest path between any nodes (resp. the diameter) is \(\le \) k. A k-core is a maximal connected subgraph in which each node has a degree \(\ge \) k. In [86], authors introduce the concept of k-community which is defined as a connected subgraph \(G' = \langle V' \subset V, E' \subset E\rangle \) of a graph \(G\) in which for every couple of nodes \(u,v \in V'\) the following constraint holds: \(|\Gamma _G(v) \cap \Gamma _G(u)| \ge k\). The computational complexity of k-cores and k-communities is polynomial. However, these structures do not correspond to all the community, but are rather used as seeds for computing communities. An additional step for adding non-clustered nodes should be provided. In [67], authors propose to compute k-cores as mean to accelerate computation of communities using standard algorithms, but on size-reduced graphs.

3.1.2 Network-based approaches

These approaches consider the whole connection patterns in the network. Historical approaches include classical clustering algorithms. The adjacency matrix can be used as a similarity one, or topological similarity between each couple of nodes can also be computed. Spectral clustering approaches [59] and hierarchical clustering approaches can then be used [70]. Usually the number of clusters to be found should be provided as an input for the algorithm. Another drawback of spectral clustering is its high computation complexity which might be cubic on the size of the input dataset. Some distributed implementations of these approaches are proposed to provide efficient implementations [85]. More popular network-based approaches are those based on optimizing a quality metric of graph partition. Different partition quality metrics have been proposed in the scientific literature. The modularity is the most widely used one [58]. This is defined as follows. Let \(\mathcal {P} = \{C_1, \dots , C_k\}\) a partition of the node’s set \(V\) of a graph. The modularity of the partition \(\mathcal {P}\) is given by:

$$\begin{aligned} Q(\mathcal {P}) = \sum _{c \in \mathcal {P}} e(\mathcal {C}) - a(\mathcal {C})^2 \end{aligned}$$
(1)

where \(e(\mathcal {C}) = \frac{\sum _{i\in \mathcal {C}} \sum _{j \in \mathcal {C}}A_{ij}}{2 \times m_G}\) is the fraction of links inside the community \(\mathcal {C}\), and \(a(\mathcal {C}) = \frac{\sum _{i\in \mathcal {C}} \sum _{j \in V}A_{ij}}{2.m_G}\) is the fraction of links incident to a node in \(\mathcal {C}\). The computing complexity of \(Q\) is \(\mathcal (O)(m_G)\) [23]. Some recent work has extended the definition to bipartite and multipartite graphs [18, 48, 56] and even for multiplex and dynamic graphs [39, 55]. Different heuristic approaches have been proposed for computing partitions that maximize the modularity. These can be classified into three main classes:

  • Agglomerative approaches: These implement a bottom-up approach where an algorithm starts by considering each single node as a community. Then, it iterates by merging some communities guided by some quality criteria. The louvain algorithm [7] is one very known example of such approaches. The algorithm is composed of two phases. First, it looks for small communities by optimizing modularity in a local way. Second, it aggregates nodes of the same community and builds a new network whose nodes are the communities. Two adjacent communities merge if the overall modularity of the obtained partition can be enhanced. These steps are repeated iteratively until a maximum of modularity is reached. The computing complexity of the approach is empirically evaluated to be \(\mathcal {O}(n log( n)) \).

  • Separative approaches: These implement a top-down approach, where an algorithm starts by considering the whole network as a community. It iterates to select ties to remove to split the network into communities. Different criteria can be applied for tie selection. The Newman–Girvan algorithm is the most known representative of this class of approaches [58]. The algorithm is based on the simple idea that a tie linking two communities should have a high betweenness centrality. This is naturally true since an inter-community tie would be traversed by a high fraction of shortest paths between nodes belonging to these different communities. Considering the whole graph \(G\), the algorithm iterates for \(m_G\) times, cutting at each iteration the tie with the highest betweenness centrality. This allows to build a hierarchy of communities, the root of which is the whole graph and leafs are communities composed of isolated nodes. Partition of highest modularity is returned as an output. The algorithm is simple to implement and has the advantage to discover automatically the best number of communities to identify. However, the computation complexity is rather high: \(\mathcal {O}(n^2 \cdot m + (n)^3log(n))\). This is prohibitive to apply to large-scale networks.

  • Other optimization approach: Other classical optimization approaches can also be used for modularity optimization such as applying genetic algorithms [31, 47, 68], evolutionary algorithms [29] or multi-objective optimization approaches [69].

All modularity optimization approaches make implicitly the following assumptions:

  • The best partition of a graph is the one that maximize the modularity.

  • If a network has a community structure, then it is possible to find a precise partition with maximal modularity.

  • If a network has a community structure, then partitions inducing high modularity values are structurally similar.

Recent studies have showed that all three above-mentioned assumptions do not hold. In [24], authors show that the modularity function exhibits extreme degeneracies: it namely accepts an exponential number of distinct high scoring solutions and typically lacks for a clear global maximum. In [40], it has been shown that communities detected by modularity maximization have a resolution limit. These serious drawbacks of modularity-guided algorithms have boosted the research for alternative approaches. Some interesting emerging approaches are label propagation approaches [71] and seed-centric ones [34].

3.1.3 Propagation-based approaches

Even the top fast algorithm, the louvain approach, has a computation complexity that becomes costly for very large-scale networks that can be composed of millions of nodes as it is frequently the case when considering online social networks today. In addition, studied complex networks are very dynamic. A low complexity incremental approaches for community detection are then needed. Label propagation approaches constitute a first step in that direction [71, 89]. The underlaying idea is simple: each node \( v \in V\) in the network is assigned a specific label \(l_v\). All nodes update in a synchronous way their labels by selecting the most frequent label in the direct neighborhood. In a formal way, we have:

$$\begin{aligned} l_v = {{\mathrm{arg\,max}}}_{l} |\Gamma ^{l}(v)| \end{aligned}$$

where \(\Gamma ^{l}(v) \subseteq \Gamma (v)\) is the set of neighbors of \(v\) that have the label \(l\). Ties situations are broken randomly. The algorithm iterates until reaching a stable state where no more nodes change their labels. Nodes having the same label are returned as a detected community. The complexity of each iteration is \(\mathcal {O}(m)\). Hence, the overall computation complexity is \(\mathcal {O}(km)\) where \(k\) is the number of iterations before convergence. Study reported in [45] shows that the number of iterations grows in a logarithmic way with the growth of \(n\); the size of the target network. In addition to its low computation complexity, the label propagation algorithm can readily be distributed allowing hence handling very large-scale networks [62, 78, 92]. While the algorithm is very fast, it suffers from two serious drawbacks:

  • First, there is no formal guarantee of the convergence to a stable state.

  • Lastly, it lacks for robustness, since different runs produce different partitions due to random tie breaking.

Different approaches have been proposed in the literature to cope with these two problems. Asynchronous, and semi-synchronous label updating have been proposed to hinder the problem of oscillation and improve convergence conditions [14, 71]. However, these approaches harden the parallelization of the algorithm by creating dependencies among nodes and they increase the randomness in the algorithm making the robustness even worse. Different other approaches have been developed to handle the problem of label propagation robustness. These include balanced label propagation [81], label hop attenuation [44] and propagation preference-based approaches [49]. Another interesting way to handle the instability of label propagation approaches consists simply on executing the algorithm \(k\) times and apply an ensemble clustering approach on the obtained partitions [33, 41, 63, 74].

3.1.4 Seed-centric approaches

The basic idea underlaying seed-centric approaches is to identify some particular nodes in the target network, called seed nodes, around which local communities can be computed [32, 65, 76]. Algorithm 1 presents the general outlines of a typical seed-centric community detection algorithm. We recognize three principal steps:

  1. 1.

    Seed computation.

  2. 2.

    Seed local community computation.

  3. 3.

    Community computation out from the set of local communities computed in the previous step.

figure a

Leader-driven algorithms constitute a special case of seed-centric approaches. Nodes of a network are classified into two (eventually overlapping) categories: leaders and followers. Leaders represent communities. An assignment step is applied to assign followers nodes to most relevant communities. Different algorithms apply different node classification approaches and different node assignment strategies. Three different LdCD algorithms have been proposed almost simultaneously in three different works [32, 36]. Next, we present briefly the first two cited algorithms.

In [36] authors propose an approach directly inspired from the K-means clustering algorithm [27]. The algorithm requires as input the number \(k\) of communities to identify. This is clearly a major disadvantage of the approach that authors of the approach admit. \(k\) nodes are selected randomly. Unselected nodes are labeled as followers. Leaders and followers form hence exclusive sets. Each leader node represents a community. Each follower nodes is assigned to the most nearby leader node. Different levels of neighborhood are allowed. If no nearby leader is found the follower node is labeled as outlier. When all flowers nodes are handled. The algorithm computes a new set of \(k\) leaders. For each community, the most central node is selected as a leader. The process is iterated with the new set of \(k\) leaders until stabilization of the computed communities. The convergence speed depends on the quality of initially selected \(k\) leaders. Different heuristics are proposed to improve the selection of the initial set of leaders. The best approach according to experimentation is to select the top \(k\) nodes that have the top degree centrality and that share little common neighbors.

The algorithm proposed in [76] is much similar to our approach. It starts by computing the closeness centrality of all nodes. The closeness centrality of a node \(v\) is given by the inverse of the average distance to all other nodes in the network. Leaders will be any node whose closeness centrality is less than at least one of its neighbors. This heuristics results in a huge set of leaders. The list of leaders is sorted in decreasing order of closeness centrality. The list is then parsed assigning to each leader direct followers that are not already assigned to another leader. At the end, leaders that are not followed by any node are assigned to the community to which belong the majority of its direct neighbors.

3.2 Community evaluation approaches

The problem of performances evaluation of community detection algorithms still to be an open problem in spite of the huge amount of work in this area. Existing approaches can be divided into three main types:

  1. 1.

    Evaluation on networks for which a ground-truth decomposition into communities is known.

  2. 2.

    Evaluation in function of the topological features of computed communities.

  3. 3.

    Task-driven evaluation.

Next, we detail these different approaches.

3.2.1 Ground-truth comparison approaches

Networks with ground-truth partitions can be obtained by one of the following ways:

  • Annotation by experts: For some networks representing real systems, experts in the system field have been able to define the community structure. Examples of such networks are given in Sect. 5.1. In general, these networks are rather very small (allowing hence to be handled by experts) and the defined community structure is usually given by a partition of the studied graph with no overlapping among defined communities.

  • Network generators use: The idea here is to generate artificial networks with predefined community structure. Some early work in this area is the Girvan–Newman benchmark graph [23]. A more sophisticated generator is proposed in [42] where the user can control different parameters of the network including the size, the density, the degree distribution law, the clustering coefficient, the distribution of communities size as well as the separability of the obtained communities. While the approach is interesting, generated networks are not guaranteed to be similar enough to real complex networks observed in real-world applications.

  • Implicit community definition: This approach is based on inferring the community structure in a graph applying simple rules taking usually the semantic of ties into account. For example in [90] authors define a community in the Live journal social network as groups of fans of a given artist. Communities in a co-authorship of scientific publications are taken to be authors participating in a same venue! The relevance of proposed rules seems to be questionable.

When a ground-truth community structure is available, classical external clustering evaluation indices can be used to evaluate and compare community detection algorithms. Different clustering comparison or similarities functions have been proposed in the literature [2]. In this work, we apply two widely used indices: the Adjusted Rand Index (ARI) [30] and the Normalized Mutual Information (NMI) [79].

The ARI index is based on counting the number of pairs of elements that are clustered in the same clusters in both compared partitions. Let \(P_i=\{ P_i^1, \dots , P_i^l\}\), \(P_j=\{P_j^1, \dots , P_j^k\}\) be two partitions of a set of nodes \(V\). The set of all (unordered) pairs of nodes of \(V\) can be partitioned into the following four disjoint sets:

  • \(S_{11}\) = {pairs that are in the same cluster under \(P_i\) and \(P_j\)}

  • \(S_{00}\) = {pairs that are in different clusters under \(P_i\) and \(P_j\)}

  • \(S_{10}\) = {pairs that are in the same cluster under \(P_i\) but in different ones under \(P_j\)}

  • \(S_{01}\) = {pairs that are in different clusters under \(P_i\) but in the same under \(P_j\)}

Let \( n_{ab} = |S_{ab}|, a, b \in \{0, 1\}\), be the respective sizes of the above defined sets. The rand index initially defined in [72] is simply given by:

$$\begin{aligned} \mathcal {R}(P_i, P_j) = \frac{2\times (n_{11}+n_{00})}{n \times (n-1)} \end{aligned}$$

In [30], authors show that the expected value of the Rand Index of two random partitions does not take a constant value (e.g. zero). They proposed an adjusted version which assumes a generalized hypergeometric distribution as null hypothesis: the two clusterings are drawn randomly with a fixed number of clusters and a fixed number of elements in each cluster (the number of clusters in the two clusterings need not be the same). Then the ARI is the normalized difference of the Rand Index and its expected value under the null hypothesis. It is defined as follows:

$$\begin{aligned} \mathrm{ARI}(P_i,P_j) = \frac{ \sum \nolimits _{x=1}^l\sum \nolimits _{y=1}^k \left( \! \begin{array}{c} | P_i^x \cap P_j^y |\\ 2 \end{array} \!\right) - t_3}{\frac{1}{2}(t_1+t_2) - t_3} \end{aligned}$$
(2)

where:

$$\begin{aligned} t_1 = \sum \limits _{x=1}^ l \left( \! \begin{array}{c} |P_i^x |\\ 2 \end{array} \!\right) ,\quad t_2 = \sum \limits _{y=1}^ k \left( \! \begin{array}{c} |P_j^y |\\ 2 \end{array} \!\right) ,\quad t_3 = \frac{2t_1t_2}{n(n-1)} \end{aligned}$$

This index has expected value zero for independent clusterings and maximum value 1 for identical clusterings.

Another family of partitions comparisons functions is the one based on the notion of mutual information. A partition \(P\) is assimilated to a random variable. We seek to quantify how much we reduce the uncertainty of the clustering of randomly picked element from \(V\) in a partition \(P_j\) if we know \(P_i\). The Shannon’s entropy of a partition \(P_i\) is given by:

$$\begin{aligned} H(P_i) = - \sum \limits _{x=1}^l \frac{|P_i^x|}{n}\mathrm{log}_2\left( \frac{|P_i^x|}{n}\right) \end{aligned}$$

Notice that \(\frac{|P_i^x|}{n}\) is the probability that a randomly picked element from \(V\) be clustered in \(P_i^x\). The mutual information between two random variables \(X\), \(Y\) is given by the general formula:

$$\begin{aligned} \mathrm{MI}(X,Y) = H(X) + H(Y) -H(X,Y) \end{aligned}$$
(3)

This can then be applied to measure the mutual information between two partitions \(P_i\), \(P_j\). The mutual information defines a metric on the space of all clusterings and is bounded by the entropies of involved partitions. In [79], authors propose a normalized version given by:

$$\begin{aligned} \mathrm{NMI}(X, Y) = \frac{\mathrm{MI}(X,Y)}{\sqrt{H(X) H(Y)}} \end{aligned}$$
(4)

Another normalized version is also proposed in [22]. Other similar information-based indices are also proposed [52, 60].

3.2.2 Topological measures for community evaluation

Two types of topological measures can be used to evaluate the quality of a computed community structure:

  • Global measures that evaluate the quality of the computed partition as a whole. The modularity \(Q\) defined in [57] (see formula 1) is the most applied measure. Other modularity measures have also been proposed [51, 54]. However, the different modularity limitations discussed earlier (see Sect. 3.1.2) hinder the utility of using it as an evaluation metric.

  • Local topological measures. A number of local topological measures have been proposed to evaluate the quality of a given community. Most are used in the context of identifying ego-centered communities [4, 11]. In [90], authors present an interesting survey on these measures. Let \(f(c)\) be a community evaluation measure. The quality of a partition is then simply given by:

    $$\begin{aligned} Q(\mathcal {C} )= \frac{\sum \nolimits _{i} f(S_i)}{| \mathcal {C} |} \end{aligned}$$
    (5)

3.2.3 Task-driven evaluation

The principle of task-driven evaluation is the following: Let \(T\) be a task where community detection can be applied. Let \({per}(T,{Algo}_{com}^x)\) be a performance measure for \(T\) execution applying the community detection algorithm \({Algo}_{{com}}^x\). We can then compare performances of different community detection algorithms by comparing induced \(per(T, {Algo}_{{com}}^x)\) values. In [66], authors propose to use the recommendation task for evaluating purposes. In this work, we propose using the data clustering as an evaluation task.

4 The LICOD approach

4.1 Informal presentation

The basic idea underlaying the proposed algorithm is that a community is composed of two types of nodes: Leaders and Followers. Algorithm 2 sketches the general outlines of the proposed approach. The algorithm functions as follows:

  1. 1.

    First, it searches for nodes in the network that are likely to be leaders in a community. Different node ranking metrics can be used to estimate the role of a node. These include the classical centrality metrics. Let \(\mathcal {L}\) be the set of identified leaders. In Algorithm 2, this step is achieved by the function \(isLeader()\) (line 3).

  2. 2.

    The list \(\mathcal {L}\) is then reduced by grouping leaders that are estimated to be in the same community. This is the task of the function \(computeCommunitiesLeader()\), line 7 in Algorithm 2. Let \(\mathcal {C}\) be the set of identified communities.

  3. 3.

    Each node in the network (a leader or a follower) computes its membership degree to each community in \(\mathcal {C}\). A ranked list of communities can then be obtained, for each node, where communities with highest membership degree are ranked first (lines 9–13 in Algorithm 2).

  4. 4.

    Next, each node will adjust its community membership preference list by merging this with preference lists of its direct neighbors in the network. Different strategies borrowed form the social choice theory can applied here to merge the different preference lists. This step is iterated until stabilization of obtained ranked lists at each node. The convergence towards a stable sate is function of the applied voting scheme.

  5. 5.

    Lastly, each node will be assigned to top-ranked communities in its final obtained membership preference list.

figure b

The local voting process intends to ensure local homogeneity in nodes membership to different communities. Notice that the algorithm is designed as a general framework that allows testing different working hypothesis: How to select leader? How to compute community membership? And how to merge preferences of linked nodes? Next we describe possible choices for implementing each step.

4.2 Implementation issues

The LICOD algorithm is implemented using the igraph graph analysis toolkit [15]. We give next some details about the implementation of each of the main steps of the proposed algorithm.

4.2.1 Function \(isLeader()\)

One simple idea to distinguish leaders from follower nodes is to compare nodes centralities. Actually, leader nodes are expected to have higher centrality (whatever the centrality is) than ordinary nodes. Different centrality measures can be used. In our experiments, we have tested the following two basic centralities:

  • Degree centrality (denoted \(dc\)): This is given by the proportion of nodes directly connected to the target node. Formally, the degree centrality of a node \(v\) is given by: \( \mathrm{dc}(v) = \frac{d_G(v)}{n_G -1}\). The computation complexity is \(\mathcal {O}(n_G)\).

  • Betweenness centrality \(\mathrm{BC}(v)\): The is given by the fraction of all-pairs shortest paths that pass through the target node. Formally, the betweenness centrality of a node \(v\) is given by \(\mathrm{BC}(v) = \sum _{s,t \in V} \frac{\sigma (s,t | v)}{\sigma (s,t)}\) where \(\sigma (s,t)\) is the number of shortest paths linking \(s\) to \(t\), and \(\sigma (s,t | v)\) is the number of paths passing through node \(v\) other than \(s\) and \(t\). The best known algorithm for computing this centrality has a computation complexity \(\mathcal {O}(n_G.m_G + (n_G)^2\mathrm{log}(n_G))\) [9].

The first centrality is local-computed metric while the later captures global proprieties of the network. A node is identified as a leader if its centrality is greater or equal to \(\sigma \in [0, 1]\) percent of its neighbors centralities. The rational behind introducing the \(\sigma \) parameter is to be able to recover leaders connected to other leaders. Notice that the number of leaders will depend on the value of the threshold \(\sigma \). More \(\sigma \) is high fewer are the leaders.

4.2.2 Function \(computecommunitiesleaders\)

Two leaders are grouped in the same community if the ratio of common neighbors to the total number of neighbors is above a given threshold \(\delta \in [0,1]\). The couple \(\sigma , \delta \) determines in somehow, the number of communities detected by the algorithm.

4.2.3 Function \(memebership(v,c)\)

We propose to measure the membership degree of a node \(v\) to a community \(c\) by the inverse of the minimal shortest path that links \(v\) to one of the leaders of \(c\).

$$\begin{aligned} membership(v, c ) = \frac{1}{(min_{x \in COM(c)} SPath(v,x)) + 1} \end{aligned}$$
(6)

It is easy to see that the previous function takes values in the range \(\left[ \frac{1}{Diameter(G)}, 1 \right] \). The diameter of a graph is the maximum of the shortest path between any pair of nodes. Notice also that for a community \(c\), the membership of all its leaders is equal to \(1\).

4.2.4 Rank aggregation approaches

Let \(S\) be a set of elements to be ranked by a set of \(m\) rankers. We denote by \(S^{r_i}\) the ranking provided by ranker \(r_i\). \(\{S^{r_1}, \dots , S^{r_m}\}\) is a set of all ranks provided by the \(m\) rankers. Notice that each list \(S^{r_i}\) represents a permutation of elements of \(S\). An optimal ensemble ranking approach seeks for a permutation \(\sigma \) that has the minimum number of pairwise disagreements with all input ranks \(S^{r_i}\) [3, 12, 19, 80]. The Kendall Tau distance computes the pairwise disagreement between two ranks defined over the same set of elements \(S\). This is formally defined as follows:

$$\begin{aligned} \mathcal {K}(\pi ,\sigma ) = \sum \limits _{ x,y \in S } d_{\pi ,\sigma }(x,y) \end{aligned}$$
(7)

where:

$$\begin{aligned} d_{\pi ,\sigma }(x,y) = {\left\{ \begin{array}{ll} 0 &{} \text{ if } \pi \text{ and } \sigma \text{ rank } \text{ x } \text{ and } \text{ y } \text{ in } \text{ the } \text{ same } \text{ order }\\ 1 &{} \text{ otherwise } \end{array}\right. } \end{aligned}$$

This problem has been extensively studied in the context of social choice algorithms [3]. Early work tackling this problem goes back the French revolution epoch with the work of Borda [8] and Marquis de Condorcet [16] striving to define a fair election rule. Rank aggregation approaches can be classified into two classes: position-based approaches and order-based ones [12].

One well-known position-based method is Borda’s method [8]: A Borda score is computed for each element in the lists. For a set of complete ranked lists \(L = [L_1,L_2,L_3,\dots ,L_k]\), the Borda’s score of an element \(i\) and a list \(L_k\) is given by: \( B_{L_k}(i) = \{\mathrm{count}(j) | L_k(j) < L_k(i)\, \& \,j \in L_k \}\). The total Borda’s score for an element is then: \(B(i) = \sum _{t=1}^{k} B_{L_t} (i)\). Elements are sorted in function of their total Borda score with random selection in case of ties.

Kemeny approaches are well-known order-based approaches. A Kemeny optimal aggregation [35] is an aggregation that has the minimum number of \(<\)div\(>\) pairwise disagreement as computed by the Kendall tau distance [43]. Computing an optimal Kemeny aggregation is NP-hard starting from a list of four candidates. Different approximate Kemeny aggregation approaches have been proposed in the literature. The basic idea of all proposed approximate Kemeny aggregation is to sort the candidate list, using standard sorting algorithms, but using a non-transitive comparison relationship between candidates. This relation is the following: \(s_i\) is preferred to \(s_j\), noted \(s_i \succ s_j\), if the majority of rankers ranks \(s_i\) before \(s_j\). Since the \(\succ \) relation is not transitive, different sorting algorithms will provide different rank aggregations with different proprieties. In [19] authors propose a local Kemeny aggregation applying a bubble sort algorithm. In [53] authors propose an approximate Kemeny aggregation applying quick sort algorithm .

4.2.5 Community assignment

A node \(v\) is assigned to top-ranked communities in the final community preference list \(P_v^*\). As showed in lines 22–26 of Algorithm 2, a node is assigned simultaneously to communities for which its membership is \(\epsilon \)-far from the membership degree to the top-ranked community. The \(\epsilon \) threshold controls the degree of desired overlapping in identified communities. However, putting \(\epsilon \) to \(0\) may still results in having overlapping communities since for a given node different communities may have the same membership degree.

5 Experimentation

5.1 Evaluation on benchmark networks

In a first experiment, we evaluate the proposed approach on a set of four widely used benchmark networks for which a ground-truth decomposition into communities is known. These networks are the following:

  • Zachary’s karate club This network is a social network of friendships between 34 members of a karate club at a US university in 1970 [91]. Following a dispute, the network was divided into two groups between the club’s administrator and the club’s instructor. The dispute ended in the instructor creating his own club and taking about half of the initial club with him. The network can hence be divided into two main communities.

  • Dolphins social network This network is an undirected social network resulting from observations of a community of 62 dolphins over a period of 7 years [50]. Nodes represent dolphins and edges represent frequent associations between dolphin pairs occurring more often than expected by chance. Analysis of the data revealed two main groups.

  • American college football dataset This dataset contains the network of American football games [23]. The 115 nodes represent teams and the edges represent games between 2 teams. The teams are divided into 12 groups containing around 8–12 teams each and games are more frequent between members of the same group. Also teams that are geographically close but belong to different groups are more likely to play one another than teams separated by a large distance. Therefore, in this dataset groups can be considered as known communities.

  • American political books This is a political books co-purchasing network. Nodes represent books about US politics sold by the online bookseller Amazon.com. Edges represent frequent co-purchasing of books by the same buyers, as indicated by the “customers who bought this book also bought these other books” feature on Amazon. Books are classified into three disjoint classes: liberal, neutral or conservative. The classification was made separately by Mark Newman based on a reading of the descriptions and reviews of the books posted on Amazon.

Figure 1 shows the structure of the selected networks with real communities indicated by the color code. Table 1 gives the basic characteristics of these networks.

Fig. 1
figure 1

Real community structure of the four selected benchmark networks. Zachary Karate Club Network [91], Collegae football network [23], US Politics books network [38], Dolphins social network [50]

Table 1 Basic topological characteristics of selected benchmark networks

For each network we have applied the proposed algorithm by changing the configuration parameters as follows:

  • Centrality metrics = [Degree centrality (dc), Betweenness centrality (BC)]

  • Voting method = [Borda, Local Kemeny]

  • \(\sigma \in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\)

  • \(\delta \in [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]\)

  • \(\epsilon \in [0.0, 0.1, 0.2]\)

For each configuration, we compute the NMI, ARI and the modularity Q. Figures 2, 3, 4 and 5 show the variations of these metrics, for each dataset, with the variation of \(\sigma \). We have omitted to show the results with different values of \(\delta \) since on these datasets the \(\delta \) value has showed negligible impact on obtained results. The same effect was observed for the \(\epsilon \) parameter. On each figure, we plot four graphics showing the variation of NMI, ARI and Q, for each of the possible four configurations depending on the choice of the used centrality and the voting method.

Fig. 2
figure 2

Performance of applying LICOD to Zachary Karate club network in function of \(\sigma \) in terms of NMI, ARI and the modularity \(Q\)

Fig. 3
figure 3

Performance of applying LICOD to American college football network in function of \(\sigma \) in terms of NMI, ARI and the modularity \(Q\)

Fig. 4
figure 4

Performance of applying LICOD to American political books networks in function of \(\sigma \) in terms of NMI, ARI and the modularity \(Q\)

Fig. 5
figure 5

Performance of applying LICOD to dolphins social network in function of \(\sigma \) in terms of NMI, ARI and the modularity \(Q\)

These results show that the use of the betweenness centrality accelerate slightly the convergence for the right value to obtain. Local Kemeny voting methods out performs that Borda in the case of the football network only and gives comparable results for the US Politics network. Borda gives good results only for the Dolphins network using also the betweenness centrality.

Increasing \(\epsilon \) results in diminishing the NMI and ARI. This can be explained by the fact that high value of \(\epsilon \) increases the overlapping degree of obtained communities while real communities we have here are all disjoint.

The best results are obtained for \(\sigma \) around \( 0.8, 0.9\). This argues for the validity the idea of introducing the \(\sigma \) threshold and not to consider extreme cases where a node is qualified as a leader if it has the highest centrality in its direct neighborhood. We notice that the dynamic curves differ from one network to another, and this is closely related to the specificities of each network. The choice of a configuration of the proposed algorithm in function of the properties of the target network constitutes one interesting topic to cope with.

We also compared the results of our algorithm with results obtained by well-known algorithms: The Newman–Girvan algorithm [58], the WalkTrap algorithm [70] and the Louvain algorithm [7]. The configuration adopted for LICOD is the following: Centrality metric is betweenness centrality, Voting method is local Kemeny, \(\sigma =\delta =0.9\), and \(\epsilon =0\). Table 2 gives obtained results on the four datasets.

Table 2 Comparison of performances of different community detection algorithms

These results show that LICOD performs better than the other algorithms for both Zachary and US Politics networks. It also gives competitive results in the other two networks. This could be explained by the absence of leaders in these two networks, which makes the communities detection task more difficult.

These results show also that the modularity metric does not correspond to the best decomposition into communities as measured by both NMI and ARI. For instance, the Louvain method obtains always the best modularity (even better than the modularity of the ground-truth decomposition), however, it is ranked not first according to NMI . Best results are obtained by our approach for high values of \(\sigma \).

5.2 Data clustering-driven evaluation

We propose here to use the task of data clustering to apply a task-driven evaluation of community detection algorithms. The basic idea is to transform a data clustering problem into a community detection one. Some earlier work has already applied community detection algorithms to the clustering task [17]. Figure 6 illustrates the overall approach. First, a relative neighborhood graph (RNG), as defined in [84], is constructed over the set of items to cluster. The choice of RNG graph is motivated by the topological characteristics of these graphs that are connexe and sparse. To build an RNG graph, we first compute a similarity matrix between couple of items in the dataset (Fig. 7). This results in a symmetric square matrix of size \(n \times n\) where \(n\) is the number of items in the dataset. A RNG graph is defined by the following simple construction rule: two points \(x_i\) and \(x_j\) are connected by an edge if they satisfy the following property:

$$\begin{aligned} d(x_i,x_j) \le \max _l \{d(x_i,x_l),d(x_j,x_l)\},\quad \forall l \ne i,j \end{aligned}$$
(8)

where \(d(x_i,x_j)\) is the distance function. A community detection algorithm is applied on the obtained graph to cluster the given examples. Clustering evaluation criteria a-can then be used to compare different algorithms.

Fig. 6
figure 6

Applying community detection to data clustering

Fig. 7
figure 7

Example of the generation of a RNG from a cloud of data: \(\alpha \) and \(\beta \) are two relatifs neighbors because there is no other node in the intersection of the two circles centered, respectively, in \(\alpha \) and \(\beta \) and with radius \(d(\alpha ,\beta )\)

We have tested our approach on five classical datasets publicly available from UCI website.Footnote 1 The selected datasets are briefly described in Table 3.

Table 3 Characteristics of used datasets

We have constructed the different RNG graphs on these datasets using the following classical distance cited in Table 4.

Table 4 Applied basic distance functions

Table 5 shows basic topological characteristics of obtained graphs. We can see that these graphs have some characteristics of real networks such as the small diameter and low density. However, the Chebyshev distance induces dense graphs though the obtained clustering coefficient is also high. We have also obtained graphs with a relatively high transitivity.

Table 5 Topological characteristics of obtained RNG graphs

Based on these results, we have applied the community detection algorithms on RNG graphs defined by the Cosine distance function. We apply on the above generated graphs four different community detection algorithms: Louvain [7], the Newman–Girvan algorithm, the Walktrap algorithm and LICOD. Results are evaluated in terms of NMI, and ARI computed in function of the real classes defined in each dataset. We compute also the modularity Q to show that it does not always reflect the true quality of the community. Results given in the Table 6 show that LICOD is ranked first for the two datasets: wine and abalone. It gives competitive results for the other datasets.

Table 6 Performance of LICOD vs Louvain, Walktrap, Newman–Girvan algorithms

6 Conclusion

In this work, we contribute to the state of the art on community detection in complex networks by:

  • Providing a new efficient algorithm for computing (eventually overlapping) communities.

  • Proposing a new approach for qualitative community evaluation using classical data clustering tasks.

Results obtained on both small benchmark social network and on clustering problems argue for the capacity of the approach to detect real communities. Future developments we are working include: testing the algorithm on large-scale networks, develop a full distributed self-stabilizing version exploiting the fact that major part of computations are made in a local manner and finally adapt the approach for K-partite and for multiplex networks [28].