Functional modules are groups of genes or proteins involved in common elementary biological functions. Proteins are also known to interact with each other by forming complexes, and each such complex performs an independent and discrete biological function through the interactions of its member proteins [1]. Single proteins may also participate in more than one complex or functional module. Functional modules or protein complexes correspond to modules, which are dense subgraphs within protein interaction networks (PINs), and hence, can be discovered by appropriate network clustering approaches. Generally speaking, modules in PINs refer to highly connected sub-graphs which have more internal edges than external edges. Many definitions of modules have been proposed in literature [2], and consequently different community detection algorithms have been proposed based on these different definitions.

Module detection in PINs is a computationally hard task and conventional clustering algorithms are not well suited for this task [3, 4]. Efficient, accurate, robust, and scalable methods are therefore required for mining large PINs [58]. There are generally three classes of modules detection approaches: 1) those based on finding cliques, which are fully connected sub-networks [9, 10]; 2) those based on detecting dense subnetworks [11, 12], not necessarily cliques; and 3) those based on uncovering the hierarchical organization of modules within PINs [13, 14]. Clique techniques are not quite scalable to large PINs and the identified modules are too strict in the biological sense of modules since proteins participating in a complex may not all interact with each other. Current density-based algorithms commonly misclassify proteins with low degree into small clusters which could be merged to core protein clusters [15]. Moreover, many biologically meaningful modules are ignored due to their low topological connectivity [15].

Hierarchical clustering methods based on global metric over nodes or edges, such as betweenness centralities, are very time-consuming, and thus do not scale well to large PINs. The few hierarchical approaches based on local metric also have the common problem of classifying very low-degree vertices into separate clusters, which does not make sense biologically. Another major issue in current hierarchical clustering approaches is their inability to perform well on noisy data. This is generally the case when clustering PIN data generated from large scale high-throughput experiments. As discussed in [16, 17], such PIN data usually contain many false positive interactions, and hence, care must be taken to deal with the sensitivity of hierarchical methods on such data.

The majority of the clustering methods proposed in the literature has focused on identifying nonoverlapping communities. However, it is well recognized that complex networks contain multi-class nodes corresponding to vertices belonging to many communities at once. Overlapping clustering algorithms have not been intensively studied nor successful at finding good subnetworks, although they first appeared three decades ago; see an extensive review of over-lapping methods in [18]. Multi-functional proteins are proteins which perform several functions and interact specifically with distinct sets of protein partners simultaneously or not, depending on the function being performed. Thus, such proteins are involved in many functional modules or protein complexes, and hence, it is reasonable to assume that PINs have overlapping communities, each containing some multi-functional proteins. Few successful hierarchical clustering approaches such as the Overlapping Cluster Generator (OCG) algorithm of [19] and the Link Communities method of [20] (to cite just a few) have been recently proposed with the aim of identifying overlapping protein communities as well as multi-functional proteins from PINs.

In this paper, we propose a fast agglomerative clustering technique, FAC-PIN, which addresses the issues and limitations discussed above for hierarchical algorithms. FAC-PIN is based on a local similarity pre-metric of relative vertex-to-vertex clustering value for clustering PINs in an agglomerative hierarchical manner.

Related works

Many hierarchical clustering approaches (both agglomerative and divisive techniques) have been introduced in literature, since the original publication of [21] for clustering networks. See the excellent survey on graph clustering algorithms in [22]. Thus, we will present only the few methods that are directly related to our proposed agglomerative approach.

An effective agglomerative technique for clustering large networks was first proposed by [21]. The Girvan-Newman (GN) algorithm [21] first computes the edge-betweenness centrality value of each edge; this is a global metric over the edges and is defined as the number of shortest paths containing a given edge. Then, GN subsequently sort and then remove edges with large betweenness values in an iterative manner and in order to detect the communities; since such edges correspond to bridges connecting two modules whereas low-betweenness edges are internal to modules. To increase the computational speed of GN, [23] made a simple but non-trivial modification in the computation of the value of the modularity function used in GN. [15] defined the concept of the degree of a subnetwork S as the number the of edges containing one endpoint inside S and the other endpoint outside S. The degree of subnetworks was used along with the edge-betweenness values to devise an agglomerative method for module discovery. [14] developed a fast agglomerative approach for community detection based on a global centrality measure, the vertex clustering coefficient ; which is defined as the ratio of the number of edges between the neighbors of a given vertex v and the total number of possible edges in that neighborhood, it measures the degree of completeness of the subnetwork defined by v and its neighbors [24]. [2] designed an agglomerative technique based on the clustering coefficient of an edge; the edge clustering coefficient extends the vertex clustering coefficient and is a global measure defined as the number of triangles to which a given edge e = (u, v) belongs to, divided by the number of triangles that might potentially include (u, v). That is:

C u , v ( 3 ) = Z u , v ( 3 ) min { ( k u - 1 ) ( k v - 1 ) } ,

where, k a is the degree of a vertex a, Z u , v ( 3 ) is the number of triangles containing edge (u, v), and min{(k u - 1), (k v - 1)} is the maximal possible number of triangles containing (u, v). This coefficient has been further generalized to higher-order cycles, C u , v ( k ) , such as squares for k = 4, C u , v ( 4 ) . Edges contained in few or no triangles have low clustering coefficients, and hence, correspond to bridges connecting two clusters. The edge clustering coefficient assumes the existence of cycles of length k in a network; which is problematic since a network can have many cycles of different lengths and the length distribution is unknown (e.g., there may be very few or very many short-length cycles). For this reason, [25] defined a local node similarity metric over the edges, the edge clustering value, which is not based on cycles but on the common neighbors of the two endpoints of edge (u; v). The edge clustering value is defined as:

E C V ( u , v ) = | N u N v | 2 | N u | | N v | ,

where, N a is the set of neighbors of a vertex a and its cardinality is defined as |N a |. Here, endpoints vertices of an edge (u, v) with a larger clustering value are more likely to be in the same cluster. Using the edge clustering value, [25] devised an agglomerative technique, the HC-PIN algorithm, for discovering modules of a PIN and which is faster and more accurate than current hierarchical algorithms for network clustering. The edge clustering objectives in Equations (1) and (2) do not take into account the reliability of interactions in the presence of false positives in PIN data, and hence, will yield incorrect clustering results. In this regards, [25] modified the objective of Equation (2) to account for noise in the PIN data, as

E C V w ( u , v ) = k I u , v w ( u , k ) k I u , v w ( v , k ) s N u w ( u , s ) t N v w ( v , t ) ,

where I u,v = N u N v , and 0 ≤ w(a, b) ≤ 1 is the weight assigned to the edge (a, b) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (2) is a special case of Equation (3) for weighted undirected graph with w(a, b) = 1 for all edges (a, b). In Equations (1)-(3), two vertices connected by an edge with larger objective value are more likely to lie in the same module.

Recently, while finalizing this manuscript, we have been made aware of an hierarchical approach introduced in [20] and which focuses on grouping links (i.e., edges) rather than vertices, in contrast to the existing literature which has almost entirely focused on grouping nodes. It is well-know that communities in complex networks often overlap such that nodes simultaneously belong to several groups at once, which in turn, are known to be involved into hierarchical structures. It has therefore proved difficult for node-focused community detection methods to accurately identify relevant functional modules because of the hierarchical structures of the overlapping groups. Let N a + denotes the set of node a and its neighbors and e a,b denote the edge (a, b), then by defining network communities as groups of links rather than groups of vertices, [20] proposed the following similarity function for link pairs that share a node in an undirected unweighted network

S ( e u , k , e v , k ) = | N u + N v + | | N u + N v + | ,

and applied a simple single-linkage hierarchical clustering algorithm to build an link dendrogram from Equation (4) which yields link communities with the best edge partition density. By identifying such non-overlapping link communities, [20] has detected hierarchically organized node community structures with pervasive overlap.

In the next section, we will propose a new criterion for weighted undirected graphs, which is a modification of the relative vertex-to-vertex clustering value which we have first introduced in [26] for un-weighted graph; in [26], however, the unweighted criterion was applied only to the problem of detecting protein complexes in PINs [27] whereas here we apply our weighted criterion here for identifying functional modules in PINs. It is a local similarity premetric combining the ideas behind the vertex clustering coefficient, the edge clustering coefficient, and the edge clustering value, and which allows to decide when a given vertex can be included into the cluster of another vertex, and which helps address all of the issues discussed above.


Network modularity structure

The concept of community is qualitative rather than quantitative; that is, nodes must be more densely connected within the community than with the rest of the network. The quantitative definition of the modularity of a network is still an open debate. Here, we use the modularity quality function Q which was introduced by the authors of [28], and which is a widely used quantitative measure for evaluating the modular structure of a network. Specifically, given an un-weighted undirected graph G = (V, E) with |V| = n, its symmetric adjacency matrix A = [A u,v ]n × nwhere A u,v = 1 if nodes u and v are connected and otherwise A u,v = 0. Then, the modularity Q function is defined as

Q ( P k ) = i = 1 k e i i - a i 2 ,

where: P(k) = ({C1,...,C k }) is a partition of V into k groups; e i i = L ( C i , C i ) L ( V , V ) is the fraction of edges with both end vertices in the same community i; a i = L ( C i , V ) L ( V , V ) is the fraction of edges with at least one end vertex in community C i ; and, L ( S 1 , S 2 ) = u S 1 , v S 2 A u , v . Larger values of Q correspond to more distinct community structures in PINs. Function Q have serious resolution limits which have been discussed at length in [22], and the size of a detected community depends on the size of the whole network; thus, the choice of partition is highly sensitive to the total number of edges in the network. A second partition scoring function Ω which seeks to improve Q has been introduced in [29] and is defined as

Ω ( P k ) = i = 1 k e i i log a i .

Function Ω allows for more diverse cluster sizes than function Q and which are not too small and not too large, and smaller values corresponds to better modularity structures. A third scoring function, the modularity density function D of [14], overcomes the resolution limits of Q by directly including information on the number of nodes in a community. It is defined as

D ( P k ) = i = 1 k L ( C i , C i ) - L ( C i , C ̄ i ) | C i | ,

where, C ̄ i = V \ C i is the set of vertices not in C i . Thus, the aim of function D is to optimize both the modularity and the density of a community. For weighted undirected graphs G = (V, E) with weights assigned to edges in E, we propose new modularity functions, Q w , Ω w and D w . These three functions are direct generalizations of Q, Ω and D above, with L(S1, S2) redefined for weighted undirected graphs as

L ( S 1 , S 2 ) = u S 1 , v S 2 w ( u , v ) .

The problem of community detection is hence equivalent to searching for a k and a partition Pk to maximize the value of a modularity function.

The relative vertex-to-vertex clustering value

Suppose an edge (u, v) in a scale-free network such that u has lower degree than v. We can reasonably assume that u has more likely joined the cluster containing v than v has joined the cluster containing u. This assumption stems from the principle of preferential attachment in power-law networks, which states that a new node u is likely to attach to a high-degree node v than to a low degree node. The edge clustering coefficient C u , v ( k ) of [2] and the edge clustering value ECV (u, v) of [25] are similarity metrics which treat both endpoints of edges (u, v) equally, irrespective of their degrees. Also, another issue is that both ECV (u, v) and C u , v ( 3 ) require vertices u and v to be connected by an edge. This requirement is quite restrictive and we aim to extend (in the future) to the case in which pair (u, v) is not an edge while still being able to decide if both vertices are in the same cluster. Finally, hierarchical approaches based on ECV (u, v) and C u , v ( 3 ) , or other objective functions, have the common problem of classifying low-degree vertices (peripheral to dense subnetwork modules) into separate clusters rather than merging them with their neighboring modules. These criteria tell how likely that both u and v lie in the same cluster, and not which of u or v has likely joined the other's cluster. Let N a be the set of neighbors of a vertex a in an un-weighted undirected graph G = (V, E). We define N a + = N a { a } as the neighbor set of a augmented with a itself. Given two vertices u and v, we define the clustering value of u relative to v as:

R ( u --→ v ) = | N u + N v + | | N u + |

To consider the reliability of edges in the presence of false positive interactions in the the PIN data, we modify the objective of Equation (9) to apply for weighted graphs, as follows

R w ( u --→ v ) = a I u , v + ; ( u , a ) E ; ( a , v ) E w ( u , a ) w ( a , v ) b N u + ; ( u , b ) E w ( u , b ) ,

where, I u , v + = N u + N v + , and 0 ≤ w(x, y) ≤ 1 is the weight assigned to the edge (x, y) and which represents the reliability of the interaction between vertices a and b or the probability of their interaction being a true positive. Clearly, Equation (9) is a special case of Equation (10) for weighted undirected graph with w(x, y) = 1 for all edges (x, y). For a node aV, we let k a = b V A a , b be its degree. For a weighted graph, we define the weighted degree of a vertex a as κ a = b V w ( a , b ) , similarly to [25].

R w (u--→v), with 0 ≤ R w (u--→v) ≤ 1, is a similarity premetric since it does not satisfy the axiom of symmetry and the triangle inequality but satisfies the axioms of self-similarity and maximality [30]; see and A vertex u with a larger clustering value given another vertex v is more likely to lie in the cluster containing v. In the following we let C(v) = (C v V, E v E) denotes the subnetwork cluster containing v and we assume C(v) is a community. Below, we describe the properties of R w (u--→v).

Analysis of R w (u--→v)

In the following, we limit our discussions to the case of un-weighted networks, though they also apply to weighted networks. To understand how the similarity premetric R w (u--→v) can be used to determine the communities in a network, we now discuss the relationships between values R(u--→v) and R(v--→u), and all the four possible cases of connectivity of an edge (u, v). The main question we address below is: when should we merge the vertex u with the current cluster C(v) of v?

  • 1 Case k u = 1. R(u--→v) = 1, thus it is maximal. R(u--→v) is also maximal when kv = 1, and hence, the connected component C = ({u, v}, (u, v)) is a community. If on the other hand k v > 1, then we have R(u--→v) >R(v--→u) and therefore u should be merged with the current cluster C(v) of v (not the other way around, which corresponds to merging v with C(u)).

  • 2 Case 1 <k u <k v . R(u--→v) >R(v--→u) and R(u--→v) may or may not be maximal. Vertex u should be merged with C(v) only when R(u--→v) > 0.5; that is, when more than 50% of the neighbors of u, N u + , are in the intersection, N u + N v + . This is a reasonable decision since the number of triangles involving the edge (u, v) is |N u N v |, and that the edge (u, v) is definitely not a "bridge" connecting two clusters when most of u's neighbors form a triangle with v.

  • 3 Case 1 <k v <k u . This is the reverse of case 2 above: thus, u should not merge with C(v) since R(u--→v) <R(v--→u).

  • 4 Case k u = k v . R(u--→v) = R(u--→v), and we should consider two possible sub-cases.

    1. (a)

      Sub-case N u + = N v + . We have R(u--→v) = R(u--→v) = 1 since N u + = N v + = N u + N v + . Hence, u should be merged with C(v) given that the induced subnetwork of G for N u + N v + forms a community.

    2. (b)

      Sub-case N u + N v + . We have R(u--→v) = R(v--→u) < 1. In this case, u should be merged with C(v), only when R(u--→v) > 0.5.

Given an edge (u, v), assume the degrees of vertices u and v in G are such that k u = k v = d are (very) large and that u and v do not have common neighbors. Then, we have R ( u --→ v ) = R ( v --→ u ) = 1 2 1 + d 0 . 5 assuming d ≥ 3. In this case, the induced subnetwork of G for {u} ∪ C v (or for N v + ) is not a community, and likewise for {v} ∪ C u (or for N u + ). In general, consider the induced subgraph of G on N u + N v + we define the local betweenness value of edge (u, v) as the percentage of paths from vertices in N u \ N v to vertices in N v \ N u going through edge (u, v). Given the number of common neighbors between u and v, |N u N v |, the local betweenness of edge (u, v) is thus λ ( u , v ) = 100 1 | N u N v | + 1 . Given two connected high-degree vertices u and v, the local edge betweenness value λ(u, v) increases as |N u N v | decreases, and hence, it corresponds to when both R(u--→v) and R(v--→u) values are both small (and both ≤ 0.5) at the same time. Edges with high local betweenness values are edges which are likely connecting two communities, and therefore, vertices u and v should not lie in the same community. This is not necessarily true since we are making an inference based not on the global edge betweenness metric defined in [21]. However, starting with correct initializations and using an appropriate node clustering mechanism, a greedy algorithm can be devised based on the faster local evaluations instead of the costly global evaluations.

R(u--→v) is maximal when | N u + | = | N u + N v + | ; that is either Case (1) or Case (4a) above. In either cases, u contributes only new internal edges in the induced subnetwork of G for C v + = { u } C v (or for C v + = N v + ) and contributes no new external edges, and hence, the induced subnetwork of G for C v + remains a community if C v (or N v + ) is a community. Finally, u is more likely to be in the community C(v) and v less likely to be in the community C(u) when both R(u--→v) > 0.5 and R(u--→v)R(v--→u). Since R(u--→v) > 0.5 then k u k v and | N u + N v + | = | N u + | 2 ; that is, more than 50% of the neighbors of u are in the intersection and less than 50% of the neighbors of v are in the intersection. Since k u k v then clearly the induced subnetwork of G for C v + = { u } C v is a community when N u N v C ( v ) with its modularity increasing with |N u N v |.

Quantitative definition of module

Given the four cases above and a user-defined merging parameter μ with 0 ≤ μ < 2, the decision to merge a node u with the cluster C(v) of a node v can be summarized into a single test containing all the four cases; that is: include u to C(v) whenever

R w ( u --→ v ) > 0 . 5 μ and R w ( u --→ v ) R w ( v --→ u ) .

The communities (i.e. modules) C determined by algorithms which use this merging test are such that the merging condition is satisfied for every internal edge of C and not satisfied for every external edge of C. Given a weighted undirected graph G = (V, E) and the merging parameter μ, a subgraph CG is said to be a μ-module if if the the condition for merging is true for every internal edge of c and false for every external edge of C. Different networks modularity structures are obtained by varying the value the merging parameter μ.

The relative vertex clustering value, R(u--→v) implements the ideas behind the edge clustering coefficient, C u , v ( k ) , of [2], since for a given vertex v and a neighbor u the number of triangles given edge (u, v) is exactly |N u N v |; and u will be included into C(v) whenever most of the neighbors of u (excluding v) are in N u N v . This is also true even when (u, v) is not an edge; in such case, |N u N v | relates to the number of squares containing vertices u and v. On the other hand, we break through the limitations of [2] as in the edge clustering value, ECV (u, v) of [25], by not assuming the existence of closed loops in a networks, such as triangles or high-order loops. The relative vertex clustering values R(u--→v) and R w (u--→v) also improves ECV (u, v) and ECV w (u, v) since neighbors u of v which have most of their neighbors forming a triangle with v are considered for possible inclusion in C(v). Searching for vertices u which form a cluster with v is also more efficient than searching for edges (u, v) that make a cluster since the number of edges is larger than the number of vertices in dense subgraphs.

The FAC-PIN algorithm

In a clustering task, we can use R w (u--→v) and R w (v--→u) to decide whether u should be included into C(v) = (C v , E v ) ⊂ G = (V, E), the current cluster of v. Based on the definitions of relative vertex-to-vertex clustering value and quantitative network modularity, we propose a fast agglomerative clustering node-focused algorithm named FAC-PIN, shown in Algorithm 1. The input to algorithm FAC-PIN is an undirected weighted graph; when un-weighted graph is used, then all edges (a, b) are treated equally with weight w(a, b) = 1. The output of FAC-PIN is a collection of non-overlapping subnetwork communities.

Given a weighted undirected PIN G = (V, E), we initially consider each vertex as a singleton cluster, and sort the vertices vV into a queue Q V in non-increasing order of their weighted degrees κ v . Then,

Algorithm 1 The FAC-PIN algorithm

Require: G = (V, E): undirected PIN graph;

      A|V| × |V|: adjacency matrix;

      W|V| × |V|: weight matrix;

      μ: merging parameter;

Ensure: P k = {C1 ,..., C k }: non-overlapping subnetwork communities

{Initialization Phase}

for all vV do

      C v ← {v}; {C v = cluster containing node v}

      E v ← ∅;

       κ v b V w ( v , b ) ; {weighted degree of v}

      C(v) ← (C v , E v ); {Each vertex is a singleton cluster }

                                 {C(v) = subnetwork containing node v}

end for

{Community Detection Phase}

Sort V to Q V in non-increasing order of κ v values;


   vQ V ; {Select highest κ v vertex in Q V }

   N v ← {uV| (u, v) ∈ E}; {Neighbor set of v}

   for all uN v not yet assigned to a cluster do

      if R w (u--→v) > 0.5μ and R w (u--→v) R w (v--→u)


         C z C v ∪ {u}, ∀ ∈ C v ∪ {u};

      end if

   end for

   Q V Q V - v; {Remove v from Q V }

until Q V = ∅

{Compute the Partition P k }


i ← 1;

while U ≠ ∅ do

   v ← randomly select a vertex from U ;

   C i C(v) = the induced subgraph of G for C v ;

   UU\{u|C u = C v };

   ii + 1;

end while

return P k ← {C1,...,C k }; Q w (P k ) and Ω w (P k );

{Evaluate the Modularity of Partition P k }

ModularityD w (P k ), Q w (P k ) and Ω w (P k );

in an iterative manner, we select the next highest κ v vertex v from Q V and then we iteratively apply the merging condition

R w ( u --→ v ) > 0 . 5 μ and R w ( u --→ v ) R w ( v --→ u )

on each neighbor uN v of v in order to decide for its inclusion into the current cluster C v of v.

A neighbor uN v is added into the current cluster C v of v, when the majority of the neighbors of u are in N u + N v + . That is when, R(u--→v) > 0.5 and R w (u--→v) R w (v--→u); in which case κ u κ v and | N u + N v + | > 1 2 | N u + | which for weighted graphs is equivalent to a I u , v + w ( u , a ) > 1 2 b N u + w ( u , b ) where I u , v + = N u + N v + . By gradually examining each high-degree vertex v from the queue Q V and then gradually adding its un-assigned neighbors u to C v , FAC-PIN agglomerates all singleton clusters into |V| vertex sets C v . The final k communities C i , for 1 ≤ ik, are the induced subgraphs of G for all distinct C v ; in the algorithm, we made a distinction between a cluster C v = {v1,...,v n }, a subnetwork C(v) = (C v , E v ), and the i-th subnetwork C i . In FAC-PIN, the merging parameter μ with 0 ≤ μ < 2 is user-defined. In particular for weighted PINs, different modularity results can be obtained by changing the values of μ

Most hierarchical methods, with the exception of the HC-PIN algorithm of [25], are based on a costly global metric for partitioning a PIN network. FAC-PIN is based on the local similarity premetric R w (u--→v), which encodes useful information about the local topology around vertices u and v, and which helps make a local decision maximizing the modularity of the final partitioning.

Computational complexity of FAC-PIN

Given weighted PIN G = (V, E), let n = |V|, m = |E|, κmax = maxvVκ v be the maximum weighted degree in G, and κ a v e = 1 n v V κ v be the average weighted degree in G. The complexity of computing R w (u--→v) is O(κmax), and hence, the complexity of FAC-PIN is O ( n κ a v e 2 ) O ( n κ max 2 ) O ( n 3 ) . PINs are power-law networks, thus the majority of proteins interact with few proteins only, and thus κ ave is generally small and can be considered a constant [25]. The CNM [23] and the HC-PIN [25] methods run in O(mh log n) and O ( m κ a v e 2 ) steps, respectively; where, h is the depth of the dendrogram describing the network's community structure. These are the currently fastest agglomerative methods. The space complexity of the three algorithms is O(m2). The main achievement with respect to computational complexity is that the cost of FAC-PIN is dependent on the number of nodes, rather than the number of edges, specially when κ ave is regarded as a constant in scale-free networks.

Results and discussion

We have carried out several computational experiments on nine PIN data from eight different species using our proposed FAC-PIN algorithm. In this section, the data sets and the evaluation methods used in our experiments are described first. Next, we discuss the effect of varying the merging parameter μ on the FAC-PIN clustering results. Then, we arbitrarily set the merging parameter to μ = 0.5 and then proceed to compare and study the clustering results of the FAC-PIN approach with those of the HC-PIN and CNM methods on the same PIN data sets; the three algorithms are compared on (i) the functional enrichment of their predicted modules, (ii) their sensitivity, specificity, and F -score, (iii) the network modularity structure of the partitioning results, and finally, (iv) their execution times.

All computational experiments were performed on an Intel machine (Core TM i5-1600, 2.400 GHz, CPU with 8 GB RAM). The program codes were all written in R.

PIN data sets

Original un-weighted PIN data of eight distinct species was downloaded from the REACTOME database and one species from the DIP database [31]. The eight PIN data from REACTOME are listed here along with their number of proteins and interactions in parenthesis are: B. taurus (5737, 113888), T. guttata (Finch bird, 3929, 74314), X. tropicalis (Frog, 5473, 122706), H. sapiens (Human, 8997, 34935), O. sativa (Rice, 3778, 320570), S. scrofa (Wild boar, 5303, 119920), D. rario (Zebra fish, 8188, 274358), and S. cerevisiae-1 (Baker's yeast, 5697, 50675). The PIN data from DIP is S. cerevisiae-2 (Baker's yeast, 4726, 15166). In all these PIN data, the number of edges is much larger than the number of vertices.

We also downloaded a list of protein complexes obtained from the MIPS database, which we consider as a gold standard data. We extracted the protein complexes corresponding to the S. cerevisiae-2 PIN data from the MIPS Comprehensive Yeast Genome Database-CYGD We proceeded similarly to [29] and considered only the known complexes (i.e., not those obtained by computational means) containing at least three proteins. Since FAC-PIN generates non-overlapping clusters, we considered only known complexes which are at the bottom of the MIPS hierarchy of complexes and subcomplexes. The unconfirmed complexes, that is those in category 550, were excluded.

Evaluation methods

In order to study and compare the performance of FAC-PIN, we downloaded the CNM code[23] and implemented the HC-PIN algorithm [25]. The two methods were applied on the same PIN data as FAC-PIN. For HC-PIN, we set the two parameters λ and s as in [25]; CNM has no parameters. Of the three algorithms, only FAC-PIN and HC-PIN can cluster weighted PINs. There are other network clustering approaches which we could compare FAC-PIN with, however they are either not designed for clustering weighted PINs or they are not hierarchical agglomerative algorithms. It should be noted that [25] compared his HC-PIN algorithms with six others PIN clustering approaches on the same S. cerevisiae-2 PIN data; none of them are hierarchical and only three of them can cluster PIN data). Due to time and space limitations, we are not able to perform computational experiments comparing FAC-PIN approach with those other six PIN clustering techniques; we leave this task as a future work. In [25], HC-PIN consistently outperforms those methods in terms of its (i) functional enrichment of the identified modules (ii) ability to detect both small-sized and large-sized modules, (iii) accuracies of the identified modules, (iv) ability to predict protein complexes, and (v) clustering efficiency. Both HC-PIN and CNM are currently the fastest agglomerative methods for clustering PIN data.

Functional enrichment validations

For the functional enrichment validations, we used DAVID's functional annotation tools[32] to identify enriched biological themes, particularly GO terms, and to estimate whether the predicted modules are biologically significant. DAVID uses a set of fuzzy classification algorithms to rank modules based on co-occurrences of their constituent proteins in annotation terms and computes a P-value indicating the significance of the module with respect to GO terms. The P-value is computed using an internal EASE score [33]. We used a P-value cutoff of 0.05 to find biologically significant clusters. A smaller P-value indicates that the predicted module is more biologically significant than one with a larger P-value

To estimate the performance of a network clustering algorithm in term of its ability to correctly identify the functional modules within a PIN, we also compute its Recall, Precision, and F-Measure as mapped to C as

R e c a l l = | C F i | | F i | ,
P r e c i s i o n = | C F i | | C | ,
F - M e a s u r e = 2 × R e c a l l × P r e c i s i o n R e c a l l + P r e c i s i o n

where, C is a module predicted by the algorithm, and F i is a known GO functional category mapped to C and considered as a true predictions. Thus, the proteins in C ∩ F i are the true positive predictions. Recall measures how effectively proteins with the same F i in the PIN are extracted, Precision measures how consistently proteins in the same C are annotated, and F-Measure is their harmonic mean [34]. The accuracy of the method is taken as the average F-Measure of the significant predicted modules. As in [25], we also only consider predicted modules of size 3 or more.

Protein complex validations

Protein complex validations proceed by determining the degree of overlap between the complexes identified by network clustering algorithm and the known protein complexes; that is, we want to determine how effectively an identified module matches a known complex. We used the overlapping score function given in [12, 25, 29, 35]. The overlapping score, O(C, K), between a discovered complex C and a known complex K is defined as:

O ( C , K ) = | C K | 2 | C | × | K | ,

in which a cluster C is considered to match a known complex K whenever O(C, K) ≥ τ ; where, 0 < τ ≤ 1 is the matching threshold. We have a perfect match only when O(C, K) = 1. Threshold value τ = 0.2 was used in [12, 25, 35] whereas [29] used τ = 0.25. We used τ = 0.2 in our complex validation. After computing the overlapping scores between all pairs (C, K) of discovered complexes and known complexes for the PIN, we then determined the ability of the method to correctly classify the known complexes. The reason for doing this is that a given complex K1 may match many clusters but with different degrees of overlap, while another complex K2 may match with a single cluster only. Hence, we calculated the Specificity, the Sensitivity, and the F-Score, as our measures of accuracy here; they are defined as follows:

S e n s i t i v i t y = T P T P + F N ,
S p e c i f i c i t y = T P T P + F P ,
F - S c o r e = 2 × s p e c i f i c i t y × s e n s i t i v i t y s p e c i f i c i t y + s e n s i t i v i t y ,

where, TP (true positive) is the number of the identified complexes C matched by the known complexes K, FN (false negative) is the number of known complexes that are not matched by the identified complexes, and FP (false positive) is the total number of the identified complexes C minus TP.

Modularity and efficiency analyses

All experiments in this paper were performed on an Intel machine (Core TM i7-2600, 3.400 GHz, CPU with 8 GB RAM). We compared FAC-PIN against HC-PIN and CNM in terms of the modularity of their clustering results and in terms of their computational efficiencies. For FAC-PIN, we ran it with its merging parameter set to μ = 0.5, then evaluated and reported the modularity of its resulting partition P k . The execution times (in seconds) are also recorded; the PINs are sorted in increasing order of their number of proteins m.

Identification of functional modules in the S. cerevisiae-2PINs

The computational results in this section are all generated with the merging parameter arbitrarily set to μ = 0.5 (except in Table 1) and with the modularity quality function Q w .

Table 1 The effect of variation of μ on clustering S. cerevisiae-2 PINs

Effect of the merging parameter μ

Table 1 shows the effect of parameter μ on FAC-PIN clustering results. Recall that a neighbor u of v is merged with the current cluster C(v) of v whenever the test

R w ( u --→ v ) > 0 . 5 μ and R w ( u --→ v ) R w ( v --→ u )

is satisfied for u. Hence, the size of a cluster C(v) increases as the merging parameter μ decreases since more neighbors are being merged together with v; and therefore, the number of clusters k also decreases as the sizes of clusters increase.

Functional enrichment of FAC-PIN modules

In Table 2, the three methods are compared for their functional enrichment of biological functions. The P- value from DAVID's internal EASE score is computed for each predicted module C, and a P-value cutoff of 0.05 is used to find the biologically significant clusters; a module whose P-value is above this cutoff is considered insignificant. The table shows, in this order, the number (percentage) and the average size of significant predicted modules with P-values falling within intervals: <E-15, [E-15, E-10], [E-10, E-5], and [E-5, 1]. Although CNM and HC-PIN show more enriched modules in the interval [<E-15], the modules with p-value falling in this range are much larger in CNM and HC-PIN than in FAC-PIN (specially CNM) with an average size of 439.83 for CNM and 103.1 for HC-PIN compared to 49.08 for FAC-PIN. Larger modules result in a high number of false positives, reducing the specificity of the highly-enriched modules. Figure 1 shows this trend. The figure compares the sizes of the modules whose enrichment P-values fall in the range [<E-15]. In the figure, there is a clear shift to the right in the case of CNM, indicating much larger modules. This trend is apparent in all P-values ranges (from Table 2). This indicates that CNM is the worst at predicting enrichment in small modules. HC-PIN's highly-enriched modules are also large compared to those produced by FAC-PIN, but their sizes are less than those of CNM. Also, FAC-PIN has the lowest rate of modules not passing the enrichment P-values cutoff of 0.05.

Table 2 Functional enrichment of the predicted modules which comprises of three or more S. cerevisiae-2 proteins; μ = 0.5
Figure 1
figure 1

P -values versus Sizes of Modules. Comparing sizes of enriched modules whose P-values fall in range [<E-15].

Predicting large-sized versus small-sized modules

The P-value of a predicted module depends on its size, and hence, Table 3 and Table 4 show the accuracy of the methods respectively for predicting large and small modules.

Table 3 Performance comparison of the algorithms for predicting modules of size ≥ 20 on S. cerevisiae-2 PIN; μ = 0.5
Table 4 Performance comparison of the algorithms for predicting modules of size ≤ 6 on S. cerevisiae-2 PIN; μ = 0.5

In Table 3, we see that more than 96% of the modules predicted by each method are validated to be significant, though FAC-PIN yields a percentage slightly larger than that of HC-PIN or CNM. Although CNM gives the highest average -log P-value, it also yields the lowest average F-measure; this is due to the fact that its significant modules are much larger than those of HC-PIN and FAC-PIN, and hence, less accurate. FAC-PIN, on the other hand, predicted more accurate significant modules than HC-PIN and CNM but with the lowest average -log P-value; again, this is due to the smaller sizes of its generated modules.

In Table 4 however, performed consistently better than CNM and HC-PIN in all performance measures; FAC-PIN seems to be better at producing small-sized modules.

Accuracy of FAC-PIN

Table 5 lists the accuracy of each method with all the validations of Biological Process (BP), Molecular Function (MF), and Cellular Component (CC). Table 5 further confirms our analysis of the results in Table 3 and Table 4; that FAC-PIN predicts smaller but more accurate significant modules.

Table 5 Performance Comparison of the accuracy of FAC-PIN, HC-PIN, and CNM on S. cerevisiae-2 PIN; μ = 0.5

Identification of functional modules in the S. cerevisiae-1PIN

Table 6 shows, in this order: the modularity value Q w (P k ) of the generated partition P k ; the number of predicted modules k3 with ≥ 3 proteins (and in parenthesis, the total k); and the average size s ̄ of the modules. Next, the validation results shows: the number k s of significant modules obtained overall (percentage of such modules is in parenthesis) and for each ontology class (Biological Process, Cellular Component, Molecular Function); the number of significant modules whose P-values fall within P-value interval <E-15, [E-15, E-10], [E-10, E-5], [E-5, 1] are listed next; the average p ̄ of -log P-value; and, the accuracy A of each algorithm as the average F-Measure of the predicted significant modules. The data set is the original unweighted PIN of S. cerevisiae-1 downloaded from the REACTOME database. In this PIN data, the number of modules discovered by FAC-PIN is comparable to (but still larger than) those detected by HC-PIN and CNM. FAC-PIN still predicts smaller and more accurate significant modules in this S. cerevisiae-1 with higher average -log P-value; which is consistent with our findings in the previous tables that FAC-PIN perform better due to the smaller sizes of its predicted modules.

Table 6 Functional enrichment of the predicted modules of un-weighted S. cerevisiae-1 PIN; μ = 0.5

Identification of protein complexes in the S. cerevisiae-2PIN

Table 7 shows the Specificity, the Sensitivity, and the F-Score of the complexes identified by each method. The results are shown for the modularity scoring function Q w . For HC-PIN, results are shown for two values of its parameter λ as in [25]. The first three columns show, respectively, the number of proteins, the number of known complexes, and the average size of the known complexes in the data; columns 5, 6, and 7 are the number of discovered complexes, their average size, and the number of perfectly matched discovered complexes. In the table, we see that FAC-PIN discovers complexes whose average sizes (column 6) are closer to the average sizes of the known protein complexes (column 3), whereas HC-PIN and CNM predict farther average sizes. The consequence of this is that FAC-PIN complexes have higher accuracy in (Specificity, Sensitivity or F-Score). In particular, we obtain a larger number of perfectly matched complexes to communities with FAC-PIN than with HC-PIN or CNM.

Table 7 Comparison of the Sensitivity, Specificity and F-Score of FAC-PIN, HC-PIN and CNM

Modularity and efficiency of FAC-PIN

Tables 8, 9, and 10 show the network modularity of the partitions obtained by the algorithms on the eight un-weighted PIN data downloaded from the REACTOME database, respectively for the modularity functions Q w , Ω w , and D w . The aim of both objectives Q w and Ω w is to optimize the modularity of the detected clusters (though Ω w yields clusters that are not too small and not too large, and therefore, it generates denser clusters than those from Q w ); the aim of D w is to optimize both the modularity and the density of the clusters.

Table 8 Network modularity quality Q w results of FAC-PIN, HC-PIN, and CNM; μ = 0.5
Table 9 Network modularity quality Ω w results of FAC-PIN, HC-PIN, and CNM; μ = 0.5
Table 10 Network modularity density D w results of FAC-PIN, HC-PIN, and CNM; μ = 0.5

CNM is a modularity optimization algorithm designed to directly optimize the modularity quality function Q w , and hence, it is no surprise that it performed best with this function, as shown in Table 8. The modularity maximization process of CNM [23] yields a partitioning containing one very large cluster and many much smaller ones; this because, a node is selected to be included into the currently largest cluster first and to maximize the current Q w value. In the columns for Rice and Yeast in Table 8, we see that FAC-PIN outperforms CNM on Q w ; Table 11 shows a possible reason for this, that the sizes max |C i | of their largest clusters are comparable.

Table 11 Comparing cluster statistics of FAC-PIN and CNM on Q w ; μ = 0.5

Recall that given a currently high-degree vertex v with its cluster C v , FAC-PIN merges it with all its neighbors u satisfying the merging condition

R w ( u --→ v ) > 0 . 5 μ and R w ( u --→ v ) R w ( v --→ u ) .

The first term in the merging condition guarantees that only edges (u, v) which have low local betweenness value λ ( u , v ) = 100 1 | N u N v | + 1 are considered for possible inclusion in the induced subgraph C(v) of C v . The second term guarantees that only those neighbors u which can contribute more edges to C(v), than v contributes to C(u), are selected. Hence, FAC-PIN merges neighbors u which contribute low local betweenness edges while optimizing the density of C(v). Also as said before, the relative vertex clustering value R w (u--→v) combines the principles behind the vertex clustering coefficient of [14], the edge clustering coefficient C u , v ( k ) of [2], and the edge clustering value ECV (u, v) of [25]. Since the objectives of Ω w and D w is to seek for modular partitioning containing dense clusters, we can see that in both Tables 9 and 10, FAC-PIN outperformed both HC-PIN and CNM on both modularity function Ω w ; in seven out of eight PIN data for Ω w , and in all PIN data for D w . In particular for D w , FAC-PIN yield much higher modularity values.

Table 12 shows the execution times (in seconds) of each algorithm and the same data sets as above, but for modularity function Q w only. As one can see, FAC-PIN ran faster than both HC-PIN and CNM on all data sets.

Table 12 Execution times of FAC-PIN, HC-PIN, and CNM; using Q w and μ = 0.5


In this paper, we have proposed a new agglomerative clustering approach, FAC-PIN algorithm, for detecting the communities of a given PIN networks, and then compared our method with two fast hierarchical techniques discussed in literature. Our approach is based on the use of a new measure, the relative vertex-to-vertex clustering value which helps decide whether a given vertex u should be included within the cluster of another vertex v depending on how many of its neighbors form a triangle with v. Our approach is very fast since we are clustering vertices not edges, as in the compared methods. Thus our method is appropriate for PIN data, which in general contain more interactions than proteins. More study needs to be done, in particular the validation based on random networks, in order to analyze the robustness of FAC-PIN. Comparisons with other methods which are not necessarily hierarchical will also be important. Non-agglomerative clustering methods based on the relative vertex-to-vertex clustering value will be investigated. In this current version of FAC-PIN, a neighbor u is merged with a cluster C v i whenever its R w (u--→ v i ) value satisfies the merging condition and irrespective of whether there is another vertex vj such that R w (u--→ v j ) also satisfies the condition; we, therefore, plan a new variant of FAC-PIN in which each node u selects the best neighbor v to be merged with. Finally, we plan to modify FAC-PIN for directed (un-weighted and weighted) protein interaction networks.

As a final note: we have not made experiments on weighted PINs. In our initial submission, we have used the following weighted criterium:

R w ( u --→ v ) = a I u , v + ; ( u , a ) E w ( u , a ) b N u + ; ( u , b ) E w ( u , b )

One of the reviewer of the initial manuscript has pointed out that this formula is incorrect since it depends only on the weights of edges connected to node u, not of the edges connected to v. An important consequence of this error, is that our analysis of R w (u--→v) (based on the formula above) will apply to the unweighted case only but will not necessarily apply to the weighted case. We have verified this, both computationally and theoretically, before engaging to experiment on weighted PINs. Due to time constraint, it is now impossible to perform and complete the experiments on weighted PINs using the correct formula in Equation (10). Our plan for the immediate future is therefore to perform these experiments.