Abstract
The advances in high throughput omics technologies have made it possible to characterize molecular interactions within and across various species. Alignments and comparison of molecular networks across species will help detect orthologs and conserved functional modules and provide insights on the evolutionary relationships of the compared species. However, such analyses are not trivial due to the complexity of network and high computational cost. Here we develop a mixture of global and local algorithm, BinAligner, for network alignments. Based on the hypotheses that the similarity between two vertices across networks would be context dependent and that the information from the edges and the structures of subnetworks can be more informative than vertices alone, two scoring schema, 1neighborhood subnetwork and graphlet, were introduced to derive the scoring matrices between networks, besides the commonly used scoring scheme from vertices. Then the alignment problem is formulated as an assignment problem, which is solved by the combinatorial optimization algorithm, such as the Hungarian method. The proposed algorithm was applied and validated in aligning the proteinprotein interaction network of Kaposi's sarcoma associated herpesvirus (KSHV) and that of varicella zoster virus (VZV). Interestingly, we identified several putative functional orthologous proteins with similar functions but very low sequence similarity between the two viruses. For example, KSHV open reading frame 56 (ORF56) and VZV ORF55 are helicaseprimase subunits with sequence identity 14.6%, and KSHV ORF75 and VZV ORF44 are tegument proteins with sequence identity 15.3%. These functional pairs can not be identified if one restricts the alignment into orthologous protein pairs. In addition, BinAligner identified a conserved pathway between two viruses, which consists of 7 orthologous protein pairs and these proteins are connected by conserved links. This pathway might be crucial for virus packing and infection.
Background
In the context of system biology, the concept of network is widely used in representing the interactions between various biological macromolecules. Several distinct types of networks have been modeled at molecular level, such as proteinprotein interaction (PPI) networks [1], gene regulatory networks [2], metabolic networks [3], and signal transduction networks [4]. Comparative analyses of these networks can facilitate the identification of conserved components across biological systems and further inference of the biological functions of these components.
A biological network is commonly represented as an undirected graph, in which each vertex corresponds to a biomolecule, e.g. protein, and each edge denotes an interaction between two biomolecules. Conceptually, network alignment is to compare and align the vertices of two or more networks to identify subnetwork(s) with similar vertices, which could share alike functions, resembling structure, or common evolutionary history. In recent years, with the development of highthroughput experimental techniques such as the yeast twohybrid system [5] and coimmunoprecipitation [6], the amount of biological networks has been increasing rapidly, leading to a huge demand for efficient network alignment methods and tools. Because network alignment is in principle an NPcomplete problem [7], devising reliable and fast network alignment heuristics has become one of the foremost challenges for network alignments.
A number of network alignment methods have been developed in the past decade [8–14, 14–16]. Similar to sequence alignment, network alignment methods can be characterized as either global network alignment or local network alignment. Global network alignment is to force the alignment to span the entire set of vertices, which can provide insightful views of similarities and differences cross species at the systemic level and help identify functional orthologs. In contrast, local network alignment only identifies highly similar subnetworks, which are more likely to be functional components such as pathways.
A pioneering work for global network alignment is IsoRank [14], which adopts a philosophy similar to Google PageRank, that is, a match between two vertices is good if the neighbors of these two vertices matched well. Based on this hypothesis, the global network alignment problem is transformed into an eigenvector problem. A more recent algorithm GRAAL [15] represents the structural information of any vertex by a vector, which records the potential hits of special structures called graphlets in its neighborhoods. By comparing the pairwise similarity between the representing vectors, a global pure graph structure alignment is achieved. Alternatively, the global network alignment problem is transferred into a linear or quadratic integer programming problem, and solved by linear relaxation [17], Lagrangian relaxation [16] or ILOG CPLEX [18]. However, these methods either restrict the alignment into orthologous candidates by setting the score of nonorthologous pairs to be −∞ or focus too much on graph structural information. As a consequence, the resulting alignments need to be further optimized.
On the other hand, previous work on PPI networks has been mostly focused on local alignments. PathBLAST [8, 9] incorporates the idea of BLAST Evalue with PPI network information to identify highly conserved pathways and complexes. By taking into account the duplication/divergence evolutionary model of proteinprotein interactions, MaWISH [10] transforms the local network alignment problem into a maximum weight induced subgraph problem and solves the problem in a greedy manner. Graemlin [11] identifies conserved dense subnetworks by comparing the probabilities that a module is under evolutionary constraints and under no evolutionary constraints. Similarly, by comparing the network evolutionary model with random model, Graph Alignment [12, 13] presents a complex scoring system on orthologous pairs, nonorthologous pairs, edge matches and mismatches, based on which a local alignment algorithm is designed. These local alignment methods can lead to local optimality because they are generally restricted to subnetworks (e.g. pathways and cliques).
To overcome the limitations of current network alignment algorithms, here we propose a new mixture network alignment method for BIological Network ALIGNment, so called BinAligner. To integrate both local and global network alignments, BinAligner constructs a pairwise similarity matrix between two networks based on three types of similarity scores derived from vertices (e.g. single node comparison based on sequence information), 1neighbor alignment (e.g. the similarity of two nodes based on the information of their first neighbor subnetworks), and graphlets (e.g. the similarity of nneighborhood subnetworks, n ≥ 2), which integrate information from both nodes and edges. The introduction of neighborhood subnetworks was based on the hypothesis that the similarity between two vertices across networks would be contextdependent. Then the alignment problem is formulated as an assignment problem, which is solved by the combinatorial optimization algorithm, such as Hungarian method, in polynomial time. The proposed algorithm was applied and validated in aligning the PPI network of varicella zoster virus (VZV) and that of Kaposi's sarcoma associated herpesvirus (KSHV) [13]. BinAligner outperformed GRAAL [15], Graph Alignment [12, 13], and IsoRank [14]. By further checking the biological functions of the aligned pairs, we identified several putative functional orthologous proteins and a conserved pathway between two viruses, which consists of seven orthologous proteins connected by conserved links. This pathway might be crucial for viral packing and infection.
Methods
Here we use PPI network to illustrate our algorithm. However, this algorithm can be applied to any types of biological networks.
Mathematical formulation of network alignment
A PPI network is denoted by an undirected graph G = (V, E), where each node v ∈ V represents a protein, and an edge uv ∈ E if there is an interaction between protein u and v. Given two PPI networks G = (V, E) and H = (U, F), a network alignment is defined to be a onetoone mapping π between vertex set V and U,
The unmapped vertices are assumed to be aligned to virtual dummy vertices. Alternatively, for any i ∈ V and j ∈ U,
So a network alignment is achieved if we specify the values of π_{ ij } for all i ∈ V and j ∈ U.
Usually, each network alignment π is associated with an alignment score S consisting of node score S_{ V } and edge score S_{ E }, which reflect how good vertices and edges being aligned between two networks respectively. In specific, for a pair of vertices i and j, let α_{ ij } be their sequence similarity score, then {S}_{V}=\sum _{i\in V}\sum _{j\in V}{\alpha}_{ij}{\pi}_{ij}. Similarly, let β_{ ijkl } denote the edge score for any 4 vertices i, j, k, l such that i, k ∈ V and j, l ∈ U, then {S}_{E}={\sum}_{i,k\in V}{\sum}_{j,l\in U}{\beta}_{ijkl}{\pi}_{ij}{\pi}_{kl}. And the overall alignment score is
The objective of global network alignment problem is to find a map π to maximize S subject to the following constraints,
The restrictions are obtained since each protein i ∈ V and j ∈ U can at most be mapped once in this framework.
A standard way to improve the speed is to linearize the quadratic objective function by introducing binary decision variables δ_{ ijkl } = π_{ ij }π_{ kl }. Thus, an equivalent linear integer programming problem is:
subject to
An appropriate scoring scheme is one of the keys to a robust and effective network alignment algorithm. There are several scoring schemes in literatures. For instance, Graph Alignment [12, 13] rewards orthologous protein pairs and edge matches, and punishes nonorthologous pairs and edge mismatches by scores based on the logratio of the probabilities that they are resulted from evolution or just by chance. Given two pairs of aligned vertices under an alignment π, say j = π(i) and l = π(k) with i, k ∈ V and j, l ∈ U, we say an edge match happens if ik ∈ E and jl ∈ F ; and an edge mismatch happens if ik ∈ E and jl ∉ F, or ik ∉ E and jl ∈ F.
Construction of similarity matrices
Similarity on nodes
We use a matrix A to denote the pairwise sequence similarity between vertex set V and U . In [13], a program called sequenceAlign is developed to calculate the identity score between two proteins and identify orthologous pairs. Let i ∈ V and j ∈ U , we define A_{ ij } = 1 if they are orthologs and A_{ ij } = 0 otherwise.
Similarity on 1neighborhood subnetworks
For networks whose maximum degree is not very large, the linear integer programming method is capable of exactly aligning the 1st neighborhoods of their vertices. The 1st neighborhood of a vertex i is an induced subgraph consisting of all vertices with distance less than or equal to 1 from i and the edges between them. Let i ∈ V and j ∈ U be any two vertices in network G = (V, E) and H = (U, F ). We use N_{ ij } to denote the best alignment score for the 1st neighborhood of i and j fixing that i is aligned to j. N is denoted as the similarity matrix on 1neighborhood subnetworks of G and H.
Due to the power law nature of PPI networks, there might be a few vertices with large degrees [19]. However, we only need an alignment score, not the exact alignment. Thus, a heuristic method, such as linear or Lagrangian relaxation, is a good alternative in this scenario. In practice, these largedegree vertices make an important role in guiding the alignment. Since the 1st neighborhood alone is too greedy for representing the similarity of two vertices, we incorporate similarities on graphlets to account for higher neighborhoods.
Similarity on graphlets (nneighborhood subnetworks, n ≥ 2)
The concept of graphlet and orbit was introduced by Przulj et al. [15, 20] to measure network local similarities. A graphlet is a small connected nonisomorphic subgraph of a large network, in which the nonisomorphic positions are labeled, and a graphlet with an unique labeled position is called an orbit [20]. However, to our best knowledge, the vertex similarity information (e.g. orthology) has not been considered in graphlet definition. In this framework, we explicitly incorporate the vertex similarity information by introducing different types of positions in a graphlet: (1) positions requiring vertex sequence similarity and (2) positions without this requirement. We list in Figure 1 all 76 graphlets containing 104 nonisomorphic orbits on 2 to 4 taxa under symmetry, in which positions requiring vertex sequence similarity are denoted by solid circles, and other positions by normal circles. A weight is also associated to each graphlet to reflect their chances of occurrence. Specifically, let the score of a normal circle be 0, for consistency we define the score of a solid circle and that of an edge between two solid circles to be the scores for orthologous pair and the scores for edge match, respectively, as we defined earlier in the 1st neighborhood alignment. For example, let α and β be the scores for orthologous pair and edge match, then the first graphlet has weight 2β and the 4th graphlet has weight 2(α + β). The graphlets and orbits containing only the 1st neighborhood information, complete graph, will be ignored because the information for the 1st neighborhood has already been considered in lst neighbor alignment.
Let \mathcal{O} be a set of orbits. For any o\in \mathcal{O}, we say that two networks G = (V, E) and H = (U, F ) hit o at (i, j), i ∈ V and j ∈ U , if there is a local alignment A between G and H such that

i is aligned to j.

o is an induced subgraph of the alignment graph of A with (i, j) being placed at the labeled vertex of o.
Where the alignment graph of A is a graph such that: (1) the vertex set consists of all aligned pairs (k, l) of vertices between V and H, (k, l) is dented by a solid circle if k and l are orthologs and normal circle otherwise; (2) there is an edge between two pairs of aligned vertices (i, j) and (k, l) if ik and jl are connected in G and H respectively. We use a vector \overrightarrow{{s}_{\mathrm{ij}}} of dimension 104 to denote the similarity of i and j on graphlets. Specifically, \overrightarrow{{s}_{\mathrm{ij}}} [k] with 1 ≤ k ≤ 104 counts the number of possible hits of the corresponding graphlet between networks G and H by fixing that i and j are located at position k. Since some graphlets are contained in other graphlets, only the hits of graphlet with the highest score is counted. For example, if say graphlet 1, 2 and 4 are hit at some pair (i, j), then only the entry of \overrightarrow{{s}_{\mathrm{ij}}} at graphlet 4 will be added by one. The graphlet score B_{ i },_{ j } of a pair (i, j) is then counted as the weighed sum of the entries in \overrightarrow{{s}_{\mathrm{ij}}}. In general, we use a matrix B to denote the similarity of networks G and H on graphlets.
The three similarity matrices A, N and B are then normalized by the largest entry in them. For simplicity, we still use A, N and B to denote the normalized matrices. Although A, N and B alone already reflects the similarity of each pair of vertices between network G and H, sometimes better alignment could be retrieved from their weighted combination C = θ_{1} ∗ A + θ_{2} ∗ N + θ_{3} ∗ B where 0 ≤ θ_{1}, θ_{2}, θ_{3} ≤ 1, θ_{1} + θ_{2} + θ_{3} = 1 are the parameters to balance the importance of vertex similarity, 1neighborhood, and n neighborhoods (n >= 2).
Retrieving alignments from similarity matrices
The network alignment π from the similarity matrix C was generated by solving the following assignment problem:
subject to
This assignment problem can be solved by the Hungarian method or ILOG CLPEX in polynomial time. An alternative strategy to retrieve the alignment is to first find high scoring pairs and fix them, then gradually expand the obtained local alignments in their close neighborhoods according to the alignment score defined by S in Eqn. 3, until all the vertices are aligned. In this process, some good local alignments and a global alignment are obtained simultaneously.
It is worth noting that both strategies have their advantages and suffer the problem of tiebreaking. To improve the performance, after obtaining the optimal assignment score, we refine the alignment according to the score function S. In specific, let ŝ be the optimal assignment score for equation (7). We add a new restriction \sum _{i\in G}\sum _{j\in H}{C}_{ij}{\pi}_{ij}= ŝ to restrictions (8) and resolve the optimization problem:
subject to
This process will not increase the running time much because usually the solution space for the assignment problem is small.
Parameter optimization
A challenging problem is how to specify the parameters α_{ ij } and β_{ ijkl }, that is, the score for node pair (i, j) and that for the link pair (ik, jl) for all i, k ∈ V and j, l ∈ U. Generally, α_{ ij } is positive if protein i and j are orthologs and negative otherwise. Similarly, β_{ ijkl } is positive if ik and jl are an edge match and negative if they are an edge mismatch. The values could be evaluated by the probabilities of a node match, mismatch, edge match and mismatch by randomly aligning the two networks. A complicated scoring scheme was shown in GraphAlignment [12], which provides functions to generate reasonable parameters using Bayesian statistics on merely the information of the two networks. We adopt these parameters as follows,
and
Performance assessment of network alignment
For an alignment \pi :V\mapsto U, two parameters were used to evaluate global network alignment: edge correctness [15] and orthologous percentage. The edge correctness (EC) is defined as the proportion of aligned edges in G = (V, E) over the number of edges E in the network. The orthologous percentage (OP) is defined as the number of aligned orthologous pairs over the theoretical maximum number of orthologous pairs being aligned. Both parameters are between 0 and 1, and the larger the better.
We also adopt the geometric random graph model, a widely used theoretical model for PPI networks [15, 20–23], to analyze the statistical significance of our edge alignments. In this model, proteins are modeled as existing in a metric space and are connected by an edge if they are within a fixed, specified distance of each other. By this model, let n_{1} = V  and n_{2} = U  be the number of nodes and m_{1} = E and m_{2} = F  be the number of edges of the two networks respectively. The probability P of successfully aligning k or more edges by chance is
where p=\frac{{n}_{2}\left({n}_{2}1\right)}{2}[15]. Usually, P < 0.025 is considered to be statistically significant, and the smaller is the P value, the more significant is the alignment.
In the end, we evaluate the performance of an alignment by exploring functions of aligned proteins. A biologically good alignment should align proteins in one network to those in another with similar functions, and should be able to find some functional orthologs missed by other alignments. In addition, it would be critical if the alignment is capable of finding some common subnetworks between two networks, which might be conserved for some important functions. However, there is no absolute criteria for comparing the protein functions as in most cases the functions of aligned proteins might be not fully known.
Results
Benchmark datasets
To validate BinAligner, we perform analysis on aligning PPI networks of two herpes viruses, the VZV, which causes chickenpox in children, and KSHV, which causes Kaposi's sarcoma. These two viruses are both herpesvirus and closely related in evolution. In addition, they are common human pathogens. Although their interactions with human are widely studied, there is relatively little knowledge about protein interactions among these viral proteins. A comparative network study could provide insights on these pathogens.
The interactions of their open reading frames (ORFs) can be found in the supplement of [24]. Similar to Berg and Lässig [13], we construct VZV and KSHV networks by using nodes to denote ORFs and links to denote the interactions between ORFs. The two networks are shown in Figure S1 and S2 (Additional file 1). The graphs are constructed using a free software Graphviz [25].
The VZV network consists of 173 interactions and 76 ORFs, among which 19 ORFs have no interaction and there are 13 self interactions. For convenience, we remove the isolated vertices and self links, and denote the network by a graph G = {V, E}, in which V = 57 and E = 160. Similarly, the KSHV network consists of 123 interactions and 84 ORFs, among which 34 ORFs have no interaction and there are 8 self interactions. We denote the network by H = {U, F}, in which U = 50 and F = 115. According to the orthologous table in [13], there are 25 orthologous pairs between the ORFs of V and U if we remove the isolated orthologous ORFs (see Table S1 in Additional file 1). Because several proteins have more than one orthologs, theoretically the maximum number of nonoverlapping orthologous pairs in an alignment is 16.
In this study, we developed a new similarity measure so called 1 neighborhood subnetwork, introduced the orthologous information into graphlets (nneighborhood subnetwork, n ≥ 2), and integrated neighborhood subnetwork and graphlet with conventional sequence similarity. To demonstrate the usefulness of new features and examine the importance of this integrative measure for distance measurement, we compare the alignment results derived from this new measure with those solely based on orthology information or graph structural information. Our results demonstrated that integration of orthologous information, 1neighborhood subnetwork, and orthologous graphlet scoring scheme, will lead to the best performance in network alignments. Finally, BinAligner was also compared with three widely used network alignment programs, including GRAAL [15], Graph Alignment [12, 13], and IsoRank [14].
Alignments of KSHV and VZV PPI networks solely based on orthologous information
By setting θ_{2} and θ_{3} to be 0, BinAligner generates an alignment based solely on orthologous information. We list in Table S2 (Additional file 1) the alignment table and also plot the alignment graph with orthologous pairs and matched edges in Figure S3 (Additional file 1) for a better view. This alignment identified 16 orthologous pairs together with 45 matched links, and thus the orthologous percentage is 100% and the edge correctness is 39.1%. Though the largest possible 16 orthologous pairs are aligned, it seems that some of them are misaligned because the alignment is restricted to orthologous pairs and thus the proteins with similar function but low sequence similarity could not be aligned. For example, KSHV ORF67.5 is aligned to VZV ORF49 and KSHV ORF23 is aligned to VZV ORF25. However, by checking the functions, KSHV ORF67.5/VZV ORF 25 are homologs of the HHV1 protein UL 33 [13]; VZV ORF49 is likely a myristylated tegument protein [26] and KSHV ORF23 is herpesvirus core gene UL21 family. Obviously, these two pairs are misaligned since ORF67.5 has several sequence orthologs, and sequence information alone cannot distinguish them. As a consequence, some important pathway conserved in KSVH and VZV PPIs are more likely to be broken. Thus, it seems that interaction pattern from link information are necessary to guide orthologous pair alignments when a protein has several orthologous partners (see the results in later section). Another major limitation for orthologous information based alignment is that it is not effective in identifying those functional orthologs with low sequence similarity. In our application, it is not surprising that except for the orthologous pairs, this alignment failed in identifying any other seemingly functional orthologous pairs since the alignment was generated based on only orthologous information.
Alignments of KSHV and VZV PPI networks solely based on graph structural information
By setting θ_{1} and α_{ ij } to be 0, the KSHV and VZV PPI networks were aligned merely using graph structural information. The aligned network has 68 edges (see Figure S4 in Additional file 1). The edge correctness is 59.1%, and Pvalue is about 6.2 × 10^{−44}. The details for aligned nodes are available in Table S3 (Additional file 1). Surprisingly, no orthologous pair was shown in the alignment network thus the alignment is probably not much biological meaningful. This result suggest that additional biological domain knowledge is crucial to be included to guide a biological network alignment, as different from other nonbiological network alignment. Other studies have showed that pure graph structural alignment could be very useful in aligning other types of nonbiological networks, such as computer networks and social networks [27].
Integration of orthologous information and neighborhood subnetwork scoring scheme resulted in the best alignment performance
By combining the orthologous information, 1neighborhood subnetwork and graphlet, the aligned network between KSHV and VZV PPI networks has the largest possible 16 orthologous pairs and 58 interactions (see Table 1). Thus, the orthologous percentage is 100%, the edge correctness is 50.4%, and the Pvalue for the edge alignment is about 1.0 × 10^{−32}. The sub alignment graph illustrating the aligned orthologous pairs and matched edges was shown in Figure 2, and the entire aligned graph was available in Figure S5 in Additional file 1).
In this aligned network, there is a connected sub alignment graph with 7 pairs of orthologous vertices which is connected by matched links. The pairs (KSHV/VZV) are ORF29b/ORF42, ORF67.5/ORF25, ORF60/ORF18, ORF61/ORF19, K8/ORF23, ORF69/ORF27 and ORF28/ORF1 and their functions are listed in Table 2. As the function all the orthologous pairs are related to virus packing and infection, this pathway might be crucial in both viruses.
In addition, BinAligner also identified some putative functional orthologous pairs with low sequence similarity but with similar function. For example, KSHV ORF56 is aligned to VZV ORF55, their sequence identity is 14.6%, however, functionally they are both helicaseprimase subunits. Similarly, KSHV ORF75 is aligned to VZV ORF44, their sequence identity is 15.3%, however they are both tegument proteins. KSHV ORF50 and VZV ORF4 are herpesvirus transcription factors with sequence identity is 11.4%. These putative functional orthologous proteins cannot be identified if we restrict the alignment into orthologous protein pairs as some conventional methods did, which confirms the importance of neighborhood similarity.
Effectiveness of sequence similarity, 1neighborhood subnetwork and graphlet
The parameters θ_{1}, θ_{2}, and θ_{3} balance the contribution of sequence similarity and neighborhood similarities. To test their influences, we compare the number of aligned orthologous pairs and matched edges using different parameters. We test the performances of using (1) only one scheme by setting one parameter to be 1 and the other two to be 0, (2) the combination of two schemes by setting the other parameter to be 0 and (3) the combination of three schemes. The full results are showing in Table S4 (Additional file 1), from which we chose a sub table (see Table 3) to show the importance of the 3 schemes. From the tables, an observation for comparing KSHV and VZV network is that sequence similarity contributes most to the orthologous pairs being aligned, whereas 1neighbor subnetwork is crucial to the number of matched edges. Graphlets contributes to both, but is not as important as the other two parameters. A possible reason is that we exclude the cliques in the graphlets. The results might be different if we add them back. However, it is beyond this study.
Comparison with other algorithms on KSHV and VZV PPI network alignments
In this section, we compared BinAligner with three popular network alignment algorithms IsoRank [14], GRAAL [15] and Graph Alignment [12, 13]. Performance evaluation of network alignment is based on the number of orthologous pairs and vertices. The more orthologous pairs, the better performance; the more vertices, the better performance.
For a fair comparison, besides our method, we tune the parameters in IsoRank and GRAAL. IsoRank and GRAAL both have one parameter to balance the node and link contribution. For IsoRank, the parameter is optimized with the range of 0 to 1 with a step of 0.01 and the parameter of GRAAL goes from 0 to 1 with step 0.1. The inconsistency in step size was arisen since the minimum increasing of step is set to be 0.1 in GRAAL. Graph Alignment is parameterfree since it provides some preassessing for parameters in its program, however the quality of the alignment seems to be quite dependent on the initial random seed chosen. We run Graph Alignment for 100 times and only record the best alignment. The comparison of all 4 methods are shown in Table 4.
Our results showed that BinAligner achieved the highest number of orthologous protein pairs and matched link pairs. Since GRAAL and GraphAlignment only aligned 2 and 9 pairs of orthologous protein and the aligned interactions are also significantly less than BinAligner, we only compare functionally the alignment by BinAligner (see Table 1) and that by IsoRank (see Table S5 in Additional file 1). The two alignments share almost all orthologous pairs except that BinAligner generates one more orthologous pair KSHV ORFK15/VZV ORF65. ORFK15 is a signal transducing membrane protein and ORF65 is a tegument protein, which immunoprecipitated a 16kDa protein from the membrane fraction of VZVinfected cells [28]. However, KSHV ORFK15 is aligned to VZV ORF64 by IsoRank where VZV ORF64 is a Gene66(IRS) protein and is by no means to be aligned to a signal transducing membrane protein. In addition, the identified functional orthologous pairs by BinAligner were missed by IsoRank. For example, instead of aligning KSHV ORF56/VZV ORF55, which are both helicaseprimase subunits, KSHV ORF56 is aligned to VZV ORF59 by IsoRank, which is an uracilDNA glycosylase. Since BinAligner also aligned 10 more matched links, so we believe that our alignment is better than that by IsoRank though the two alignments indeed share a lot of aligned protein pairs.
Discussion
In the study, a pairwise similarity matrix on vertices of two biological networks is constructed from sequence similarity, 1neighborhood subnetwork, and graphlets with orthologous information. The philosophy is that the similarity of two nodes in different biological networks is reflected successively by their sequence similarity (their own information), similarity of the vertices and edges link to them, and similarity of those indirectly links to them. The closer the vertices and edges are to the compared core vertices, the more impacts they are in reflecting the similarity of the compared vertices. To the best of our knowledge, the 1neighborhood subnetwork and graphlets with orthologous information have not been studied in the literatures. And our example illustrate that the two similarity measures, especially the 1neighborhood subnetwork contribute significantly in identifying a good network alignment. In addition, we remove the orthologous information and conducting network structure based alignment, which also show the importance of 1neighborhood subnetwork similarity in guiding a good alignment. The graphlets with orthologous information are incorporated to account for the information of farther neighborhoods. In this study, 104 graphlets were applied to consider information from up to 3neighborhood. The global similarity of two proteins is mostly decided by its sequence similarity and then the proteins and interactions close to them. However, the far proteins and interactions may still have indirect influence on them. So it could be beneficial to consider this indirect information. In practice, the best alignment was achieved by combining the three similarity measures.
Similar to sequence alignment, comparison of biological networks is very important in guiding various biological researches. Though we focus on the alignments of two proteinprotein interaction networks in this study, BinAligner could be used to align any other types of biological networks, such as gene regulatory networks, metabolic networks and so on. Local network alignment could be used to identify functional components like pathways and complexes that is conserved among different species or individuals, while global network alignment helps to infer the evolutionary relationships among species and could provide some useful information of functional orthologs, which might not be detected from sequence analysis alone. By aligning the PPI networks of KSHV and VZV, we identified a subnetwork consisting of seven orthologous protein pairs and connected by matched links in the two networks. This subnetworks might be conserved for important functions crucial to the two herpesviruses such as virus packing and infection. We also identified some nonorthologous pairs sharing similar link patterns in each network, and might be functional orthologs.
Current version of BinAligner is only feasible for aligning a small network with tens to hundreds of vertices. BinAligner would be useful in accurate comparison of biological networks such as viral networks and in refining subnetwork alignment in large network alignments.
However, it is still a big disadvantage for BinAligner to be unscalable. As the sequence similarity comparison and graphlets signature identification are currently available even for networks with thousands vertices and edges [15], the main bottleneck of this method is to generate the exact alignment score of 1neighborhood networks. There are two main reasons slowing down the process. Firstly, suppose the number of vertices of two networks are n_{1} and n_{2}, then we need to perform n_{1} × n_{2} pairwise 1neighborhood subnetwork alignments. The number of comparisons could be huge if both n_{1} and n_{2} are large. Since each pairwise 1neighborhood subnetwork comparison is independent with the other, a readily solution is to do parallel programming. Secondly, due to the powerlaw nature of biological networks, there might be a few vertices with large degrees [19]. However, we only need an estimate of alignment score which could reflect the similarity of 1neighborhood of two compared core vertices, not the exact alignment. Thus, an heuristic method, such as linear or lagrangian relaxation is a good alternative in this scenario. In the future, parallel programming and heuristic alignments for comparing 1neighborhoods with the number of vertices in both subnetworks are large will be implemented into BinAligner.
Conclusion
BinAligner compares the node similarity between biological networks by their sequence similarity, 1neighborhood subnetwork and similarity on graphlets, and then retrieves a global or local alignment from the node similarity matrix. The results on aligning the PPI networks of two herpes viruses KSHV and VZV show that BinAligner outperforms some existing methods by aligning more orthologous protein pairs and more protein interactions.
Availability and implementation
BinAligner is available at
References
Phizicky EM, Fields S: Proteinprotein interactions: methods for detection and analysis. Microbiol Rev. 1995, 59 (1): 94123.
Davidson E, Levin M: Gene regulatory networks. Proc Natl Acad Sci. 2005, 102 (14): 493510.1073/pnas.0502024102.
Schuster FD S, Dandekar T: A general definition of metabolic pathways useful for systematic organization and analysis of complex metabolic networks. Nature Biotechnology. 2000, 18: 326332. 10.1038/73786.
Galperin M: Bacterial signal transduction network in a genomic perspective. Environmental Microbiology. 2004, 6 (6): 552567. 10.1111/j.14622920.2004.00633.x.
Fields S, Song O: A novel genetic system to detect proteinprotein interactions. Nature. 1989, 340: 245246. 10.1038/340245a0.
Aebersold R, Mann M: Mass spectrometrybased proteomics. Nature. 2003, 422: 198207. 10.1038/nature01511.
Lathrop R: The protein threading problem with sequence amino acid interaction preferences is NPcomplete. Prot Eng. 1994, 7: 10591068. 10.1093/protein/7.9.1059.
Kelley B, Sharan R, Karp R, Sittler T, Root D, Stockwell B, Ideker T: Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc Natl Acad Sci. 2003, 100: 1139411399. 10.1073/pnas.1534710100.
Kelley B, Yuan B, Lewritter F, Sharan R, Stockwell B, Ideker T: PathBLAST: a tool for alignment of protein interaction networks. Nucl Acids Res. 2004, 32: W83W88. 10.1093/nar/gkh411.
Koyutürk M, Kim Y, Topkara U, Subramaniam S, Szpankowski W, Grama A: Pairwise alignment of protein interaction networks. J comput biol. 2006, 13: 82199.
Flannick J, Novak A, Srinivasan B, McAdams H, Batzoglou S: Graemlin:General and robust alignment of multiple large interaction networks. Genome Res. 2006, 16: 11691181. 10.1101/gr.5235706.
Berg J, Lässig M: Crossspecies analysis of biological networks by bayesian alignment. Proc Natl Acad Sci. 2006, 103: 1096710972. 10.1073/pnas.0602294103.
Kolar M, Lässig M, J B: From protein interactions to functional annotation: Graph alignment in Herps. BMC Syst Biol. 2008, 2: 9010.1186/17520509290.
Singh R, Xu J, Berger B: Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc Natl Acad Sci. 2008, 105: 1276312768. 10.1073/pnas.0806627105.
Kuchaiev O, Milenković T, Memisević V, Hayes W, Przulj N: Topological network alignment uncovers biological function and phylogeny. J R Soc Interface. 2010, 7: 13411354. 10.1098/rsif.2010.0063.
Klau G: A new graphbased method for pairwise global network alignment. BMC Bioinformatics. 2009, 10 (Suppl 1): S5910.1186/1471210510S1S59.
Li Z, Zhang S, Wang Y, Zhang X, Chen L: Alignment of molecular networks by integer quadratic programming. Bioinformatics. 2008, 24: 594596. 10.1093/bioinformatics/btm630.
Li J, Yang J, Dong L, Hu K, Li F, Grünewald S: Pairwise Alignment of ProteinProtein Interaction by Linear Programming. Acta biophysica sinica. 2010, 26: 7379.
He X, Zhang J: Why Do Hubs Tend to Be Essential in Protein Networks?. PLoS Genet. 2006, 2: e8810.1371/journal.pgen.0020088.
Przulj N, Corneil D, I J: Modeling Interactome, ScaleFree or Geometric?. Bioinformatics. 2004, 20: 35083515. 10.1093/bioinformatics/bth436.
Higham DJ, Rašajski M, Pržulj N: Fitting a geometric graph to a proteinprotein interaction network. Bioinformatics. 2008, 24 (8): 10931099. 10.1093/bioinformatics/btn079.
Kuchaiev O, Rasajski M, Higham D, N P: Geometric denoising of proteinprotein interaction networks. PLoS Comput Biol. 2009, 5 (8): e100045410.1371/journal.pcbi.1000454.
Kuchaiev O, Stevanovic A, Hayes W, N P: GraphCrunch 2: Software tool for network modeling, alignment and clustering. BMC Bioinformatics. 2011, 12: 2410.1186/147121051224.
Uetz P, Dong Y, Zeretzke C, Atzler C, Baiker A, Berger B, Rajagopala S, Roupelieva M, Rose D, Fossum E, Haas J: Herpesviral protein networks and their interaction with the human proteome. Science. 2006, 311: 239242. 10.1126/science.1116804.
Ellson J, Gansner E, Koutsofios L, North S, Woodhull G, Description S, Technologies L: Graphviz: open source graph drawing tools. Lecture Notes in Computer Science. 2001, 483484.
Sadaoka T, Yoshiil H, Imazawa T, Yamanishi K, Mori Y: Deletion in Open Reading Frame 49 of VaricellaZoster Virus Reduces Virus Growth in Human Malignant Melanoma Cells but Not in Human Embryonic Fibroblasts. Journal of Virology. 2007, 81: 1265412665. 10.1128/JVI.0118307.
Burt R: The network structure of social capital. Research in Organizational Behavior. 2000, 22:
Cohen J, Sato H, Srinivas S, Lekstrom K: Varicellazoster virus (VZV) ORF65 virion protein is dispensable for replication in cell culture and is phosphorylated by casein kinase II, but not by the VZV protein kinases. Virology. 2001, 280 (1): 6271. 10.1006/viro.2000.0741.
Acknowledgements
The authors would like to thank Michal Kolar for providing the data for this analysis.
Declarations
This work was partially supported by the Natural Science Foundation of China (No. 10971213) to S.G., and Department of Justice (2010DDBX0596) and National Institutes of Health (NIAID RC1AI086830) to X.F.W.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
XFW and SG supervised the project. JY and JL performed the experiments, analyzed data and wrote the paper. All authors revised the paper.
Jialiang Yang, Jun Li contributed equally to this work.
Electronic supplementary material
12859_2013_6086_MOESM1_ESM.docx
Additional file 1: Supplementary file contains the legends and description of 5 supplementary figures and 5 supplementary tables. (DOCX 2 MB)
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what reuse is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and reuse information, please contact the Rights and Permissions team.
About this article
Cite this article
Yang, J., Li, J., Grünewald, S. et al. BinAligner: a heuristic method to align biological networks. BMC Bioinformatics 14 (Suppl 14), S8 (2013). https://doi.org/10.1186/1471210514S14S8
Published:
DOI: https://doi.org/10.1186/1471210514S14S8