Exact p-values for global network alignments via combinatorial analysis of shared GO terms

Network alignment aims to uncover topologically similar regions in the protein–protein interaction (PPI) networks of two or more species under the assumption that topologically similar regions tend to perform similar functions. Although there exist a plethora of both network alignment algorithms and measures of topological similarity, currently no “gold standard” exists for evaluating how well either is able to uncover functionally similar regions. Here we propose a formal, mathematically and statistically rigorous method for evaluating the statistical significance of shared GO terms in a global, 1-to-1 alignment between two PPI networks. Given an alignment in which k aligned protein pairs share a particular GO term g, we use a combinatorial argument to precisely quantify the p-value of that alignment with respect to g compared to a random alignment. The p-value of the alignment with respect to all GO terms, including their inter-relationships, is approximated using the Empirical Brown’s Method. We note that, just as with BLAST’s p-values, this method is not designed to guide an alignment algorithm towards a solution; instead, just as with BLAST, an alignment is guided by a scoring matrix or function; the p-values herein are computed after the fact, providing independent feedback to the user on the biological quality of the alignment that was generated by optimizing the scoring function. Importantly, we demonstrate that among all GO-based measures of network alignments, ours is the only one that correlates with the precision of GO annotation predictions, paving the way for network alignment-based protein function prediction.


Introduction
Network alignment aims to uncover similar network connection patterns between two or more networks under the assumption that common network topology (which may be easily observable) correlates with common function (which is more difficult to observe).Network alignment algorithms abound and W. Hayes Dept. of Computer Science, UC Irvine Tel.: +1-949-824-1753 Fax: +1-949-824-4056 E-mail: whayes@uci.edutheir number is increasing rapidly; see for example Table 4 and recent surveys [1,2,3,4,5,6,7,8] While most practitioners agree on the goal of network alignment, in order to test various algorithms against each other for the ability to recover functional similarity, one needs a way to evaluate the functional similarity uncovered by a given network alignment.Unfortunately, there are almost as many ways to evaluate an alignment as there are alignment algorithms.
One of the most common methods for evaluating the biological significance of an alignment involves using the Gene Ontology's (GO) term hierarchy [9].There are several mathematical/statistical complications that arise when attempting to evaluate an alignment using GO terms: -Most GO terms have inter-dependencies with many other GO terms via the GO hierarchy [10].-Most genes and proteins have more than one GO annotation, and it is difficult to create a measure that correctly evaluates similarity between two proteins with different sets of GO terms that only partially overlap.-Since most GO terms annotate many proteins, it is nontrivial to compute the significance of aligning a set of protein pairs while accounting for both the frequency and inter-relationships between GO terms that may appear in multiple pairs across the set of aligned pairs.-Even given just one GO term g, it is nontrivial to compute the statistical significance of the event that k protein pairs in the alignment share g.
In this paper we deal only with the last issue: given a particular global alignment between a pair of networks in which k aligned protein pairs share a specific GO term g, we compute the exact p-value that a random alignment would have k such aligned pairs.The good news is that, once an exact p-value is known for each GO term g, the Empirical Brown's Method [11] can be used to approximately account for the other complications above.
Additionally, there are non-mathematical considerations when using GO terms: protein function is ultimately determined experimentally, so there is always experimental uncertainty involved in claiming that a certain protein should be annotated with a particular GO term; molecular and cellular biology is far from being fully understood, and so the GO term hierarchy itself is in constant flux, with new GO terms introduced as completely novel functions are discovered, or GO terms being merged or split or even deleted as the functional hierarchy is re-evaluated; and different authors may disagree on which GO terms are important, reliable, etc.While these are obviously important scientific considerations, they are beyond the scope of this paper and we will not discuss them further.
2 Method: GO-term p-values by exhaustive enumeration of alignments

Network alignment and functional similarity
Given two networks G 1 , G 2 , let the node sets V 1 , V 2 represent n 1 and n 2 proteins respectively, and the edge sets E 1 , E 2 represent protein-protein interactions (PPIs).Assuming (without loss of generality) that n 1 ≤ n 2 , a pairwise global network alignment (PGNA) is a 1-to-1 mapping f : V 1 → V 2 in which every node in V 1 is mapped to exactly one node in V 2 .
Once an alignment is specified, we need to measure the extent to which functionally similar proteins have been mapped to each other.Many existing methods evaluate their alignments using GO terms from the Gene Ontology [9], and most often evaluate the functional similarity of each pair of aligned proteins independent of all the others, and then average across the pairs.While the score of each pair may be meaningful, taking an average across pairs assumes that each pair is independent of all the others.This is not true because the pairings themselves are inter-dependent via the alignment itself, which is built globally.For example, in a 1-to-1 alignment, each node in each network can appear at most once across the entire alignment, a property which destroys the independence assumption needed for a meaningful average across aligned protein pairs; we discuss this problem in more detail in §4.3.
Our solution to this problem is to look at an alignment from the viewpoint of one GO term at a time, rather than one aligned pair of proteins at a time.
To that effect, we now describe how to compute the exact p-value that exactly k aligned protein pairs share a particular GO term g.

Computing the total number of possible alignments
In the following exposition, we must discuss in great detail the combinatoric structure of a given alignment.To aid visualization, we use what I call the "Pegs and Holes" analogy: given networks G 1 , G 2 with n 1 , n 2 nodes, we imagine G 2 's nodes as n 2 identical "holes" drilled into a large board, and G 1 's nodes as n 1 identical "pegs" that can each fit into any hole.To enforce the global 1-to-1 property, there are two cases: 1. n 1 ≤ n 2 , so every peg is placed into some hole, leaving n 2 −n 1 empty holes.
There are n2 n1 ways to choose which holes to use, and n 1 !ways to place the pegs.2. n 1 > n 2 , so every hole is filled with some peg, leaving n 1 − n 2 pegs unplaced.There are n1 n2 ways to choose which pegs to place, and n 2 !ways to place them.The above two cases are symmetric and so, without loss of generality, we assume n 1 ≤ n 2 .Then, the total number of all possible alignments is The function P (•, •) of Eq. ( 1) is more commonly known as k-permutations-ofn, or P (n, k).However, P (n, k) is usually defined to be zero if n < k, whereas we will often need to compute the number of alignments when we don't know which of the two values is larger.Thus, in this paper, we will adopt a modified permutation function π(n 1 , n 2 ) as follows 2.3 Counting alignments with exactly k matches Given a particular GO term g, assume g annotates λ 1 pegs and λ 2 holes.A peg and the hole it sits in are, more technically, a pair of aligned nodes.
We say that such a pair "match" with respect to GO term g if they are both annotated with g.Let λ = min(λ 1 , λ 2 ), and λ = max(λ 1 , λ 2 ).Given a random 1-to-1 alignment, we are going to compute the probability p that exactly k pairs of aligned nodes share g.In our analogy, this means that exactly k pegs-no more, no less-that are annotated with g sit in holes that are also annotated with g.To do this, we will use a combinatorial argument to enumerate all possible PGNAs that can exist that have exactly k matches.
Given that number, we simply divide by Eq. ( 1) to get the probability that a randomly chosen alignment has exactly k matches.

Special cases
The following are special cases: , then p = 0, otherwise p > 0 is computed below.
The last case arises when λ 1 > n 2 − λ 2 , which means that there are more annotated pegs than non-annotated holes, necessitating that at least λ 1 − (n 2 − λ 2 ) annotated pegs must align with annotated holes.(Recall we are computing the probability of exactly k aligned pairs sharing g, so k too small in this case gives p = 0.) Below we describe the general case in detail.In broad outline, there are three steps: (i) create the required k matches by placing k annotated pegs into k annotated holes; (ii) arrange to place the remaining annotated pegs away from the annotated holes in order to keep k constant; (iii) place any remaining pegs (all of which are non-annotated) in any still-empty holes (some of which may be annotated).In each case we either sum, or multiply, as appropriate, the number of ways to perform the described action.In the end we have counted all the possible ways to create an alignment that has exactly k matches.

Creating exactly k matches
Out of the λ 1 pegs annotated with g, pick k ≤ λ of them; there are λ1 k ways to do this.We will place these k pegs into k holes that are also annotated with g; there are λ2 k ways to pick the holes, and k! ways to place the k pegs into the k holes.Thus, the total number of ways to match exactly k pairs of nodes that share g is From this point onward, in order to keep k constant, we are committed to creating no more matches.

Enumerating the ways to use the remaining annotated holes
To ensure that no more node pairs are matched, we need to ensure that none of the remaining (λ 1 − k) annotated pegs are placed into any of the remaining (λ 2 −k) annotated holes.Thus, each annotated hole must either remain empty, or take an non-annotated peg.There are n 1 −λ 1 available non-annotated pegs, regardless of the value of k.Pick µ of them.Since these µ pegs are all nonannotated, they can go into any unoccupied annotated hole without changing k.However, there are lower and upper bounds on what µ can be, as follows: µ can be at most µ ≡ min(n 1 −λ 1 , λ 2 −k), since n 1 −λ 1 is the total number of non-annotated pegs, and λ 2 − k is the number of available annotated holes in which to place (some of) them.
note that we have n 1 − k pegs (of both types) remaining to place, and exactly n 2 − λ 2 non-annotated holes, into which some (or all) of the pegs can be placed.By the pigeon hole principle, if (n , then some of the pegs-and they can only be non-annotated pegs-must go into annotated holes.Thus, µ-which refers only to non-annotated pegs-must be at least µ ≡ (n ; otherwise µ = 0.

Distributing the remaining pegs
For any µ ≤ µ ≤ µ, we need to count how many alignments can be built when µ non-annotated pegs are placed into the λ 2 − k available annotated holes, as well as what happens to all the remaining pegs.The process is as follows.
1.There are n1−λ1 µ ways to choose µ non-annotated pegs, and π(λ 2 − k, µ) ways to align them with the open annotated holes.To simplify notation note that n 1 , n 2 , λ 1 , λ 2 are all fixed; thus, let γ k (µ) = n1−λ1 µ π(λ 2 − k, µ). 2. Recall that there are still λ 1 − k annotated pegs to be placed, and that they must be placed into non-annotated holes, so we must "reserve" λ 1 − k non-annotated holes, which will be further accounted for below.3. Once µ annotated holes are filled with non-annotated pegs, the rest of the annotated holes must remain empty; this leaves n 1 − λ 1 − µ non-annotated pegs to go into the n 2 − λ 2 non-annotated holes.Keeping in mind the "reservation" above, there are n 2 − λ 2 − (λ 1 − k) available non-annotated holes.There are n2−λ2 λ1−k ways to choose which holes to use while reserving λ 1 − k of them, and π(n Finally, we place the remaining λ 1 − k annotated pegs into the reserved holes of the same number; there are (λ 1 − k)! ways to do this.

Summing the unmatched region of the alignment
Combining all of the above for fixed µ and then summing over all possible µ, the total number of ways that n 1 − λ 1 non-annotated pegs can be used to (partially or wholly) fill λ 2 − k annotated holes, and then use all the remaining pegs and holes in a manner consistent with keeping k constant, is

Final tally for exactly k matches
Combining Eq.s ( 3) and ( 4), the total number of alignments in which exactly k aligned node pairs share GO term g is 2. 4 The probability of an alignment with exactly k matches Eq. ( 5) counts all possible alignments in which exactly k aligned node pairs share GO term g.To get the probability p k of the same event, we divide by Eq. ( 1): where a superscript g has been added as appropriate to denote that this probability is specifically tied to GO term g.Note this refers to exactly k matches.To measure the statistical significance of m matches, we sum Eq. ( 6) for k from m to λ g .

Efficiently dealing with huge numbers
Though technically it is only an implementation detail, it is important to briefly discuss how to deal with the astronomically huge numbers involved in these calculations.Typical modern biological networks can have thousands to tens of thousands of nodes, and some GO terms annotate thousands of genes in each network.For example, in BioGRID 3.4.164that we use below, the two biggest PPI networks in terms of number of nodes are H. sapiens and A. thaliana, which contain exactly 17,200 and 9,364 unique proteins, respectively, that are involved in physical interactions.Eq. ( 1) in this case is approximately 10 38270 -an integer with over 38,000 digits in base-10, which is far above the values typically representable on modern hardware.Luckily, its logarithm is easy to represent in double precision floating point, and so all of the multiplications herein can be computed as the floating-point sum of logarithms.The sole complication is the summation in Eq. ( 4), which is a sum of values, not logarithms.We use the following trick.Given two numbers a and b, assume we have at our disposal only their logarithms, α = log(a) and β = log(b).Our goal is to estimate log(a + b).Without loss of generality, assume a ≤ b.Then, where f (x) is some function that can provide an accurate estimate of log(1+x) for any |x| ≤ 1.One must be careful because if |x| is below the machine epsilon (≈ 10 −16 in double precision), then 1 + x evaluates to 1 because x is rounded away, and a direct evaluation of the expression log(1 + x) gives zero.The solution is not hard: the built-in library function for log can evaluate log(1+x) with sufficient accuracy if |x| > 10 −6 ; for smaller values of |x|, we explicitly invoke the Taylor series, which is extremely accurate for small values of |x|.
We have tested that this method gives values for log(a + b) that are accurate to almost machine precision for any |x| ≤ 1.

Numerical Validation
Staring at C k (λ 1 , λ 2 ) in Eq. ( 5) and tracing back through the equations that define its components, it is not immediately obvious that the C k (λ 1 , λ 2 ), when summed over all possible values of k, must add up to exactly π(n 1 , n 2 ) independent of the choice of λ 1 , λ 2 .Yet if Eq. ( 5) is correct, then this must be the case since summing p k in Eq. ( 6) across all k of must give exactly 1.
In the calculation of p g k in Eq. ( 6), the values of k and g are fixed.For a fixed g, valid values of k range from zero to λ g .If our calculations are correct, then the sum across k of p g k should be exactly 1 for any fixed g, n 1 , n 2 , λ 1 , λ 2 .We tested tested this property in the following cases: 1. exhaustively for all 0 ≤ λ 1 ≤ n 1 and 0 ≤ λ 2 ≤ n 2 for all 0 ≤ n 1 ≤ n 2 ≤ 100; 2. as above but in steps of 10 in λ i and n i up to n 2 = 1, 000; 3. as above but in powers of 2 in λ i and n i up to n 2 = 32, 768; 4. several billion random quadruples of (n 1 , n 2 , λ 1 , λ 2 ) with n 2 chosen uniformly at random up to 100,000, n 1 chosen uniformly at random up to n 2 , and the λ's chosen uniformly at random up to their n value.We found in all cases that the difference from 1 of the sum over k of p g k was bounded by 10 −9 .(Keep in mind that we had access only to the logarithms of the C k ; that the actual sum across k had to be approximated term-by-term using Eq. ( 9); that the correct answer in log space is log(1) = 0; and that all operations were performing in floating point, which incurs roundoff error.)Furthermore, in any particular case, the numerical (floating-point roundoff) error will be dominated by the sum over µ in Eq. ( 4), and so we would expect the error to be smaller (ie., sum closer to 1) when there are fewer terms in Eq. ( 4).The number of terms is well-approximated by min(n 1 − λ 1 , n 2 ).Indeed, we find that if the sum was S, then the value |S − 1|/ min(n 1 − λ 1 , n 2 ) has mean ≈ 3 × 10 −14 , standard deviation ≈ 3 × 10 −13 , and was never observed to exceed 3 × 10 −12 .

Validation against random alignments of real PPI networks
We downloaded the 8 largest protein-protein interaction networks from release 3.4.164(August 2018) of BioGRID (cf.Table 1), and the GO database release of the same month.As many authors of network alignment papers do, we then split the GO database into two versions: one with all GO terms, and ones where sequence-based GO terms were disallowed.For each of the 8 2 = 28 pairs of networks and for both versions of the GO database, we generated 400 million random alignments, for a total of 22.4 billion random alignments.For each GO term g, we observed the integer frequency φ g k that g was shared by exactly k proteins when it annotated λ g 1 out of n 1 proteins in network G 1 and λ g 2 proteins out of n 2 in network G 2 .(Note that formally φ g k it has six parameters, φ g k (n 1 , n 2 , λ g 1 , λ g 2 ), though we often abbreviate it to φ g k or even just φ k or φ if context is clear.)It is a non-negative integer bounded by the number of random alignments, N = 4 × 10 8 , and dividing it by N gives an estimate of the probability that a randomly chosen alignment between G 1 and G 2 will contain exactly k aligned protein pairs that share g.
The estimated (ie., observed) probability φ g k /N can be compared to p g k of Eq. ( 6).Across the 22.4 billion random alignments, we observed 428,849 unique combinations of the six parameters g, k, n 1 , n 2 , λ g 1 , λ g 2 that formally define φ g k .Figure 1 is a scatter plot of φ g k /N for all 428,849 of them, versus the theoretical value from Eq. ( 6).The agreement is excellent.(We note that our Figure 1 is exactly analogous to Figure 1 of the paper that introduced BLAST [12], in which the authors compared their statistical model of sequence alignment to computational experiments involving random sequence alignments.) The scatter in Figure 1 increases towards the low end because events with probability near N −1 are rarely observed, and so the estimate of their probability contains significant sampling noise.In fact there is "width" to the scatter plot at all values of probability, but it is difficult to observe in Figure 1.To more clearly see the scatter, we compute the ratio of the observed to theoretical values of probability, which will have an expected value of 1 if Eq. ( 6) is an accurate and unbiased estimator of probability.Figure 2

|.
Each observed frequency φ g k (which we will henceforth abbreviate a φ) is converted to an observed probability φ/N , where N is the number of random alignments (4 × 10 8 ) per pair of networks.However, φ is also the number of samples used to create the observed probability estimate; higher φ gives a better estimate of the probability.We binned φ in powers of 2 (ie. the bin is log 2 (φ) , and for each bin plotted the mean and standard deviation of D. We see that as the number of samples increases, the ratio approaches 1 as the square root of the number of samples, consistent with sampling noise.
|1 − (φ g k /N )/p g k | across all 428,849 observed frequencies, as a function of the number of samples that gave rise to the probability estimate.We can clearly see that the ratio approaches 1 asymptotically with the square root of the number of samples, consistent with sampling noise in φ.

Comparison with a simpler Poisson model
We introduce a Poisson-based model that correctly iterates across GO terms rather than protein pairs, though it simplistic and only provides an approximate p-value.Given that GO term g annotates λ i out of n i proteins in network G i , then a randomly chosen protein u i from network G i has probability λ i /n i of being annotated by g.Thus, when a pair of proteins (u 1 , u 2 ) is independently sampled from all possible pairs of proteins in V 1 ×V 2 , the probability that they both share g is λ1 n1 λ2 n2 ; note that, at this stage, this not an approximation-the probability is exact.Since multiple independent Poisson processes have a cumulative rate which is simply the sum of their individual rates, if we choose m such pairs of nodes, each independently of all the others, then the number of pairs that share g is modeled by a Poisson distribution with mean Λ = m λ1 n1 λ2 n2 , and the probability that k such pairs share g is While this distribution correctly models the case where each protein pair is chosen independently and uniformly at random from all the others, it is only an approximation for the distribution of protein pairs that share g in a 1-to-1 network alignment, because the set of node pairs in an alignment are not independent: each pairs depends implicitly on all the others via the alignment itself, which is built globally and disallows any one node from being used more than once.
In relation to the GO term frequencies, Eq. ( 10) is a good approximation when λ 1 n 1 and λ 2 n 2 , because then the probabilities are small and each pair that shares g has only a small influence on others.However, the approximation gets worse as either of λ 1 or λ 2 increases.To demonstrate this, we took an assortment of good alignments between the 3.4.164BioGRID networks [13] which had some astronomically small p-values.Figure 3 plots the ratio of the Poisson-based p-value of Eq. ( 10) to the exact one of Eq. ( 6), as a function of λ 1 + λ 2 .As we can see, the ratio is rarely less than 1 (ie., the Poisson-based p-value is almost always greater than or equal to the exact one), but can be huge if λ 1 + λ 2 is large-meaning the Poisson model grossly underestimates the statistical significance of matching a large number of pairs that share g.

Re-evaluating a previous comparison
In [2], we introduced SANA-the Simulated Annealing Network Aligner.In that paper, we evaluated SANA's ability to optimize a wide variety of networktopology-based objective functions, with two goals: first, to demonstrate that, given any objective function F used by some other alignment algorithm A, SANA usually provided a more optimal value for F than A did itself; and (b) to compare how well each objective function F was able to produce biological  10) to the exact p-value of Eq. ( 6), across 562,000 random samples from good alignments between networks of BioGRID 3.4.164[13].As we can see, the p-value returned by the Poisson model gets progressively worse (underestimating statistical significance) as the λ values grow.Note that some points had ratios as high as 10 70 , though we truncated the vertical axis at 10 5 .Each point has been perturbed in a random direction by a distance distributed as N (0, 10 −1 ) in log space, otherwise thousands of points land at the same integer co-ordinates, making it impossible to visualize the density of points across the plane.meaningful alignments.Given that the first question was sufficiently demonstrated in [2], here we re-evaluate the second question-which topological objectives are best at recovering biologically meaningful alignments-using the exact p-value method herein, combined to one holistic value using the Empirical Brown's Method [11].(We note that since the SANA paper was published in 2017 based on work done between 2014 and 2016, the BioGRID networks used were version 3.2.101from June 2013.)Tables 2 and 3 show the results.The Resnik column was the quantity computed originally [2], while the p-value and bit-score columns are from the current work.The differences are stark.First, although the Resnik score generally correlates positively with bit-score, it does not effectively demonstrate the sometimes enormous differences in p-value between various alignments.For example, in Table 3 Table 2 Re-evaluating the same alignments that were presented in the first SANA paper [2].Each row represents a single alignment between a pair of BioGRID species (see Table 1 for abbreviations), using SANA to optimize the objective function used by the algorithm in the Objective Function column.(This was done because we had shown, in [2], that SANA optimized the objectives of the other aligners better than those aligners did themselves).GRAAL, and yet the p-values are 10 −292 and 10 −77 , respectively-a difference of over two hundred orders of magnitude in base-10, even though the Resnik scores differ by less than 1.For comparison, completely random alignments have a Resnik score of about 2-3, while perfect alignments score about 12.
Second, HubAlign's Importance objective function is the best-scoring objective function in each and every species pair, usually beating the next best alignment by astronomical amounts in p-value and bit-score.In the case of the two species for which we have the greatest amount of GO-based knowledge (SC-HS, ie., yeast and human), HubAlign scores over 76,000 bits, while the next best objective (from WAVE) scores just over 22,500 bits-ie., HubAlign's Importance-based bit score is more than 3 times larger, a difference of more than 16,000 orders of magnitude (base 10) in p-value, yet the Resnik scores differ by only 1.01.

Limitations: single vs. multiple GO terms; subsets of GO terms
The method described herein provides a p-value for a single GO term g being shared by k protein pairs in a pairwise global network alignment.While this is a useful number, it is not the end of the story.For example, if two very different network alignments both have the same p-value for a particular GO term g, our method can say nothing about with is "better" with respect to g; it would then be the user's task to look more closely to discover which alignment they prefer.
Once we compute a rigorous p-value for each GO term g that appears in both networks, computing a GO-based p-value of the entire alignment requires the multitude of "per-GO-term" p-values into a single, "holistic" GO-based p-value for the entire alignment.Doing so rigorously is a challenging problem in itself, and is well beyond the scope of this paper; to our knowledge nobody has yet worked out how to rigorously account for the issues raised in our bulleted list in §1; see for example surveys [14,15,16].In the meantime, a robust approximation to providing a single "holistic" p-value combining multiple p-values that may have complex relationships is provided by the recent Empirical Brown's Method [11], which uses the covariance matrix among all observed samples to account for inter-relationships.
Our analysis is easily adapted to evaluate network alignments based on any subset of GO terms.For example, one may wish to separately evaluate the three GO hierarchies of Biological Process (BP), molecular Function (MF), and Cellular Component (CC).Similarly, one should evaluate an alignment without the use sequence-based GO terms if sequence played any part in constructing the alignment.

The problem with ignoring GO terms close to the root of the hierarchy
A common practice [10] involves arbitrarily ignoring GO terms in the top few levels of the GO hierarchy on the assumption that, when a GO term annotates so many proteins, a protein pair that matches it has little value.A known problem with this suggestion is the definition of "top few levels": even GO terms at the same level but different regions of the GO hierarchy can have vastly different values of λ, so that it is difficult to choose which GO terms to ignore [10].While there are sometimes valid reasons for ignoring such common GO terms-such as the fact that they may be "catch-all" terms with little meaning or with very low confidence-there may be cases where ignoring them is unjustified.
From the network alignment perspective, ignoring these common GO terms has the opposite problem to that of §4.3 in that, rather than failing to penalize a bad alignment, this procedure fails to adequately reward alignments that are "good" in the following sense.Assume a GO term g annotates 10% of proteins in each network, and that these annotations are not simply low-confidence, "catch-all" GO terms.This can be a substantial number of proteins (eg., over 1700 in human and almost 700 in mouse), and such a GO term is likely to be high in the hierarchy.However, if a network alignment matches a substantially larger fraction of this plethora of pairs than is expected at random, it is a sign that large regions of functional similarity are being correctly aligned to each other, even if individual proteins are not.In other words, perhaps similar pathways are being correctly mapped to each other even if the individual proteins in the pathway are incorrectly mapped.A network alignment that successfully matches such large regions should be rewarded for doing so, but if "common" GO terms are disregarded, this won't happen.

The fallacy of averaging across pairs of aligned nodes
There is a crucially important case that is often implicitly ignored in the literature by methods that evaluate GO-based significance of alignments by evaluating all aligned protein pairs, rather than evaluating all GO terms.This case is alluded to by phrases such as "consider the GO terms shared by a pair of aligned proteins. . .".The problem is when there is a GO term g that exists in both networks (ie., λ g 1 and λ g 2 are both nonzero), but no pair of aligned proteins share it.Then the "consider..." phrase above implicitly misses the fact that g could have been shared by some aligned protein pairs, but was not. 1 Unless taken care of explicitly, the alignment evaluation fails to penalize the alignment for failing to provide any matches for GO term g.In contrast, our method is correctly penalized for such cases: any GO term g for which k = 0 but both λ 1 and λ 2 are nonzero receives the appropriate penalty of a p-value with little statistical significance.Unfortunately, since many existing publications ignore this case, many published p-values claim far more statistical significance than actually exists.
A second major problem with existing ad hoc measures is that they do not scale even remotely monotonically with statistical significance.Take the Jaccard similarity, which is the most popular according to Table 4, though it has variously been called GO Correctness or Consistency (GOC), as well as Functional Correctness/Consistency (FC).Formally, given node u ∈ V 1 aligned to v ∈ V 2 , let S u , S v be the set of GO terms annotating u, v, respectively.Then the Jaccard/GOC/FC between u and v is defined as Given this similarity across all aligned pairs of proteins, the entire alignment is given an FC value equal to the mean across all aligned pairs.It is easy to construct a scenario to demonstrate how badly the Jaccard/GOC/FC measure can lead one astray.Consider the following simple system: network G has n = 1000 nodes.Each node is annotated with one, and only one, GO term.The first 900 nodes are annotated with GO term g 0 -ie., λ g0 = 900.We will refer to these as the "common" nodes.The remaining 100 nodes are each individually annotated with their own unique GO term, with names {g 1 , g 2 , . . ., g 99 , g 100 }; thus, λ gi = 1 for all i = 1, . . ., 100.We will refer to these as the "specific" nodes.For simplicity, we will align G to itself, and assume that all 101 of the GO terms are independent, so that the p-value of the entire alignment is the product of the p-values across 101 GO terms. 2  Then, every pair of aligned nodes constitutes a cluster, and the only possible per-cluster FC scores are 0 and 1, so that the mean alignment-wide FC score is simply the fraction of node pairs that are matched, using the formal definition of "match" from Section §2.3.
In a random alignment of G to itself, each common node has a 90% chance of being aligned with another common node, so that the expected number of matched common nodes is 900 × 0.9 = 810; evaluating Eq. ( 6) we find that p 810 (1000, 1000, 900, 900) = 0.139-not statistically significant, as expected.On the other hand, each specific node has only a 0.1% chance of being aligned with its one and only match, so that in a random alignment we expect none (or very few) of the specific nodes to match.For this example, assume we do a bit better on the common nodes and match 820 of them, but match none of the specific nodes.The p-value of matching 820 common nodes is 0.0007. 2 The assumption of independence is not entirely unfounded; for example we could choose g 0 to be the Cellular Component (CC) GO term GO:0005634, which describes the location "nucleus", and choose the remainder of GO terms to be molecular functions (MF) that tend to occur only outside the nucleus.In fact, in the Sept. 2018 release of the GO term database there are over 700 MF GO terms with the following properties: (a) they annotate exactly one protein (ie., each of over 700 GO terms g has λ g = 1), and (b) for each such GO term, the one protein it annotates is not annotated with GO:0005634.The fact that over 700 such GO terms exist make our independence assumption plausible-at least in this artificial scenario.
The 100 unmatched specific nodes each have p 0 (1000, 1000, 1, 1) = 0.999, and 0.999 100 = 0.90.All told, this alignment has FC score of 0.82-making it look very good-and a p-value of about 0.0006.Now consider a second alignment with the same FC score: we will correctly match just 10 of the specific nodes, and assume the other 90 are aligned with common nodes.This leaves precisely 810 common nodes to match each other, so the FC score is (810+10)/1000 = 0.82, as above.The p-value of 810 matched common nodes is above-0.139.However, each specific node has probability only 10 −3 of matching in a random alignment, so the p-value of matching 10 of them is 10 −30 .
Thus, both alignments have a mean FC of 0.82, yet-to the nearest orderof-magnitude-the first has a p-value of only ≈ 10 −3 , while the second has a p-value of ≈ 10 −31 .From a statistical significance standpoint, the second one is-quite literally-an astronomically better alignment.
The takeaway message is that any method that evaluates functional significance across pairs of aligned proteins, rather than across GO terms, can lead to very misleading conclusions by making random alignments look just as good as excellent ones.

Appendix: brief survey of existing GO-based measures of network alignments
Table 4 presents a list of alignment papers and the measures they use to evaluate functional similarity.Without exception, all of these methods evaluate each pair of aligned nodes individually, and then take some sort of average across pairs.(Some methods are not 1-to-1 and so the "pair" of aligned nodes we discuss must be generalized to a cluster of aligned nodes, but this generalization does not negate our point.)We are aware of no existing methods that consider the alignment from the perspective of one GO term's performance globally across all clusters.Thus, all of these methods suffer the major drawbacks described in Section §4. 3.
Below is a brief description of the methods.-Jaccard / FC / GOC: A common pairwise measure is the Jaccard similarity of Eq. ( 11), often called GOC (for GO "consistency" or "correctness") or FC (functional consistency/correctness).-Common GO terms: In a PGNA, choose an integer threshold h (usually 1-5), and count how many aligned pairs have at least h GO terms in common.No effort is made to account for the annotation frequency (λ in our notation) of any GO term.-Entropy: Given a cluster of proteins S in which d GO terms {g 1 , . . ., g d } appear at least once across all the proteins in S, the entropy is defined as H(S) = − d i=1 p i log p i , where p i is the fraction of all proteins in S that are annotated with GO term g i .Entropy is always non-negative and lower values are better.The normalized entropy is N (S) = H(S)/m, where m is the number of unique GO terms in S. Alignments can then be scored using Mean Entropy (ME) or Mean Normalized Entropy (MNE), which is just the appropriate mean across all clusters S. Despite its apparent sophistication, these methods still take an average across clusters, and thus still suffer from the fallacy described in §4.3.
-Resnik: Based on Resnik's measure of semantic similarity [17,18], it was originally designed only to evaluate the similarity between two GO terms by finding their most informative common ancestor in the GO hierarchy, and using an information-theory argument to compute their common information.Later it was extended to measure similarity between gene products, such as proteins, by taking some sort of mean or maximum between the GO terms of two proteins (see, eg., [19,20,10].4 Sample of published network alignment algorithm names, with their citation, year, and the method(s) they used to evaluate functional similarity.The rows are sorted by publication year; the columns are sorted by popularity of evaluation measure.Header Legend: Jac=Jaccard Similarity (called "GOC" and "FC" by some authors); Com=number of "common" GO terms in the cluster; MNE=Mean Normalized Entropy; Res=Resnik [17,18]; Sch=Schlicker's method [19]; Enr=Enrichment of GO terms in a cluster compared to average cluster; m-sim=similarity using only GO terms with frequency (λ in our notation) less than m.

Algo
-Schlicker's method [19]: an extension of Resnik's measure tailored more closely to genes and gene products.-Enrichment: has been defined in various ways but usually measures whether the shared GO terms between a pair of aligned proteins (or more generally in a cluster) is "enriched" beyond what's expected at random.m-sim: This measure is only used by MUNK [21], which is not technically a network alignment algorithm, though it is designed to find functionally similar genes or proteins between species.It is also the only method from Table 4 that takes into account the frequency of annotation of a GO term g (λ g in our notation), by using only GO terms with λ below some threshold m.

Fig. 1
Fig. 1 Scatter plot of the observed φ k /N vs. theoretical p k probability across 22.4 billion random alignments between pairs of networks from BioGRID 3.4.164.The vertical axis depicts the observed probability of an event, which is the observed frequency φ g k (n 1 , n 2 , λ 1 , λ 2 ) divided by the number of samples N = 4 × 10 8 .The horizontal axis is the value given by Eq. (6) for the parameters of the observation.There are 428,849 observations plotted across all observed values of n 1 , n 2 , λ g 1 , λ g 2 , k.

Fig. 2
Fig. 2 Same data as Figure 1, except that, for each point, we have computed the distance D from 1 of the ratio of observed to predicted probability: D = |1 − φ g k /N p g k

Fig. 3
Fig. 3 Ratio of the Poisson-based p-value of Eq. (10) to the exact p-value of Eq. (6), across 562,000 random samples from good alignments between networks of BioGRID 3.4.164[13].As we can see, the p-value returned by the Poisson model gets progressively worse (underestimating statistical significance) as the λ values grow.Note that some points had ratios as high as 10 70 , though we truncated the vertical axis at 10 5 .Each point has been perturbed in a random direction by a distance distributed as N (0, 10 −1 ) in log space, otherwise thousands of points land at the same integer co-ordinates, making it impossible to visualize the density of points across the plane.

Table 1
The 8 largest networks of BioGRID 3.4.164,sorted by node count.
, observe that for species pair RN-CE (rat-worm), the top-scoring alignment (using HubAlign's Importance objective function) has a Resnik score of just 4.42 compared to the next best score of 3.80 from L-

Table 3
The alignments are then sorted best-to-worst by p-value (lowest first) or, equivalently, bit score (highest first).Between species-pairs, we sort by the best alignment for that species pair.This table shows all species for which the best alignment's bit-score was greater than 2,000 bits.Continuation of Table2, but for species pairs with best bit-scores less than 2,000.