Conserved rules govern genetic interaction degree across species
- First Online:
- Cite this article as:
- Koch, E.N., Costanzo, M., Bellay, J. et al. Genome Biol (2012) 13: R57. doi:10.1186/gb-2012-13-7-r57
- 10k Downloads
Synthetic genetic interactions have recently been mapped on a genome scale in the budding yeast Saccharomyces cerevisiae, providing a functional view of the central processes of eukaryotic life. Currently, comprehensive genetic interaction networks have not been determined for other species, and we therefore sought to model conserved aspects of genetic interaction networks in order to enable the transfer of knowledge between species.
Using a combination of physiological and evolutionary properties of genes, we built models that successfully predicted the genetic interaction degree of S. cerevisiae genes. Importantly, a model trained on S. cerevisiae gene features and degree also accurately predicted interaction degree in the fission yeast Schizosaccharomyces pombe, suggesting that many of the predictive relationships discovered in S. cerevisiae also hold in this evolutionarily distant yeast. In both species, high single mutant fitness defect, protein disorder, pleiotropy, protein-protein interaction network degree, and low expression variation were significantly predictive of genetic interaction degree. A comparison of the predicted genetic interaction degrees of S. pombe genes to the degrees of S. cerevisiae orthologs revealed functional rewiring of specific biological processes that distinguish these two species. Finally, predicted differences in genetic interaction degree were independently supported by differences in co-expression relationships of the two species.
Our findings show that there are common relationships between gene properties and genetic interaction network topology in two evolutionarily distant species. This conservation allows use of the extensively mapped S. cerevisiae genetic interaction network as an orthology-independent reference to guide the study of more complex species.
Gene Expression Omnibus
new end take off
Synthetic Genetic Array.
Most genes are not essential for eukaryotic life under standard laboratory conditions, which may reflect that organisms are highly buffered from genetic and environmental perturbations . However, rare combinations of singly benign genetic variation can lead to synergistic effects, such as synthetic lethality, where mutations in two genes, neither of which is lethal independently, combine to generate an inviable double-mutant phenotype . Because natural variations that distinguish two people occur relatively frequently  and complex genetic interactions may underlie most individual phenotypes , understanding the general principles that govern genetic networks may be critical for solving the genotype-to-phenotype problem and implementing personal medicine .
Recently, we tested approximately 5.4 million Saccharomyces cerevisiae gene pairs for genetic interactions, mapping an extensive network of more than 100,000 interactions by synthetic genetic array (SGA) analysis . The study discovered both negative genetic interactions, instances in which a double mutant exhibits a more extreme phenotype than the expected combined effect of the single mutants, as well as positive genetic interactions, instances in which a double mutant exhibits a less-pronounced phenotype than expected . This study revealed the distribution of genetic interactions with respect to gene function, highlighting a central role for chromatin-related, transcription, and secretory functions. Additionally, it identified several fundamental physiological and evolutionary gene properties that are significantly correlated with genetic interaction degree in the S. cerevisiae genetic interaction network . For example, we showed that the genetic interaction degree of a gene is highly correlated with single mutant fitness, such that genes with a substantial fitness defect show a large number of genetic interactions.
While genetic interactions have been the most extensively studied in the yeast S. cerevisiae, there is intense interest in developing and applying large-scale screening technologies in other species. For example, large studies have already been completed in Escherichia coli [7, 8], Schizosaccharomyces pombe [9, 10], Caenorhabditis elegans [11, 12], Drosophila melanogaster [13, 14], and human cell lines [15, 16, 17]. Although definitive comparative analysis of these networks across species would be premature given the sparsity of known interactions in species other than S. cerevisiae, there have been preliminary comparative studies. In particular, the yeast S. pombe provides an attractive setting for this analysis due to the availability of a genome-wide deletion mutant collection  and scalable technology for automated genetic analysis . Furthermore, S. cerevisiae and S. pombe are estimated to have diverged approximately 500 million years ago and display markedly different physiological properties  but share 75% of their gene content [19, 20]. The two comparative studies to date estimated approximately 30% conservation of individual negative genetic interactions, but also found substantial differences between the two species [21, 22]. These studies demonstrate the power and necessity of comparative analysis of genetic interaction networks, but have conducted only limited sampling of genetic interactions in S. pombe. The properties of these networks that are conserved across species and the rules governing their evolution remain largely open questions, making further characterization of the evolution of genetic interaction networks important.
In this study, we build predictive models capturing the relationship between gene properties and genetic interaction network connectivity, demonstrating that a small set of properties can accurately predict degree in the S. cerevisiae genetic interaction network. We show that models built from the genome-scale S. cerevisiae genetic interaction network also successfully predict genetic interaction degree of S. pombe genes, the vast majority of which have not previously been screened for interactions. Finally, we use our model to predict differences between the S. pombe and the S. cerevisiae networks, some of which may be reflective of functional rewiring and physiological differences between the species, and show that these differences are independently supported by the divergence of co-expression networks based on comparative analysis of gene expression.
Results and discussion
Modeling interaction degree in the S. cerevisiaegenetic interaction network
Pearson correlations between features and negative genetic interaction degree in S. pombe (pom) and S. cerevisiae (cer) are observed to be significant in many cases
SM fitness defect
Codon Adaptation Index
Number of domains
Number of unique domains
To this end, we applied a regression tree approach to model combinations of 16 gene features that are predictive of negative genetic interaction degree (Figure 1b). Regression trees are built by repeatedly splitting sets of training genes, according to the values of gene features, until genes are sorted into small sets that each contain genes with similar genetic interaction degrees. The hierarchy of gene sets produced is visualized as a binary tree and the final sets of genes are each associated with linear regression models that assign predictions to query genes (Figure 1b). Bootstrapped subsets of the training data were used to build an ensemble of regression trees; this use of multiple models, bootstrap aggregation, is a typical method for building a robust predictive model  (Materials and methods).
To validate our approach, we used our model to predict negative genetic interaction degree for all genes in the S. cerevisiae genetic interaction network (Figure 1c; Materials and methods). A high correlation (r = 0.80, P < 10-324) was observed between predicted and actual genetic interaction degrees of genes not used in training the models, indicating that our model accurately reflects topological features of the S. cerevisiae genetic interaction network (Figure 1c). A strength of this type of model, in addition to providing degree predictions for previously unseen genes, is that the learned tree structures highlight rules consisting of combinations of gene features that explain variation in degree (Figure 1b).
Predicting genetic interaction degree in a distantly related species
Despite the low conservation of single mutant fitness and the varying correlations between individual gene properties for orthologs, we found that relationships between S. pombe gene features and genetic interaction degree were strikingly similar to those observed in S. cerevisiae (Figure 2b, Table 1). Consistent with S. cerevisiae trends (Figure 1a, Table 1), fitness defect was the strongest predictor of S. pombe genetic interaction degree. That is, S. pombe strains with severe fitness defects often exhibit high numbers of genetic interactions. The observed trends suggested that in addition to correlation with individual gene features, higher-level combinations of features that are predictive of connectivity in the S. cerevisiae genetic interaction network  (Figure 1a) may also be informative of S. pombe genetic interaction degree.
To test this hypothesis, we built a predictive model relating the combination of available gene features to genetic interaction degree in S. cerevisiae and then applied the resulting model to predict genetic interaction degree in S. pombe (Materials and methods). Interestingly, we observed significant correlation (r = 0.51, P < 10-36) between interaction degree predicted by our model and the number of interactions previously determined  for 548 S. pombe genes (Figure 2c, left side, light blue bar).
Our ability to predict genetic interaction degree from a small set of gene-specific properties is evidence that rules governing genetic interaction network topology are conserved across a large evolutionary distance (Figure 2c). Importantly, there is no significant decrease in correlation between predicted and actual interaction degree when predictions were restricted to genes unique to S. pombe (Figure 2c, left side, dark blue bar), indicating that the model performs equally well when applied to genes lacking orthologs in the species used to learn relationships in the model.
As a baseline comparison for our cross-species predictive model, we built a model from S. pombe gene features and genetic interaction degrees instead of from S. cerevisiae data. Within-species predictions for S. pombe interaction degrees are not significantly more accurate than predictions made by the cross-species model (Figure 2c, left side, compare red and light blue bars). We also note that although a simplistic predictor that maps the degree of a S. cerevisiae gene directly to its S. pombe ortholog provides reasonable performance (Pearson correlation approximately 0.4), this strategy is out-performed by our cross-species model and is limited to conserved genes. Strikingly, the models trained on S. pombe interactions and features were also able to predict interaction degree in the S. cerevisiae network with high accuracy (Figure 2c, right side, compare red and light blue bars). In general, interaction degree predictions for S. pombe genes were weaker than S. cerevisiae interaction degree predictions, which may reflect the limited functional diversity of available S. pombe genetic interaction studies [9, 10]. Nonetheless, the ability to predict interaction degree using features measured in either yeast species is evidence that relationships between genetic interactions and fundamental physiological and evolutionary properties are generally conserved.
The strong correlation between single mutant fitness defect and negative genetic interaction degree has the unsurprising consequence that the models are considerably influenced by this feature. To explore the reliance of our model on fitness defect, we constructed two types of bootstrapped regression tree models that were trained on all features except fitness defect. The first of these additional models is trained to predict negative genetic interaction degrees and is able to successfully make both within- and cross-species predictions (Figure S1 in Additional file 1). The second model was trained to predict the residual negative genetic interaction degree that remained after subtracting degree predictions made from a regression tree model that was trained on the single feature single mutant fitness defect. The prediction of these residuals by the remaining features was also significant (Figure S2 in Additional file 1). We therefore consider the inclusion of many other features to be a worthwhile part of our model, since they capture aspects of genetic interaction degree that fitness defect alone does not.
Validating predictions with S. pombewhole-genome screens
Consistent with our results for a published dataset  (Figure 2c), we observed a significant correlation (r = 0.40, P < 10-111) between the predicted number of interactions and the total number of experimentally derived interactions observed for a given array mutant in this genome-wide deletion set. Grouping genes with the same observed degree, we found that the distributions of our predictions were reflective of actual degrees (Figure 3b). For example, the median degree percentile predicted for genes with a degree of one was approximately 0.72, while the median prediction for genes with zero interactions was approximately 0.42. Importantly, the significance of the correlation was robust to the choice of interaction cutoff and persisted for a higher-confidence, sparser network (Materials and methods).
Identifying network rewiring suggested by cross-species predictions
Although many individual genes are conserved, yeast genetic interaction networks may have undergone substantial rewiring, as only approximately 30% of the interactions are conserved . Similarly, a low conservation of genetic interactions has also been observed between S. cerevisiae and C. elegans . To examine the extent of network rewiring, we first inferred interaction degrees for the entire S. pombe genome using our cross-species model. Because the predictions do not depend on sequence orthologs (Figure 2a, c), they can be used to compare the topologies of the S. cerevisiae and S. pombe networks even though only a small fraction of the S. pombe network has been screened.
However, we also identified many examples of possible rewiring, in which a significant difference in network connectivity, observed in S. cerevisiae and inferred in S. pombe, was found for orthologous modules (Figure 4a; Figure S3 in Additional file 1; Materials and methods). These predicted-rewired groups represent complexes or biological processes that may have evolutionarily diverged in terms of their importance in the genetic interaction network, acting as hubs in one species but not in the other. In particular, we found that 11 of 65 (17%) protein complexes and 44 of 545 (8%) GO biological processes may have undergone significant rewiring (Figure 4a; Figure S3 in Additional file 1) at a level of significance expected to identify only 3 and 27 (5%) rewired modules, respectively. For example, components of the dynactin complex are hub genes in the S. cerevisiae genetic interaction network (complex average of 85th percentile; Figure 4a) whereas the orthologous genes were predicted to exhibit average connectivity in the S. pombe genetic interaction network (complex average of approximately 50th percentile; Figure 4a). Dynactin, a multi-subunit protein complex known for interacting with dynein and enabling long-range movement along microtubules (reviewed in ), has been implicated in a S. cerevisiae cell cycle checkpoint pathway that arrests cell cycle progression in response to perturbations in cell wall synthesis . A similar checkpoint has not been reported in S. pombe, suggesting that the difference in the number of genetic interactions observed across species may reflect a dynactin-specific role in monitoring S. cerevisiae cell wall integrity.
In addition to S. cerevisiae-specific genetic interaction hubs, we also identified gene groups predicted to be hubs in the S. pombe but not observed as such in the S. cerevisiae genetic network. One such case is the calcineurin-associated protein complex (Figure 4a). A difference in network connectivity might reflect a unique role for calcineurin in the regulation of bi-polar growth activation in S. pombe . Unlike an S. cerevisiae cell, which grows predominantly via an actin-dependent budding mechanism, an S. pombe cell grows in a highly polarized bi-polar manner from its two ends. Following cell division, cell growth is initiated from the old end first, and later, after completion of S phase, from the newer end that forms at the site of cell septation (referred to as new end take off, or NETO). Calcineurin has been shown to play an important role in the delay of NETO by directly dephosphorylating critical targets involved in microtubule dynamics at the site of cell growth. This mechanism is dependent on activation of Cds1 kinase, best known for its role in the intra-S phase DNA replication checkpoint . A connection between the intra-S phase checkpoint and inhibition of bipolar growth activation is so far unique to S. pombe and distinct from the checkpoint controls operating in S. cerevisiae. Additionally, calcineurin is dispensable for growth in S. cerevisiae ; in S. pombe, its deletion leads to defects in cell growth, cytokinesis, cell polarity, mating, and spindle pole body positioning, which are widespread effects consistent with its hub-like activity .
While our method of identifying rewired modules reports several statistically significant differences, we note two caveats in interpreting these results. First, since degrees of genes within functional modules may be systematically poorly predicted, our procedure may incorrectly identify modules as significantly rewired in cases where our test statistic would also have indicated that the within-species difference between predicted and observed degree was significant. Therefore, as a control, a version of this rewiring experiment that compares observed and predicted S. cerevisiae degrees will enable identification of cases that do not reflect true cross-species rewiring (Figure S4a, b in Additional file 1). Second, due to variations in the experimental protocol for measuring genetic interactions, there are differences in the media on which fitness defects were measured in S. cerevisiae and S. pombe, which may also contribute to apparent rewiring .
Functional properties of genes can be captured by many types of biological networks, so we turned to an independent dataset for confirmation of our rewiring predictions. To enable a comparative analysis of gene expression profiles across the two yeasts, we constructed a species-specific S. pombe co-expression network using a previously published approach  and large collections of publicly available expression data (Materials and methods), and obtained a previously published S. cerevisiae network . Each species' network contains 832 genes that are one-to-one orthologs between the two yeasts and connected genes are those pairs that have high co-expression values surpassing a threshold of the 95th percentile. At our selected density of 0.05, there are approximately 17,000 edges in each network. In general, we found evidence of conservation between the S. cerevisiae and S. pombe networks: co-expression edges between two genes occurred in both networks for 9.2% of the gene pairs that were co-expressed in at least one network. This is about twice the background conservation rate of approximately 4.3%, as determined through comparison to a randomized network produced by a degree-preserving procedure.
To explore the connection between genes predicted to be rewired in the genetic interaction networks and differences between the co-expression networks, rewiring predictions were overlaid on the co-expression networks. Specifically, all non-essential one-to-one orthologs were classified as either rewired or non-rewired based on our prediction of genetic interaction degree (Figure 4b). Using this rewiring labeling, we measured the conservation rate of three types of co-expression edges: co-expression edges connecting two non-rewired genes, connecting two rewired genes, and connecting rewired and non-rewired genes.
We found that co-expression edges involving predicted rewired genes are consistently less conserved than edges with exclusively non-rewired endpoints (Figure 4c), a trend that is robust over different co-expression thresholds used for network sparsification (Figure S5 in Additional file 1). For example, when genes whose degrees differ by 55 interactions or more are considered rewired, 6.9% of the co-expression relationships connecting rewired genes are conserved (107 of 1,659), in contrast to the significantly higher 10.1% of co-expression relationships that are conserved between non-rewired genes (1,238 of 12,472, Fisher's exact test P < 10-6). This trend grows stronger when considering genes that were predicted to have even larger differences between S. pombe and S. cerevisiae. This analysis independently confirms predictions of highly rewired genes between the two species and suggests that changes at the level of gene expression regulation are at least one mechanistic factor that contributes to these differences.
Although individual interactions and gene-specific properties may not be strongly conserved between species, our findings suggest that these properties influence genetic interaction networks in a similar manner. For example, while the genes important for normal growth may vary, the relationship between a gene's fitness contribution and the genetic interactions it exhibits appears to be conserved. Indeed, models trained on both S.cerevisiae- and S.pombe-derived gene properties were significantly predictive of cross-species genetic interaction degree (Figure 2c), suggesting that the general principles governing genetic interaction network structure are retained through evolution. Thus, a complete genetic interaction network for an organism such as S. cerevisiae should serve as a reference network to guide studies to uncover genetic interactions in more complex systems. Predicting specific pairwise interactions across species is of course the next (more difficult) challenge, but models that can accurately predict the variation in number of interactions across the genome provide a foundation for cross-species interaction analysis. Our results also demonstrate that integrative comparisons leveraging multiple functional genomic datasets across species may be one approach to build confidence in differential network analysis. As more data become available, both the extent and nature of network conservation should reveal how functional conservation and divergence can be recognized and utilized in distantly related species.
Materials and methods
Yeast conservation is a count of how many of 23 different species of Ascomycota fungi possess an ortholog of a given gene. This measure was first described in , and ortholog data were downloaded from . The 23 species are an expanded set of the 17 species described in the study, with the additions of Schizosaccharomyces octosporus, Schizosaccharomyces japonicus, Lodderomyces elongosporus, Candida parapsilosis, Candida tropicalis, and Candida guilliermondii.
Similar, though complementary, to yeast conservation, broad conservation is a count of how many out of a set of 86 non-yeast species possess an ortholog of a given gene. To count this, we obtained orthogroup designations from InParanoid . For each gene, we considered it to have an ortholog in another species only if it appeared in a cluster with the other species and was given a score of 1.0 by the InParanoid clustering method; that is, we considered a yeast gene to have an ortholog in species x if it was a seed gene for a gene cluster that had an orthologous cluster in species x. Although Ostlund et al.  considered 100 species, we disregarded the yeast species, since the yeast conservation measure already captures information from these species.
Codon Adaptation Index
The Codon Adaptation Index, a measure of bias in the usage of synonymous codons, was calculated with the cai tool in the EMBOSS suite . For each gene, the index is based on a comparison between codon frequencies in the gene and frequencies observed in a set of highly expressed genes; for both S. pombe and S. cerevisiae, EMBOSS included a default codon usage table that was used.
Copy number is a count of the number of paralogs a gene has. This was determined from clusters identified by the InParanoid algorithm  run on S. cerevisiae and S. pombe. All genes that appear in the same cluster were considered copies.
The protein disorder measure is the percent of unstructured residues in a gene's protein product as predicted by the Disopred2 software .
dN/dS is the ratio between nonsynonymous and synonymous mutations in coding regions of genes. For S. pombe genes, dN/dS was calculated twice, using S. japonicus, S. octosporus as out-group species, and averaged to produce a final dN/dS estimate. Orthologous protein sequences were globally aligned with EMBOSS  using default parameters. For each S. pombe gene, only the out-group ortholog that produced the highest alignment score was used for dN/dS calculations; dN/dS ratios were calculated with the PAML package's implementation of the Yang and Nielsen method for estimating substitution rates [42, 43].
Similarly, we computed the average dN/dS ratio for S. cerevisiae in comparison to the sensu strictu yeast species (Saccharomyces paradoxus, Saccharomyces bayanus and Saccharomyces mikatae). Protein sequences were aligned using MUSCLE  and dN/dS ratios were computed using PAML .
Number of domains
The number of domains for a gene is the number of regions that Pfam has identified as domains in the protein sequence of the gene. Domain matches for each protein were obtained online from the Pfam database .
Number of unique domains
Since the same domain is often repeated multiple times in a single protein, this feature modifies number of domains by counting the number of unique domains present in each protein.
This measure is a simple statistic of codon usage bias and expresses the effective number of codons used in a gene. The chips tool of EMBOSS  was used to calculate this feature.
Protein length is simply the number of amino acids in the corresponding protein.
This measure is derived from the co-expression network, the construction of which is described in its own section. The network contains a level of co-expression for all pairs of genes. We therefore sparsified the network by considering only edges between gene pairs whose co-expression levels were above the 95th percentile. The co-expression degree of a gene is the number of genes with which its co-expression value is retained in this restricted network.
We estimated the amount of variability in a gene's expression level by measuring the variance of its expression across a number of different microarray experiments, which included microarray data from different growth conditions and replicates. Within each study, we found each gene's percentile of variation. The final value assigned to each gene is its average percentile across all studies. These datasets were obtained from a number of different studies that deposited data in the Gene Expression Omnibus (GEO) . S. pombe data used in this analysis are the same as those used in construction of the S. pombe co-expression network.
S. pombe fitness defect measurements were obtained by conducting a series of control SGA experiments as described elsewhere [2, 9, 33]. Briefly, a S. pombe SGA query strain harboring a dominant drug-resistance marker (natMX4) inserted at a neutral genomic locus (h- leu1Δ::natMX4 ade6-M210 ura4-Δ18 leu1-32) was crossed against the S. pombe non-essential deletion mutant collection (h+ geneXΔ::kanMX4 ade6-M210 ura4-Δ18 leu1-32). Following mating and sporulation, haploid meiotic progeny harboring both the kanMX4 and natMX4 markers are selected and colony sizes are measured after applying standard normalization procedures. We have previously shown that colony sizes derived from these control screens reflect fitness defect of the kanMX4-marked single mutant strains that comprise the deletion mutant array. Fitness estimates were based on four control screens as described above and combined with five mutant screens (prz1, res2, SPAC1687.22c, SPCC1682.08, and SPAC6G9.14), which contained the dominant drug-resistance marker (natMX4) .
S. cerevisiae fitness defect values, defined quantitatively in , were published in  and experimental procedures are detailed in . As in the S. pombe protocol described above, SGA was used to insert a neutral query marker into mutant strains so that we could observe colony growth for each mutant in the deletion collection under the effects of only the single deletion. Fitness estimates are based on a large number of replicate screens.
Protein-protein interaction degree
The protein-protein interaction degree of each gene's protein is the number of physical interactions reported in BioGRID, version 2.0.58 . Interactions considered physical were restricted to those identified by the following terms: Affinity Capture-MS, Affinity Capture-RNA, Affinity Capture-Western, Biochemical Activity, Co-crystal Structure, Co-fractionation, Co-localization, Co-purification, Far Western, FRET, PCA, Protein-peptide, Protein-RNA, Reconstituted Complex, and Two-hybrid.
Multifunctionality is a measure of the number of GO terms that are annotated to a gene . From GeneDB  and Saccharomyces Genome Database  gene association files (download in November 2009) for S. pombe and S. cerevisiae, respectively, redundant terms - one term from pairs of terms that are considered 'alternative ids' - were removed before totaling the number of GO term annotations for each gene.
Genetic interaction degrees
Negative genetic interaction degrees of S. pombe genes were derived from interactions reported in  (Additional file 2). Only those interactions with S-scores ≤ -2.5 were considered. This dataset contains 551 genes that are involved in chromosome function; intentionally included are approximately 100 genes that participate in processes present in both S. pombe and human, but importantly, are not present in S. cerevisiae (for example, RNA interference machinery).
Negative genetic interaction degrees of S. cerevisiae genes (Additional file 3) were collected from the measurements reported in , which screened for interactions involving 3,456 array genes, 1,438 of which have S. pombe orthologs. As suggested by the authors, only negative interactions with an epsilon value of ≤ -0.08 and a P-value cutoff < 0.05 were considered. This dataset includes degree measurements for most non-essential genes.
Orthology mappings (Additional files 4 and 5) are from the InParanoid eukaryotic ortholog database . Although the InParanoid algorithm produces clusters, our analysis depends on ortholog pairs. To calculate correlations between S. cerevisiae and S. pombe for each of the gene features (Figure 2a), only genes in one-to-one orthology mappings were used. When holding out orthologs for degree prediction in a set of 'species-specific' genes (Figure 2c), all genes that had any number of orthologs were removed. Since InParanoid may not report orthologs that other algorithms have detected, we took a conservative approach by additionally removing any genes that had an ortholog in the pombe database GeneDB , which includes manually curated orthologs.
Models and evaluation
Our models are bagged regression trees that use the 16 features described above (Additional file 6). Breiman  suggests that using an ensemble of only 25 classifiers can result in nearly all improvement gains that bagging can produce over a single classifier; however, we used 100 trees because the computation required in training is relatively low and we were interested in analyzing the tree structures. Individual trees were trained by MATLAB's classregtree function, which minimizes node impurity according to mean squared error. For each tree, a bootstrap sample was used to select, with replacement, a set of training genes the same size as the set of total genes (therefore each tree is trained on approximately 63.2% of all genes) and held out genes. The final prediction for a single gene of the species used to train the model (that is, the within-species prediction) is the median of all predictions from trees for which the gene was not in the training set (Additional file 7). The final prediction for a gene of the species not used to train the model (that is, the cross-species prediction) is the median of predictions from all trees (Additional file 8).
To assess the performance of the model, we calculated the Pearson correlation coefficient between predicted and actual degrees of genes with known degrees. To estimate stability of performance, we repeated the model construction and evaluation 25 times and reported predictive ability as the mean Pearson correlation coefficient and its standard deviation across all 25 repetitions for within- and cross-species cases (Figure 2c).
S. pombegenetic interaction screens
Eight whole-genome S. pombe genetic interaction screens were completed using the method described in . The query strains were deletion mutants for each of the following genes: SPCC1682.08c, SPBC21D10.12, SPBC13E7.09, SPAC4G8.13c, SPAC3A11.13, SPAC27D7.13c, SPAC22F3.09c, SPAC16A10.07c. The resulting double mutant colonies were processed as described in . Negative interactions were derived from the scores by applying an interaction cutoff of ≤ -0.08 and a P-value cutoff of < 0.05. Degree measurements were then derived for all non-essential genes by counting the number of significant interactions across the set of eight queries (Additional file 9). Significant correlation with the predicted degrees was also observed when a stricter cutoff was applied (interaction score ≤ -0.12, P-value < 0.05 yielded a correlation r = 0.41, P-value < 10-117).
Rewiring groups and significance assessment
To make comparisons between degrees of orthologs in the genetic interaction networks of the two yeast species, we considered genetic interaction degree to be predicted percentile for all S. pombe genes, while percentiles of actual degrees were used for S. cerevisiae.
To search for groups of functionally related genes that have been rewired since the divergence of S. pombe and S. cerevisiae, we defined gene groups in two ways. The first simply grouped genes whose protein products form a complex in a set of complexes defined in  (Additional file 10). The number of proteins per complex ranges from 2 to 81, with the vast majority having 6 or fewer proteins.
The second method for making sets of functionally related genes grouped genes that share a biological process GO term annotation  (Additional file 11). We considered GO terms that are annotated to greater than 3 and fewer than 50 genes in either of the two species. Additionally, a group of S. cerevisiae genes was required to have a minimum number of two genes with known genetic interaction degrees; a group of S. pombe genes was required to have a minimum of two genes with known fitness defect. Since GO terms tend to be highly redundant, we filtered gene groups so that no pair of groups overlapped by more than 50% of either group's genes.
To determine orthologous pairs of groups that have significantly different average degrees, we calculated the difference between the median degrees of genes in each species' group, and then compared the differences to a distribution of differences produced from randomly grouped genes. We generated this background by creating groups of randomly selected genes in one species, then identifying orthologous groups in the other species composed of the selected genes' orthologs. A query gene-group pair was compared to a background containing only random gene-group pairs whose group sizes were identical to the query groups. For example, a protein complex of five individual S. cerevisiae proteins may contain four genes that have S. pombe orthologs; this query gene-group pair would be compared with a background of groups with five random S. cerevisiae genes matched with a group of four of their S. pombe orthologs.
Comparative analysis of co-expression networks
To independently validate genetic interaction degree differences across species, we performed a comparative analysis of co-expression networks of S. cerevisiae and S. pombe genes. The S. cerevisiae network was previously published  and is based on integration of a large collection of expression datasets. To construct the S. pombe network (Additional file 12), data from nine expression studies were collected from the GEO database  (Additional file 13). Genes with missing values for more than 30% of the samples were removed, and the remaining missing values in each dataset were imputed using KNNImpute . Datasets reflecting probe intensities (rather than relative ratios) were log-transformed. After processing, the nine S. pombe expression datasets were integrated as described in [34, 56]. The naive Bayes approach for dataset integration requires a gold standard set of positives, for which we used direct gene co-annotation to any term in the GO that contained between 2 and 100 genes. S. pombe gene annotations were downloaded from the GO website [26, 57] in May 2011. All analysis and integration of expression data were completed using the Sleipnir library .
We applied a 95th percentile cutoff to edges in both the S. cerevisiae and S. pombe co-expression networks, such that only the highest scoring 5% of edges were retained.
To estimate the overlap between the S. cerevisiae and S. pombe networks in the absence of biological conservation, we randomized the edges of the S. cerevisiae network and considered the background conservation to be the overlap between this randomized network and the S. pombe network. The randomizing procedure repeatedly chose two random edges and exchanged an endpoint of one edge with an endpoint of the other edge, thus maintaining the degrees of genes in the network. The number of endpoint swaps performed was 20 times the number of edges in the network, which is a sufficient number of swaps to remove the original relationships between genes.
EK is an NSF Graduate Research Fellow. EK, JB, RD and CLM are partially supported by a grant from the Minnesota Partnership for Biotechnology and Medical Genomics program, a grant from the National Institutes of Health (1R01HG005084-01A1), and the National Science Foundation (DBI0953881). EK, MC, JB, RD, CLM, BJA, and CB are also supported by a grant from the National Institutes of Health (1R01HG005853-01). MC, BJA, and CB are supported by the Canadian Institutes of Health Research (MOP-57830) and the Ontario Research Fund (GL2-01-22). GC was supported by funds from a CIHR-Operating grant. KC-R was supported by a Queen Elizabeth II Graduate Scholarship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors thank Jeff Piotrowski for providing S. pombe query strains, Karen Dowell and Matt Hibbs for providing co-expression network software, Valerie Wood for providing orthology maps, and Tahin Syed for help with image processing.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.