Background

The non-random arrangement of genes in the genome is intimately connected to the pattern of gene expression across the genome [1]. While the connection between gene location and expression has been known for some time in prokaryotes [2], a similar genome-wide connection between gene order and gene expression has relatively recently been identified in eukaryotes [3]. Clusters of physically adjacent genes that are co-expressed are now known to be common in eukaryotic genomes and have been reported in yeast [4, 5], plants [6], worms [710], fruit flies [1113], mice [1416], and humans [1620].

A number of mechanisms have been proposed to explain the existence of these co-expression clusters including the presence of duplicate genes that are in close physical proximity, shared regulatory regions, chromatin-level regulation, and common pathway or tissue regulated expression of physically clustered genes [1, 8, 21]. Similarly, a number of hypotheses have been proposed concerning the interplay between transcriptome evolution and genome organization that can explain the existence of co-expression clusters including positive selection for genomic rearrangements leading to close physical proximity of co-expressed genes [22] and purifying selection against genomic rearrangments that break-up co-expression clusters [5, 18]. The possibility that co-expression results in correlated rates of sequence evolution among cluster genes has also been proposed [23] and a recent analysis has found evidence of co-evolution of tissue specific expression of adjacent genes [24]. Although not fully resolved, these analyses have led to a clearer picture of both the pattern of co-expression clusters within species and explanations concerning why gene expression is often coordinated among physically adjacent genes.

If there is a physical clustering of coordinated gene expression within species, then it is likely that gene expression can also evolve in a coordinated manner [24]. A block of consecutive genes where expression evolves in a coordinated manner will leave an evolutionary signature that can be detected by non-zero expression correlation among neighboring genes when analyzing multiple species. Therefore, just as correlated expression profiles are used to identify co-expression among genes within species [4, 12] the same approaches can be used to analyze co-evolution of expression in gene clusters when comparing gene expression among species. Instead of analyzing the correlations in gene expression among developmental stages, environments, tissue localizations, etc. [1], we consider the correlations among mean gene expression levels estimated for each species. This approach will identify clusters of genes where the evolution of expression is not a gene independent process.

Our goal is to identify clusters of genes that show consistent patterns of coordinated expression evolution among species of Drosophila. We assayed genome-wide gene expression levels using Affymetrix GeneChip Arrays in three-day-old male adults under standardized environmental conditions [25]. The following seven species in the D. melanogaster subgroup were analyzed: D. melanogaster, D. simulans, D. sechellia, D. mauritiana, D. santomea, D. teissieri, and D. yakuba. The relationships among these species are well established (Fig. 1) and represents a taxonomic sampling that spans ~6 million years [26]. Clusters identified for these species are therefore of value for understanding how the transcriptomes of species evolve across time scales on the order of one-hundred thousand to several million years.

Figure 1
figure 1

Relationships among the seven species in the D. melanogaster subgroup that were analyzed. Times of speciation events follow estimates from [26].

A number of methods have been applied to the identification of co-expression clusters within species using microarray expression data [4, 12, 23, 2729]. The most common of these is calculation of a statistic based on the estimated correlation matrix for blocks of consecutive genes, generally the mean of N*(N-1)/2 correlations when considering N consecutive genes [12]. For the current study, we use a different statistic to summarize the correlations among N genes: the fraction of variation explained by the leading eigenvalue of the correlation matrix. This statistic describes the maximum fraction of variation that can be explained by a linear function of the original variables after scaling the variance of each variable to one. This statistic therefore provides an intuitive description of the degree to which a set of genes act as a single unit because the closer this ratio is to "1" the greater the degree that expression of the genes are completely correlated, regardless of whether the correlations among any gene pair are positive or negative. While this statistic has not been explicitly applied to the analysis of co-expression, the leading eigenvalue(s) of a correlation or covariance matrix are commonly used to summarize the structure of correlated variation [30, 31].

Our analysis shows that a considerable proportion of transcriptome evolution among species in the D. melanogaster subgroup occurs via co-evolution of expression in clustered genes. Comparison of the locations of clusters that reflect coordinated evolution of gene expression across taxa to clusters of coordinated expression within the species D. simulans demonstrated a lack of correspondence in locations. This implies that different mechanisms may be responsible for producing co-expression clusters within species and those producing co-expression clusters that evolve in a coordinated manner. We additionally analyze a number of genome organization, functional, and evolutionary aspects to identify over-(under-) representation with clusters displaying coordinated expression within D. simulans or coordinated expression evolution among the seven species. Of these, the most interesting are genes that show a signature of adaptive evolution in their coding sequences. A previous analysis of tissue co-expression within mice and humans did not find a significant positive correlation between rates of non-synonymous substitutions (K A ) and co-expressed genes [23], although the ~75 million years since humans and mice diverged likely limited the power of this analysis [32]. Here, we find that co-expression clusters that vary within D. simulans are not enriched for adaptive evolving loci. However, genes with an adaptive evolutionary signature are over-represented in clusters where expression is co-evolving among species. This result points to a connection between coordinated gene expression evolution and adaptive evolution in coding regions of genes, although the exact nature of this connection is still unknown.

Results and Discussion

Clusters of genes evince coordinated evolution of expression across the seven species (Figure 2). On all chromosomes at all scales, there were far more windows with significant coordinated expression evolution than expected at random (Table 1). Many of these significant windows were identified across multiple window sizes and likely reflect a single larger block of genes where expression evolution is coordinated [12] (we present combined significant windows at different cutoffs in Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). These windows were also robust to the removal of individual species and therefore were not being driven by evolution in a specific lineage (results presented in Additional files 1, 2, 3, 4, 5, 6, 7, 8, 9, 10). In addition, there was no detectable difference between the absolute level of transcript abundance as measured by the arrays for neighboring genes where expression displays co-evolution compared to other groups of neighboring genes (p-values > 0.05 for all window sizes). Because we used the D. melanogaster as the reference genome for ordering genes along chromosomes there is the possibility that genome re-arrangements would lead to some of these significant clusters to include non-physically adjacent genes in some of the species. Our results are, however, robust to this issue as we found that breakpoints between D. simulans-D. melanogaster or D. yakuba-D. melanogaster interrupted clusters where expression is co-evolving no more frequently than expected by chance, which is consistent with a previous analysis of co-expression within species [33].

Table 1 Number of significant windows where expression is co-evolving.
Figure 2
figure 2

Sliding window heat map of p-values resulting from the clustering analysis across species projected onto the D. melanogaster genome. Approximate position on chromosomes is plotted along the x-axis and window size on the y-axis. Centromere proximal regions are indicated by yellow shading on the chromosome. The spectrum runs from highly significant p-values (red) to highly non-significant p-values (dark blue).

The use of a sliding window approach to identify co-evolution of expression in gene clusters means that tests at a given scale will be correlated with neighboring windows and these tests will also be correlated across window sizes. There is not a clear optimal approach for dealing with the multiple testing problem and the properties of strategies such as estimation of False Discovery Rates (FDRs) [34, 35] are not clear in such cases. To assess whether there was a clear genome-wide tendency for coordinated expression evolution, we therefore used a permutation approach using total number of significant tests at a given window size (2, 5, 10, and 20) as a test statistic to assess the null hypothesis that there are no more gene clusters where expression is co-evolving than we would expect at random [12]. With only seven species (samples) these tests are not expected to be particularly powerful. However, the test was still rejected at a window size of 2 (p-value < 0.02) and at a window size of 10 (p-value < 0.04) (although not for window sizes of 5 and 20) indicating that at least on genomic scales spanning 2 to 10 genes, there is genome level co-evolution of gene expression in neighboring genes.

Similar results were obtained for the analysis of within species co-expression for D. simulans (Table 2, Figure 3). Many windows on all chromosomes at all scales were significant. Interestingly, the test of a genome-wide pattern produced significant results for window sizes of 2 (p-value < 0.04) and 10 (p-value < 0.01). While this could be interpreted as an artifact of microaray design [36, 37], there is no regular spacing to the distribution of significant windows [28]. Interestingly, there was little overlap between the significant windows identified as evolving across species and being co-expressed within D. simulans (Table 3). The number of overlapping windows obtained when comparing repeated analysis of mean expression levels for all species and the number of overlapping windows for repeated analysis of the D. simulans data (i.e. non-overlap due to permutation effects) are presented for comparison. Given that many of the co-evolving expression clusters and the co-expression clusters identified within D. simulans may reflect false positives, a small fraction of overlap between these cluster types might be expected. However, even at a conservative cutoff (p-value < 0.001) the absolute number of overlapping clusters is still very low (Table 3) indicating that there is little correspondence. It therefore appears that completely different sets of genes are involved in the pattern of co-expression within species compared to those where expression evolves in a coordinated manner across species.

Table 2 Number of significant co-expression windows in D. simulans.
Table 3 Number of overlapping significant windows between the analysis of all species and within D. simulans.
Figure 3
figure 3

Heat map of p-values resulting from the sliding window analysis within D. simulans projected onto the D. melanogaster genome. Color coding follows Figure 2.

While the mechanisms underlying the existence of clusters of co-evolution of expression among species and co-expression clusters within species cannot be resolved from these data, the existence of paralogous genes in close proximity can be ruled out as the major factor for the observed pattern. Paralogous genes in close proximity may be expected to produce the evolving clusters or within species co-expression clusters as a result of shared regulatory elements and/or maintained common functions. However, paralogs may also produce the pattern by cross-hybridizing to common probes on the microarray. If the second of these possibilities can explain a considerable proportion of clusters, this could mean the observed pattern was an artifact of the microarray assay. However, we find that very few paralogous gene sets are within either evolving clusters or co-expression clusters (between 1.4%–12.0% of evolving clusters identified using window sizes 2–20 and a p-value cutoff of 0.001 contain paralogous genes and 2.1%–8.5% of co-expression clusters within D. simulans contain paralogous genes; see Additional file 9). The majority of evolving clusters and co-expression clusters cannot therefore be explained by paralogous genes.

Spatial, Adaptive, and Functional Distribution of Co-expression Clusters

To determine if co-expression or co-evolution of expression is related to the physical organization of genes within clusters, we investigated the spatial distribution of the clusters across the genome of D. melanogaster. While order of genes in clusters where expression is co-evolving are conserved across species in our analysis (see above), the physical location of these clusters relative to D. melanogaster need not be. Fortunately, the local spatial organization of genes is highly conserved across these species, which allows us to again use the heavily annotated D. melanogaster genome as a reference for our spatial analysis.

There are fewer co-evolving expression clusters in centromere proximal regions (heterochromatic centrometric regions were not included in our analysis). This was also observed for the within species co-expression clusters. Two hypotheses may explain this pattern. First, gene density tends to decline in regions proximal to the centromere [38], which may reduce the total number of gene clusters observed in these relatively gene depauperate regions. Second, centromere proximal regions have higher amounts of heterochromatin, which can dramatically affect gene expression [39, 40]. Most euchromatic gene expression is suppressed in heterochromatin and natively heterochromatic genes are typically only expressed when surrounded by heterochromatin. Therefore, local shifts over evolutionary time in the heterochromatin content – which are common in centromere proximal regions – may inhibit the formation of clusters near centromeres [39].

Most genes in the genome are physically grouped together on the chromosome as determined by the coefficient of deviation (Table 4). In contrast, clusters of genes where there is co-expression or where there is co-evolution of expression tend to be more dispersed, which suggests that co-expression is not simply a function of gene density. Nor is it the result of local recombination rate; there is no relationship between the rate of recombination in D. melanogaster and the density of clusters (analysis not shown). This conflicts with the hypothesis that lower recombination tends to evolve among co-expressed genes [24, 41]. This result may however be confounded by variation in recombination rate across the species analyzed.

Table 4 Comparisons of the spatial distributions of co-expression clusters.

In contrast to these large scale patterns, genes within evolving clusters do not show unusual local structural organization relative to the rest of the genome. For example, the genes in clusters where there is co-evolution of expression are not physically closer to each other relative to the rest of the genome (whole genome median spacing: 827 bases, whole genome mean spacing 4853 bases; cluster gene median spacing: 3573 bases; cluster gene mean spacing: 8503 bases). Likewise, there is no strand bias – genes are equally likely to be on either strand of DNA (+ strand: 118; - strand: 114). When considering pairs of genes – "window 2" clusters – there was no significant difference in the strand orientations of these pairs. Pairs of genes were essentially equally likely to both be on the same strand or on opposite strands in either the + - or -+ orientation (χ2 = 3.468, d.f. = 3, p-value = 0.3249). Nor were there significant runs of genes on the same strand among genes within the largest clusters (p-value = 0.45). There was similarly no unusual structural organization for clusters of co-expression within D. simulans compared to the rest of the genome.

However, as noted above, clusters where there is coordinated evolution of expression among species seldom correspond to clusters where there is co-expression within species. This fact suggests that the evolutionary and genetic forces affecting coordinated expression within and between species are distinct. If the evolution of co-expression clusters reflected a neutral process, we would expect the patterns of co-expression within species to be reflective of the patterns of co-expression between species. Instead, we see very different patterns within and between species. Between species, our analysis identifies groups of genes whose expression is correlated across evolutionary time. Natural selection, directional or purifying, could drive or preserve patterns of co-expression among genes. Directional selection, however, is the more likely explanation for the diversification in expression we observe across species. We tested this idea using data from Begun et al. 2007 [42]. Using polymorphism data in D. simulan s in conjunction with divergence data from D. melanogaster and D. yakuba, Begun et al. 2007 identified genes evincing recurrent directional selection using a McDonald-Krietman test [43]. These genes had normal levels of within species polymorphism, but high levels of between species divergence. We compared the frequency of genes with significant McDonald-Krietman tests (MKtest) within the clusters where there is co-evolution of expression to the whole genome empirical distribution (nominal threshold of p-value < 0.05). Genes within clusters have 27% more adaptively evolving genes than the genome average using a polarized MKtest, which does not confound evolution on two branches (p-value < 0.001; unpolarized test difference is only 4%). This result suggests that recurrent directional selection may be the evolutionary force shaping the evolution of co-evolving expression of neighboring genes. Recent work looking at a subset of the species we analyzed also suggests a tie between adaptive evolution of coding sequences and changes in gene expression [44]. The phenomena observed here may reflect that larger evolutionary process.

Our within species analysis of D. simulans identified clusters where expression is coordinated and we applied the MKtest analysis to these blocks of genes. In contrast to the among species analysis, the within D. simulans clusters lack genes evidencing recurrent adaptive amino acid evolution (nominal MKtest p-value < 0.05; polarized, 15% fewer adaptively evolving genes, p-value < 0.001; unpolarized 30% fewer adaptively evolving genes, p-value < 0.001). This paucity of significant MKtests may indicate a role for balancing selection in maintaining some, but not all, of these polymorphic within-species clusters. Regardless, distinctly different evolutionary forces appear to be operating to produce co-evolution of expression in clusters compared to co-expression clusters within species.

The seven species that we analyze are morphologically almost indistinguishable except for differences in male genitalia and in the case of D. santomea which has distinct pigmentation [26]. The co-evolving expression clusters are therefore not likely to be related to tissue specific expression of clustered genes underlying morphological divergence. We do, however, find there are more statistically over-represented GO categories involved in reproduction in clusters where there is co-evolution of expression and more genes involved in immune response for clusters evincing co-expression within species clusters (Table 5 and Additional file 9). Recurrent selection has repeatedly been shown to drive evolution of reproduction related genes, especially in males [4547]. Thus it makes sense that our co-evolving expression clusters are enriched for both adaptively evolving and reproduction related genes. Similarly, a subset of immune response genes have high levels of nucleotide polymorphism within species [4850], which is also consistent with our MKtest analysis of within species clusters.

Table 5 Over-representation of gene classes in clusters.

Conclusion

To understand how complex phenotypes evolve, we need to understand how interacting genes evolve. We have analyzed expression microarray data from species in the Drosophila melanogaster subgroup, including data from ten distinct D. simulans lines and identified blocks of genes where there is coordinated evolution of expression among species. We have also identified clusters where there is co-expression within the species D. simulans. Our work shows that coordinated evolution of expression among physically adjacent genes among species and co-expression among adjacent genes within species is common. In general, there is little correspondence between clusters where there is co-evolution of expression and those clusters where there is co-expression within species for the species analyzed. The evolutionary forces shaping these two types of clusters are clearly different. Our analysis of co-expression within a species showed no relationship between adaptive evolution at the sequence level and expression of genes within clusters. In contrast, adaptive sequence evolution is associated with those genes in clusters where expression is evolving in a coordinated manner. In sum, we find that just as expression of genes is not independent of physical location of genes within a species, the evolution of expression of many genes is not independent of the evolution of their neighbors. We also show that this result is not simply the byproduct of clusters of duplicated genes. Our analysis suggests that clusters of genes where there is co-evolution of expression may be natural units for quantifying and understanding the evolution of transcriptomes.

Methods

Data Collection

For all species, flies were raised under the standardized conditions as described in Nuzhdin et al. 2004 [25]. Under these conditions, a component of the variation in gene expression will reflect evolved differences among species. For males three days post-pupation, RNA was extracted from whole adult tissue using 20 individual flies per replicate. Assays of transcript abundance were carried out using Drosophila 1.0 Affymetrix GeneChip Arrays using the recommended protocols (Affymetrix Inc, Santa Clara, CA). Assays were carried out on three independent (biological) replicates per species for a single line of the following five additional species: D. yakuba (Tuscon Stock Center Number: 14021-0261.00), D. santomea (TSCN: 14021-0271.00), D. teissieri (TSCN: 14021-0257.00), D. mauritiana (David 105, TSCN: 14021-0241.01), D. sechellia (Roberstson, TSCN: 14021-0248.21). These data were combined with data previously collected for D. melanogaster and D. simulans under the same conditions that used the same Affymetrix GeneChip Array [25]. Analyses of the entire data set therefore included a total of 48 independent transcript assays covering seven Drosophila species in the D. melanogaster subgroup (Figure 1). Array data have been deposited in GEO repository (Series record GSE7873). Processed data after background correction and normalization without masking are available in the Additional files (Additional file 2).

Note that the samples assayed for D. melanogaster reflect an even genotypic contribution of 10 isogenic lines developed from a wild population (Winters, CA) and crossed in a round-robin design. Variation in transcript abundance in D. melanogaster therefore includes variation due to polymorphism and measurement error. For D. simulans, three replicate arrays were used to assay each of 10 round-robin crosses between 10 isogenic lines developed from the same population. Each group of three replicates for D. simulans can therefore be used as an estimate of the expression levels associated with individual (heterozygous) genotypes and the variation assayed can be attributed to measurement error. Variation in transcript abundance among these D. simulans genotypes provides an estimate of the within species genetic variation in gene expression. For all other species, variation among replicate arrays includes only measurement error. Analysis of the means of transcript abundance among these seven different species therefore provides an estimate of the variation in gene expression that has evolved among species.

Microarray Probe Masking and Normalization

The Drosophila 1.0 Affymetrix GeneChip is designed with probe sets for 14,010 locations of the D. melanogater genome with sets usually containing 14 pairs of 25-nucleotide probes. For each pair, one of the pair is a 'perfect' match (PM) to the D. melanogaster reference sequence and the other differs only at the 13th base and is the mismatch probe (MM), the latter used to account for non-specific hybridization. We expect many of the PM probes to not be perfect matches for other species. To account for these effects, we removed all probe pairs where the PM probe is not an exact sequence match to D. yakuba. D. melanogaster and D. yakuba shared a common ancestor ~6 million years ago [26] and reflect the broadest taxonomic sampling of the species analyzed. Probe homology between them makes the probes likely to be perfect matches for all species analyzed. Indeed comparison of D. yakuba to D. simulans shows that this is true for 87.4% of genes included in our analysis. For maximum power, we would mask only those probes that were divergent in a particular species, however, for several practical reasons we did not do this: 1. we found that including the D. simulans mismatches had little effect, once the D. yakuba was masked, 2. adding D. sechellia has little effect vs D. simulans, 3. as genome sequences are not available for all species in our study, we cannot make masks for all taxa. From an analysis point of view, using multiple masks is problematic as the power and precision of expression estimation would then vary across taxa. We therefore opted for a conservative approach of masking all taxa based on the most divergent genome, which should catch the most egregious cases. This approach is conservative as unmasked divergent probes typically, but not always, lead to over-estimates of expression (see below). As explained below, our analysis looks for genes showing consistent patterns of expression across taxa among expressed genes (e.g. those that appear to be above background). Including the occasional mismatch likely reduces our power to detect such a pattern, not create it. Thus, positive results in our analysis are probably robust to the spurious noise caused by mismatched probes.

To identify probes for removal (masking) we used BLAST to compare the Affymetrix target sequences from this array to identify homologous regions in release 2.0 of the D. yakuba genome. We conducted this analysis on each chromosome arm to minimize spurious matches. Contigs with the best matches to the target sequences were then compared to the Affymetrix probe sequences. Full matches were then extracted and the percent match calculated and the locations and types of mismatches noted. After probe removal, a total of 7024 probe sets were still represented on the array, with the following breakdown for number of probes removed (0–3) 85, (4–6) 396, (7–9) 1701, (10–12) 4842. Note that if (13) or (14) probes were removed, the probe set was not included in the analysis.

A number of previous analyses have found that highly expressed genes tend to evolve at a slower rate [5153]. This will result in two effects. First, we expect genes with low expression to have more probes removed and on average will therefore have higher standard errors. This will tend to reduce our chances of identifying co-expression clusters for these genes. Second, a larger proportion of genes with low expression will be removed from the analysis completely since none of the probes will be homologous between D. melanogaster and D. yakuba. If higher expressed gene are comparatively enriched for co-expression clusters within species, our analysis will be considering a subset of genes that are potentially enriched for within species clusters. Given that our permutation approach permutes among genes for which we have data this may lead to inflation of the number of clusters identified as significant at different cutoffs. However, testing the null hypothesis of the existence of co-expression clusters or clusters where there is co-evolution of expression is still valid in this case and the relative ordering of significance will not be affected. We therefore expect the clusters identified as the most significant in our analysis to reflect the most likely candidates for being true co-expression clusters or clusters where there is co-evolution of expression.

Background correction and normalization of our masked set was carried out using packages available in Bioconductor [54]. We found that a Loess smoothing followed by "mas5" and "liwong" [in Bioconductor: liwong or mas5(normalize(, method="loess"))] produced the best correction for intensity dependent trends as determined by MA plots (see Additional file 1). Results of our co-expression analysis for these two normalizations did not differ qualitatively and in the following we only present the results for MAS5. Note that the Loess smoothing has a stochastic component which results in slightly different measures for a given run. We repeated our Loess + normalization five times to assess the effects of this stochastic component. Differences when running the co-expression analysis on these were found to be minimal and we randomly chose one outcome for further analysis.

Obtaining accurate transcript abundance measures using the masking approach depends on the assumptions that using a subset of the 14 probe pairs does not dramatically bias the final measure of transcript abundance. Conversely, using a known PM probe that is diverged among species is expected to produce a biased measure. To assess the first of these assumptions, we compared the level of expression for in D. yakuba for 21 genes where all 14 probesets in D. yakuba were also PM in D. melanogaster. We analyzed the effect of removing 1, 5, 9, and 12 probes on the final expression measure using Loess + MAS5 (Figure 4A). We found that while the effect of probe removal can have a significant effect on expression (mean R2 for removal of 1, 5, 9, and 12 probes were 0.008, 0.0236, 0.1449, 0.1718, respectively), the effect is equally likely to be an increase or decrease – that is, the direction of the change in effect tends to be random (p-value >> 0.05 for sign tests). For the second assumption, we compared the gene expression levels after masking sets of 21 randomly chosen genes that had 1, 5, 9, and 12 mismatched probes when compared to D. yakuba to the results when not masking these probes (Figure 4B). The results were even more significant (mean R2 for removal probes were 0.2887, 0.3480, 0.3333, 0.4310, respectively) but biased towards producing smaller values when masking (sign test p-value << 0.01). In other words, including mismatched probes tends to bias estimates of expression upward. This is the overall trend when comparing the masked to unmasked measures for the probe sets included in the analyses described below (Figure 4C). We suspect this result occurs because a small subset of probes hybridized heterologous cRNA better than homologous cRNA. This subset of probes has a disproportionate effect on the mean because of the inability of the technology to detect differences between poorly hybridizing probes and very poorly hybridizing probes.

Figure 4
figure 4

Effects of probe masking on the estimated expression level. Each circle reflects the slope of the regression divided by mean squared error for a single probe set for a regression using all 48 samples: A) Effect of masking random probes for probe sets where all 14 probes are perfect matches for both D. melanogaster and D. yakuba, B) Effect of masking in probe sets where probes masked have diverged between D. melanogaster and D. yakuba (all other probes in a set are perfect matches for both species), C) Effect of masking for all probe sets included in the analysis.

We used Ensembl (CG/CR) annotation provided by Affymetrix and merged these with the Drosophila melanogater genome annotation (release 4.3) retaining genes with protein coding regions (CDS, protein, and gene). We used the D. melanogaster annotation to determine gene order. In total, our gene ordering included 14517 genes. After masking, we obtained transcript abundances for 7024 that were dispersed relatively evenly across the entire genome and generally tracked gene density: X: 1118, 2L: 1236, 2R: 1421, 3L: 1408, 3R: 1811, 4: 30. Our co-expression analysis was carried out on the entire ordering of 14517 genes handling missing data as described below. The genes along each chromosome were ordered based on the start coordinates of D. melanogaster and were numbered consecutively (See Additional file 3).

We performed ANOVAs to assess whether there was significant variation among the 7024 genes among species and within D. simulans. For a significance level of 0.05 we found 5344 (among) and 6825 (within) tests and at 0.001 we found 3111 and 2887, respectively. Thus, there appears to be detectably significant variation for most of the genes consider in our analysis.

Evolutionary Co-Expression Analysis

Our approach for identifying clusters where there is co-evolution of expression is closest to the analysis of correlation approach used by Spellman and Rubin 2002 with the following two differences: 1. use of an eigenvalue ratio statistic to summarize correlations within a window instead of the mean correlation, and 2. analysis of all window sizes from 2–20 instead of limiting analysis to a single window size of 10 genes. For the co-expression analysis, we calculated the mean value of each gene for a given species using the total set of the arrays used to assay that species. We then carried out a sliding window analysis of the correlations among pairs of consecutive genes, i.e. correlations for pairs of genes were considered where there were seven 'species' observations per gene. Our test statistic for each window is calculated as follows: 1. perform a principal components analysis (i.e. calculate eigenvectors) on the correlation matrix for all the genes in the window, 2. calculate the ratio of the first eigenvalue compared to the sum of all of the eigenvalues. Intuitively, this statistic reflects the fraction of variation along the first principal component. As an example of the structure of correlation captured by this statistic, consider the following correlation matrices:

[ 1.0 0.8 0.0 0.0 0.8 1.0 0.0 0.0 0.0 0.0 1.0 0.8 0.0 0.0 0.8 1.0 ] [ 1.0 0.8 0.8 0.7 0.8 1.0 0.7 0.8 0.8 0.7 1.0 0.8 0.7 0.8 0.8 1.0 ] [ 1.0 0.8 0.8 0.7 0.8 1.0 0.7 0.8 0.8 0.7 1.0 0.8 0.7 0.8 0.8 1.0 ] MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaqbaeqabeWaaaqaamaadmaabaqbaeqabqabaaaaaeaacqaIXaqmcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaI4aaoaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaI4aaoaeaacqaIXaqmcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIXaqmcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaI4aaoaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaIWaamaeaacqaIWaamcqGGUaGlcqaI4aaoaeaacqaIXaqmcqGGUaGlcqaIWaamaaaacaGLBbGaayzxaaaabaWaamWaaeaafaqabeabeaaaaaqaaiabigdaXiabc6caUiabicdaWaqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiEda3aqaaiabicdaWiabc6caUiabiIda4aqaaiabigdaXiabc6caUiabicdaWaqaaiabicdaWiabc6caUiabiEda3aqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiEda3aqaaiabigdaXiabc6caUiabicdaWaqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiEda3aqaaiabicdaWiabc6caUiabiIda4aqaaiabicdaWiabc6caUiabiIda4aqaaiabigdaXiabc6caUiabicdaWaaaaiaawUfacaGLDbaaaeaadaWadaqaauaabeqaeqaaaaaabaGaeGymaeJaeiOla4IaeGimaadabaGaeGimaaJaeiOla4IaeGioaGdabaGaeyOeI0IaeGimaaJaeiOla4IaeGioaGdabaGaeyOeI0IaeGimaaJaeiOla4IaeG4naCdabaGaeGimaaJaeiOla4IaeGioaGdabaGaeGymaeJaeiOla4IaeGimaadabaGaeyOeI0IaeGimaaJaeiOla4IaeG4naCdabaGaeyOeI0IaeGimaaJaeiOla4IaeGioaGdabaGaeyOeI0IaeGimaaJaeiOla4IaeGioaGdabaGaeyOeI0IaeGimaaJaeiOla4IaeG4naCdabaGaeGymaeJaeiOla4IaeGimaadabaGaeGimaaJaeiOla4IaeGioaGdabaGaeyOeI0IaeGimaaJaeiOla4IaeG4naCdabaGaeyOeI0IaeGimaaJaeiOla4IaeGioaGdabaGaeGimaaJaeiOla4IaeGioaGdabaGaeGymaeJaeiOla4IaeGimaadaaaGaay5waiaaw2faaaaaaaa@BF96@

The value of our statistic for these matrices is 0.45, 0.825, and 0.825 respectively. Note that this statistic tends to reflect our intuition about the extent of correlation among the (in this case) four genes in the window. In the first case, gene pairs {1,2} and {3,4} are highly correlated but there is no correlation between genes in different pairs. This is therefore a case where the entire set of genes are not highly correlated compared to the case of the second matrix where every gene is highly correlated with every other gene in the window, and the statistic reflects this structure (0.825 > 0.45). Also, note that it does not matter if the correlations are negative or positive, only that the genes are acting as a correlated group, i.e. the second and third matrices have the same value of the statistic.

The co-expression analysis proceeded by calculating the eigenvalue ratio statistic for each consecutive (overlapping) window along each individual chromosome for window sizes of 2–20. Note that in most cases, missing data caused the total number of genes in a given window to be less than the window size. A direct consequence of this is that larger window sizes often included the same number of gene sets as smaller window sizes for specific windows. Each window size in the analysis should therefore not be considered as analyzing the exact number of genes of the window size but as a relative measure considering more genes on average as the window size increases.

To identify specific sets of genes that have more extreme co-expression for a given window size, we randomly permuted data within each chromosome for genes where we had data. We did this randomization 1000 times and repeated the sliding window analysis each time. We then counted the number of consecutive windows that had values of the statistic greater than the top 5%, 1%, and 0.1% of these random sets. In many cases, these groups of genes at the various cutoffs overlap. We compared the number of significant tests at these cutoffs to the expected number we would obtain if there were no true significant tests. This is the expected number of false discoveries and the ratio of these two is an estimate of the false discovery rate [34]. We present the number of significant windows (Table 1), the p-values associated with each window (Additional file 4), and merged significant windows at the 0.05 significance level (Additional file 5).

We additionally used a randomization approach to assess whether the genome-wide number of clusters where expression is co-evolving is greater than we would expect at random. We repeated the analysis and 1000 permutations for 100 random data sets, where again, we randomly re-ordered the genes within a chromosome for which we had data (i.e. we kept the missing data structure intact). We then repeated the co-expression analysis for window sizes 2, 5, 10, and 20 and counted the number of windows that had p-values less than 0.001 of the 1000 random sets of a given window size as described in the previous paragraph, i.e. our statistic used to assess significance is the number of windows identified at a given cutoff in our original data (Table 1). We used the 0.001 cutoff since this corresponds to the lowest FDR levels associated with the number of significant windows that we identified and we therefore have the greatest confidence that many of these reflect true clusters where expression is co-evolving. We did not perform this analysis for all window sizes given computational constraints.

A potential concern with this approach is that a single species with dramatically different gene expression levels may be driving the pattern across all species, i.e. not all species are evolving so co-expression is species specific. To assess this possibility we performed the sliding window analysis after removing a single species and compared the results to the analysis including all species. A dramatic difference between the number of regions identified would indicate that the overall pattern may be lineage/species specific. We determined the number of windows that were below a given cutoff in both the analysis of all species and for the analysis with one species removed. We present these results in Additional file 6. As a metric for comparison, we also determined the number of windows below cutoffs for the original analysis of all species and for a repeated analysis of all species (i.e. an additional independent 1000 permutations). Results are presented in Table 3.

An additional concern with our analysis is we are using the gene ordering of D. melanogaster. For species in the D. yakuba branch there are known to be a considerable number of rearrangements affecting gene order. For such locations, our analysis is not being performed on genes that are physically adjacent for all species. If our analysis is identifying adjacent genes that are more correlated than other gene sets, we would not expect to find highly correlated genes that are interrupted by these breakpoints more often than expected by chance. To assess this possibility, we mapped D. yakuba breakpoints to positions of the regions where expression is co-evolving. We generated this file by computationally comparing the locations of all D. yakuba genes in our array data set to the position of their D. melanogaster homolog. Anywhere that a block of D. yakuba genes shifted to a new position was recorded and marked as an inversion breakpoint. These breakpoints are essentially the same as those noted in Lemeunier and Ashburner 1976 [55]. Like Ranz et al. 2007, we find no effect of the presences of breakpoints on the presence or absence of cluster.

Within Species Co-Expression Analysis

The entire co-expression analysis was also repeated to identify co-expression domains within D. simulans. We used the D. yakuba mask file for this analysis so that the results are directly comparable to the results among species and conservative as we likely mask more genes than necessary. These data consisted of three replicate array assays for 10 crosses between D. simulans lines [56]. While expression array analysis of each cross therefore does not assess the transcript abundance variation associated with a single genotype, assay of these crosses provides a reasonable estimate of the variation observed in this D. simulans population. We performed the co-expression analysis on the mean value associated with each of the genotype crosses. We present the number of significant windows (Table 2), the p-values associated with each window (Additional file 7), and merged significant windows at the 0.05 significance level (Additional file 8). We compared these regions to those found among species by comparing the number of windows with p-values below 0.05, 0.01, and 0.001. As a metric for comparison, we also determined the number of windows below cutoffs for a repeated analysis of D. simulans (Table 3).

The evolutionary analysis of transcript abundance considers a single representative line per species with the exception of D. simulans. Thus, the observed variation between species – in principle – may only reflect the typical intra-specific variation between any two lines within species. To assess this possibility, we performed individual ANOVAs for each gene comparing the values of the D. simulans crossed-genotypes to the other six species considered as a group. We compared the distribution of p-values for this analysis to ANOVAs for the same partition where each array was assigned to a partition at random. We identified far more p-values at a given cutoff for the D. simulans vs. other species indicating that the variation we are analyzing between species is far greater than the variation we observed within the 10 combined-genotypes of D. simulans (6825 significant tests as compared to 5291 in the random assignment at 0.05, 2887 significant compared to 2022 random at 0.001).

We compared the relative locations of clusters we identified to be evolving across species and within D. simulans at the 0.05 and 0.001 cutoffs to the intra-species clusters identified in D. melanogaster by Spellman and Rubin 2002 [12]. Since Spellman and Rubin 2002 reported a set of clusters representing merged overlapping windows using a window size of 10 genes, we calculated merged windows for a window size of 10 for our among and within species analysis and compared the overlap between these windows. Note that in the Spellman and Rubin 2002 study, the co-expression among genes expressed at different developmental stages and under different environmental conditions was analyzed while in the current study, only expression at the adult stage was considered. Not surprisingly, we found little correspondence between the co-expression clusters identified within D. simulans when compared to the results of Spellman and Rubin 2002.

Additional Representation Analyses

We considered whether the existence of paralogous genes could explain the existence of clusters. Tandem pairs of duplicated genes were identified by sequentially estimating the degree of similarity of all neighboring pairs of genes on all major chromosomes using the gene ordering of FLYBASE v5 (i.e. we are using D. melanogaster as a reference). At a local scale these syntenic relationships should be preserved across taxa [33]. The first transcript of each gene (e.g. PA) was translated and pairs of proteins were then aligned with bl2seq, a version of BLAST that is optimized for pair wise alignments. Genes with 85% amino acid similarity were considered tandem duplicates. Typically, more than 90% of these genes were greater than 90% similar in amino acid sequence. We then determined the number of cases where paralogs gene pairs were located within clusters identified among species and within D. simulans (Additional file 9).

We also considered whether clusters either within D. simulans or among species tend to be located at regions of high gene density. We calculated the gene density in a co-evolving or co-expression window from the start of the first gene to the start of the last gene in that cluster. We then calculated the empirical distribution of like sized windows across that chromosome arm. The mean gene density of clusters was then compared to this distribution.

We used the coefficient of dispersion ( C D = s 2 X ¯ MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaem4qamKaemiraqKaeyypa0tcfa4aaSGaaeaacqWGZbWCdaahaaqabeaacqaIYaGmaaaabaGafmiwaGLbaebaaaaaaa@336D@ ), a commonly used as a measure of spatial clustering, to look at large scale spatial patterns of both genes and clusters. For genes, the distance (in nucleotides) from the end of one gene to the start of the next was calculated for each locus. For clusters, we used the distance in nucleotides from the end of one cluster to the start of the next. The mean, variance, and CD were calculated for each chromosome arm for both genes and clusters. CD much greater than one normally suggests clumping; CD less than one indicates over dispersal.

We mined FLYBASE for genetic map data and physical map data. From these data we inferred regional recombination rates. We then compared the mean recombination rate of the genes in our clusters to the genome average. There was no significant difference.

We used the D. melanogaster annotation to see if pairs of genes within clusters tended to be on the same strand as each other. For simplicity, we limited this analysis to clusters of only two genes. If shared cis-regulatory regions were critical to coordinated expression evolution, we may expect that the genes would tend to be on the same stand.

We used the DAVID analysis tool [57] to determine which gene ontology classes were over-represented in identified clusters. We did this for both clusters identified across species and within D. simulans at window size of 2 and 10 using the 0.05 and 0.001 cutoffs.

Evolutionary Analysis

Begun et al. 2007 identified genes in the D. simulans genome that deviated from the standard neutral model in a manner consistent with recurrent directional selection. They performed both polarized and unpolarized McDonald-Krietman tests [43]. Polarized tests only identify genes evolving "adaptively" along the D. simulans lineage. Unpolarized tests confound divergence across both D. melanogaster and D. simulans lineages, but may capture genes adaptively evolving in both species. Both data sets were used to see if genes within clusters – either within D. simulans or among taxa – are enriched for adaptively evolving genes. We employed a re-sampling approach to determine if one particular subset of genes was enriched (10,000 samples for each analysis). As only those genes for which sufficient polymorphism data were used Begun et al.'s analysis, only a subsample of the genome was tested (unpolarized N = 6704; polarized N = 2653). As detailed in Begun et al. 2007, if a correction for multiple testing, such as a Bonferroni correction, is used very few genes in the genome-wide comparison are significant. Thus, for the resampling analysis we set a nominal p-value threshold for the McDonald-Krietman tests; enrichment is thus defined as having a significantly greater number of McDonald-Krietman tests with p-values below this nominal threshold.