Background

The use of complete genomes for phylogenetic analysis has greatly improved our understanding of prokaryote evolution [13]. However, until recently, relatively few complete genome sequences were available for such analyses in eukaryotes. As this improves, there will be a greater demand on methodology for evolutionary analysis of complete genomes. Previous whole-genome studies of eukaryotes have focused on gene and gene family presence-absence [47], lineage-specific gene loss [8, 9], insertion-deletion markers and introns [6, 10, 11], and other non-sequence based information. While these approaches have their advantages, previous studies have not used complete genome sequences (nucleotides and/or amino acids) for reconstructing evolutionary relationships. At the same time, the complexity of eukaryote genomes, with numerous gene duplications and losses in different lineages, has created a challenge for sequence-based phylogeny estimation. Here, we outline a conservative approach designed to utilize the wealth of evolutionary information present in complete genome sequences by identifying orthologs in multiple eukaryotes for the purpose of evolutionary analysis.

Methods for the identification of clusters of orthologs and lineage-specific paralogs have proven useful for classifying gene function and identifying cases where genes have been differentially lost or duplicated in different lineages [1214]. However, such assemblages of data contain a mixture of orthologs, paralogs, and missing data as a result of gene loss, and are not generally suitable for large-scale phylogenetic sequence analysis of organismal evolution. Our approach for comparing multiple genome sequences involves the identification of single-copy orthologs across a number of genomes for evolutionary analysis (Figure 1). We refer to such strict (1:1) orthologs as panorthologs, in reference to their presumed "complete" orthology, in contrast to synorthologs, which contain a mixture of species divergences and gene duplication events. In other words, panorthologs are those genes (or clusters of sequences) that contain only species divergences and do not contain in-paralogs, out-paralogs, or co-orthologs [15]. On the other hand, synorthologs are those genes (or clusters of sequences) that contain species divergences and any combination of paralogy (in-paralogs and out-paralogs). While the use of panorthologs is conservative and reduces the number of usable genes or proteins, it also lowers the probability that errors will be made in confusing a species divergence with a gene duplication event. Because the ability to identify orthologs is diminished in analyses of small to moderate numbers of species or genomes, such a conservative method is appropriate in those cases. This conservative approach has been used to identify the number of shared, unduplicated proteins in Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, and Saccharomyces cerevisiae, where it was determined that such proteins perform primarily anabolic rather than catabolic functions [16].

Figure 1
figure 1

Flowchart of multigenome intersection approach (MIA). 1) Complete genomes are reciprocally compared against themselves and all other genomes with BLAST. 2) Pairwise ortholog clusters are identified using similarity scores and imported into a local database. 3) The intersection between genomes is determined by iteratively comparing sequence identification tags and retaining those clusters showing panorthology. 4) Additional genomes are added and checked as in the previous step. 5) Sequence data files are generated for evolutionary analysis.

We compare our phylogenetic results and divergence time estimates for an analysis of seventeen published eukaryote genomes to a previous study that assembled nuclear protein sequence data in a more conventional manner from public databases [17]. While the phylogenetic relationships between the organisms included in this study are not controversial, with the exception of the position of nematodes [18], this general approach will prove useful as more genomes, including those with questionable phylogenetic affinity, are sequenced. In addition, this approach facilitates the estimation of divergence times between organisms with numerous molecular clock methods.

Results

The number of orthologous clusters per pairwise comparison and the percentage of those clusters showing panorthology are presented in Table 1. On average, pairwise orthologous clusters contained approximately 60.3% panorthologs; exceptions include comparisons between fungi, including Encephalitozoon (average 89% panorthology), and all comparisons with Arabidopsis (average 34.6% panorthology). Comparisons within metazoans averaged 54.7% panorthology, with Mus and Rattus showing the highest number of shared transcripts (16,413 orthologous clusters; 79.2% panorthology) as expected due to their recent evolutionary divergence. Previous analyses showed approximately 12,400 panorthologs between Mus and Rattus [19]. Caenorhabditis elegans and C. briggsae, who diverged roughly 100 Ma [20], also shared a large number of transcripts (12,416 orthologous clusters; 84.4% panorthology), which is similar to a previous estimate of 12,155 panorthologs [21]. The number of orthologous clusters between Drosophila and Anopheles found here (7072, 61.3% panorthology) is also similar to a previous estimate of approximately 6130 panorthologs [22]. Pairwise comparisons with the smallest genome, the Guillardia nucleomorph, averaged ~176 orthologous clusters, but the percentage of panorthologs varied greatly, from a low of 25.1% with Arabidopsis to as high as 97.2% with Encephalitozoon.

Table 1 Number of orthologous clusters (upper-right) and percentage panorthologs (lower-left) per pairwise comparisona.

The intersection of nine metazoan genomes resulted in a large number of shared genes. Among the nine genomes, 285 panorthologs were found, totaling 97,581 amino acids. The neighbor-joining tree of that concatenation is shown in Figure 2; all nodes in this tree received 100% bootstrap support. The intersection of all seventeen eukaryote genomes included in this study resulted in three shared genes (t-complex protein delta subunit, proteasome beta type-1 subunit, and Nip7p biogenesis factor) and orthology was confirmed manually. The reconstructed trees for the three genes showed long-branch attraction errors associated with the intracellular parasite Encephalitozoon and the Guillardia nucleomorph (data not shown). This was expected because both have highly reduced genomes and high rates of substitution across many genes as a result of their current or ancestral parasitic and symbiotic lifestyles [23, 24]. For this reason, the intersection of the remaining fifteen genomes was determined, resulting in ten panorthologs. The intersection of genomes from twelve multicellular eukaryotes resulted in 63 panorthologs. The functional classifications of the panorthologs found here are similar to those identified in previous studies [14], with the most frequently represented functions being transcription, translation, replication and repair, and RNA processing. The phylogenetic trees reconstructed from the concatenated datasets both showed the expected relationships (Figure 3a and 3b) [17]. All nodes in these trees received very high bootstrap support, with only one node showing less than 95% bootstrap support (animals + fungi in Figure 3a). The long branch observed in Plasmodium (Figure 3a) may be the result of both the long evolutionary separation from the other eukaryotes included in this study, and the high (A-T) composition of the genome [25] leading to biased amino acid compositions among proteins [26].

Figure 2
figure 2

Neighbor-joining tree of nine metazoan genomes, 285 panorthologs (97,581 amino acid positions, alpha = 1.28). All nodes are supported significantly (>95%) in bootstrap analyses of neighbor-joining and maximum likelihood. The arrow indicates an alternative root [6, 18].

Figure 3
figure 3

Neighbor-joining trees of complete eukaryotic genome sequence analyses. (A) The intersection of fifteen eukaryotic genomes, 10 panorthologs (5094 amino acid positions, alpha = 1.01). (B) The intersection of genomes from twelve multicellular eukaryotes, 63 panorthologs (23,571 amino acid positions, alpha = 1.15). All nodes are supported significantly (>95%) in bootstrap analyses of neighbor-joining and maximum likelihood, with the exception of node indicated by an asterisk (94% with maximum likelihood) in (A).

Phylogenetic trees were also reconstructed for each panortholog to test for congruence with well-supported phylogenies from the concatenated data (see Additional file 1). We found that in most cases, the consensus values calculated from individual trees agree with the high bootstrap support of the concatenated analysis. Two exceptions were the low consensus values for the accepted close relationship between animals and fungi and the contested position of nematodes. Both taxa showed slightly longer branch lengths, and long-branch attraction artifacts [27] may be affecting the individual datasets, causing low consensus values. Also, recent empirical [28] and simulation [29] studies suggest that results from multigene analyses are more accurate when a tree is derived from a concatenated dataset of individual genes rather than a consensus of trees from multiple analyses.

Divergence times were estimated for both the 15-genome and 12-genome datasets (Table 2). Results were consistent with previous studies [17, 3033], showing an early divergence between plants, animals, and fungi (animals/fungi vs. plants ~1670 Ma, animals vs. fungi ~1400 Ma), and a Precambrian origin for animals (~900–1100 Ma). To specifically address the deuterostome-arthropod divergence within animals, two additional datasets were assembled to maximize the number of proteins analyzed: the intersection of seven genomes (Homo, Mus, Rattus, Takifugu, Drosophila, and Anopheles; Arabidopsis as outgroup) and the intersection of four genomes (Homo, Takifugu, and Drosophila; Arabidopsis as outgroup). The seven genome intersection contained 380 panorthologs (132,190 amino acids; Figure 4a), and yielded a vertebrate-arthropod divergence time of 955 Ma. The four genome intersection contained 753 panorthologs (287,000 amino acids; Figure 4b) and yielded a vertebrate-arthropod divergence time of ~1100 Ma. Although this last estimate was derived from more than five times the data previously used, the divergence time is remarkably consistent with previous large-scale studies [17], and suggests that bilaterian animals originated hundreds of millions of years before the first fossil evidence of their existence in the Cambrian (~520 Ma). With the exception of the maximum fossil-based time estimate used in the tetrapod-actinopterygian fish calibration, the other fossil constraints used here are minimums, and therefore the resulting time estimates are minimums [34]. The agreement between our results and those of previous studies using different methods of data assembly suggests that our genome intersection approach is correctly assembling orthologs. Younger time estimates of the vertebrate-arthropod divergence have been obtained in some studies [3537]. However, those results are problematic because they also contain estimates which are inconsistent (too young) with undisputed aspects of the fossil record, such as the oldest red algae (1200 Ma), green algae (1000 Ma), and stramenopiles (1000 Ma) [3840]. Possible reasons for their inconsistency are discussed elsewhere [41].

Table 2 Bayesian divergence time estimates (± one standard deviation) for eukaryote genome datasets.
Figure 4
figure 4

Neighbor-joining trees of genomes used to address deuterostome-arthropod divergence time. (A) The intersection of seven eukaryotic genomes, 380 panorthologs (132,190 amino acid positions, alpha = 1.38). (B) The intersection of four eukaryotic genomes, 753 panorthologs (287,000 amino acid positions, alpha = 1.46). All nodes are supported significantly (>95%) in bootstrap analyses of neighbor-joining and maximum likelihood.

Discussion

The complete genome sequence of an organism is essentially the maximum amount of discrete, genetically encoded information available for evolutionary analyses. However, orthology determination has been a major obstacle in the analysis of complete genomes, especially in eukaryotes where considerable gene duplication and loss has created additional complexity. Our approach for evolutionary analysis of complete eukaryote genome sequences is both simple and fast compared with the conventional method of gene-by-gene orthology determination using similarity searches in the public databases. The results of this approach applied to a subset of the available eukaryote genomes show agreement with previous results using conventional (non-genomic) approaches. In addition, the relatively high consensus values for most nodes indicate general agreement in tree topology among individual panorthologs.

The relatively low number of common genes in our intersections of 12–17 genomes is a combination of using panorthologs and including distantly related species. Genes are more likely to duplicate over long periods of evolutionary time, as in the time elapsed since plants separated from animals (~1600 Ma) [17, 32]. Therefore, a better approach with such distantly related species (e.g., all eukaryotes), and those groups with high levels of gene duplication and gene loss (e.g., nematodes), would be to relax the orthology criterion and include synorthologs. In that case, a representative or consensus sequence may be chosen from among in-paralogs. On the other hand, the implementation used here (panorthologs) should yield many genes in analyses of genomes from closely related species (e.g., within mammals), even if large numbers of species are used.

The use of sequence data for comparative genomics and phylogenetics has several advantages over the use of datasets based on the presence and absence or position of genes, introns and insertions. Sequence data can provide a larger number of characters for analysis, yielding hundreds of thousands of amino acid sites and more than a million nucleotide sites in some cases. Also, statistical models of sequence change are better known than those for non-sequence based data. Finally, the assembly of sequence data from complete genomes of multiple organisms not only facilitates phylogenetic and divergence time analyses, but a diversity of other comparative evolutionary analyses requiring sequence information [42, 43].

Conclusion

Unlike previous studies of complete eukaryote genomes, here we have used a fast, conservative approach to assemble orthologous clusters of proteins for phylogenetic analysis and divergence time estimation. Our results are similar to previous studies that used conventional (slower) gene-by-gene data mining. We find that complete genome sequences support the close evolutionary relationship between animals and fungi, and also that molecular divergences between animals occurred approximately 400 million years before the Cambrian Explosion of fossils. Our approach will be further tested as more eukaryote genomes are sequenced.

Methods

Multigenome intersection approach for evolutionary analysis (MIA): Reciprocal BLAST [44] searches of genomes versus themselves and versus all other genomes included in the analysis were used to generate lists of pairwise similarity scores for each transcript. These scores were then used to generate orthologous clusters of proteins by first determining the "primary" ortholog pair through reciprocal best hits, then adding lineage-specific paralogs (in-paralogs) as implemented in INPARANOID [45]. The settings used here (sequence overlap cut-off 50%, group overlap cut-off 50%, in-paralog confidence cut-off 5%) were considered optimal in the sense that all closely related lineage-specific paralogs (and alternative transcripts) will be placed in the same pairwise cluster, minimizing the probability that the same gene will be represented in more than one cluster. In-paralogs are added to clusters if they are more similar to one member of the primary ortholog pair than the two primary orthologs are to each other [45].

Only pairwise ortholog clusters can be obtained using INPARANOID. For phylogenetic analysis, ortholog sets for a larger number of genomes (at least four) must be constructed. Therefore, we combined the pairwise ortholog clusters from groups of species using a relational database. The intersection between ortholog clusters was determined by iteratively comparing sequence identification tags and reducing the intersected clusters at each round to exclude clusters that represent relationships other than panorthology. For example, consider genomes A, B and C. First, ortholog clusters are determined for each pairwise genome comparison, which results in clusters: A-B, B-C and A-C. Intersection of sets A-B and B-C is obtained by searching for common sequences of genome B in the two sets and merging the two sets accordingly into an ortholog cluster set of A-B-C. This combined set is reduced to include only clusters with panorthology relationships, i.e. only clusters with one sequence for each of the genomes A, B and C are retained. Further, the combined cluster A-B-C is compared with the pairwise cluster A-C based on common sequence tags for genome C. Any cluster from the combined A-B-C set that does not agree with the sequence grouping of genomes A and C is removed. This last step serves as an important check for orthology in each iteration of the intersection procedure, and is similar to the construction of three-member COGs (clusters of orthologous groups) [46]. The steps described above were performed iteratively in order to add more species to the ortholog clusters. Any number of genomes can be intersected (tested up to seventeen here), and an outgroup can be treated as part of the intersection or added separately by using a pairwise comparison to one of the in-group taxa. All programming was written in Perl.

Analysis of Eukaryote Genomes: Complete protein transcripts were obtained for the following eukaryote genomes [three letter abbreviation]: Homo sapiens [Hsa] (version 34b.2) [16, 47, 48]; Mus musculus [Mmu] (version 32.2) [48, 49]; Rattus norvegicus [Rno] (version 3b.1) [19, 48]; Takifugu rubripes [Tru] (version 3.0) [50, 51]; Ciona intestionalis [Cin] (version 1.0) [51, 52]; Drosophila melanogaster [Dme] (version 3.1) [53, 54]; Anopheles gambiae [Aga] (version 2a.2) [48, 55]; Caenorhabditis elegans [Cel] (version 120) [56, 57]; Caenorhabditis briggsae [Cbr] (version 25) [21, 57]; Saccharomyces cerevisiae [Sce] [58, 59]; Neurospora crassa [Ncr] (version 3) [60, 61]; Ashbya gossypii [Ago] (version 1.0) [62, 63]; Encephalitozoon cuniculi [Ecu] [24, 64]; Arabidopsis thaliana [Ath] (version 5.0) [65, 66]; Cyanidoschyzon merolae [Cme] [67, 68]; Guillardia theta nucleomorph [Gtn] [23, 64]; and Plasmodium falciparum 3D7 [Pfa] (version 4.1) [25, 69]. Some genome transcripts were given unique sequence identifiers to avoid redundancy when sequence tags are truncated. The intersection of these seventeen genomes was determined as described above.

Each panortholog was aligned [70] and individual datasets were concatenated. Both individual panorthologs and concatenations were analyzed using maximum likelihood [71] to determine alpha parameters for the gamma rate-heterogeneity correction. Phylogenetic trees of concatenated datasets were reconstructed using neighbor-joining (Poisson + gamma correction model) with bootstrapping (2000 replicates) [72] and maximum likelihood (JTT + gamma correction model) with 1000 puzzling steps [73]. Phylogenetic trees of individual datasets were reconstructed using maximum likelihood (Poisson + gamma correction model) [71] and a consensus tree was derived [74]. Consensus values (i.e. the proportion of trees recovering a specific node) were calculated for each dataset.

Divergence times were estimated for concatenated datasets using Bayesian inference (JTT model) [75] as described previously [17]. The following fossil dates were used as minimum constraints: Mus-Rattus 12 Ma [76], primate-rodent 65 Ma [77], tetrapod-actinopterygian fish 425 Ma (lower bound) and 495 Ma (upper bound) [78], vertebrate-urochordate 520 Ma [79], Drosophila-Anopheles 250 Ma [80], chordate-arthropod 545 Ma [77], green algae/plants-red algae 1200 Ma [40].