Background

Traditionally, the animal body cavity (coelom) has played a major role in interpretations of metazoan evolution, from groups (e.g., flatworms) lacking a coelom to those (e.g., nematodes) with a false coelom and finally to the bulk of animal phyla having a true coelom (Coelomata) [1, 2]. There has never been complete agreement on animal phylogeny and classification, but most researchers have divided living coelomate animals into deuterostomes (echinoderms, hemichordates, urochordates, cephalochordates, and vertebrates) and protostomes (arthropods, annelids, mollusks, and other phyla) based on differences in early embryonic development. An analysis of small subunit ribosomal RNA (18S rRNA) sequences challenged this arrangement by placing acoelomate and pseudocoelomate phyla in more derived positions among the protostomes, and in further defining a clade (Ecdysozoa) of molting animals that includes arthropods and nematodes [3]. This "Ecdysozoa" hypothesis has influenced diverse fields [4] and interpretations of developmental evolution in animals [57]. Since its publication, evidence has appeared both for and against this hypothesis [815]. Knowing the branching order of the major animal lineages, especially those three with fully sequenced genomes, is of importance to diverse fields such as medical genetics, physiology, neurobiology, paleontology, and astrobiology. With a genealogy of animals, it will be easier to determine the origins and inheritance of mutations, genes, gene functions, and structures.

The three possible relationships of these animal phyla are: (I) arthropods + vertebrates, (II) arthropods + nematodes, and (III) nematodes + vertebrates. The first hypothesis corresponds to the traditional grouping Coelomata and the second corresponds to Ecdysozoa [3]. For convenience, we will use these names in reference to the two hypotheses while recognizing that this study, by necessity, involves only a subset of all animal phyla. The third hypothesis will be referred to as "hypothesis III" (Fig. 1). To test each hypothesis, sequence alignments of more than 100 nuclear proteins were assembled and subjected to a series of analyses designed to reveal biases that could result in an incorrect phylogeny.

Figure 1
figure 1

The three possible relationships of vertebrates, arthropods, and nematodes.

Results and discussion

Analyses of the individual protein datasets using neighbor-joining [16] show that most (62%) support Coelomata while 25% support Ecdysozoa and 13% support hypothesis III (Supplemental Table 1, four-taxon analysis). Of the 25 proteins in which support for one of the three hypotheses is significant (≥ 95%), the results are 84% (21 proteins), 16% (4), and 0%, respectively; for those ten proteins with a highly significant (≥ 99%) topology, the results are 90% (9 proteins), 10% (1), and 0%, respectively. The four proteins showing significant support (>95%) for a hypothesis other than Coelomata were reanalyzed using maximum parsimony and maximum likelihood; bootstrap values were not significant using other methods. Such divided results are typical of single-gene analyses because of limited information (~400 amino acids), necessitating combined analysis. Coelomata was supported significantly (100% bootstrap confidence, posterior probability = 1.0) when the 100 four-taxon protein alignments were concatenated and analyzed using neighbor-joining, maximum parsimony, maximum likelihood, and Bayesian inference (Fig. 2A). Using the Shimodaira-Hasegawa (SH) test [17], this maximum likelihood topology was significantly different from the alternative hypotheses of Ecdysozoa (P < 0.001) and Hypothesis III (P < 0.001). These results agree with earlier studies involving 10–36 nuclear proteins [8, 9, 11].

Figure 2
figure 2

Phylogenetic analyses of individual and combined (concatenated) sequence alignments bearing on the position of nematodes. V = vertebrate, A = arthropod, N = nematode, P = platyhelminth. Bootstrap values (>95%) are shown for neighbor-joining, maximum parsimony, and maximum likelihood, respectively; all are indicated for the node joining Homo and Drosophila (=Coelomata). Posterior probabilities are not shown (all highlighted nodes = 1.0). (A) Four-taxon analysis of 100 combined protein alignments (44,214 amino acids), using nematode Caenorhabditis elegans (Chromadorea, Rhabditida, Rhabditoidea, Rhabditidae); the nematode branch is approximately 16% longer than the vertebrate and arthropod branches. (B) Five-taxon analysis of 100 combined proteins includes planarian EST sequences (14,041 amino acids); the nematode branch is approximately 23% longer. Other trees show different representative nematodes. (C) Brugia (Chromadorea, Spirurida, Filarioidea, Onchocercidae), based on 18 combined proteins (4598 amino acids); nematode branch= 15% longer. (D) Trichinella (Enoplea, Trichocephalida, Trichinellidae), based on 6 combined proteins (2261 amino acids); nematode branch = 24% longer than the vertebrate branch and 5% shorter than the arthropod branch. (E) Proportion of individual protein analyses supporting each of the three possible topologies with differing numbers of phyla included (4 taxa = 124 proteins, 5 taxa= 107 proteins, 6 taxa= 66 proteins, >6 taxa = 12 proteins).

To test the stability of Coelomata to taxon sampling, we included new sequences of the planarian Dugesia japonica (Phylum Platyhelminthes) in 100 five-taxon protein alignments (Supplemental Table 1, five-taxon analysis). The results, upon concatenation with this additional taxon, were unchanged (Fig. 2B): Coelomata continued to be significantly supported (≥ 98% bootstrap support, posterior probability = 1.0) and the alternative hypotheses were both rejected using the SH test (P < 0.001), although the relationships among the basal phyla (Nematoda and Platyhelminthes) could not be resolved.

Another potential bias is the specific taxa included in an analysis. For example, the original support for Ecdysozoa was obtained only with a particular genus of nematode, Trichinella, that had a short branch in the 18S rRNA tree [3]. To test this, phylogenies were constructed using different species of nematodes. The Coelomata hypothesis was significantly supported (≥ 97% bootstrap support, posterior probabilities = 1.0; Ecdysozoa rejected by SH test, P < 0.006; Hypothesis III rejected by SH test, P < 0.027) using either a genus in a different order, Brugia (18 proteins), or Trichinella (six proteins) (Fig. 2C, 2D). To further address the possibility that these results could be biased by taxon sampling, we included representatives from all available phyla for each protein. The results indicate that an increase in the number of taxa does not decrease single-protein support for Coelomata; in fact, the trend is the reverse (Fig. 2E). Simulation studies have shown that incomplete taxon sampling does not increase topological errors, and that most error is caused by limited sequence data [18].

In the initial study defining Ecdysozoa [3], rate variation was considered to be the major bias affecting the phylogenetic position of nematodes. In the 18S rRNA gene, nematodes typically have long branches indicating an increased rate of sequence change. Other nuclear genes also show this pattern, but to a lesser degree [8, 9]. Phylogenetic methods can accommodate moderate amounts of rate variation among lineages without producing an incorrect phylogeny [19]. However, if the rate of change is sufficiently large, longer branches in a phylogeny will sometimes attract one another [20]. If that happens, an ingroup species with a long branch may move to a more basal position in the tree. In analyses of the 18S rRNA gene, nematodes typically appear basal to arthropods + vertebrates. Because the use of a short branch nematode (Trichinella) resulted in a tree whereby nematodes clustered with arthropods, the basal position of nematodes in typical 18S analyses has been interpreted as long-branch attraction [3].

If nematodes cluster basally because of long-branch attraction, then the strongest support for Ecdysozoa should be obtained with the slowest evolving proteins. This was tested in an analysis of 36 nuclear proteins [8], but the results were equivocal. Therefore, we tested this suggestion with our four-taxon data set of 100 proteins, ordered by rate of evolution. Rate orders were determined in two ways: (i) nematode branch length and (ii) vertebrate-arthropod pairwise distance. The 100 proteins were grouped into concatenations of 10 proteins and 20 proteins to increase statistical resolution. The results show support for Coelomata at all rate orders, but the support is significant with the slowest evolving proteins, regardless of rate measure or number of proteins combined (Fig. 3). Concatenations of slow evolving proteins also show compositional homogeneity (pairwise disparity index test, P < 0.05) [21], suggesting the basal position of the nematode results from true phylogenetic signal and not compositional bias. Support for Coelomata was weakest with the fastest evolving proteins (which also showed compositional heterogeneity), indicating that Ecdysozoa, not Coelomata, may be the result of a rate bias, compositional bias, or other artifact.

Figure 3
figure 3

Effect of genetic distance on bootstrap support for the three hypotheses from analysis of 100 nuclear proteins with four taxa. (A, B) show bootstrap support for Coelomata; (C, D) for Ecdysozoa; (E, F) for Hypothesis III. Proteins were ordered from slowest evolving to fastest evolving based on two criteria: vertebrate-arthropod pairwise distance (diamonds) and nematode branch length (squares). Proteins were concatenated into ten groups often (A, C, E) and five groups of twenty (B, D, F). Graphs show rate from slowest to fastest evolving (left to right). Trend lines are indicated (solid for vertebrate-arthropod distance, dashed for nematode branch length).

To ensure that this result was not affected by mutational saturation in our data set, the mean number of variants per variable site was determined for each protein and averaged for ten groups of proteins ordered by evolutionary rate (Fig 4A). As predicted, variable sites in the faster evolving proteins showed a higher number of variants than those in slow-evolving proteins. We also examined the minimum number of nucleotide changes required for sites where only the nematode sequence varied (Fig 4B). Slow-evolving proteins showed a smaller number of nucleotide changes required to alter amino acid identity, while faster evolving proteins required more changes. Thus, the nematode sequences in the slow-evolving proteins do not appear to be mutationally saturated.

Figure 4
figure 4

Test of mutational saturation in the four-taxon data set. (A) The mean number of variants per variable site was averaged for ten groups often according to evolutionary rate (vertebrate-arthropod distance = diamonds, nematode branch length = squares). (B) The minimum number of nucleotide changes required for unique nematode variants were also averaged according to evolutionary rate. Trend lines are indicated (solid for vertebrate-arthropod distance, dashed for nematode branch length).

Finally, the affect of lineage-specific rate variation on support for Coelomata was tested with the use of relative rate tests. Presumably, the selective elimination of genes with long branches will increase statistical support for the correct topology. Individual proteins from the four-taxon data set were each subjected to two different relative rate tests [22, 23]. Proteins determined to be rate-constant at the typically applied stringency level (5% significance), and at two greater stringency levels (10% and 40% significance) were concatenated and bootstrap support was determined using neighbor joining. The results of the two tests were similar. As stringency increased, 40–83% of proteins were rejected, and the relative nematode branch length (to the arthropod and vertebrate branches) dropped from 16% to 0%. However, in all cases, Coelomata remained highly significant (Fig. 5). Thus, the suggestion that a basal position of nematodes is the result of long-branch attraction [3] can be rejected.

Figure 5
figure 5

Effect of rate constancy on bootstrap support for Coelomata in four-taxon analysis. Graphs show results before application of tests (left, 0-level) followed by increasing stringency (5, 10, 40% significance) of the chi-square test [22] (circles) and Z-test [23] (triangles); the 5% level is normally used. (A) number of proteins passing rate constancy at each cutoff level. (B) relative nematode branch length upon concatenation of all rate constant proteins at each level. (C) bootstrap support for Coelomata for each rate-constant concatenation.

The importance of knowing the branching order of these species is illustrated by the immediate and wide acceptance of the Ecdysozoa hypothesis and its use in tracing patterns of developmental evolution [57, 10]. However, in the initial analysis of 18S rRNA sequences [3], Ecdysozoa was statistically significant only when a paralinear distance method was used; three other methods did not yield significant bootstrap support. In that study, Ecdysozoa also was not significant, using any method, when the flatworm sequence was included [3]. Subsequent analyses of the 18S rRNA gene have been interpreted differently [24, 25], but none has yielded statistically significant results supporting Ecdysozoa. Moreover, the molting cuticles of arthropods (chitin) and nematodes (collagen) are not homologous [4]. The significance of other morphological characteristics bearing on the position of nematodes continues to be debated [26].

Besides the 18S rRNA evidence, other genetic evidence for the grouping of nematodes and arthropods has come from qualitative interpretations of Hox gene [10] and β-thymosin [12] evolution. In the case of Hox genes, support comes from a single posterior gene sequence (Y75B8A.1) of the nematode Caenorhabditis elegans argued to have greater amino acid similarity with a posterior Hox genes of Drosophila and Priapulus[10]. Unfortunately, the Hox homeodomain is a short (60 amino acid) region with many sequence differences between these taxa. Definition of "sequence signatures" is qualitative and has not been tested statistically. In a subsequent study of nematode posterior Hox genes, other researchers were unable to determine if the simple nematode Hox cluster of six genes is an ancestral or a derived condition [13].

In the case of β-thymosin, a sequence signature also has been argued to support a grouping of Drosophila and Caenorhabditis[12]. However, it is a gene family known to have paralogs within animals, the position of introns differs between sequences from the two species, and only four other metazoan taxa were surveyed. In addition, knowing the presence or absence of a gene can be problematic without the complete genome sequence of an organism (in this case, genomes were known only in Drosophila and Caenorhabditis). Thus, although suggestive, it is too soon to judge the significance of this sequence signature. One difficulty with interpreting such qualitative evidence, including Hox gene orthology, is that almost any pattern can be found in nature if one looks. In other words, sequence signatures have not yet been surveyed systematically and objectively. In contrast, sequence evidence from randomly selected genes, analyzed phylogenetically, provides a more unbiased database amenable to statistical analysis.

Conclusions

Although it is possible that a basal position of nematodes is the result of some unknown and widespread bias not yet identified, a simpler explanation is that the grouping of nematodes with arthropods is an artifact that arose from the analysis of a single gene, 18S rRNA. The results presented here suggest caution in revising animal phylogeny from analyses of one or a few genes or sequence signatures. Although many other aspects of animal phylogeny remain unresolved, our results indicate that insects (arthropods) are genetically and evolutionarily closer to humans (vertebrates) than to nematodes.

Materials and methods

DNA sequences from Dugesia japonica[27] were used to search the public protein database (Entrez) for orthologous counterparts in Drosophila melanogaster, Caenorhabditis elegans, Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae, and Schizosaccharomyces pombe. When available, sequences from other major animal phyla (e.g., Mollusca, Echinodermata, Annelida) also were obtained. In addition, the database was searched for all proteins from two other nematodes, Brugia and Trichinella, with orthologs in Drosophila, Homo, and Arabidopsis, Saccharomyces, or Schizosaccharomyces. Arabidopsis was used as the primary outgroup (95 out of 100 proteins in the four-taxon analysis, 94 out of 100 proteins for the five-taxon analysis, all proteins in Brugia and Trichinella analyses); yeast was used as the outgroup when Arabidopsis sequences were unavailable or paralogous. This was because many more genes were available for rooting with the plant than with the fungus. All three kingdoms are about equidistant from each other in terms of branch lengths [9] and therefore a plant serves about equally well for rooting an animal phylogeny as does a fungus. Orthology was assessed using reciprocal BLAST searches of the public protein database; those sequences receiving high scores in each search were also analyzed phylogenetically to ensure orthology. Short (<100 amino acids) sequences were omitted.

Sequences were aligned using Clustal X [28] and each alignment was visually inspected. Primary analyses of aligned protein data sets were conducted in MEGA2 [29]. Phylogenies were reconstructed using neighbor-joining [16] under a Poisson correction and a gamma distance (α = 2, or estimated from the data for combined analyses), with bootstrapping (2000 replications) for all analyses. Gamma parameters were estimated from the combined data using maximum likelihood under a Poisson correction [30] (4-taxon, α = 1.62; 5-taxon, α = 0.94, Brugia, α = 0.87; Trichinella, α = 0.66). In addition, phylogenetic analyses were conducted with maximum likelihood (JTT-F option) [31] and maximum parsimony (Max-Mini Branch & Bound option) [29] on combined data sets; in all cases they resulted in similar results (topology and significance) to the neighbor-joining analyses. Posterior probabilities of concatenated files were computed using Bayesian inference [32] (Jones model with gamma estimated from data; 10,000 generations; 4 chains with temp = 0.2). Shimodaira-Hasegawa tests [17] were performed in PAML [30] (JTT-F option, fixed gamma); p-values for each topology were recorded.

Rate constancy was assessed using a chi-square test [22] under increasing stringency (5, 10, 40% significance levels); p-values were recorded for each protein. A Z-test [23] was also used under increasing stringency; z-values were recorded for each protein. Proteins determined to be rate constant at different significance levels were concatenated and analyzed in MEGA2 [29]. Nematode position and evolutionary distance were determined for each concatenation. New sequences, accession numbers of sequences, and sequence alignments may be found at the Evogenomics website http://www.evogenomics.org/publications/data/nematode/.