Background

Comparative genomics has revealed pronounced differences in gene content across species [1]. In an early analysis of eight microbial genomes, 20–56% of the genes in a genome were shown to not have high similarity to any sequence in public databases [2]. Initially these genes were referred to as orphan genes, or ORFans, because they correspond to stretches of open reading frame in bacterial genomes that have no known relationship to other sequences. As more eukaryote genome sequences become available, the term 'lineage-specific gene' is gaining in popularity because one can specify the 'lineage specificity' of a gene to describe its phylogenetic distribution [3].

Newly evolved genes may be important for adaptation and generation of diversity [4]. For example, the protozoan parasite Cryptosporidium parvum possesses a set of nucleotide salvage genes that are unique among all apicomplexans surveyed to date [5]. Acquisition of the nucleotide salvage pathway from a proteobacterial source as well as other sources apparently facilitated loss of genes involved in de novo pyrimidine biosynthesis, rendering this parasite entirely dependent on the host for both its purines and pyrimidines. Characterization of these lineage-specific genes not only leads to a better understanding of the parasite's biology but also provides a promising therapeutic target against an important parasite, since blocking the nucleotide salvage pathway can inhibit parasite growth but not harm its human host [5].

Currently, there are several hypotheses regarding the origin of lineage-specific genes. The first model invokes the process of horizontal gene transfer, in which organisms acquire genes from other distantly related species. This mechanism can create lineage-specific genes that are not shared by closely related organisms, as in the example of nucleotide salvage enzymes in C. parvum [5]. Previous studies have shown that horizontal gene transfer is an important force for genome evolution in bacteria [68], unicellular eukaryotes [9], and multicellular eukaryotes [10].

The second model is based on gene duplication followed by rapid sequence divergence [11, 12]. Based on the observation that the sequence divergence rate is positively correlated with lineage specificity in a diverse set of organisms [3, 1114], Alba and Castresana [12] proposed that newly duplicated genes may be released from selective constraint and accumulate mutations at a faster rate. While most of the mutations may be deleterious and lead to loss of function in one copy [15], it is also possible that one of the copies can acquire new functions and become a novel gene in the genome. However, whether gene duplication followed by rapid divergence is truly an important mechanism of generating lineage-specific genes is still under debate. Elhaik et al. [16] suggested that the correlation between divergence rate and lineage specificity may simply be an artifact, stemming from our inability to identify homologs of fast-evolving genes across distantly related taxa based on sequence similarity searches. However, a recent simulation study by Alba and Castresana [17] demonstrated that sequence similarity searches performed at the amino acid level can reliably detect fast-evolving genes due to the rate heterogeneity among sites.

In addition to the two main models discussed above, other explanations for the origin of lineage-specific genes such as de novo creation from non-coding sequences [18, 19], exon-shuffling [20, 21], intracellular gene transfer between organellar and nuclear genomes [9], and differential gene loss [22] also have been proposed. However, the relative importance of various forces that generate lineage-specific genes remains largely unknown.

While erroneous annotation has also been proposed as one explanation for the abundance of lineage-specific genes [23, 24], expression data [25, 26] and nucleotide substitution patterns [24, 27] suggest that many lineage-specific genes are indeed functional and not annotation artifacts. Unfortunately, understanding the biological function of these genes is difficult due to the lack of homologs in model organisms to use for functional characterization. As a result, a large percentage of the lineage-specific genes that have been identified to date are annotated as hypothetical proteins of unknown function.

In this study, we aim to characterize the lineage-specific genes in a group of unicellular eukaryotes from the phylum Apicomplexa, including several important pathogens of humans and animals. The most infamous member of this phylum is the causative agent of malaria, Plasmodium, which causes more than one million human deaths per year globally [28]. Other important lineages include Cryptosporidium that causes cryptosporidiosis in humans and animals [29, 30], Theileria that causes tropical theileriosis and East Coast fever in cattle [31, 32], and Toxoplasma that causes toxoplasmosis in immunocompromised patients and congenitally infected fetuses [33]. The availability of genome sequences from these apicomplexan species has provided us with new and exciting opportunities to study their genome evolution. Improved knowledge of the lineage-specific genes in these important parasites can lead to a better understanding of their adaptation history and possibly identification of novel therapeutic targets.

Results

Inference of the species tree

We based our comparative analyses on a phylogenetic framework in order to infer the lineage specificity of individual genes. Among the nine species included in the data set (seven apicomplexans as well as two outgroup ciliates), we identified 83 single-copy genes that contain at least 100 alignable amino acid sites to infer the species tree (see Methods for details; a list of these 83 genes is provided in Additional file 1). Based on the concatenated alignment of these 83 genes (with 24,494 aligned amino acids sites), we infer a species tree with strong bootstrap support (Figure 1). This tree is consistent with our prior understanding of apicomplexan relationships based on morphology and development [34], rDNA analyses [35, 36], and multigene phylogenies [37, 38].

Figure 1
figure 1

The apicomplexan species tree. Maximum likelihood tree generated from the concatenated alignment of 83 single-copy genes (24,494 aligned amino acid sites). Two free-living ciliates, Paramecium tetraurelia and Tetrahymena thermophila, are included as the outgroup to root the tree. Labels above branches indicate the level of clade support inferred by 100 bootstrap replicates.

Phylogenetic distribution of orthologous genes

Using the species tree (Figure 1) as the foundation, we characterized the phylogenetic distribution of orthologous gene clusters among the apicomplexan genomes analyzed (Figure 2, Table 1). The orthologous gene identification was performed using OrthoMCL [39] based on sequence similarity searches with an additional step of Markov Clustering [40] to improve sensitivity and specificity (see Methods for details). Our results indicated that many genes are genus-specific, ranging from approximately 30% of the genes in Plasmodium and Theileria up to about 45% in Cryptosporidium.

Figure 2
figure 2

Phylogenetic distribution of orthologous gene clusters. The numbers after species name abbreviation (see Table 1) indicate the total number of annotated protein coding genes in the genome. The numbers above a branch and proceeded by a '+' sign indicate the number of orthologous gene clusters that are uniquely present in all daughter lineages; the numbers below a branch and proceeded by a '-' sign indicate the number of orthologous gene clusters that are uniquely absent. For example, on the internal branch that leads to the two Plasmodium species, 1,645 gene clusters contain sequences from both Pf and Pv but not any other species present on the tree. Similarly, there are 22 gene clusters that contain sequences from all species except Pf and Pv. Note that a gene cluster may contain more than one sequence from a species if paralogs are present in the genome. The levels refer to the degree of lineage specificity; genes in level 1 are shared by all species on the tree and genes in level 6 are species-specific.

Table 1 List of species name abbreviation and data sources

We selected Plasmodium falciparum and Theileria annulata for further investigations of lineage-specific genes. The asymmetrical topology of the species tree allows categorization of the genes in these two species into six levels of lineage specificity (Figure 2), yielding the highest resolution in determining the lineage specificity of a gene. The least specific genes at level 1, denoted as Pf1 for those in the P. falciparum genome and Ta1 for those in the T. annulata genome, are shared by all nine species analyzed, including two free-living ciliates; the most specific genes at level 6, denoted as Pf6 for those in the P. falciparum genome and Ta6 for those in the T. annulata genome, are species-specific. Together these six sets of genes account for 77% of annotated P. falciparum proteins (4,141/5,411) and 84% of annotated T. annulata proteins (3,191/3,795). Genes that are shared by a non-monophyletic group (e.g., shared by P. falciparum and T. annulata but are not found in any other species) are omitted from the following analyses. Additionally, the two species pairs, P. falciparum-P. vivax and T. annulata-T. parva, may have comparable divergence times in the range of approximately 80–100 million years [41, 42] such that we can directly compare the properties of their species-specific genes. Finally, within the two focal genera, P. falciparum and T. annulata have a higher level of completeness of genome assembly than their sister species and thus are better choices for determining the chromosomal location of the lineage-specific genes.

Sequence divergence

The two Plasmodium species, P. falciparum and P. vivax, differ greatly in their base composition. In the coding region, P. falciparum has a (G + C) content of 24% while P. vivax has a (G + C) content of 46%. Estimates of d N (the number of nonsynonymous substitutions per nonsynonymous site) and d S (the number of synonymous substitutions per synonymous site) are not reliable due to the extreme AT-bias in the P. falciparum genome. The average d S calculated from 4,159 P. falciparum-P. vivax sequence pairs is 45.7. For this reason, we quantified sequence divergence at the amino acid level based on the protein distance calculated by TREE-PUZZLE [43]. We found that the level of sequence divergence between sister taxa is positively correlated with the lineage specificity of a gene (Figure 3). The same trend is observed in both species-pairs. Compared to the two Plasmodium species, the Theileria species-pair has a lower level of sequence divergence. Level 6 genes are not included in the sequence divergence result because they are species-specific and have no orthologous sequence in the sister species for comparison.

Figure 3
figure 3

Level of amino acid sequence divergence. The five categories on the X-axis refer to the level of lineage specificity defined in Figure 2. Level 6 genes are not included because they are species-specific and have no orthologous sequence for comparison. Error bars indicate standard errors.

We identified 1,701 genes that are single copy in both Theileria species and are reasonably conserved for substitution rate analysis at the nucleotide level (i.e., d S <= 1). Consistent with the sequence divergence measured at the amino acid level, nucleotide substitution rates are higher in genes with higher lineage specificity (Table 2). We do not find strong evidence of any gene under positive selection (i.e., d N /d S ratio > 1, data not shown).

Table 2 Nucleotide substitution rates in Theileria

(G + C) content and relative codon bias

The average (G + C) content at the third codon position (i.e., (G+C3)) increases with lineage specificity in P. falciparum (Figure 4), suggesting that phylogenetically conserved genes are biased toward AT-rich codons in this extremely AT-rich genome. In T. annulata, the opposite trend is observed; genes with high lineage specificity have a lower (G + C) content at the third codon position (Figure 4).

Figure 4
figure 4

(G + C) content at the third codon position. The level of lineage specificity for each calculation is as defined in Figure 2. Error bars indicate standard errors.

We used the relative codon bias developed by Karlin et al. [44] to compare the differences in codon usage between different gene sets within each species (Table 3). In both P. falciparum and T. annulata, the level 6 (i.e., species-specific) genes exhibit a high level of deviation with regard of their codon preference compared to the other gene sets (see Methods for details). In P. falciparum, the average pairwise difference in all comparisons is 0.049 and the mean pairwise difference involving Pf6 genes is 0.102 (Table 3A). In T. annulata, the average pairwise difference in all comparison is 0.098 and the mean pairwise difference involving Ta6 genes is 0.183 (Table 3B).

Table 3 Relative codon bias

Functional analyses based on annotation

As expected, most of the phylogenetically conserved genes have functional annotation or have at least one identifiable protein domain (Table 4). As the phylogenetic distribution of a gene becomes more restricted, it is more likely to be annotated as a hypothetical protein. Functional analysis based on available gene annotation indicates that most conserved genes (levels 1 and 2) are responsible for basic cellular processes (e.g., DNA replication, transcription, translation, etc), while most genus- and species-specific genes (levels 5 and 6) are hypothetical proteins of unknown function (see Additional files 2 and 3). Despite the poor annotation of genus- and species-specific genes, 87% of level 5 genes and 72% of level 6 genes in P. falciparum have expression data available based on oligonucleotide microarrays [26]. This result suggests that most of the hypothetical proteins are real genes and not annotation artifacts.

Table 4 Characteristics of lineage-specific genes in Plasmodium falciparum

The two focal lineages in our analysis, Plasmodium and Theileria, exhibit one interesting difference in terms of the phylogenetic distribution of surface antigens. We found that surface antigens are species-specific in Plasmodium and genus-specific in Theileria. All members of the three large surface antigen protein families in P. falciparum genome, including 161 rifin, 74 PfEMP1, and 35 stevor, are found in the Pf6 list and have no ortholog in P. vivax. Of the 163 T. annulata proteins that contain FAINT, a protein domain that associates with proteins exported to the host cell [31], 116 are in the Ta5 list (i.e., shared by T. annulata and T. parva) and only 28 are in the Ta6 list (i.e., specific to T. annulata).

In P. falciparum 41% of the genus-specific proteins and 62% of the species-specific proteins contain a putative signal peptide or at least one predicted transmembrane domain (Table 4), which suggests that these proteins may be exported to the host cell or present on the surface of the parasite or its vacuole. This result is consistent with the hypothesis that lineage-specific genes in apicomplexan parasites are likely to be involved in host-parasite interactions and thus, potentially adaptation.

Chromosomal location

Analysis of chromosomal location demonstrated that most species-specific genes in P. falciparum are located near chromosome ends (see Figure 5 for one example chromosome and Additional file 4 for all 14 chromosomes). In T. annulata (see Figure 6 for one example chromosome and Additional file 5 for all four chromosomes), we observed a similar pattern that the regions adjacent to chromosome ends are devoid of the phylogenetically conserved genes (cf. Figures 5B and 6B). However, unlike the pattern found in P. falciparum, most of the species-specific genes in T. annulata (i.e., Ta6) are distributed across the entire length of chromosomes and are not enriched in the regions adjacent to chromosome ends (cf. Figures 5A and 6A).

Figure 5
figure 5

Chromosomal location of genes in Plasmodium falciparum. Chromosomal location of genes on P. falciparum chromosome 10. See Additional file 4 for views of all 14 chromosomes in this species. The level of lineage specificity is as defined in Figure 2. A. View of entire chromosome 10 (MAL10). B. Close-up view of the first 200 kb of chromosome 10.

Figure 6
figure 6

Chromosomal location of genes in Theileria annulata. Chromosomal location of genes on T. annulata chromosome 2. See Additional file 5 for views of all four chromosomes in this species. The level of lineage specificity is as defined in Figure 2. A. View of entire chromosome 2. B. Close-up view of the first 200 kb of chromosome 2.

To quantify the pattern of gene distribution on chromosomes, we calculated the distance of each gene to the nearest chromosome end. For each set of genes (levels 1 through 6 in each species), we utilized (1) the average distance to the nearest chromosome end and (2) the minimal distance to the nearest chromosome end (i.e., the minimal found in a given gene set) for this analysis. In P. falciparum, the average distance scales with chromosome size and the species-specific genes (i.e., Pf6) are closer to chromosome ends (Figure 7A). In contrast, minimal distance does not scale with chromosome size (Figure 7B). For all chromosomes, the minimal distances of phylogenetically conserved genes from the chromosome ends (i.e., Pf1 through Pf4) are larger than 50–100 kb. This result indicates that the regions that are occupied exclusively by genus- and species-specific genes are proportionally larger in smaller chromosomes. Consistent with this observation, three of the smallest chromosomes in P. falciparum (i.e., MAL1, MAL2, and MAL4) have many more species-specific genes than random expectation (Chi-square test d.f. = (6 gene sets -1) * (14 chromosomes - 1) = 65, P-value = 1e-12).

Figure 7
figure 7

Average and minimal distance of mapped genes to chromosome end. The level of lineage specificity is as defined in Figure 2. A. Average distance to chromosome end in Plasmodium falciparum. B. Minimum distance to chromosome end in P. falciparum. C. Average distance to chromosome end in Theileria annulata. B. Minimum distance to chromosome end in T. annulata.

In T. annulata, genes with different levels of lineage specificity have similar average distances to chromosome ends (Figure 7C). This result corroborates the visual pattern in Figure 6A that species-specific genes are distributed across the entire length of a chromosome, in contrast to the clustering near chromosome ends observed in P. falciparum (Figure 5A). For all four chromosomes in T. annulata, the regions that are adjacent to chromosome ends and devoid of phylogenetically conserved genes (i.e., Ta1 through Ta4) are approximately 20–40 kb (Figure 7D), a distance smaller than in P. falciparum. Unlike the pattern found in P. falciparum in which species-specific genes are closer to chromosome ends than genus-specific genes, genus- and species-specific genes in T. annulata (i.e., Ta5 and Ta6) have similar minimal distances in all four chromosomes (Figure 7D).

In both P. falciparum and T. annulata, genes located near chromosome ends have a higher level of sequence divergence relative to its ortholog in the sister species at the amino acid level (Figure 8). This trend is observed in genes with different levels of lineage specificity and is stronger in T. annulata.

Figure 8
figure 8

Amino acid sequence divergence and chromosomal location. Plot of amino acid sequence divergence as a function of the distance to the nearest chromosome end. A. Plasmodium falciparum. B. Theileria annulata. The black lines in both panels (i.e., Pf1-5 in panel A and Ta1-5 in panel B) refer to the combined results from genes with five different levels of lineage specificity and are included as the background reference. Error bars indicate standard errors.

Discussion

We identified a pattern in which lineage-specific genes have a higher level of sequence divergence among sister species in a group of important protozoan parasites. This result is consistent with previous studies in bacteria [13], fungi [3], and animals [11, 12, 14]. Now we further confirm that this pattern also holds true in a protistan phylum, suggesting that it may be universal across much of the tree-of-life. Results from functional analyses agree with our intuitive expectation that conserved genes are involved in basic cellular functionalities and are well annotated. A large number of the lineage-specific genes (at the species level in Plasmodium and the genus level in Theileria) are found to be putative surface antigens that the parasites use to interact with their hosts. This result supports the hypothesis that lineage-specific genes may be important in adaptation [4]. In addition, the physical distance of a gene to the nearest chromosome end is correlated with the level of sequence divergence.

We found three contrasting properties of lineage-specific genes between two major apicomplexan lineages. First, families of surface antigens are species-specific in Plasmodium but genus-specific in Theileria. Second, most of the species-specific genes are located in sub-telomeric regions in P. falciparum but no such pattern exists in T. annulata. Third, the (G + C) content at the third codon position increases with lineage specificity in P. falciparum but decreases in T. annulata. Taken together, these results suggest that the mechanisms of generating lineage-specific genes and their subsequent evolutionary fates differ between apicomplexan parasite lineages.

Gene content evolution

All apicomplexan species analyzed have small genomes compared to the free-living out-group. This result is consistent with comparative genomic analyses conducted in other pathogenic bacteria and eukaryotes; extreme genome reduction is a common theme in the genome evolution of these organisms [45].

A large proportion of the genes in apicomplexans are genus-specific (Figure 2). One parsimonious explanation for this observation is that each lineage acquired a new set of genes during its evolutionary history. An alternative explanation invokes differential loss among lineages when evolving from a free-living ancestor with a relatively large genome. We found that 23% of the protein coding genes in P. falciparum and 16% in T. annulata have a complex phylogenetic distribution pattern and do not fit into a simple single gain/loss model. These results suggest that some ancestral genes in the apicomplexans may have experienced multiple independent losses during their evolutionary history. Further investigation is necessary to distinguish true gene gains from differential retention of ancestral genes.

Comparison of genes with different levels of lineage specificity

Consistent with previous studies in bacteria [13], fungi [3], and animals [11, 12, 14], we observed a pattern in which sequence divergence is higher in genes with a higher level of lineage specificity. One explanation is that phylogenetically conserved genes are often involved in fundamental cellular processes (see Results). These genes are likely to be under purifying selection that constrains the rate of sequence divergence. In support of this hypothesis, we observe that the mean d N /d S ratio among the level 1 genes in Theileria is only 0.07 (Table 2), indicating an extremely low rate of nonsynonymous substitution relative to synonymous substitution.

Based on the hypothesis that lineage-specific genes are often involved in adaptation [4], such as invasion of hosts or evasion of the immune responses, lineage-specific genes may be under positive selection and have a faster rate of sequence divergence. Our data is suggestive in this regard, as genus-specific genes exhibit higher sequence divergence than genes with lower levels of lineage specificity. Unfortunately we cannot directly test the hypothesis that lineage-specific genes are more likely to be under positive selection using the d N /d S ratio data. The level of sequence divergence is too high in both species pairs for such analysis. Practically all of the genes from the Plasmodium pair and approximately 1,000 genes from the Theileria pair (i.e., more than a quarter of the gene repertoire) have a d S estimate that is larger than one. Under this high level of sequence divergence, we cannot confidently estimate the substitution rate due to saturation. Better detection of positive selection in these genes requires data on genetic variation at within- and between-species levels [46, 47].

Codon bias analyses indicate that species-specific genes have a different codon preference compared to other genes in the same genome, whereas the genes with lower levels of lineage specificity are relatively similar to each other (Table 3). It is possible that species-specific genes are relatively young and have yet to adapt to the codon usage pattern of the genome. Support for this hypothesis provided by the observation that the (G + C) content at the third codon position is much lower in the phylogenetically conserved genes in P. falciparum (Figure 4), suggesting that these 'older' genes are more biased toward GC-poor codons in this AT-rich genome. Alternatively, some species-specific genes may be subject to a different pattern of selection and thus possess different codon preference.

For the lineage-specific genes at the genus and species level that have functional annotations, many are known surface antigens. Because surface antigens are used by the parasites to interact with their hosts [48], such as adhesion to the cell surface or evasion of the host immune response, this result supports the hypothesis that (at least some) lineage-specific genes are involved in host-parasite interactions and have facilitated lineage-specific adaptation. Interestingly, surface antigens are species-specific in Plasmodium, but are genus-specific in Theileria. In addition, 62% of P. falciparum-specific genes contain a putative signal peptide or at least one predicted transmembrane domain. This result is consistent with one previous study that compared P. falciparum with three other Plasmodium species that cause rodent malaria [49]. Of the 168 P. falciparum-specific genes identified in this previous study that are not located in sub-telomeric regions, 68% are predicted to be exported to the surface of the parasites or the infected host cells.

Comparison between Plasmodium and Theileria

Previous studies suggest that the two focal species pairs have similar divergence times. The two Plasmodium species diverged about 80–100 million years ago [41] and the two Theileria species diverged about 82 million years ago [42]. Our results indicate that sequence divergence is much higher between the two Plasmodium species (Figures 1 and 3). This may be caused by the difference in nucleotide composition, since P. falciparum has a GC content of 24% while P. vivax has a GC content of 46% in the coding region. Bias in nucleotide composition has been shown to change codon usage and amino acid composition [50]. Alternatively, it is also possible that the divergence time between T. annulata and T. parva was overestimated because it was based on a simplified assumption that the synonymous substitution rate in Theileria is similar to that in Plasmodium [42].

In both P. falciparum and T. annulata, the sub-telomeric regions contain exclusively genus- or species-specific genes. Interestingly, the physical size of these regions is not correlated with chromosome size. This observation indicates that these regions are proportionally larger in smaller chromosomes and helps explains the pattern that the three small chromosomes in P. falciparum have many more species-specific genes than predicted by random expectations (see Results). In addition, genes that are located near a chromosome end have a higher level of sequence divergence in both species, regardless of their lineage specificity (Figure 8). The high evolutionary rates in sub-telomeric regions are shared by many eukaryotic lineages; high rates of inter-chromosomal recombination, local duplication, and segmental rearrangement have been reported in organisms including humans [51], yeasts [52], and plants [53].

Given the high rates of evolution in sub-telomeric regions, it may be advantageous for pathogens to have their surface antigen genes located in these evolutionary hotspots to facilitate the generation of antigenic diversity. Consistent with this hypothesis, many micro-parasites have large gene families that encode surface antigens in sub-telomeric regions (reviewed in [54]). The best-studied example is the causative agent of African trypanosomiasis, Trypanosoma brucei. The vsg gene family in T. brucei encodes variant surface glycoproteins (VSG) that form a dense coat on the outside of the parasite. In the bloodstream stage, T. brucei sequentially expresses different members of the vsg gene family, one at a time, to generate antigenic variation [55]. The positioning of vsg genes in the genome is tightly linked to regulation of expression; the actively expressed vsg is duplicated into one of the bloodstream expression sites located in the sub-telomeric regions (reviewed in [56, 57]). This homologous recombination process which involves loci that are not positional alleles is hypothesized to be important in generating genetic diversity within the gene family [54]. Although the genes encoding surface antigens in P. falciparum are not known to be duplicated into specific expression sites as observed in T. brucei, the clustering of these genes in sub-telomeric regions can facilitate inter-chromosomal recombination that increases antigenic variation [58].

We found that most of the surface antigen genes in P. falciparum are located in sub-telomeric regions, as previously noted [28]. Several studies have established the importance of genome location in the generation and maintenance of antigenic variation in P. falciparum [58, 59]. The surface antigen PfEMP1 possessed by P. falciparum is exported to the cell surface of infected erythrocytes. PfEMP1 can remove infected erythrocytes from blood circulation by cellular adherence to microvascular endothelial cells and avoid spleen-dependent killing [60]. The study on genetic structuring suggested that the approximately 60 copies of var genes (which encode PfEMP1) in the P. falciparum genome can be divided into three functionally diverged groups with two in sub-telomeric regions and one close to the centers of chromosomes [59]. Furthermore, the recombination rate is found to be high among members in the same functional group but low for members belonging to different groups. This recombinational hierarchy may facilitate the generation of genetic diversity within a group and promote specialization between different groups. Experimental evidence suggests that the clustering of var genes in the sub-telomeric regions is important in the epigenetic regulation of gene expression in P. falciparum [61, 62].

Given the generality of association between surface antigen genes and sub-telomeric regions in micro-parasites, it is interesting to see that T. annulata appears to be an exception to this rule. This finding may provide an explanation for the difference in host range between the two apicomplexan lineages. Because a large percentage of surface antigen genes in Plasmodium are located in sub-telomeric regions, the generation of antigenic variation may be faster in Plasmodium than in Theileria. Our results indicate that gene families encoding surface antigens in Plasmodium are highly diverged between species within the genus, whereas the two Theileria species still share most of their surface antigens and the genes encoding them are distributed across the entire lengths of chromosomes. For this reason, Plasmodium may be able to adapt to new host species at a faster rate, resulting in its much wider host range compared to Theileria; Plasmodium spp. can infect mammals, birds, and reptiles, whereas Theileria spp. are limited to ruminants [34].

Conclusion

Our results agree with previous observations in other organisms that lineage-specific genes have a higher level of sequence divergence compared to phylogenetically conserved genes. In addition, two major apicomplexan lineages may have different mechanisms for generating or retaining species-specific genes. Because many lineage-specific genes in these parasites are surface antigens that interact with the host, future investigations on genome evolution in these parasites may facilitate the identification of new therapeutic or vaccine targets. Future studies that focus on improving functional annotation of parasite genomes and the collection of genetic variation data at different phylogenetic levels will be important in our understanding of parasite adaptation and natural selection.

Methods

Data source and orthologous gene identification

The data sources of the annotated proteins are listed in Table 1. Protein domain identification was performed with HMMPFAM [63] (version 20.0). Transmembrane domain prediction [28] and gene expression data [26] of annotated Plasmodium falciparum genes were downloaded from PlasmoDB [64] (Release 5.3).

Orthologous gene clusters were identified using OrthoMCL [39] (version 1.3, April 10, 2006) with default parameter settings. The ortholog identification process in OrthoMCL is largely based on the popular criterion of reciprocal best-hits but also involves an additional step of Markov Clustering [40] to improve sensitivity and specificity. We used WU-BLAST [65] (version 2.0) for the all-against-all BLASTP similarity search step with the e-value cutoff set to 1e-15.

Phylogenetic inference

Based on the orthologous gene clustering result, we identified genes that are shared by all nine species to infer the species tree. Orthologous gene clusters that contain more than one gene from any given species were removed to avoid the complications introduced by paralogous genes in phylogenetic inference. Of the 768 orthologous gene clusters that are shared by all nine species (Figure 2), 154 clusters were single-copy in all species. For each gene, CLUSTALW [66] (version 1.83) was used for multiple sequence alignment. We enabled the 'tossgaps' option to ignore gaps when constructing the guide tree and used the default settings for all other parameters. The alignments produced by CLUSTALW were filtered by GBLOCKS [67] (version 0.91b) to remove regions that contain gaps or are highly divergent. Individual genes that had less than 100 aligned amino acid sites (33/154) or contained identical sequences from different taxa (38/154) after GBLOCKS filtering were eliminated from further analysis. We concatenated the alignments from the remaining 83 genes (with a total of 24,494 aligned amino acid sites) and utilized PHYML [68] to infer the species tree based on the maximum likelihood method. We used PHYML to estimate the proportion of invariable sites and the gamma distribution parameter (with eight substitution categories). The substitution model was set to JTT [69] and we enabled the optimization options for tree topology, branch lengths, and rate parameters. To estimate the level of support on each internal branch, we performed 100 non-parametric bootstrap samplings.

Quantification of sequence divergence

The nonsynonymous and synonymous substitution rates at the nucleotide level (i.e., d N and d S ) were estimated using CODEML in the PAML package [70]. We performed pairwise sequence alignment at the amino acid level using CLUSTALW [66] with default parameters for all orthologous genes that are single copy in both Plasmodium species or both Theileria species. The protein alignments were converted into the corresponding nucleotide alignments using NAL2PAL [71] (version 12). All gap positions were removed from the alignments before the substitution rate estimation by CODEML. To avoid problems of inaccurate rate estimation caused by saturation, we excluded sequences with a synonymous substitution rate (d S ) that is greater than one.

To quantify the level of sequence divergence at the amino acid level, we used TREE-PUZZLE [43] to calculate the protein distance between orthologs in sister species. The parameters were set to the JTT substitution model [69], mixed model of rate heterogeneity with one invariable and eight Gamma rate categories, and the exact and slow parameter estimation. Orthologous sequences were first aligned using CLUSTALW [66] followed by a filtering step using GBLOCKS [67] to remove gaps and highly divergent regions before the calculation of protein distance. Five sequences (PFA0650w, PFD0105c, PFL0060w, and PFD1140w from P. falciparum and TA18345 from T. annulata) that were not reliably aligned to their ortholog in the sister species were excluded from this analysis.

Calculation of relative codon bias

The relative codon bias between sets of genes in the two focal species, P. falciparum and T. annulata, was calculated based on the method developed by Karlin et al. [44]. Briefly, the method considers two sets of genes, one focal set and one reference set, and calculates the difference in relative frequency of codon family that encode the same amino acid between the two sets. The theoretical maximum of the difference between two sets of genes is 2.000, but the empirical values based on biological data generally range from 0.050 to 0.300 [44, 72, 73]. This measurement is different from the conventional codon adaptation index (CAI) developed by Sharp and Li [74], in which a set of highly expressed genes is always used as the reference set. We choose the relative codon bias to measure codon preference because it can provide a better resolution under certain conditions. For example, two sets of weakly expressed genes may have similar values of codon adaptation index but still possess vastly different codon preferences.

Visualization and quantification of chromosomal location

GBROWSE [75] was used for visualization of gene distribution on chromosomes. To quantify the pattern of chromosomal location, we calculated the distance of each gene to the nearest chromosome end. For example, the P. falciparum gene PF10_0023 on chromosome MAL10 (physical size is 1,694,445 bp) starts at position 99,380 and ends at 100,362. Its distance to the nearest chromosome end was calculated as 99,380 - 1 = 99,379 bp. For gene PF10_0369 on the same chromosome that starts at 1,493,991 and ends at 1,496,955, its distance to the nearest chromosome end was calculated as 1,694,445 – 1,496,955 = 197,490 bp. The orientation of a gene (i.e., whether it is on the '+' strand or the '-' strand) is ignored for distance calculation.