Background

Cancer/testis (CT) genes are a growing family of genes defined by a unique pattern of expression: amongst normal tissues, they are expressed only in cells of the germ line and in embryonic trophoblasts, but their gene products are also found in a significant number of malignant cancers [1]. The first CT genes were discovered because of the immune responses that they elicit in some cancer patients, and can thus be classified as CT antigens [2, 3]; systematic exploration of publicly available gene expression profiles (as documented in EST libraries, SAGE and MPSS data, and microarray experiments) uncovered a significant number of additional CT genes [4, 5], against most of which immune responses have not yet been documented. Nevertheless, all CT genes are in principle attractive targets for cancer immunotherapy, because the gonads are immunoprivileged organs and anti-CT immune responses will therefore target tumours specifically. Vaccination using peptides derived from the NY-ESO-1 (CTAG1B) and MAGEA1 CT genes has already been proven to bring clinical benefits to melanoma patients [6, 7].

CT genes comprise more than 240 members from 70 families, and can be subdivided into two broad categories based on chromosomal localization. CT-X genes are located on the X chromosome, are mostly members of gene families organized into complex direct and inverted repeats, and are expressed primarily during the spermatogonial stage of spermatogenesis [8]. Non-X CT genes are located on autosomes, are mostly single-copy genes, and are expressed primarily during the meiotic and reduction division stages of spermatogenesis [8]. Careful annotation of the sequence of the human X chromosome has revealed that as many as 10% of all genes present on the chromosome are members of known CT families [9]; further analysis of the expression patterns of genes of unknown function located in repeated regions could even increase this estimate [5]. The biological functions of most CT-X genes have not been characterized in any detail. However, evidence is emerging that the best studied of these, the MAGE genes, can act as signal transducing transcriptional modulators. Moreover, MAGE genes appear to be able to mediate proliferative signals [1012] and a member of the GAGE family has been shown to repress apoptosis [13], thus directly contributing to the malignant phenotype when aberrantly expressed in cancer. Available data suggest that many CT genes are involved in the re-programming of the transcriptional machinery that occurs during the transition from mitotic to meiotic division during spermatogenesis. It has been suggested that a similar re-programming may be responsible for some of the phenotype of malignant cancer cells [8, 14].

There is mounting evidence that the evolutionary history of the human X chromosome is significantly different from that of autosomes. It contains a disproportionate number of tandem and interspersed segmental duplications, both direct and inverted, containing genes with a testis-specific expression pattern including many CT-X genes [9]. These duplications are unstable in the genome, and subject to copy number polymorphisms, both within the human population and between humans and chimpanzees [15, 16]. While its overall DNA sequence has diverged significantly less than that of autosomes since speciation of hominoids from chimpanzees [17], a significant proportion of protein-coding genes located on the X chromosome are under higher diversifying (positive) selection than those on autosomes [18]. Genes located on the X chromosome are also the most abundant source of functional retrogenes in the primate lineage, and constitute a reservoir of genetic material for the generation of new genes and functions in this lineage, again with a bias toward testis-specific functions [19, 20].

For all of these reasons, it is of interest to trace the evolutionary history of CT genes, and particularly of the CT-X subset, and to measure the selective pressures that act on them. Many of the human CT-X genes do not have easily identifiable orthologues in the mouse, rat or dog genomes, precluding such an analysis among Eutheria using currently available genome data. For example, it has been shown that the large MAGE family of CT-X genes has expanded independently in the primate and rodent lineages [21]. The recent availability of a draft genome for the chimpanzee has made it feasible to study the evolution of the CT genes within the primate lineage. We show here that the CT genes in general and the CT-X genes in particular are under strong diversifying pressure and amongst the fastest-evolving genes in the human genome.

Results

Identification of CT gene families in chimpanzee

To date at least seventy CT gene families, many with multiple members, have been identified in human. We took the opportunity afforded by the publication of the initial sequence of the chimpanzee genome [18] to ask whether CT genes were conserved in man's closest evolutionary neighbour. To this end we assembled a list of human transcript sequences representing all CT gene families, and searched for homologous sequences in the human and chimpanzee genomes. We expected that given the relatively short time elapsed since human-chimpanzee divergence (~ 6 million years ago [17]) the human sequences would be able to detect CT gene homologues in the chimpanzee genome. Moreover, since the majority of CT genes isolated thus far were detected and characterized using transcript information via cDNA cloning protocols, performing the same search in human allowed us to identify all CT genes present in the current assembly of the human genome. We implemented a two-stage approach in order to accurately define the structure of each CT gene locus. First, we used MegaBlast [22] to search for regions homologous to the CT transcript sequences. Then we applied the SIBsim4 cDNA to genome alignment program (an improved version of sim4 [23]) to these regions to establish a gene structure from a locus-specific spliced alignment (see Methods). As can be seen in Table 1, almost all human CT families are found in chimpanzee, and the chromosomal locations of the CT genes in chimpanzee correspond to those in human. In terms of copy number, the biggest family, PRAME, is well represented in chimpanzee (37 genes), as are MAGEA (9 genes) CTAGE (15 genes), XAGE (12 genes) and SSX (8 genes). The number of CT genes in each family is probably underestimated because of the relatively low sequence coverage in the current version of the chimpanzee genome assembly. This is especially true for the X chromosome, where the sequence coverage is only about 2-fold [18], and where most of the human multi-gene CT families are located. Nevertheless, the current data indicate that some chimpanzee CT families (FTHL17/CT38, TSPY/CT78 and PRAME) may contain more members than in human.

Table 1 Number and chromosomal location of CT genes in human and chimpanzee

In order to investigate more closely the relatedness of CT genes in these two species, we sought putative human and chimpanzee orthologues for as many CT genes as possible, based on nucleotide sequence identity to the cognate human transcript sequence. Ninety-eight orthologous CT pairs were defined in this way (see Methods and additional file 1). The average identity of the human and chimpanzee orthologues to the human transcript sequences was 99.6% and 97.8%, respectively. Since we were interested in the characteristics of CT genes as a group, we also defined a group of human-chimpanzee orthologous non-CT control genes from chromosome X, where most of the CT genes are located, and from autosomal chromosomes 18 and 19 (see Methods). The reasons for choosing a limited set of control genes were two-fold: first, this allowed us to generate manually curated alignments of the same quality as for the CT genes, and second, it provided test and control groups of similar sizes for statistical analysis. The average identity of the human and chimpanzee control orthologues to the human transcript sequences was 99.6% and 98.7%, respectively. The finding that the chimpanzee and human CT orthologues were on average less closely related than the control orthologues (97.8% versus 98.7%; p < 2.2e-16 by a chi-squared test) suggested a possible difference in the divergence rates between the CT group and the control group. We tested this by analysing the substitution rates between human and chimpanzee ORF sequences (see below). Given the high accuracy of the human genomic sequence, the finding that the average human identity was less than 100% for both CT genes and non-CT control genes presumably reflects polymorphisms and/or sequencing errors in the original transcript sequences.

CT genes on chromosome X are evolving faster than those on other chromosomes

We estimated the divergence rates of the CT genes from pairwise sequence alignments of the human and chimpanzee orthologues using phylogenetic analysis (PAML package [24]). Mutations in a protein-coding gene can either have no effect (synonymous changes) or alter the sequence of the encoded protein (non-synonymous changes). The rate of synonymous changes (dS) indicates the background mutation frequency, while the ratio of the non-synonymous to synonymous mutation rates (dN/dS) indicates the type of evolutionary pressure acting on the gene. A dN/dS ratio value less than 1 suggests negative or purifying selection, a ratio equal to 1 suggests neutral evolution, and a ratio greater than 1 suggests positive or diversifying selection [25]. To test what type of evolutionary pressure might be acting on the CT genes, we aligned the ORFs in the human-chimpanzee orthologue pairs and used the codeml program from the PAML package [24] to estimate the dN/dS ratios. Again, for comparison purposes, the control genes were subjected to an identical procedure. Figure 1 shows the distribution of dN/dS ratios for the CT genes and controls by chromosomal location. In contrast to the control genes, which show the distribution of ratios expected if most genes are under purifying selection, CT genes located on chromosome X have an excess of ratios greater than one. At the level of individual genes, SSX1, PAGE2B, SSX4, MAGEB2, GAGE4 and CPXCR1 have rate ratios greater than 2, indicative of strong evolutionary selective pressure acting on the gene products (Table 2). CT genes located on chromosomes other than chromosome X (CT-nonX) have a distribution of ratios skewed towards lower values, suggesting that this subgroup is evolving slower than the CT-X genes. In contrast, the majority of control genes, irrespective of chromosomal location, have rate ratios less than 0.5, suggestive of purifying selection. In addition, the nonsynonymous substitution rates for CT genes which had no synonymous changes between human and chimpanzee was on average higher than for the controls (see additional file 2).

Table 2 Nucleotide substitution rates estimated from alignments of human and chimpanzee orthologous CT ORFs
Figure 1
figure 1

Distribution of dN/dS ratios for CT genes and controls. The proportion of genes in each category with ratios in intervals A-I is shown. The categories are: CT-X, CT genes on chromosome X (N = 33); CT-nonX, CT genes not on chromosome X (N = 49); Control-X, control genes on chromosome X (N = 64); Control-nonX, control genes not on chromosome X (N = 71). The intervals are: 0 ≤ A ≤ 0.25; 0.25 < B ≤ 0.5; 0.5 < C ≤ 0.75; 0.75 < D ≤ 1.0; 1.0 < E ≤ 1.25; 1.25 < F ≤ 1.5; 1.5 < G ≤ 1.75; 1.75 < H ≤ 2; 2 < I ≤ 4. Genes which had no synonymous changes (dN/dS denoted '∞' in Table 2) were omitted from the analysis.

The apparent difference between the dN/dS distributions for the CT genes and the controls was assessed for significance using a nonparametric Mann-Whitney test, which indicates whether the medians of the two populations are significantly different. The difference in dN/dS values between all CT genes and all controls is highly significant with a p-value of 1.128e-11 (Table 3). Moreover, the difference between CT genes and the controls is significant whether the CT genes are located on chromosome X (p = 4.686e-10) or not (p = 1.498e-05). The distribution of dN/dS values is also significantly different for CT genes on chromosome X compared to those elsewhere (p = 2.812e-05), suggesting that there is stronger selective pressure on CT genes located on chromosome X. In contrast, there is no significant difference in the distribution of dN/dS ratios between the control genes located on chromosome X or elsewhere (p = 0.4962). Previous work has shown that the protein-coding genes on the hominid X chromosome have a higher average dN/dS value than other chromosomes [18]. Our results suggest that the CT genes contribute strongly to this difference, and thus to the rapid evolution of protein-coding genes on the X chromosome.

Table 3 Significance of the differences in the distributions of dN/dS ratios between CT and control ORFs

Discussion

Several recent publications have taken advantage of the chimpanzee draft genome to identify genes that are under diversifying selection in the primate lineage ([26] and references therein). Their conclusions were concordant, in that they identified the X chromosome as containing a high number of positively selected genes, they found that positively selected genes are predominantly testis-specific, and that their functions are linked to gametogenesis as well as sensory perception and immunity against invading pathogens. Because most of these studies were performed at the whole genome level, they tended to focus on genes for which orthologues could be easily identified and pairwise alignments of coding regions generated automatically. This may explain why they failed to identify CT genes as a dominant group of positively selected genes. A review of recently published literature confirms that only a limited number of CT genes have been recognised as undergoing positive selection (Table 4). Moreover, a large proportion were identified through investigation of individual CT gene families (SPANX [27] and PRAME [28]). In the present study, we have focused on the comparison between human and chimpanzee CT genes, with an emphasis on generating high-quality manually curated data. This was made necessary by the fact that many CT genes are located within segmental duplications and hence have multiple paralogues, and that we tried to be exhaustive in our analysis of all known CT gene families. Because of the large number of gaps that remain in the current assembly of the chimpanzee genome and the relatively high stringency we imposed on the extent of the alignments, we have certainly underestimated the number of CT homologues present in the chimpanzee genome, and some of the human:chimpanzee pairs may not correspond to true orthologues. However, neither of these problems should significantly affect the main conclusions of our study.

Table 4 Reports of positive selection pressure on CT genes

Given the close evolutionary kinship between humans and chimpanzees it is not surprising that all known CT gene families are shared between the two species. On the other hand, homologues of many CT antigens have not been found outside the primate lineage so far, and the available genome data are still too sparse to track the appearance of CT gene families during mammalian evolution. Even though the data are still incomplete, it is clear that most CT gene families are undergoing copy number expansions in the primate lineage, presumably driven by non-allelic homologous recombination between segmental duplications. The best-studied CT family in this respect is SPANX, which is present as a single-copy gene in rodents and has duplicated and acquired new sub-families in the primate lineage, including at least one (SPANX-C) found to be specific to humans on the basis of its genomic position [27]. SPANX genes have been shown to have copy number polymorphisms in the human population, potentially linked to susceptibility to prostate cancer, and to undergo very rapid evolution affecting both dN and dS [29]. An elegant study of the PRAME cluster on human chromosome 1 [28] revealed the recent expansion in the human lineage of these genes via two large segmental duplications, and subsequent smaller duplications that may be polymorphic in the human population. The large MAGE family of CT antigens, which also comprises genes that do not show a CT expression pattern, has expanded in both the primate and rodent lineages, but independently [21]. Our data also show that many MAGE genes are under diversifying selection (Table 2).

By definition, CT genes are expressed in testis, and for those for which data exists expression has been shown to be restricted to cells involved in spermatogenesis. It is believed that many CT genes are also expressed during oogenesis, but data on this process are still very sparse [30, 31]. There is abundant evidence in the literature that many genes expressed predominantly during gametogenesis, as well as those implicated in reproduction in general (e.g. those encoding proteins found in the seminal fluid or expressed predominantly in the prostate) are undergoing positive selection during evolution [3234]. In this respect, CT genes seem to behave much like other reproductive genes.

However, the CT-X genes are a special case, in that diversifying selective pressure seems more intense on this class. It is probable that the evolutionary pressures driving changes in the encoded protein sequences and those driving the expansion of the CT-X gene families are similar. Strikingly, the X chromosome is enriched in intrachromosomal tandem segmental duplications relative to autosomes [9]. Several hypotheses have been put forward to explain why a subset of genes located on the X chromosome is evolving faster than those on autosomes [3436]. Our data do not shed new light on this subject. However, it is interesting to note that CT-X genes contribute very significantly to the high average positive selection observed in protein-encoding genes on this chromosome, against a genomic background that is much more highly conserved than on the autosomes [17]. One may speculate that transcriptional controls on recently duplicated genes could be relaxed relative to the parental copies, thereby allowing re-expression in tumours and the partial replication in these tumours of the transcriptional changes accompanying gametogenesis.

Conclusions

Essentially all human CT families have homologues at the same chromosomal locations in the chimpanzee genome. The copy numbers in the multi-gene CT families may differ between the two species but until a high-quality assembly of the chimpanzee genome is available this cannot be assessed in a reliable way. On the average, CT genes are under stronger positive selection than a set of randomly selected control genes. CT-X genes as a group are evolving very rapidly, not only relative to control genes on the X chromosome or on autosomes, but also relative to autosomal CT genes.

Methods

CT genes and human/chimpanzee genomic sequences

Human Reference sequence (RefSeq [37]), or GenBank (where no RefSeq was available) entries were obtained for transcripts representing all documented CT gene families in the CT Gene Database [38]. Transcript sequences were also obtained for additional candidate CT genes described in recent publications, which have not yet been added to the CT Gene Database. In some cases, multiple alternatively spliced transcript sequences from the same gene were selected to maximize sequence representation of the locus. Although PRAME has not been designated a CT gene, due to its trace level of expression in some normal adult tissues other than testis, it does exhibit the other main characteristics of CT genes, i.e. strong expression in the testis and up-regulation in various tumours, and was included in the set of CT genes selected for this study. Non-CT control genes were randomly chosen from lists of genes having a RefSeq identifier on chromosomes X, 18 (low gene density) and 19 (high gene density), generated using BioMart [39, 40]. Control genes were selected from locations distributed uniformly along the lengths of the chromosomes to average out site-specific differences in mutation rates. The human (Homo sapiens) genomic sequence used was NCBI Build Number 36 (version 1, release date 9 March 2006), obtained from the NCBI. The chimpanzee (Pan troglodytes) genomic sequence used was NCBI Build Number 2 (version 1, release date 4 October 2006), also obtained from the NCBI.

Identification of CT gene loci in human and chimpanzee

CT gene loci were identified in both human and chimpanzee based on sequence identity between the human transcript sequences and human or chimpanzee genomic sequences. We used MegaBlast [22] to identify genomic regions homologous to the RefSeq sequences and SIBsim4 [41] (an improved version of sim4 [23]) to produce high quality spliced alignments at those sites, from which locus-specific transcript sequences were generated. A gene was considered complete if the alignment contained at least 80% of the cognate transcript length or 80% of the annotated open reading frame (ORF), and had at least 85% identity to the human transcript sequence. Putative orthologues were identified as the sequences in human and chimpanzee genomes having the highest identity (and satisfying the 80% length threshold) to the same human transcript sequence. In many cases the poor quality (gaps, incorrect assembly) of the published chimpanzee genome sequence prevented us from finding a chimpanzee orthologue to the human gene. High quality sequence alignments for putative human/chimpanzee orthologues were obtained for 98 of the initial list of 135 CT genes (73%) and 153 of the 180 control genes (85%) selected randomly from chromosomes 18, 19 and X.

Divergence of CT genes

The genome-based transcript sequences derived from human and chimpanzee for each putative orthologous pair were aligned using clustalw (version 1.81 [42]), with gap extension penalties set to zero to allow gaps in the alignment arising from sequences missing in the chimpanzee assembly. Both sequences in the alignment were then trimmed to the extent of the human ORF based on annotation in the RefSeq or GenBank entry. Each nucleotide alignment was manually curated and revised, if necessary, to reflect the corresponding protein alignment. ORFs containing stop codons were dropped from the analysis. Rates of synonymous (dS; also known as Ks) and non-synonymous (dN; also known as Ka) substitutions between aligned ORFs were estimated using the codeml programme from the PAML package [24] with the F3x4 codon frequency model (and runmode = -2 in the codeml control file). Note that incomplete codons in either the human or the chimpanzee sequence are ignored by codeml. The statistical significance of differences in the distributions between human-chimpanzee divergence rates (dN/dS) among CT genes and controls was assessed using a Mann-Whitney (Table 3) or Welch two sample t-test (additional file 3) in the R package [43].