Background

Humans and chimpanzees shared a common ancestor approximately 6–7 million years ago (MYA) [1]. Distinguishing characteristics, such as those relating to cognitive abilities, language, habitual upright gait, dentition, and susceptibility to malaria, are all assumed to be associated with genetic differences between these two species [2]. However, these phenotypic differences have been associated with specific human-chimpanzee sequence differences in fewer than a handful of cases. In humans, two coding changes in FOXP2 have been proposed to contribute to language acquisition [3], disruption in the MYH16 myosin heavy chain gene is proposed to have led to a reduction in masticatory muscles [4], and the pseudogenisation of a type I hair keratin has been associated with modifications in our hair keratin phenotype [5]

It is also unclear at which developmental stages, and in which tissues, such human-specific adaptations are first manifested. For example, the abnormal spindle-like microcephaly-associated (ASPM) gene has roles in mitosis, meiosis and cytokinesis, and is broadly expressed in many tissues. Yet it is a major determinant of cerebral cortical size [6] and has evolved adaptively in recent hominin evolution (reviewed in Ponting & Jackson (2005))[7]. As the signatures of recent adaptation are identified in the human genome it will be important to associate these DNA changes with molecular, cellular and physiological innovation.

With the sequencing of the human and chimpanzee genomes comes the possibility of discerning nucleotide changes that have been acquired adaptively and thus might be associated with physiological innovation [2]. Two factors, however, often confound such studies. First, the scarcity of substitutions (~1% [8, 9]) between human and chimpanzee orthologous coding sequence provides insufficient statistical power to distinguish adaptive from neutral substitutions. Second, the chimpanzee genome has been sequenced only to low coverage (4-fold statistical coverage for panTro1; see [10]). As a result, the chimpanzee genome sequence contains many gaps, sequence inaccuracies and assembly artefacts. Such problems are exacerbated in regions containing identical or almost identical tandem segments which pose particular problems for both sequencing and assembly. Juxtaposed and virtually identical sequences are frequently represented either by only single versions in genome assemblies, or are absent altogether, thereby giving rise to gaps in the assembly.

By contrast, the human genome assembly is virtually complete and is accurate to approximately one error every 105 bases [11]. The human sequence's high statistical coverage gives rise to an assembly that is a mosaic of contributions from multiple individuals and thus does not represent any single genome. This mosaicism is less important for single nucleotide polymorphisms (SNPs) than it is for larger-scale polymorphisms, such as copy number polymorphisms (CNPs). This is because most SNPs are selectively neutral whereas the evidence suggests that this appears not to be the case for CNPs [12].

Identifying sequence changes that distinguish human and chimpanzee physiology, development and behaviour is a challenge not only because of errors and polymorphisms in genome assemblies, but also because the very types of sequence differences that contribute most to these characteristics remain ill-determined. The near-identity of human and chimpanzee orthologous coding sequence led to an initial suggestion that gene expression, rather than coding sequence change, is the major contributor to our differences [13]. However, it has become clear that most of the variations between human and chimpanzee in non-coding sequence are not adaptive either [14]. Identifying adaptive substitutions, whether in coding or non-coding sequence, remains a considerable problem.

Our approach has been not to investigate single nucleotide substitutions as potential substrates of adaptation. Rather, we wish to consider larger sequence differences between human and chimpanzee genomes, namely genes which have duplicated in a lineage-specific manner in the past 6–7 MY since the last common ancestor of the two species.

To this end, we recently determined the number of synonymous (silent) mutations per synonymous site (K S ) between closely-related human genes and used this to predict the lineage-specificity of duplication events. We identified a relatively large fraction (5%) of human genes that have participated in duplication events since the last common ancestor with the rodents [11]. Gene pairs that together have accumulated few substitutions in synonymous sites (K S < 0.3) were suggested to be primate-specific. The vast majority of these paralogues pairs have accumulated even fewer silent substitutions (K S < 0.015), indicating that most human duplications occurred only in the past 3–4 million years, after the divergence of Homo and Pan lineages. It is not yet known whether these recent duplications in the mosaic human genome assembly are fixed in the human population, or instead represent CNPs, although the latter explanation now appears increasingly likely [15]. The functions of these recently-duplicated genes are not uniformly distributed. Genes involved in reproduction, chemosensation and host defense and immunity are over-represented [11]. 'Cancer Testis antigen' (CTA) genes, most of which are normally expressed in the testis but are also highly active in certain cancers [16], are another prominent category among the recently-duplicated human gene set. They are represented among a small number of gene families, including one whose founding member is PRAME ('Preferentially expressed antigen of melanoma'), a human gene that is expressed highly in a large proportion of tumours [17, 18]. In all cases, the physiological role of CTA genes in normal cells remains unclear, but their recent and extensive duplications are consistent with adaptive functions, such as chemosensation, immunity and reproduction [19]. Moreover, their specific expression in the testis and ovary argues for their involvement in the acquisition of innovative reproductive function during recent primate evolution.

CTA genes frequently have been duplicated on the human X chromosome [11, 20] which might indicate a male selective advantage in possessing these genes. A mouse PRAME-like X-linked gene is known to be expressed specifically in spermatogonia, and may perform roles in the early stages of spermatogenesis [21]. Other members of this family are clustered together on an autosome, mouse chromosome 4. Because mammalian sex chromosomes undergo inactivation in late stages of spermatogenesis, it is possible that X-linked PRAME genes may play a part early in spermatogenesis, whereas the cluster of autosomal PRAME genes functions either in later stages, or in other tissues. Indeed, one autosomal mouse PRAME-like gene is known to be expressed in both oocytes and early cleavage-stage embryos [22].

Here we describe the extraordinary recent evolution of autosomal PRAME-like gene clusters on human and chimpanzee chromosomes 1, and mouse chromosome 4. We use a molecular clock, calibrated using synonymous or intronic nucleotide substitutions, to infer the recent origin of many of these PRAME genes. This is corroborated independently by comparison with available chimpanzee genomic sequence. Our analyses confirm that these human genes have duplicated unusually rapidly within the last 3 MY, with concomitant and substantial sequence diversification resulting from adaptive evolution. We predict that the differences between human and chimpanzee PRAME genes contributed to the functional divergence along the hominin lineage.

Results

Evolutionary survey of 7 human CT-Antigen gene families

We investigated whether rapidly-duplicating members of seven CTA families have experienced rapid sequence diversification as a result of adaptive evolution. Using ENSEMBL gene predictions, we initially used codeml [23] to predict sites in their amino acid alignments that have been subject to positive selection. Only 2 of the 7 families, namely the PRAME genes and SSX-like genes, were predicted to contain positively-selected sites (posterior probabilities > 0.9 for each of three model pairs (see Methods)). Because of its large size and because of the large number (23) of positively-selected sites found in an initial analysis (using the NCBI34 genome assembly (data not shown)), we decided to perform a more comprehensive analysis of PRAME genes and pseudogenes in human, chimpanzee and mouse genomes; the human SSX-like family contains only 7 members, for which 6 positively-selected sites were predicted (data not shown).

Recent origin for the PRAME gene cluster

We then investigated whether PRAME genes, located between RefSeq genes DHRS3 and T1A-2 on HSA1, are present in orthologous locations in other vertebrates. Indeed, the mouse genome contains PRAME-like genes in its orthologous region [24], as does the rat genome. However, the orthologous regions of both dog and chicken genome assemblies possess no PRAME homologous genes, as determined by searches of these regions using TBlastn [25]. Moreover, this region of the dog genome assembly contains no clone gaps and insufficiently large (≥ 2.5 kb) fragment gaps to accommodate any missing dog PRAME genes. As humans and rodents shared a more recent common ancestor than either humans and dogs, or humans and chickens, it thus appears likely that one or more PRAME genes were translocated into this genomic location after the divergence with the Laurasiatherian lineage containing extant carnivores (approximately 95 MYA), but before the rodent-primate split (approximately 85 MYA) [26].

Human, chimpanzee and mouse PRAME genes

Our comprehensive reprediction of human PRAME homologues from the 0.74 Mb region of HSA1 yielded a total of 22 PRAME genes and 10 pseudogenes (Figure 1). (These we number sequentially along the assembly, Homo _1, Homo _2, etc.) Each of these genes is approximately 3.0 kb long (average 3069 bases), contain three exons (labelled A, B and C) and two introns (a and b) both with consensus (GT-AG) splice sites. The translated protein is approximately 474 amino acids in length with the three exons having median lengths of 95, 193 and 186 amino acids.

Figure 1
figure 1

Dot plot representation [63] of a 0.74 Mb region of human chromosome 1 (bases 1276000–1350000) annotated (below) according to the locations of PRAME genes (blue arrowheads) and pseudogenes (red arrowheads), approximately to scale. Gene or pseudogene orientation is indicated by arrowhead direction. PRAME gene or pseudogene numbers are provided beneath the arrowheads. Single short diagonals represent alignments of two PRAME genes or pseudogenes. Gaps in the assembly (bases 13015219–13065218 and 13302469–13352468) are indicated, on the axes, by thick black bars. Two recent segmental duplications (Homo_7–12 and 15–20, and Homo_19–25 and 26–32; see text) are highlighted in blue and pink, respectively. Regions identified by Sebat et al. [12] or by Iafrate et al. [36], as being copy number polymorphic are indicated by a yellow, or a black-and-yellow-striped, bar, respectively.

Initial predictions of chimpanzee PRAME genes, from placed and unplaced PTR1 sequence and from unmapped assembled sequence, yielded 17 candidate genes. Careful inspection, however, revealed these gene predictions to be of poor quality with many predicted genes spanning suspiciously large (>> 3 kb) genomic distances and exhibiting regions of poor sequence similarity. We believe this is a result of the sequence incompleteness and the low (4-fold) statistical coverage of the chimpanzee genome sequence in the assembly. We thus instead resorted to independently predicting each of the three chimpanzee PRAME exons. This resulted in 16 exon A, 24 exon B and 19 exon C predictions. Several of these predictions appear to be identical and could thus be redundant. Adjacent exons and introns were assembled to give 12 putative chimpanzee PRAME genes and pseudogenes (labelled Pan_1, Pan_2 etc.), all of whose introns were confirmed to contain consensus (GT-AG) splice sites. Of these predictions, only 3 appear to be full length, with 3 exons and 2 introns lacking gaps. 30 predictions contain a single exon only. 7 of the 12 sequences contain 3 stop codons and 12 frameshifts, and so might be pseudogenes. (At suggested nucleotide substitution rate of 3 × 10-4 and insertion/deletion error rates of approximately 2 × 10-4 (Tarjei Mikkelsen, personal communication) we expect few if any of these disruptions to arise from sequencing errors (data not shown)).

We inferred relationships between chimpanzee exons and intron, and their human orthologous sequences, using phylogenetic trees (see below). This revealed both well assembled chimpanzee sequence, with consecutive exons and introns assigned to the same human orthologous gene, and poorly assembled sequences, manifested by short contigs, separated by gaps, in a disordered arrangement.

18 PRAME-like genes and 15 pseudogenes were predicted in the orthologous region of mouse chromosome 4. (These are numbered sequentially Mus_1, Mus_2 etc., in the same orientation as that used for the human and chimpanzee numbering scheme.) Of these 5 (Mus_1, Mus_9, Mus_10, Mus_18 and Mus_30) have previously been investigated by Dade et al. [24], who describe these as having roles in oogenesis.

Local gene duplication

In order to visualise the chromosomal landscape of this region of HSA1, we compared its repeat-masked DNA sequence with itself using a dotplot representation (Figure 1). As befits tandemly-duplicated and highly similar sequence, a strong pattern of many diagonals was evident. Each short diagonal represents the DNA alignment of two PRAME genes or pseudogenes. The orientation of the diagonal indicates whether these two genes or pseudogenes are situated on the same, or else the opposite, strand. We observed two pairs of long diagonals (highlighted in colour in Figure 1) which represent two predicted events of segmental duplication (see below).

Human and mouse PRAME genes are monophyletic

We then were able to exploit these gene predictions from human, chimpanzee and mouse to infer the genes' evolutionary relationships and the sequential order of gene duplications. At this stage, we do not rule out that paralogous sequences have been subject to recent inter-locus gene conversion [27, 28] which may result in greater sequence similarity and, hence, an apparently more recent date of evolutionary divergence (see Discussion). Dendrograms were constructed from two types of quasi-neutral nucleotide substitution rates: K S values, either for single coding exons, or for complete coding sequence, and K I values, defined as the numbers of nucleotide substitutions per site within intronic sequence.

A phylogenetic tree constructed from human and mouse PRAME gene K S values revealed that mouse sequences are monophyletic, as are human sequences (Figure 2). No pair of mouse and human PRAME genes thus possesses a simple 1:1 orthology relationship. This is a striking result since the vast majority (approximately 80%) of mouse genes possess a single human ortholog [29]. As predicted earlier [11], many human PRAME genes thus have arisen by duplication recently in the primate lineage. What was unexpected, however, is that all mouse, and similarly all human, PRAME sequences have arisen by duplication events that occurred since their last common ancestor, approximately 85 MYA.

Figure 2
figure 2

Phylogenetic relationships of mouse and human full-length PRAME homologues, inferred using K S as a distance metric. Mouse PRAME homologues (blue lineages) are monophyletic, as are human PRAME homologues (red lineages).

Human PRAME genes have frequently and recently duplicated

Three further phylogenetic trees compared human and chimpanzee K S values from alignments of each of the three PRAME exons (Figure 3; Additional Information). Each of these trees indicates that PRAME gene sequence duplicated frequently in the terminal human branch (i.e. the lineage from the common ancestor of humans and chimpanzees to humans).

Figure 3
figure 3

Phylogenetic relationships of exons A of human and chimpanzee PRAME homologues, inferred using K S as a distance metric. Phylogenetic relationships derived using alignments of exons B and C are available as Additional files 1 and 2. Homo_9 and Homo_18 are not shown, as these pseudogenes each appears to lack exon A.

Importantly, many pairs of human PRAME genes, and their constituent exons (Figure 3) and introns (Figure 4), were found to exhibit low synonymous rates (Figure 5) that are more typical of duplications in the terminal human branch, than they are of duplications that occurred prior to the common ancestor of humans and chimpanzees, approximately 6–7 MYA [1]. Later in the manuscript we return to the issue of whether these recently-duplicated human genes are present or absent from the chimpanzee genome.

Figure 4
figure 4

Phylogenetic relationships of introns a of human and chimpanzee PRAME homologues, inferred using K > I as a distance metric, and a neighbour-joining tree. Percentage bootstrap support (1000 iterations) is shown on branches where the support was less than 50%. Phylogenetic relationships derived using an alignment of intron b is available as Additional file 3.

Figure 5
figure 5

Scatter plot of the lowest neutral rate estimates (either K S calculated from exon, or K I for intron, alignments) for human PRAME genes and either their human paralogues (indicated in red) or their chimpanzee orthologues (indicated in black). Circles represent averages of intronic rate (K I ) estimates, whereas squares represent averages of exonic rate (K S ) estimates. The horizontal axis represents genomic location within a 0.74 Mb region of human chromosome 1 (see Figure 1). Two recent segmental duplications (Homo_7–12 and 15–20, and Homo_19–25 and 26–32; see text) are highlighted in blue and pink, respectively. The dark line represents the median K S value (3.58 × 10-3) for human paralogues. The grey band identifies 25–75% of this median value (second and third quartiles). The blue line represents the median K S (0.011) for human-chimpanzee coding sequence [30, 31]. The exonic K S value for Homo_12 vs Homo_15 is not shown due to incongruencies in K S -derived phylogenetic trees (see text). Homo-Pan rate estimates are missing when the most-closely related sequences, that are available, are relatively divergent K I or K S > 0.1. These missing values are likely to reflect the incompleteness of the current chimpanzee genome assembly. Homo-Homo rate estimates are missing for 4 genes (Homo_1, 3, 5 and 13) which appear not to have duplicated recently (K I or K S > 0.1).

Assignment of human paralogues and chimpanzee orthologues

By testing for congruency among the three exon (K S ) and the two intron (K I ) trees (Figures 3 and 4; Additional Information Files 1, 2, 3) we were able to identify the closest human paralogue to each human PRAME gene. The set of assignments was found to be unambiguous and internally consistent, with one notable exception: Homo_15 and Homo_12 are almost identical in their first two exons but are divergent in their exons C. Upon closer inspection it appears that either genome assembly error has generated a chimaeric Homo_12 gene, or else its exon C and a portion of intron b, have been subjected to inter-locus gene conversion with an, as yet unknown, PRAME homologue. Consequently, in subsequent evolutionary rate calculations, comparisons between exons C from Homo_12 and Homo_15 have been discarded. All human PRAME genes, with only 4 exceptions (Homo_1, Homo_3, Homo_5 and Homo_13), are little diverged (K S < 0.1 or K I < 0.1; Figure 5) from another human gene, and are thus part of a pair of sequence similar paralogues which have apparently been generated by a recent gene duplication.

A similar protocol was adopted to identify chimpanzee orthologues of human PRAME exons and introns. For each human exon (or intron), we assigned as its orthologue the chimpanzee exon (or intron) with the lowest K S (or K I ) value from the tree, whilst checking to see that these values were approximately 0.011, the median K S value between chimpanzee and human orthologues [30, 31]. This process resulted in at least one orthology assignment to the exons or introns of all but 9 (Homo_4, Homo_6, Homo_10, Homo_14, Homo_17, Homo_21, Homo_25, Homo_28, and Homo_32) of the human PRAME homologues; these missing orthologues can be assumed to be present in the chimpanzee genome but absent from its current assembly. For each human orthologue, we then examined the chimpanzee genome assembly for contiguity of its assigned chimpanzee orthologous exons and introns. For example, Pan_1_A, Pan_2_B and Pan_3_C, which are the chimpanzee orthologues of the three exons of Homo_1, appear consecutively within the chimpanzee genome sequence, complete with intervening intronic sequence, and thus were assigned as a full length chimpanzee PRAME, Pan_1. Several chimpanzee orthologue exon pairs appeared not to be contiguous in the current assembly, which again indicates that considerable additional data and attention will be required to provide an accurate assembly of this region.

Pseudogenes

Of 32 HSA1 PRAME homologues, 10 are predicted to be pseudogenes. A similar proportion of chimpanzee PRAME exons are disrupted by at least one stop codon: 19 (3 exon A, 7 exon B and 9 exon C) out of 59 chimpanzee predicted exons contain at least one such disruption. It is probable that some of these are due merely to sequencing or assembly errors due to the low (4-fold) statistical coverage of the chimpanzee genome.

We can safely infer that at least three of these pseudogenes (Homo_9, Homo_13, and Homo_18) were present in the common ancestor to both human and chimpanzee simply because in each case the disruptions coincide between orthologues. Five human sequences (Homo_3, Homo_11, Homo_16, Homo_20 and Homo_27) appear to have become pseudogenes in the hominin lineage as a result of disruptions which are absent from their chimpanzee orthologues. Homo_20 and Homo_27, which differ by only two synonymous substitutions, acquired their disrupting mutation (a stop codon) only recently, since their divergence from the Homo_7 gene, within the last 1 MY (see below).

Dating segmental duplications in the human genome

The branching order of human genes, both from exon K S -based trees (Figure 3; Additional Information) and from intron K I -based trees (Figure 4; Additional Information), indicates that two large-scale duplication events occurred recently in a human ancestral genome. The most recent event appears to have been a single tandem duplication of 7 PRAME homologues to generate a pair of segments encompassing genes Homo_19–25 and genes Homo_26–32 (Figure 1; Figure 5).

We can estimate the age of this duplication using the neutral rate estimates as a molecular clock and calibrating this by the divergence time (6–7 MY) between the human and chimpanzee lineages (see Figure 3 and Additional Information). Previous large-scale studies have shown that the median K S value between human and chimpanzee orthologues is 0.011 [30, 31]. The mean K S value between the seven Homo_19–32 genes and their assigned orthologues in chimpanzee was found to be 0.00995. Divergence between these regions of HSA1 and PTR1 thus is typical of these genomes as a whole.

We expect, therefore, that pairs of human paralogues possessing K S values less than approximately 0.01 are likely to have arisen in the terminal human branch, within the past 6–7 MY, whereas human paralogues possessing K S values greater than 0.01 arose due to duplications that occurred prior to the divergence of chimpanzee and human lineages. We calculated that the seven least divergent pairs between Homo_19–25 and Homo_26–32 exhibit a mean divergence of 1.46 × 10-3 which is nearly seven-fold lower than the chimpanzee-human divergence. This indicates an age for this duplication of approximately (1.46 × 10-3 /9.95 × 10-3) × 6 ≈ 0.9 MY. A similar calculation using intronic nucleotide substitution rates K I predicts an age of 0.8 MY. As these predicted ages considerably postdate the split between chimpanzee and human lineages (6–7 MYA), the tandem duplication of Homo_19–25 and Homo_26–32 genes appears to have been a hominin-specific event.

This conclusion is reinforced by the high identity of genomic sequence between the two duplications. Genomic sequences encompassing Homo_19–25 (HSA1 bases 13132294–13272173) and Homo_26–32 (bases 13353079–13493033) PRAME genes are 99.82% identical (161 mismatched bases over ~140 kb). 0.18% divergence, again, is almost seven-fold lower than 1.23%, the average divergence between human and chimpanzee sequence [8, 31, 32]. This divergence is also twice the average polymorphism rate (0.08%; [11, 33, 34]) between human individuals and in the human genome assembly. This most recent large-scale duplication of human PRAME genes thus appears to be recent, with respect to the human-chimpanzee divergence event, but ancient, compared with the appearance of most human polymorphisms, within approximately the last 0.10 MY [35].

A more ancient, but still apparently hominin-specific, large, segmental and inverted duplication is that of Homo_7–12 and Homo_15–20 PRAME genes (Figure 1; Figure 5). This duplication's average divergence (mean K I = 0.00275; mean K S = 0.00447) is 2.2–3.6-fold smaller than that expected divergence (≈ 0.010, see above) for human-chimpanzee comparisons, which corresponds to an estimated divergence time of between 1.7 and 2.7 MYA. These estimates again considerably postdate the chimpanzee-human divergence.

Copy number polymorphisms (CNPs)

The recent segmental duplications of human PRAME genes suggest that this region of HSA1 might contain CNPs within the human population. By querying the database of genomic variants [36] we determined that HSA1p36.21, which encompasses these PRAME genes, is one of only 11 polymorphic loci found in two large-scale CNP investigations [12, 36, 37] (see Figure 1). This implies that not only has this region undergone two large-scale duplications in ~ 3 MY, but that there have been additional, more recent, duplications which are not fixed in the human population and have not been captured in the human genome reference sequence.

Positive selection of PRAME genes

Gene duplication in a genome provides a substrate upon which selection may act. The preservation of duplicates without disruption to their open-reading frames over millennia is itself an indication that these duplicates confer a selective benefit to the host organism. More direct evidence of positive selection comes from the elevated values of the ratio of K A , the number of nonsynonymous (amino acid changing) substitutions per nonsynonymous site, to K S . After discarding closely-related sequences (K S < 0.02), the median K A /K S ratio between pairs of human PRAME genes is 0.73, and 19 gene pairs exhibit K A /K S ratios greater than 1, with a maximum value of 1.73 between Homo_6 and Homo_10. These values are considerably higher than the average ratio between human and rodent single gene orthologues (median K A /K S ~ 0.12) [29].

K A /K S values of approximately 1 might be due to positive selection of nonsynonymous nucleotide substitutions, or to reduced selective constraints due to the loss of PRAME genes' functions. In order to distinguish between these hypotheses, and to further investigate the evolution of these genes, we used codeml [3840] to infer positive selection at single sites within multiple alignments of human or mouse PRAME genes.

Among human PRAME genes, a large number (30) of amino acid sites were identified as having been subject to positive selection. By mapping these sites to a homologous protein structure, that of porcine ribonuclease inhibitor, we observed that these sites aggregate to form a pronounced cluster on one exterior face (Figure 6). The majority of these sites would thus be available to participate in binding interactions. A similar analysis of mouse PRAME genes also demonstrated the impact of positive selection: 17 positively selected sites were identified, of which 4 coincide with such sites among human PRAME genes.

Figure 6
figure 6

Structure of porcine ribonuclease inhibitor (PDB code 2BNH) with amino acid sites that are positively-selected among human and mouse PRAME proteins shown in red and blue, respectively, ((A) front view, (B) rear view).

Discussion

Our findings demonstrate an extraordinarily rapid expansion within this PRAME gene family that occurred independently in both primate and rodent lineages. Given the high conservation of gene order among chicken, dog, human and rodent genomes, we can date the origin of this cluster to between approximately 95 and 85 MYA [26]. This is because PRAME homologues are undetectable in the orthologous region of the chicken and dog genomes, but are present in syntenic portions of primate and rodent genomes. Thereafter, many episodes of gene duplication have occurred in both primate and rodent lineages.

In order to infer the most recent of these duplication events, we identified 13 pairs of human PRAME paralogues which appear to have arisen by duplication since the common ancestor with chimpanzees: their divergence is considerably less than both the expected and the observed divergence between orthologous human and chimpanzee sequence (Figure 1 and Figure 5). Using a molecular clock, and a palaeontological calibration of divergence between these two species of 6–7 MYA [1], we estimate that two large segmental duplications of PRAME genes occurred independently in the terminal human branch, within approximately the past 3 MY.

The low divergence of these human paralogues, compared with the divergence between chimpanzee and human orthologues, argues strongly that chimpanzee lacks single orthologues of many, if not all, of these human duplicated genes. Insufficient nucleotide substitutions at synonymous and intronic sites have accumulated to indicate that Homo_7–12 and Homo_15–32 genes were all present in single copies in the common ancestor of chimpanzee and human. Even when the chimpanzee genome is completed, we thus expect that chimpanzee single orthologues of these genes will not be identified.

In addition to these duplications which are apparent in the human genome assembly, it appears, from two independent studies [36, 12], that the region of human chromosome 1 (HSA1) containing these PRAME genes is copy number polymorphic. These human genes are present in different numbers among the human population, thus providing further evidence that human-specific duplications are a feature of this region. CNPs are thought not to be selectively neutral [12]. Their persistence in the human population suggests, rather, that at least a subset of CNPs may be adaptive.

In support of this hypothesis, we found that a large number (30) of codons in the HSA1 PRAME family have been subject to positive selection. 26 of these adaptive codons were confirmed using the "sitewise likelihood-ratio" (SLR) method [41] (data not shown). These sites are clustered onto one surface in a homology model of protein structure, thereby demarcating a likely surface-accessible functional site. Mouse PRAME genes also contain a large number (17) of positively selected sites, which cluster within a site equivalent to that for human PRAME genes (Figure 6).

Expansion of this PRAME gene family has occurred independently in both primate and rodent lineages. In each of these lineages, PRAME genes appear to have evolved by 'birth-and-death' processes, such as occurs for immunity genes [42]: genes both persist as duplications, and are lost by pseudogene creation. Sequence similarities between paralogues, however, could have arisen also from concerted evolution, as the result of homologous recombination, in particular, gene conversion and unequal crossing over [43]. Nevertheless, the recent origin of the PRAME progenitor gene just prior to the common ancestor of primates and rodents, and its rapid duplication thereafter, and the occurrence of CNPs in the human population, each indicates that the predominant process in this expansion has been gene duplication. Moreover, the congruency of dendrograms associated with separate exons or introns (Figures 3 and 4, and Additional files 1, 2, 3), and the tandem segmental duplications we have inferred (Figure 1; Figure 5), also argue against concerted evolution as a dominant evolutionary mechanism.

PRAME genes have arisen by rapid gene duplication and pseudogene creation, and their sequences have been subject to positive selection. Nevertheless, the adaptive advantages conferred on these genes by their duplication and sequence diversification remain unclear, as are the genes' functions in normal tissues. Their expression profile often is limited to testes and to a wide variety of tumors, which suggests that PRAME proteins might perform important mitotic roles in rapidly dividing cells. This is consistent with the observation that a PRAME-like gene (oogenesin) in mouse accumulates in the nucleus only at the late one-cell and early two-cell stages of early embryos [22].

At least one of the mouse orthologues of these PRAME genes is expressed in spermatogonial cells [21]. Interestingly, it is known that a mutation in FGFR2 expressed in these cells confers a selective advantage, thereby leading to clonal expansion similar to that seen in tumours [44]. We suggest that similar nonsynonymous substitutions in PRAME genes might have conferred comparable benefits to these cells, and these have driven positive selection for both gene duplication and sequence change.

The evidence thus points mainly to darwinian selection in spermatogonia, and adaptive evolution of PRAME genes thus may be thought not to have phenotypic consequences at larger anatomical scales. Nevertheless, a gene with a similar function, and a similar evolutionary history, to PRAME genes, has been proposed to having contributed to anatomical adaptations during recent hominid evolution. The Abnormal spindle-like microcephaly associated (ASPM) gene, which has evolved rapidly in the great apes [45], has roles in both spermatogenesis and oogenesis in Drosophila [46, 47]. When mutated in humans, this results in primary microcephaly, which is manifested by a greatly reduced brain size [6]. Further examination of human PRAME genes' functions should assist in our understanding of the cellular and physiological consequences of its recent and rapid evolution.

Conclusion

Whatever the selective advantages conferred by the PRAME genes discussed here, it is apparent from their recent introduction to an ancestral chromosome of primates and rodents that they benefited only these lineages. Moreover, the HSA1 PRAME gene family has expanded further, in particular by two large segmental duplications in the past 3 MY and further duplications that manifest themselves as copy number polymorphisms in the human population. These extremely rapid duplications, taken together with strong evidence for darwinian adaptation at approximately 30 sites among human PRAME genes, indicates that this family has experienced sustained episodes of positive selection during recent hominin history.

Methods

Survey of human CT-Antigen genes

We recently identified 41 human gene families that have experienced multiple gene duplications since the divergence of rodent and primate lineages [11]. Among these were 6 families, SSX-like, MAGE, GAGE, XAGE, SAGE and SPAN-X, all encoded on the X chromosome, which exhibit the tissue expression profiles of CT-antigens (CTAs). A seventh CTA family, recently duplicated in our lineage, encode PRAME genes that are located on Homo sapiens chromosome 1 (HSA1). In an initial survey, only 2 of these 7 families yielded evidence of positive selection at individual sites using codeml (data not shown), using methods and criteria described below. Subsequently, we chose to perform more comprehensive analyses of the PRAME gene family because of its larger size (13 human ENSEMBL predicted genes) and its high number of positively selected sites identified.

Prediction of human PRAME genes and exons

The amino acid sequences of 5 PRAME homologues were aligned using CLUSTALW [48]. These are genes that were identified by Ensembl [11, 49] and lie in a cluster between bases 12550000 and 13100000 of human chromosome 1 (assembly NCBI35). (Many of the remaining 8 ENSEMBL PRAME genes were mispredicted.) From this multiple alignment a hidden Markov model (HMM) was constructed [50]. On the basis of strong conservation of a translation initiating methionine codon, presumed non-coding sequence upstream of this codon was discarded.

In order to ensure gene prediction fidelity and completeness, we repredicted PRAME homologous genes and pseudogenes from this region of HSA1 (bases 12550000 and 13100000, which include the flanking non-homologous RefSeq genes, DHRS3 and T1A-2). Gene prediction employed Genewise [51], the PRAME HMM and default parameters. Upon building phylogenetic trees, the predicted PRAME homologues were found to be monophyletic, and are only distantly-related to other PRAME homologous genes located elsewhere on HSA1 and on HSA22 (data not shown). Pseudogenes were distinguished from genes on the basis of premature stop codons or frameshifts; pseudogenes that are functionally disrupted due only to mutations that occur outside of coding sequence are thus misassigned.

In an independent approach, we also predicted homologues of each of the three protein-coding exons of these PRAME genes within this region of HSA1 using HMMs of multiply aligned nucleotide exonic sequence. This procedure resulted in no predictions that were additional to those found using Genewise and full-length gene sequence.

Prediction of mouse PRAME genes

Mouse PRAME genes were predicted as described above for human genes, except for the use of known mouse 'PRAME-like' (PRAMEL) genes to derive the HMM used in the Genewise step. Initially a CLUSTALW alignment of three known mouse RefSeq genes (PRAMEL1 [RefSeq code: NM_031377.1], PRAMEL3 [NM_031390.1] and PRAMEL4 [NM_178248.2]) was used for the HMM query template. The 3' ends of the PRAMEL genes, however, were found to be relatively divergent. The alignment was thus trimmed back to exclude the ends of the third, and final, exons. Thus the alignment of mouse PRAME proteins is 58 amino acids shorter than that of human PRAME proteins. Mouse genes were predicted within the orthologous region of the mouse genome (Mus musculus chromosome 4 [MMU4]; May 2004 assembly; bases 141,850,000–142,800,000) between mouse DHRS3 and T1A-2. 29 mouse PRAMEL genes and pseudogenes were predicted from this approach. An HMM was then derived from a multiple alignment of these sequences and used to query this region of MMU4 in a second round of searches. Four additional predictions were found.

Prediction of chimpanzee PRAME genes and exons

Chimpanzee (Pan troglodytes) PRAME genes were predicted as for human genes, as described above, using the HMM derived from the alignment of human PRAME amino acid sequences as query to search a region lying between orthologous DHRS3 and T1A-2 genes on chimpanzee chromosome 1 (PTR1, bases 10240000–13450000, Nov 2003 assembly). This method identified 17 full-length chimpanzee PRAME genes. This gene count was substantially fewer than for the human genome assembly, and may be a consequence of the low (four-fold) statistical coverage of the chimpanzee genome assembly. We reasoned, therefore, that additional non-full-length PRAME gene exons might be represented in the chimpanzee genome assembly. We thus predicted homologues of each of the three PRAME gene exons in this region using the protocol (described above) that was used for predicting human PRAME gene exons.

Prediction of introns

Human and chimp intron sequences were identified as the sequence intervening between adjacent exons. PRAME genes contain 3 coding exons (labelled A, B, C) and 2 intervening introns, labelled intron a and intron b. For human sequence, these introns were all complete, without gaps, but for chimpanzee sequence only 9 intact intron a and 5 intron b chimpanzee introns could be identified.

Exon, intron and splice site predictions from these three species were all consistent with the gene structures apparent from available cDNAs, in particular 14 cDNAs mapped to human HSA1 bases 12769277–1349033, 29 cDNAs mapped to mouse MMU4 bases 141852757–142731257, and cDNAs from the eponymous PRAME gene on HSA22 (bases 21,215,046–21,218,065). Conservation of gene structures between mammals as diverse as human and mouse, together with conservation of splice sites, indicates that chimpanzee PRAME genes also possess an identical gene structure.

Sequence alignments

Conceptual translations of PRAME genes and pseudogenes were aligned using CLUSTAL W [48] and then modified to minimise gaps (see Additional files 1, 2, 3). Stop characters were replaced by 'X'. Estimates of K A and K S for sequence pairs (see below) were calculated from cDNA sequences aligned according to these amino acid multiple alignments.

Human and chimpanzee nucleotide intronic sequences were aligned using DIALIGN-2 [52].

Exonic evolutionary rates

Codeml [23] was used to conduct site-specific K A /K S analysis on the human and mouse full length PRAME predictions. An amino acid alignment and corresponding cDNA alignment were prepared for each analysis. Identified pseudogenes were removed from the alignment because they are likely to be no longer subject to selective constraints.

The maximum likelihood approach of Yang [40] was used to predict sites in a group of cDNA sequences that have been subject to positive selection. Pairs of models were compared by calculating log likelihood values (l), which were then compared for significant differences using a Likelihood Ratio Test. The first of each pair of models compared is a simple model where sites are predicted to be associated with K A /K S ratios between 0 and 1. The second is a more complex model that allows adaptive sites: for these, ratios can be greater than 1. If the complex model indicates an estimated K A /K S ratio that is greater than one, and the test statistic (2Δl) is greater than critical values of the Chi square (χ2) distribution with the appropriate degree of freedom [53], then positive selection can be inferred. Bayesian probabilities are used to predict which codons in the original data have most likely been subjected to positive selection.

The pairs of simple and complex models we used were: M0 (one-ratio) [54] versus M3 (discrete) [23]; M1 (neutral) versus M2 (selection) [55]; and M7 (beta) versus M8 (beta + ω) [23, 40]. Only non-conserved alignment positions predicted to be under positive selection with a posterior probability > 0.90 by all three codeml models were mapped onto a homologous protein structure (Figure 6).

Intronic evolutionary rates

Using the DIALIGN-2 alignment of intronic sequences, we calculated their genetic distances using the TN93 nucleotide substitution model [56]. We then derived a phylogenetic tree based on the distance matrix using neighbour-joining methods (1000 bootstrap iterations). Numbers of nucleotide substitutions per intronic site between sequence pairs (K I ) were then estimated using BASEML [38, 39], a maximum likelihood method, and the TN93 nucleotide substitution model. This analysis was implemented using the DAMBE package [57].

Structure

In order to gain insight into the functional relevance of positively selected sites, we searched the protein sequences of known tertiary structure using PSI-BLAST [25] using a human PRAME sequence (Homo_7; UniProt: YA03_HUMAN) at NCBI using default parameters. Significant sequence similarity (E = 1 × 10-12) was found after three search iterations to porcine ribonuclease inhibitor, RNI (PDB code 2BNH). Two types of alignment guided the assignment of leucine-rich repeats (LRRs) to human PRAME sequences. First, the BLAST alignment, and second the optimal and suboptimal alignments of PRAME sequence against the SMART [58] LRR HMM. RNI was first aligned to Homo_7 using these methods, and adjusted manually, and then aligned to the full alignment of all human PRAMEs guided by the Homo_7 alignment. This allowed human PRAME positively selected sites (as identified by the method above) to be mapped to RNI residues. This procedure was also followed to align full-length mouse PRAMEL sequences against RNI. Protein tertiary structure was viewed, manipulated and annotated in Swiss Pdbviewer [59].

Evolutionary relationships

Phylogenetic relationships were deduced from dendrograms constructed from three types of neutral rate estimates: either (i) K S values of pairwise alignments of full-length coding sequences; (ii) K S values of pairwise alignments of coding sequences from exons A, B or C; or (iii) K I values of pairwise alignments of intronic sequences from introns a or b. Dendrograms were constructed using code based on PHYLIP [60] which uses the Fitch-Margoliash criteria to build trees with contemporaneous tips. The dendrograms were visualised in njplot and treeview [61].

Copy number polymorphisms (CNPs)

The database of Genomic variants ([62, 12, 36]) was queried to determine whether the human PRAME region has been determined previously to harbour CNPs.